## Get data

1. Download SI-NLI from [link](https://www.clarin.si/repository/xmlui/handle/11356/1707).
2. Load libraries.
3. Import ```train.tsv``` file.

In [None]:
import pandas as pd

df = pd.read_csv('SI-NLI/train.tsv', sep='\t')
df.head()

## Basic statistics

1. How many examples are in a dataframe?
2. How many variables are in a dataframe?
3. Count values in the ```label``` column.
4. Are there any missing values in the data?
5. Count the number of missing values per column.

In [None]:
# 1. How many examples are in a dataframe?
len(df)

In [None]:
# 2. How many variables are in a dataframe?
len(df.columns)

In [None]:
# 3. Count values in the ```label``` column.
df['label'].value_counts()

In [None]:
# 4. Are there any missing values in the data?
any(df.isna())

In [None]:
# 5. Count the number of missing values per column.
df.isna().sum()

## Subset observations and variables

1. Select ```premise``` column and store it in a list.
2. Print first 3 rows from the first 3 columns.
3. Select ```pair_id```, ```premise```, ```hypothesis```, ```label``` columns and save them into ```train_dataset``` variable.
4. Drop ```pair_id``` column.
5. Convert all columns to uppercase.
6. Replace ```_``` with ```-``` in column names.
7. Select rows that belong to the ```neutral``` label.
8. Select last 30 rows.
9. Select rows with ```hypothesis``` longer than 100 characters.
10. Select rows with ```hypothesis``` longer than 100 characters and belong to the ```neutral``` label.
11. Select the row with the longest ```hypothesis```.
12. Remove rows that contain ```č```, ```š```, ```ž``` in ```premise``` or ```hypothesis```.
13. Remove rows that contain at least one missing value.
14. Remove the column with the most missing values.

In [None]:
# 1. Select premise column and store it in a list
premise_col = df['premise'].to_list()

In [None]:
# 2. Print first 3 rows from the first 3 columns.
df.iloc[:3, [0, 1, 2]]

In [None]:
# 3. Select ```pair_id```, ```premise```, ```hypothesis```, ```label``` columns and save them into ```train_dataset``` variable.
train_dataset = df[['pair_id', 'premise', 'hypothesis', 'label']]

In [None]:
# 4. Drop ```pair_id``` column.
df.drop(columns=['pair_id'])

In [None]:
# 5. Convert all columns to uppercase.
df.columns = [i.upper() for i in df.columns]
df.head()

In [None]:
# 6. Replace ```_``` with ```-``` in column names.
df.columns = [i.replace('_', '-') for i in df.columns]
df.head()

In [None]:
# 7. Select rows that belong to the ```neutral``` label.
df = pd.read_csv('SI-NLI/train.tsv', sep='\t') # reload
df_neutral = df[df['label'] == 'neutral']
df_neutral

In [None]:
# 8. Select last 30 rows.
df.tail(30)

In [None]:
# 9. Select rows with ```hypothesis``` longer than 100 characters.
long_hypo_mask = df['hypothesis'].apply(lambda s: len(s) > 100)
long_hypo = df[long_hypo_mask]

# check
print(long_hypo['hypothesis'].apply(len))

In [None]:
# 10. Select rows with ```hypothesis``` longer than 100 characters and belong to the ```neutral``` label.
long_hypo_mask = df['hypothesis'].apply(lambda s: len(s) > 100)
neutral_label_mask = df['label'] == 'neutral'
final_df = df[long_hypo_mask & neutral_label_mask]
final_df

In [None]:
# 11. Select the row with the longest ```hypothesis```.
df['hypo_len'] = df['hypothesis'].apply(len)
df[df['hypo_len'] == df['hypo_len'].max()]

In [None]:
# 12. Remove rows that contain ```č```, ```š```, ```ž``` in ```premise``` or ```hypothesis```.
def check(s):
 for c in s.lower():
 if c in chars:
 return False
 return True

chars = ['č', 'š', 'ž']
premise_mask = df['premise'].apply(check)
hypo_mask = df['hypothesis'].apply(check)
df[premise_mask & hypo_mask]

In [None]:
# 13. Remove rows that contain at least one missing value.
df.dropna(axis=0, how='any') # all rows contain at least one missing value
df.dropna(axis=1, how='any') # this is not true for columns

In [None]:
# 14. Remove the column with the most missing values.
col_position = df.isna().sum().argmax()
df.drop(columns=[df.columns[col_position]])

## Create new variables


1. Create integer type variable ```vowel_count_premise``` which stores the number of vowels in a ```premise```. Repeat for ```hypothesis```.
2. Create integer type variable with possible values ```1```, ```2```, ```3``` that counts how many annotations a single example received.
3. Create boolean type variable ```agreement``` which reflects whether all annotators agreed on the label.

In [None]:
# 1. Create integer type variable ```vowel_count_premise``` which stores the number of vowels in a ```premise```. Repeat for ```hypothesis```.
def count_vowels(s):
 n = 0
 for c in s.lower():
 if c in vowels:
 n += 1
 return n

vowels = {'a', 'e', 'i', 'o', 'u'}
df['premise_vowels'] = df['premise'].apply(count_vowels)
df['hypothesis_vowels'] = df['hypothesis'].apply(count_vowels)
df.head()

In [None]:
# 2. Create integer type variable with possible values ```1```, ```2```, ```3``` that counts how many annotations a single example received.
df['num_of_annotations'] = df[['annotator1_id', 'annotator2_id', 'annotator3_id']].notna().sum(axis=1)
df.head()

In [None]:
# 3. Create boolean type variable ```agreement``` which reflects whether all annotators agreed on the label.
values = []
for idx, row in df[['annotation_1', 'annotation_2', 'annotation_3']].iterrows():
 row = row.dropna() # drop na from a row
 s = set(row.to_dict().values())
 if len(s) == 1:
 values.append(True)
 else:
 values.append(False)
df['agree'] = values
df

## Combine datasets

1. Import dev and test files.
2. Combine all three splits into one large dataset.
3. What is the average length of ```premise``` per label?
4. How many examples each split contains?
5. Create a subset that contains exactly the same number of examples per split.

In [None]:
# 1. Import dev and test files.
train = pd.read_csv('SI-NLI/train.tsv', sep='\t')
train['split'] = ['train']*len(train)
dev = pd.read_csv('SI-NLI/dev.tsv', sep='\t')
dev['split'] = ['dev']*len(dev)
test = pd.read_csv('SI-NLI/test.tsv', sep='\t')
test['split'] = ['test']*len(test)

In [None]:
# 2. Combine all three splits into one large dataset.
df = pd.concat([train, dev, test])
df

In [None]:
# 3. What is the average length of ```premise``` per label?
df['premise_length'] = df['premise'].apply(len)
group = df.groupby(by='label')
group['premise_length'].mean()

In [None]:
# 4. How many examples each split contains?
df['label'].value_counts()

In [None]:
# 5. Create a subset that contains exactly the same number of examples per split.
train_s = train.sample(n=100)
dev_s = dev.sample(n=100)
test_s = test.sample(n=100)

subset = pd.concat([train_s, dev_s, test_s])
subset['split'].value_counts()

## Save dataframes

1. Save the original dataset to disk in a ```csv``` format.

In [None]:
# 1. Save the original dataset to disk in a ```csv``` format.
df = pd.read_csv('SI-NLI/train.tsv', sep='\t')
df.to_csv('SI-NLI/train.csv', index=False)