{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Get data\n",
    "\n",
    "1. Download SI-NLI from [link](https://www.clarin.si/repository/xmlui/handle/11356/1707).\n",
    "2. Load libraries.\n",
    "3. Import ```train.tsv``` file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('SI-NLI/train.tsv', sep='\\t')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Basic statistics\n",
    "\n",
    "1. How many examples are in a dataframe?\n",
    "2. How many variables are in a dataframe?\n",
    "3. Count values in the ```label``` column.\n",
    "4. Are there any missing values in the data?\n",
    "5. Count the number of missing values per column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 1. How many examples are in a dataframe?\n",
    "len(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 2. How many variables are in a dataframe?\n",
    "len(df.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 3. Count values in the ```label``` column.\n",
    "df['label'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 4. Are there any missing values in the data?\n",
    "any(df.isna())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 5. Count the number of missing values per column.\n",
    "df.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Subset observations and variables\n",
    "\n",
    "1. Select ```premise``` column and store it in a list.\n",
    "2. Print first 3 rows from the first 3 columns.\n",
    "3. Select ```pair_id```, ```premise```, ```hypothesis```, ```label``` columns and save them into ```train_dataset``` variable.\n",
    "4. Drop ```pair_id``` column.\n",
    "5. Convert all columns to uppercase.\n",
    "6. Replace ```_``` with ```-``` in column names.\n",
    "7. Select rows that belong to the ```neutral``` label.\n",
    "8. Select last 30 rows.\n",
    "9. Select rows with ```hypothesis``` longer than 100 characters.\n",
    "10. Select rows with ```hypothesis``` longer than 100 characters and belong to the ```neutral``` label.\n",
    "11. Select the row with the longest ```hypothesis```.\n",
    "12. Remove rows that contain ```č```, ```š```, ```ž``` in ```premise``` or ```hypothesis```.\n",
    "13. Remove rows that contain at least one missing value.\n",
    "14. Remove the column with the most missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 1. Select premise column and store it in a list\n",
    "premise_col = df['premise'].to_list()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 2. Print first 3 rows from the first 3 columns.\n",
    "df.iloc[:3, [0, 1, 2]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 3. Select ```pair_id```, ```premise```, ```hypothesis```, ```label``` columns and save them into ```train_dataset``` variable.\n",
    "train_dataset = df[['pair_id', 'premise', 'hypothesis', 'label']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 4. Drop ```pair_id``` column.\n",
    "df.drop(columns=['pair_id'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 5. Convert all columns to uppercase.\n",
    "df.columns = [i.upper() for i in df.columns]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 6. Replace ```_``` with ```-``` in column names.\n",
    "df.columns = [i.replace('_', '-') for i in df.columns]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 7. Select rows that belong to the ```neutral``` label.\n",
    "df = pd.read_csv('SI-NLI/train.tsv', sep='\\t')  # reload\n",
    "df_neutral = df[df['label'] == 'neutral']\n",
    "df_neutral"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 8. Select last 30 rows.\n",
    "df.tail(30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 9. Select rows with ```hypothesis``` longer than 100 characters.\n",
    "long_hypo_mask = df['hypothesis'].apply(lambda s: len(s) > 100)\n",
    "long_hypo = df[long_hypo_mask]\n",
    "\n",
    "# check\n",
    "print(long_hypo['hypothesis'].apply(len))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 10. Select rows with ```hypothesis``` longer than 100 characters and belong to the ```neutral``` label.\n",
    "long_hypo_mask = df['hypothesis'].apply(lambda s: len(s) > 100)\n",
    "neutral_label_mask = df['label'] == 'neutral'\n",
    "final_df = df[long_hypo_mask & neutral_label_mask]\n",
    "final_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 11. Select the row with the longest ```hypothesis```.\n",
    "df['hypo_len'] = df['hypothesis'].apply(len)\n",
    "df[df['hypo_len'] == df['hypo_len'].max()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 12. Remove rows that contain ```č```, ```š```, ```ž``` in ```premise``` or ```hypothesis```.\n",
    "def check(s):\n",
    "    for c in s.lower():\n",
    "        if c in chars:\n",
    "            return False\n",
    "    return True\n",
    "\n",
    "chars = ['č', 'š', 'ž']\n",
    "premise_mask = df['premise'].apply(check)\n",
    "hypo_mask = df['hypothesis'].apply(check)\n",
    "df[premise_mask & hypo_mask]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 13. Remove rows that contain at least one missing value.\n",
    "df.dropna(axis=0, how='any')  # all rows contain at least one missing value\n",
    "df.dropna(axis=1, how='any')  # this is not true for columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 14. Remove the column with the most missing values.\n",
    "col_position = df.isna().sum().argmax()\n",
    "df.drop(columns=[df.columns[col_position]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Create new variables\n",
    "\n",
    "\n",
    "1. Create integer type variable ```vowel_count_premise``` which stores the number of vowels in a ```premise```. Repeat for ```hypothesis```.\n",
    "2. Create integer type variable with possible values ```1```, ```2```, ```3``` that counts how many annotations a single example received.\n",
    "3. Create boolean type variable ```agreement``` which reflects whether all annotators agreed on the label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 1. Create integer type variable ```vowel_count_premise``` which stores the number of vowels in a ```premise```. Repeat for ```hypothesis```.\n",
    "def count_vowels(s):\n",
    "    n = 0\n",
    "    for c in s.lower():\n",
    "        if c in vowels:\n",
    "            n += 1\n",
    "    return n\n",
    "\n",
    "vowels = {'a', 'e', 'i', 'o', 'u'}\n",
    "df['premise_vowels'] = df['premise'].apply(count_vowels)\n",
    "df['hypothesis_vowels'] = df['hypothesis'].apply(count_vowels)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 2. Create integer type variable with possible values ```1```, ```2```, ```3``` that counts how many annotations a single example received.\n",
    "df['num_of_annotations'] = df[['annotator1_id', 'annotator2_id', 'annotator3_id']].notna().sum(axis=1)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 3. Create boolean type variable ```agreement``` which reflects whether all annotators agreed on the label.\n",
    "values = []\n",
    "for idx, row in df[['annotation_1', 'annotation_2', 'annotation_3']].iterrows():\n",
    "    row = row.dropna()  # drop na from a row\n",
    "    s = set(row.to_dict().values())\n",
    "    if len(s) == 1:\n",
    "        values.append(True)\n",
    "    else:\n",
    "        values.append(False)\n",
    "df['agree'] = values\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Combine datasets\n",
    "\n",
    "1. Import dev and test files.\n",
    "2. Combine all three splits into one large dataset.\n",
    "3. What is the average length of ```premise``` per label?\n",
    "4. How many examples each split contains?\n",
    "5. Create a subset that contains exactly the same number of examples per split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 1. Import dev and test files.\n",
    "train = pd.read_csv('SI-NLI/train.tsv', sep='\\t')\n",
    "train['split'] = ['train']*len(train)\n",
    "dev = pd.read_csv('SI-NLI/dev.tsv', sep='\\t')\n",
    "dev['split'] = ['dev']*len(dev)\n",
    "test = pd.read_csv('SI-NLI/test.tsv', sep='\\t')\n",
    "test['split'] = ['test']*len(test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 2. Combine all three splits into one large dataset.\n",
    "df = pd.concat([train, dev, test])\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 3. What is the average length of ```premise``` per label?\n",
    "df['premise_length'] = df['premise'].apply(len)\n",
    "group = df.groupby(by='label')\n",
    "group['premise_length'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 4. How many examples each split contains?\n",
    "df['label'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 5. Create a subset that contains exactly the same number of examples per split.\n",
    "train_s = train.sample(n=100)\n",
    "dev_s = dev.sample(n=100)\n",
    "test_s = test.sample(n=100)\n",
    "\n",
    "subset = pd.concat([train_s, dev_s, test_s])\n",
    "subset['split'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Save dataframes\n",
    "\n",
    "1. Save the original dataset to disk in a ```csv``` format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# 1. Save the original dataset to disk in a ```csv``` format.\n",
    "df = pd.read_csv('SI-NLI/train.tsv', sep='\\t')\n",
    "df.to_csv('SI-NLI/train.csv', index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}