{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Get data\n", "\n", "1. Download SI-NLI from [link](https://www.clarin.si/repository/xmlui/handle/11356/1707).\n", "2. Load libraries.\n", "3. Import ```train.tsv``` file." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('SI-NLI/train.tsv', sep='\\t')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Basic statistics\n", "\n", "1. How many examples are in a dataframe?\n", "2. How many variables are in a dataframe?\n", "3. Count values in the ```label``` column.\n", "4. Are there any missing values in the data?\n", "5. Count the number of missing values per column." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 1. How many examples are in a dataframe?\n", "len(df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 2. How many variables are in a dataframe?\n", "len(df.columns)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 3. Count values in the ```label``` column.\n", "df['label'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 4. Are there any missing values in the data?\n", "any(df.isna())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 5. Count the number of missing values per column.\n", "df.isna().sum()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Subset observations and variables\n", "\n", "1. Select ```premise``` column and store it in a list.\n", "2. Print first 3 rows from the first 3 columns.\n", "3. Select ```pair_id```, ```premise```, ```hypothesis```, ```label``` columns and save them into ```train_dataset``` variable.\n", "4. Drop ```pair_id``` column.\n", "5. Convert all columns to uppercase.\n", "6. Replace ```_``` with ```-``` in column names.\n", "7. Select rows that belong to the ```neutral``` label.\n", "8. Select last 30 rows.\n", "9. Select rows with ```hypothesis``` longer than 100 characters.\n", "10. Select rows with ```hypothesis``` longer than 100 characters and belong to the ```neutral``` label.\n", "11. Select the row with the longest ```hypothesis```.\n", "12. Remove rows that contain ```č```, ```š```, ```ž``` in ```premise``` or ```hypothesis```.\n", "13. Remove rows that contain at least one missing value.\n", "14. Remove the column with the most missing values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 1. Select premise column and store it in a list\n", "premise_col = df['premise'].to_list()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 2. Print first 3 rows from the first 3 columns.\n", "df.iloc[:3, [0, 1, 2]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 3. Select ```pair_id```, ```premise```, ```hypothesis```, ```label``` columns and save them into ```train_dataset``` variable.\n", "train_dataset = df[['pair_id', 'premise', 'hypothesis', 'label']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 4. Drop ```pair_id``` column.\n", "df.drop(columns=['pair_id'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 5. Convert all columns to uppercase.\n", "df.columns = [i.upper() for i in df.columns]\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 6. Replace ```_``` with ```-``` in column names.\n", "df.columns = [i.replace('_', '-') for i in df.columns]\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 7. Select rows that belong to the ```neutral``` label.\n", "df = pd.read_csv('SI-NLI/train.tsv', sep='\\t') # reload\n", "df_neutral = df[df['label'] == 'neutral']\n", "df_neutral" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 8. Select last 30 rows.\n", "df.tail(30)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 9. Select rows with ```hypothesis``` longer than 100 characters.\n", "long_hypo_mask = df['hypothesis'].apply(lambda s: len(s) > 100)\n", "long_hypo = df[long_hypo_mask]\n", "\n", "# check\n", "print(long_hypo['hypothesis'].apply(len))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 10. Select rows with ```hypothesis``` longer than 100 characters and belong to the ```neutral``` label.\n", "long_hypo_mask = df['hypothesis'].apply(lambda s: len(s) > 100)\n", "neutral_label_mask = df['label'] == 'neutral'\n", "final_df = df[long_hypo_mask & neutral_label_mask]\n", "final_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 11. Select the row with the longest ```hypothesis```.\n", "df['hypo_len'] = df['hypothesis'].apply(len)\n", "df[df['hypo_len'] == df['hypo_len'].max()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 12. Remove rows that contain ```č```, ```š```, ```ž``` in ```premise``` or ```hypothesis```.\n", "def check(s):\n", " for c in s.lower():\n", " if c in chars:\n", " return False\n", " return True\n", "\n", "chars = ['č', 'š', 'ž']\n", "premise_mask = df['premise'].apply(check)\n", "hypo_mask = df['hypothesis'].apply(check)\n", "df[premise_mask & hypo_mask]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 13. Remove rows that contain at least one missing value.\n", "df.dropna(axis=0, how='any') # all rows contain at least one missing value\n", "df.dropna(axis=1, how='any') # this is not true for columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 14. Remove the column with the most missing values.\n", "col_position = df.isna().sum().argmax()\n", "df.drop(columns=[df.columns[col_position]])" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Create new variables\n", "\n", "\n", "1. Create integer type variable ```vowel_count_premise``` which stores the number of vowels in a ```premise```. Repeat for ```hypothesis```.\n", "2. Create integer type variable with possible values ```1```, ```2```, ```3``` that counts how many annotations a single example received.\n", "3. Create boolean type variable ```agreement``` which reflects whether all annotators agreed on the label." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 1. Create integer type variable ```vowel_count_premise``` which stores the number of vowels in a ```premise```. Repeat for ```hypothesis```.\n", "def count_vowels(s):\n", " n = 0\n", " for c in s.lower():\n", " if c in vowels:\n", " n += 1\n", " return n\n", "\n", "vowels = {'a', 'e', 'i', 'o', 'u'}\n", "df['premise_vowels'] = df['premise'].apply(count_vowels)\n", "df['hypothesis_vowels'] = df['hypothesis'].apply(count_vowels)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 2. Create integer type variable with possible values ```1```, ```2```, ```3``` that counts how many annotations a single example received.\n", "df['num_of_annotations'] = df[['annotator1_id', 'annotator2_id', 'annotator3_id']].notna().sum(axis=1)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 3. Create boolean type variable ```agreement``` which reflects whether all annotators agreed on the label.\n", "values = []\n", "for idx, row in df[['annotation_1', 'annotation_2', 'annotation_3']].iterrows():\n", " row = row.dropna() # drop na from a row\n", " s = set(row.to_dict().values())\n", " if len(s) == 1:\n", " values.append(True)\n", " else:\n", " values.append(False)\n", "df['agree'] = values\n", "df" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Combine datasets\n", "\n", "1. Import dev and test files.\n", "2. Combine all three splits into one large dataset.\n", "3. What is the average length of ```premise``` per label?\n", "4. How many examples each split contains?\n", "5. Create a subset that contains exactly the same number of examples per split." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 1. Import dev and test files.\n", "train = pd.read_csv('SI-NLI/train.tsv', sep='\\t')\n", "train['split'] = ['train']*len(train)\n", "dev = pd.read_csv('SI-NLI/dev.tsv', sep='\\t')\n", "dev['split'] = ['dev']*len(dev)\n", "test = pd.read_csv('SI-NLI/test.tsv', sep='\\t')\n", "test['split'] = ['test']*len(test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 2. Combine all three splits into one large dataset.\n", "df = pd.concat([train, dev, test])\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 3. What is the average length of ```premise``` per label?\n", "df['premise_length'] = df['premise'].apply(len)\n", "group = df.groupby(by='label')\n", "group['premise_length'].mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 4. How many examples each split contains?\n", "df['label'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 5. Create a subset that contains exactly the same number of examples per split.\n", "train_s = train.sample(n=100)\n", "dev_s = dev.sample(n=100)\n", "test_s = test.sample(n=100)\n", "\n", "subset = pd.concat([train_s, dev_s, test_s])\n", "subset['split'].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Save dataframes\n", "\n", "1. Save the original dataset to disk in a ```csv``` format." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# 1. Save the original dataset to disk in a ```csv``` format.\n", "df = pd.read_csv('SI-NLI/train.tsv', sep='\\t')\n", "df.to_csv('SI-NLI/train.csv', index=False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }