# Data Wrangling Exercises


## Introduction

Data wrangling is the process of cleaning, transforming, and organizing data to make it more suitable for analysis. It is a critical step in any data analysis project, as it ensures that the data is accurate, consistent, and complete.

These exercises are designed to provide practice in data wrangling skills using a real-world dataset. The dataset used in these exercises is the Slovenian Natural Language Inference dataset (SI-NLI), which contains labeled examples of text pairs with corresponding labels of entailment, contradiction, or neutral.

The exercises cover a range of data wrangling techniques, including importing data, performing basic statistics, subsetting observations and variables, creating new variables, grouping data, and combining datasets.

## Get data

1. Download SI-NLI from [link](https://www.clarin.si/repository/xmlui/handle/11356/1707).
2. Load libraries.
3. Import ```train.tsv``` file.

## Basic statistics

1. How many examples are in a dataframe?
2. How many variables are in a dataframe?
3. Count values in the ```label``` column.
4. Are there any missing values in the data?
5. Count the number of missing values per column.

## Subset observations and variables

1. Select ```premise``` column and store it in a list.
2. Print first 3 rows from the first 3 columns.
3. Select ```pair_id```, ```premise```, ```hypothesis```, ```label``` columns and save them into ```train_dataset``` variable.
4. Drop ```pair_id``` column.
5. Convert all columns to uppercase.
6. Replace ```_``` with ```-``` in column names.
7. Select rows that belong to the ```neutral``` label.
8. Select last 30 rows.
9. Select rows with ```hypothesis``` longer than 100 characters.
10. Select rows with ```hypothesis``` longer than 100 characters and belong to the ```neutral``` label.
11. Select the row with the longest ```hypothesis```.
12. Remove rows that contain ```č```, ```š```, ```ž``` in ```premise``` or ```hypothesis```.
13. Remove rows that contain at least one missing value.
14. Remove the column with the most missing values.

## Create new variables

1. Create integer type variable ```vowel_count_premise``` which stores the number of vowels in a ```premise```. Repeat for ```hypothesis```.
2. Create integer type variable with possible values ```1```, ```2```, ```3``` that counts how many annotations a single example received.
3. Create boolean type variable ```agreement``` which reflects whether all annotators agreed on the label.

## Save dataframes

1. Save the original dataset to disk in a ```csv``` format.
...