### Task 1: Entity Recognition and Replacement
- **Objective:** Recognize named entities in imdb dataset (such as person names, locations) and replace them with generic labels to anonymize text data, if necessary.
- **Instructions:**
  - Utilize Stanza's entity recognition capabilities to identify named entities within the text.
  - Replace recognized named entities with generic labels or placeholders to ensure anonymity in the text.

### Task 2: Removal of HTML Tags or URLs
- **Objective:** Enhance the preprocess function to eliminate HTML tags and URLs from the text data in imdb dataset.
- **Instructions:**
  - Update the preprocess function using regular expressions to remove HTML tags (e.g., <tag>) and URLs present within the text.
  - Ensure the removal of HTML tags and URLs to clean the text data effectively.

### Task 3: Comparison of Pre-trained Word Embeddings
- **Objective:** Download various pre-trained word embeddings (e.g., FastText, Word2Vec, GloVe) and evaluate their performance against custom-trained embeddings on imdb dataset.
- **Instructions:**
  - Access the NLPL Vector Repository http://vectors.nlpl.eu/repository/ to download pre-trained word embeddings such as FastText, Word2Vec, and GloVe.
  - Compare these pre-trained embeddings with our custom-trained embeddings on imdb dataset.

In [27]:
from datasets import load_dataset
import pandas as pd

# Download the IMDb dataset
imdb_dataset = load_dataset('imdb')

# Select 1,000 examples from each split (train and test)
data = pd.DataFrame(imdb_dataset['train'].shuffle(seed=42).select(range(20000)))

# TASK 1

In [26]:
import stanza

# Initialize Stanza pipeline with NER
stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')

def anonymize_entities(text):
    doc = nlp(text)
    for entity in doc.ents:
        text = text.replace(entity.text, '[ANONYMIZED]')
    return text

# Example usage
anonymized_text = [anonymize_entities(text) for text in data['text'] ]
print(anonymized_text)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 24.0MB/s]                    
2024-01-03 15:02:36 INFO: Downloading default packages for language: en (English) ...
2024-01-03 15:02:38 INFO: File exists: /Users/azagar/stanza_resources/en/default.zip
2024-01-03 15:02:43 INFO: Finished downloading models and saved to /Users/azagar/stanza_resources.
2024-01-03 15:02:43 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 26.6MB/s]                    
2024-01-03 15:02:44 INFO: Loading these models for language: en (English):
| Processor | Package          |
--------------------------------
| tokenize  | combined         |
| ner       | ontonotes_charlm |

2024-01-03 15:02

# Task 2

In [None]:
import re

def preprocess_text(text):
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Add other preprocessing steps here if needed
    return text

# Example usage
sample_text = "Check out this link: <a href='http://example.com'>Example</a>"
clean_text = preprocess_text(sample_text)
print(clean_text)

# Task 3

In [28]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

# Download necessary resources (if not already downloaded)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize Lemmatizer and Stemmer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

# Function to preprocess text
def preprocess_text(text):
    # Tokenize the text into words
    words = word_tokenize(text.lower())  # Convert text to lowercase

    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    words = [word.translate(table) for word in words if word.isalpha()]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Lemmatization
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

    # Stemming (uncomment if you want to use stemming)
    stemmed_words = [stemmer.stem(word) for word in words]

    # Join the words back into a string
    preprocessed_text = ' '.join(lemmatized_words)
    return preprocessed_text


# Apply preprocessing
data['clean_text'] = data['text'].apply(preprocess_text)
# Check preprocessed first instance
data['clean_text'][0]

[nltk_data] Downloading package punkt to /Users/azagar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/azagar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/azagar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


'relation fortier profiler fact police series violent crime profiler look crispy fortier look classic profiler plot quite simple fortier plot far complicated fortier look like prime suspect spot similarity main character weak weirdo clairvoyance people like compare judge evaluate enjoying funny thing people writing fortier look american hand arguing prefer american series maybe language spirit think series english american way actor really good funny acting superficial'

In [29]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data[['clean_text', 'text']], data['label'], test_size=0.2, random_state=42)

In [30]:
import gensim.downloader as api

# Download embeddings -> https://github.com/piskvorky/gensim-data
word2vec_model = api.load("word2vec-google-news-300")
# word2vec_model = api.load("glove-twitter-25")
    

In [31]:
import numpy as np

def document_vector(word2vec_model, doc):
    # remove out-of-vocabulary words
    doc = [word for word in doc if word in word2vec_model.key_to_index]
    if len(doc) == 0:
        return np.zeros(300)
    return np.mean(word2vec_model[doc], axis=0)

In [32]:
# Add Word2Vec representations to DataFrame
X_train_w2v = [document_vector(word2vec_model, text.lower().split()) for text in X_train['clean_text']]
X_test_w2v = [document_vector(word2vec_model, text.lower().split()) for text in X_test['clean_text']]

In [33]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

# Logistic Regression model
logistic_model = LogisticRegression()
logistic_model.fit(X_train_w2v, y_train)
logistic_predictions = logistic_model.predict(X_test_w2v)
logistic_accuracy = accuracy_score(y_test, logistic_predictions)
print("Logistic Regression Accuracy:", logistic_accuracy)

# Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(X_train_w2v, y_train)
rf_predictions = rf_model.predict(X_test_w2v)
rf_accuracy = accuracy_score(y_test, rf_predictions)
print("Random Forest Accuracy:", rf_accuracy)

Logistic Regression Accuracy: 0.856
Random Forest Accuracy: 0.81375
