주제: Text mining | Data Mining (HSE)

Text mining is a field of machine learning with some specific for dealing with text. As texts are complex objects, we need to transform them into a numeric structure before the analysis.

Preprocessing. Preprocessing is the first and a very important step and we can never go without it. For example, say that we have a sentence:

"I like working with computers."

Now, we can transform it in the following way:

transform to lowercase: "i like working with computers."
split into analytical units (called tokens): "i", "like", "working", "with", "computers", "."
normalize words (transform them into base words): "i", "like", "work", "with", "computer", "."
filter stopwords, punctuation: "i", "like", "work", "computer"
filter by document frequency: remove words that appear in more than X and less than Y % of documents\

Some tokenization procedures can already discard punctuation. Otherwise we have to do it manually. Filtering by document frequency can be relative (word should appear in a certain percent of documents) or absolute (word should appear in a certain number of documents).

Bag of Words. The second step is Bag of Words, which transforms text (and the prepared tokens) into document vectors. A simple way to do it is to count the words, but a more elegant approach is term frequency - inverse document requency (TF-IDF), which ecreases the count of words which appear frequently across all documents and increases the count for those that are significant for a small number of documents.

$$\mathrm{TF} = occurences\ of\ word\ in\ doc, \;\; \mathrm{IDF} = log\frac{\mathrm{number\ of\ docs}}{\mathrm{docs\ that\ contain\ word}}$$

TF-IDF measure is the product of the two, $$\mathrm{TFIDF} = \mathrm{TF}\times \mathrm{IDF}.$$

Afterward we can do clustering, classification or any other analysis we wish.

Word Enrichment. Word enrichment is a nice way to inspect a data subset. It computes those words that are significant for a subset compared to the entire corpus.

Sentiment Analysis. Sentiment analysis is a popular approach for analyzing user opinion, product reviews, and so on. The most simple methods are lexicon-based, which means there is a list in the background that defines positive and negative words, then counts the occurrences of each and sums them. The sum is the final sentiment score.

Section outline