Text mining
Section outline
-
Text mining is a field of machine learning with some specific for dealing with text. As texts are complex objects, we need to transform them into a numeric structure before the analysis.
Preprocessing. Preprocessing is the first and a very important step and we can never go without it. For example, say that we have a sentence:
"I like working with computers."
Now, we can transform it in the following way:
- transform to lowercase: "i like working with computers."
- split into analytical units (called tokens): "i", "like", "working", "with", "computers", "."
- normalize words (transform them into base words): "i", "like", "work", "with", "computer", "."
- filter stopwords, punctuation: "i", "like", "work", "computer"
- filter by document frequency: remove words that appear in more than X and less than Y % of documents\
Some tokenization procedures can already discard punctuation. Otherwise we have to do it manually. Filtering by document frequency can be relative (word should appear in a certain percent of documents) or absolute (word should appear in a certain number of documents).
Bag of Words. The second step is Bag of Words, which transforms text (and the prepared tokens) into document vectors. A simple way to do it is to count the words, but a more elegant approach is term frequency - inverse document requency (TF-IDF), which ecreases the count of words which appear frequently across all documents and increases the count for those that are significant for a small number of documents.
$$\mathrm{TF} = occurences\ of\ word\ in\ doc, \;\; \mathrm{IDF} = log\frac{\mathrm{number\ of\ docs}}{\mathrm{docs\ that\ contain\ word}}$$
TF-IDF measure is the product of the two, $$\mathrm{TFIDF} = \mathrm{TF}\times \mathrm{IDF}.$$
Afterward we can do clustering, classification or any other analysis we wish.
Word Enrichment. Word enrichment is a nice way to inspect a data subset. It computes those words that are significant for a subset compared to the entire corpus.
Sentiment Analysis. Sentiment analysis is a popular approach for analyzing user opinion, product reviews, and so on. The most simple methods are lexicon-based, which means there is a list in the background that defines positive and negative words, then counts the occurrences of each and sums them. The sum is the final sentiment score.
-
Comprehensive overview of text mining techniques and algorithms. [obligatory]
-
Why regular expression can be very helpful. [optional read]
-
Why using TF-IDF is a good idea. [technical, interesting read]