Applying Machine Learning to Sentiment Analysis and Topic Modeling

This notebook will explore two topics from Natural Language Processing. The first, sentiment analysis, where we will use machine learing to classify documents based on their positive or negative sentiment. Followed by topic modeling, where we will extract the main topics from these documents.

We will be working with the IMDB movie reviews data set containing 50,000 reviews.

topics covered - data cleaning and processing - feature axtraction from text - training a classifyer on positive and negative sentiment - topic modeling with LDA

This notebook is based on code and material from the excellent book by S. Raschka Machine Learning with PyTorch and Scikit-Learn

import numpy as np
import pandas as pd
import tarfile

#import os
#import sys

from tqdm import tqdm
from pathlib import Path

p = Path.cwd()

1. Data Cleaning and Preprocessing

The IMDB data set was produced by Andrew Mass and others (Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).) and contains 50,000 polar movie reviews, labeled either positive or negative.

data can be downloaded from here

with tarfile.open('data/aclImdb_v1.tar.gz', 'r:gz') as tar:
    tar.extractall()

basepath = p/'data/aclImdb'
labels = {'pos':1, 'neg':0}

pbar = tqdm(range(50000))
df = pd.DataFrame()

for s in ('test','train'):
    for l in ('pos', 'neg'):
        path = basepath/s/l
        for file in path.iterdir():
            with open(path/file, 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)

            pbar.update()

df.columns = ['review', 'sentiment']

  0%|                                                                         | 0/50000 [00:00<?, ?it/s]/var/folders/h6/76mmjn5902lf0r8382f_r52r0000gn/T/ipykernel_56182/3956242205.py:13: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df = df.append([[txt, labels[l]]], ignore_index=True)
100%|███████████████████████████████████████████████████████████▉| 49971/50000 [02:01<00:00, 258.76it/s]

df.head()

	review	sentiment
0	Based on an actual story, John Boorman shows t...	1
1	This is a gem. As a Film Four production - the...	1
2	I really like this show. It has drama, romance...	1
3	This is the best 3-D experience Disney has at ...	1
4	Of the Korean movies I've seen, only three had...	1

df.sentiment.value_counts()

1    25000
0    25000
Name: sentiment, dtype: int64

# shuffle index
df = df.reindex(np.random.permutation(df.index))

# save for later
df.to_csv(p/'data'/'imdb_review_data.csv', index=False, encoding='utf-8')

2. The Bag-of-Words Model

Before text data can be passed onto a machine learning or deep learning model, it needs to be converted into numerical form. The bag-of-words model allows us to do just this by representing text as feature vectors. The model can be summarised as follows...

create a vocabulary of unique tokens (words) from the endire set of documents
construct a feature vector for each document that contains the frequency count of words as they appear in each particular document.

These feature vectors are usually very sparse (containing mainly zeros) since the occurrance of unique words represents only a small subset of all words.

2.1 From Words to Feature Vectors

Scikit-learn has implemented the CountVectorizer class that will take in an array of data (documents or sentences), and constructs the bag-of-words model for us.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
docs = np.array([
    'the sun is shining',
    'the weather is sweet',
    'the sun is shining, the weather is sweet',
    'and one and one is two'
])

bag = vectorizer.fit_transform(docs)

# list of unique words with integer indices 
# ie, sort alphabetically then assign index

vectorizer.vocabulary_

{'the': 6,
 'sun': 4,
 'is': 1,
 'shining': 3,
 'weather': 8,
 'sweet': 5,
 'and': 0,
 'one': 2,
 'two': 7}

# let's sort these for convenience
sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1])

[('and', 0),
 ('is', 1),
 ('one', 2),
 ('shining', 3),
 ('sun', 4),
 ('sweet', 5),
 ('the', 6),
 ('two', 7),
 ('weather', 8)]

let's look at the feature vectors

bag.toarray()

array([[0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 1],
       [0, 2, 0, 1, 1, 1, 2, 0, 1],
       [2, 1, 2, 0, 0, 0, 0, 1, 0]])

Each index position in the feature vectors corresponds to the sorted vocabulary, and represents the frequency of the word within that vector. For example...

Looking at the last row ([2, 1, 2, 0, 0, 0, 0, 1, 0]), the word and appears at index position 0 and is represented by the frequency of the word (which is 2) within that particular sentence.

The values in these feature vectors are also called the raw term frequencies: $tf(t,d)$ which is the number of times a term $t$, appears in a document $d$.

2.2 Assessing word relevancy via term frequency-inverse document frequency (tfidf)

Often, when analysing text data, the same word will appear across both classes (in context this means, the same word would appear in positive and negative reviews). These words often don't contain useful or discrimatory information. The tfidf technique can be used to downweight frequentlty occuring words.

tfidf can be defined as the product of the term frequency and the inverse document frequency $tfidf = tf(t,d) x idf(t,d)$ and is calculated like...

$$idf(t,d) = log\frac{n_d}{1+df(t,f)} $$

where $n_d$ is the total document count, $df(t,f)$ is the number of documents $d$ that contain the term $t$. The $log$ is used to ensure that low document frequencies are not given too much weight.

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)

tfidf.fit_transform(vectorizer.fit_transform(docs)).toarray()

array([[0.        , 0.37632116, 0.        , 0.56855566, 0.56855566,
        0.        , 0.46029481, 0.        , 0.        ],
       [0.        , 0.37632116, 0.        , 0.        , 0.        ,
        0.56855566, 0.46029481, 0.        , 0.56855566],
       [0.        , 0.4574528 , 0.        , 0.3455657 , 0.3455657 ,
        0.3455657 , 0.55953044, 0.        , 0.3455657 ],
       [0.65680405, 0.1713738 , 0.65680405, 0.        , 0.        ,
        0.        , 0.        , 0.32840203, 0.        ]])

The word "is" appears in all 4 documents. We can see that the results of the tfid have downweighted its importance. This is evident in the 4th document where it has relatively low importance (0.171).

The scikit-learn implementation is slightly different from the one above due to the smooth_idf=True argument which assigns zero weight to terms that appear in all documents.

TfidfTransformer also normalises the tf-idfs directly bu applying L2-Normalisation, which returns a vector of length 1. The purpose for doing this is that the feature values become proportionate to each other.

This can be verified like so...

v = tfidf.fit_transform(vectorizer.fit_transform(docs)).toarray()
np.linalg.norm(v[0])

1.0

3. Cleaning Text Data

remove punctuation and html markup
tokenisation
removing stop words

The above steps are pretty typical in NLP pipeline. There are different approaches to these, ie for neural nets I've seen different encoding strategies where things like capitals, html tags, unknown words etc are replaced with tags which allows the model to capture this information which may (or may not) be useful.

3.1 Stripping Punctuation & html

# source: this code comes straight from the book!
# https://sebastianraschka.com/books/

import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

s = df.loc[37720, 'review'][:50]
s

'WARNING: REVIEW CONTAINS MILD SPOILERS<br /><br />'

preprocessor(s)

'warning review contains mild spoilers'

3.2 Tokenisation

Tokenisation is the process of splitting a document into individual elements (tokens). There are different strategies for doing this, ie word tokenisation, sentence tokenisation. Ontop of this are other techniques like word stemming - the process of transforming a word into it's root form ie running -> run.

The NLTK library is one of many with tools to help with stemming and lemmatisation.

def tokeniser(text):
    return text.split()

tokeniser('runners like running')

['runners', 'like', 'running']

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def tokeniser_stemmer(text):
    return [stemmer.stem(word) for word in text.split()]

tokeniser_stemmer('runners like running')

['runner', 'like', 'run']

3.3 Stop word removal

Stop words are considered words that are extremely common and likely bear no useful or discrimatory information. Again, in the world of deep learning this is debateable and you should consider whether the task requires this and ultimately assess model performance to determine whether this is necessary.

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/devindearaujo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

from nltk.corpus import stopwords

stop = stopwords.words('english')

s = 'a runner likes running and runs a lot'

[w for w in tokeniser_stemmer(s) if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

4. Document Classification via logistic regression

Classify movie reviews using logistic regressin, employing all of the preprocessing steps discussed above.

# use grid search to find optimal model params
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# combines TfidfTransformer & CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# train test split

X_train, X_test = df.loc[:25000, 'review'].values, df.loc[25000:, 'review'].values
y_train, y_test = df.loc[:25000, 'sentiment'].values, df.loc[25000:, 'sentiment'].values

4.1 Finding optimal model params via GridSearchCV

models parameter available to use with Grid Search...

lr = LogisticRegression(solver='liblinear')
lr.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

tfidf = TfidfVectorizer(
    strip_accents=None,
    lowercase=False,
    preprocessor=None
)

# param grid
param_grid = [
    {
        'vect__ngram_range': [(1,1)],
        'vect__stop_words': [None],
        'vect__tokenizer': [tokeniser, tokeniser_stemmer],
        'clf__penalty': ['l2'],
        'clf__C': [1., 10.]
    },
    {
        'vect__ngram_range': [(1,1)],
        'vect__stop_words': [stop, None],
        'vect__tokenizer': [tokeniser],
        'vect__use_idf': [False],
        'vect__norm': [None],
        'clf__penalty': ['l2'],
        'clf__C': [1., 10.]   
    }
]

# pipeline
lr_tfidf = Pipeline([
    ('vect', tfidf),
    ('clf', LogisticRegression(solver='liblinear'))
])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5,
                          verbose=2, n_jobs=-1)

gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect',
                                        TfidfVectorizer(lowercase=False)),
                                       ('clf',
                                        LogisticRegression(solver='liblinear'))]),
             n_jobs=-1,
             param_grid=[{'clf__C': [1.0, 10.0], 'clf__penalty': ['l2'],
                          'vect__ngram_range': [(1, 1)],
                          'vect__stop_words': [None],
                          'vect__tokenizer': [<function tokeniser at 0x11831fdc0>,
                                              <function tokeniser_stemmer at 0x118344310>]},
                         {...
                          'vect__stop_words': [['i', 'me', 'my', 'myself', 'we',
                                                'our', 'ours', 'ourselves',
                                                'you', "you're", "you've",
                                                "you'll", "you'd", 'your',
                                                'yours', 'yourself',
                                                'yourselves', 'he', 'him',
                                                'his', 'himself', 'she',
                                                "she's", 'her', 'hers',
                                                'herself', 'it', "it's", 'its',
                                                'itself', ...],
                                               None],
                          'vect__tokenizer': [<function tokeniser at 0x11831fdc0>],
                          'vect__use_idf': [False]}],
             scoring='accuracy', verbose=2)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

gs_lr_tfidf.best_params_

{'clf__C': 10.0,
 'clf__penalty': 'l2',
 'vect__ngram_range': (1, 1),
 'vect__stop_words': None,
 'vect__tokenizer': <function __main__.tokeniser(text)>}

print(f'Average CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')

Average CV Accuracy: 0.888

Using the best estimator, check classification accuracy on the training set.

clf = gs_lr_tfidf.best_estimator_

print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')

Test Accuracy: 0.893

4.2 Updating the Pipeline with best parameters

The results demonstrate that the logistic regression model can predict whether a movie is positive or negative with 86% accuracy. Using the best parameters, retrain the logistic regression model. The lr_tfidf pipeline can be updated using the set_params method and passing in the best_params_ from gs_lr_tfidf

# create inference pipeline &
# update tfidf and logistic regression params

inf_pl = lr_tfidf.set_params(**gs_lr_tfidf.best_params_)

# refit with best params
inf_pl.fit(X_train, y_train)

# score on training
inf_pl.score(X_train, y_train)

0.9967112810707457

print(f"Test Accuracy: {inf_pl.score(X_test, y_test):.3f}")

Test Accuracy: 0.893

# check on some random text
s = np.array(["""terminator 2 was a horrible movie. the effects were good, \n
              but i just couldn't get onboard with Robert Patrick's character"""])

inf_pl.predict(s)

array([0])

5. Topic Modeling with Latent Dirichlet Allocation (LDA)

Broadly speaking, topic modeling describes a method for assigning topics to unlabelled documents. For example, categorising a large text corpus of newspaper articles, or wiki pages. This can also be considdered a clustering task - assigning a label to simmilar sets of items, here, the items are documents.

LDA is not to be confused with the matrix decomposition method Linear Discriminant Analysis, also abbreviated... to LDA.

Latent Dirichlet Allocation (LDA) is a generative probabilistic model that aims to find groups of words that frequently appear together across a corpus of documents. This works on the assumption that each document is made up of mixtures of different words. The words that appear together often, become topics.

The input to an LDA model is a bag-of-words model. Given this, LDA decomposes it into two new matrices.. - a document-to-topic matrix - a word-to-topic matrix

The decompostion works in such a way that we are able to reconstruct (with the lowest possible error) the original matrix by multiplying the two latent feature matrices together. The downside to LDA, is that the number of topics is a hyperparameter, that must be specified manually beforehand.

5.1 Bag-of-words on Movie Reviews

Fit a bag-of-words model using CountVectorizer on the movie reviews data. We can exclude words that appear too frequently across documents by setting max_df to 10%. The dimensionality of tha dataset can be controlled using the max_features argument, here 5000 is chosen arbitrarily.

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(
    stop_words='english',
    max_df=.1,
    max_features=5000
)

X = vect.fit_transform(df['review'].values)

5.1 Fitting the LDA model

With a total of 10 topics...

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(
    n_components=10,
    random_state=123,
    learning_method='batch',
    n_jobs=-1
)

X_topics = lda.fit_transform(X)

lda.components_.shape

(10, 5000)

n_top_words = 6
feature_names = vect.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {(topic_idx + 1)} : ')
    print(' '.join([feature_names[i] 
                   for i in topic.argsort()\
                   [:-n_top_words -1:-1]]))

Topic 1 : 
worst minutes script awful stupid terrible
Topic 2 : 
family mother father children girl women
Topic 3 : 
war american dvd music tv history
Topic 4 : 
human audience cinema art sense feel
Topic 5 : 
police guy car dead murder goes
Topic 6 : 
horror house sex blood girl woman
Topic 7 : 
role performance comedy actor plays performances
Topic 8 : 
series episode episodes tv season original
Topic 9 : 
book version original read effects fi
Topic 10 : 
action fight guy guys fun cool

based on the most important words for each topic we can make a general assumption about the review topics...

generally terrible movie reviews
movies about families
history/war movies
art/arthouse movies
crime films
horror films
comedy films
tv series or shows
movies based on books
action movies

To confirm our assumptions, we can print out sections of reviews from a particular category, say crime films.

horror_idx = X_topics[:, 5].argsort()[::-1] # sort descending

for iter_idx, movie_idx in enumerate(horror_idx[:3]):
    print(f'\nHorror movie #{(iter_idx + 1)}:')
    print(df['review'].iloc[movie_idx][:300], '...')


Horror movie #1:
<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac ...

Horror movie #2:
Before I talk about the ending of this film I will talk about the plot. Some dude named Gerald breaks his engagement to Kitty and runs off to Craven Castle in Scotland. After several months Kitty and her aunt venture off to Scottland. Arriving at Craven Castle Kitty finds that Gerald has aged and he ...

Horror movie #3:
This film marked the end of the "serious" Universal Monsters era (Abbott and Costello meet up with the monsters later in "Abbott and Costello Meet Frankentstein"). It was a somewhat desparate, yet fun attempt to revive the classic monsters of the Wolf Man, Frankenstein's monster, and Dracula one "la ...

comedy_idx = X_topics[:, 6].argsort()[::-1] # sort descending

for iter_idx, movie_idx in enumerate(comedy_idx[:3]):
    print(f'\nComedy movie #{(iter_idx + 1)}:')
    print(df['review'].iloc[movie_idx][:300], '...')


Comedy movie #1:
From producer/writer/Golden Globe nominated director James L. Brooks (Terms of Endearment, As Good as It Gets) this is a really good satirical comedy film showing behind the scenes in the life of a news reporter/anchor/journalist or producer might be like. Basically Jane Craig (Oscar and Golden Glob ...

Comedy movie #2:
THE SUNSHINE BOYS was the hilarious 1975 screen adaptation of Neil Simon's play about a retired vaudevillian team, played by Walter Matthau and George Burns, who had a very bitter breakup and have been asked to reunite one more time for a television special or something like that. The problem is tha ...

Comedy movie #3:
As far as I know the real guy that the main actor is playing saw his performance and said it was an outstanding portrayal, I'd agree with him. This is a fantastic film about a quite gifted boy/man with a special body part helping him. Oscar and BAFTA winning, and Golden Globe nominated Daniel Day-Le ...

Conclusion

In the following notebook we can see how even a vanilla implementation of document classification with logistic regression is able to accurately predict whether a review is positive or negative. Following this, using LDA is an effective method for classifying documents based on topics extracted from the raw text input.