This is a simple implementation of Naive Bayes classifier for review rating. The classifier uses Bayes' rule to determine the probability of a text belonging to a certain class, then we find the class with the highest probability.

Let a token belonging to a sentence be $w_i$ where $i$ is its position in the sentence. We can find the probability $P(c | w_1 w_2 ... w_n)$ for a piece of text with $n$ tokens. Using Baye's rule (we can ignore the evidence term since it doesn't change no matter the class), and assuming independence of the tokens with respect to each other, we have $$ \begin{aligned} P(c | w_1 w_2 ... w_n) & \propto P(c) P( w_1 w_2 ... w_n | c) \\ & \propto P(c) P( w_1 | c) P (w_2 | c) ... P( w_n | c) \\ & \propto P(c) \prod_{i=0}^{n} P( w_i | c) \\ & \propto \log P(c) + \sum_{i=0}^{n} \log P( w_i | c) \end{aligned} $$

This model doesn't directly use the probability $P(c | w_1 w_2 ... w_n)$ but rather uses the likelihood $P( w_1 w_2 ... w_n | c)$ and prior $P(c)$. It therefore belongs to the family of generative models.

We can use counts to approximate those probabilties.

$$ P(w_i | c) = \frac{count(w_i)}{count(c)} $$

with $count(w_i)$ being the number of times the token $w_i$ appears in class $c$ and $count(c)$ the number of examples in that class. $$ P(c) = \frac{count(c)}{|\mathcal{D}|} $$ with $count(c)$ the number of examples in that class and $|\mathcal{D}|$ the size of the dataset.

We will try the classifier on review data. We will combine labeled review data from Amazon, Imdb and Yelp.

We will use the nltk python package for preprocessing. More specifically, it will be the punkt tokenizer and wordnet lemmatizer.

from nltk import word_tokenize
import nltk
import numpy as np
import re

nltk.download('punkt')
nltk.download('wordnet')

Now we have the code for the class. The notable methods here are:

  • _train: it collects the various statistics we need for our classifier
  • _predict: uses those statisctics to make predictions
  • preprocess: takes in the sentence and does the tokenization, normalization and lemmatization

I also added a method to load and save the model.

Here is the initializer for our naive classifier

class naive_classifier:
    def __init__(self, classes):
            """
            Initialization of the object. We create the dictionaries that will hold the priors and likelihoods.
            """
            self.trained = False
            self.classes = classes
            self.nclasses = len(self.classes)

            self.likelihoods = {c : dict() for c in range(self.nclasses) }
            self.priors = [0 for i in range(self.nclasses)]
            self.vocabulary = []

It only takes classes as argument. That would be a list of classes to predict like ["postive","negative] for instance. There is a trained boolean that will be set to true once training is done. n_classes is the number of classes. The likelihoods variable will be a dictionary that will contain as many dictionaries as there are classes. The dictionaries inside that pariable will contain the probabilities of the words belonging to that class, $P( w_i | c) $ in the equations above. It would look somehow like this:

{
    "positive": {
        "bad": 0.007, # probability of the word 'bad' being in a postive review
        "great": 0.401, # probability of the word 'great' being in a postive review
        ...
        "good": 0.576, # probability of the word 'good' being in a postive review
    },
    "negative": {
        "bad": 0.72, # probability of the word 'bad' being in a negative review
        "great": 0.0001, # probability of the word 'great' being in a negative review
        ...
        "good": 0.016, # probability of the word 'good' being in a negative review
    }
}

There's a prior for the probabilities of each class, $P(c)$ in the equations above.

I will skip the read method. You can check it out by running the notebook. Just go to the top of the page and click on "open in colab". You can also use "view on github".

The first method we are interested in is the preprocessing method.

def preprocess(self, sentence):
    """
    preprocesses the sentences. Tokenizes them, then lemmatizes the token using the wordnet lemmatizer
    """
    import string
    from nltk.stem import WordNetLemmatizer
    wordnet_lemmatizer = WordNetLemmatizer()
    words = word_tokenize(sentence)
    toReturn = []
    for word in words:
        if (word not in string.punctuation):
            toReturn.append(wordnet_lemmatizer.lemmatize(word))
    return toReturn

It takes in a string object. NLTK provides us with a tokenizer, it will break the sentence into a list of tokens. We will then replace each token by its lemma using the wordnet lemmatizer.

Let's look at the training function:

def _train(self, corpus):
    classCounts = [0 for i in range(self.nclasses)]
    ndoc = len(corpus)
    wordCounts = {c : dict() for c in range(self.nclasses)}
    for document in corpus:
        review = document[0]
        label = document[-1]
        classCounts[label] += 1
        for word in review:
            if word in wordCounts[label].keys():
                wordCounts[label][word] += 1
            else:
                wordCounts[label][word] = 1

    for index in range(len(self.classes)):
        self.priors[index] = np.log(classCounts[index]/ndoc)
        self.vocabulary += list(wordCounts[index].keys())
    self.vocabulary = set(self.vocabulary)
    print ("Vocabulary size: ",len(self.vocabulary))

    for index in range(len(self.classes)):
        for word in self.vocabulary:
            if word in wordCounts[index]:
                numerator = wordCounts[index][word] + 1 
                denominator = sum(wordCounts[index].values()) + len(wordCounts[index])
                self.likelihoods[index][word] = np.log(numerator/denominator)
            else:
                ## for words that are not yet in the vocabulary, we start from one
                numerator = 1
                denominator = sum(wordCounts[index].values())+len(wordCounts[index])
                self.likelihoods[index][word] = np.log(numerator/denominator)

The first for loop of this code collects the counts in a dictionary called wordCounts. The second one computes the priors for each class and creates the vocabulary. The vocabulary is the combined set of tokens from all the classes. The last loop then uses the compiled statistics to compute the likelihoods for each word within each class.

Now let's look at the predict function.

def _predict(self, sentence):
    """
    Takes tokenized input and outputs numerical class
    """
    import operator
    sumc = dict()
    for c in range(self.nclasses):
        sumc[c] = self.priors[c]
        for word in sentence:
            if word in self.vocabulary:
                sumc[c] += self.likelihoods[c][word]
    return max(sumc.items(), key=operator.itemgetter(1))[0]

In this method, we loop trough the classes and compute the value of $\log P(c) + \sum_{i=0}^{n} \log P( w_i | c)$. We put them in a variable called sumc which is a dictionary with the classes as keys. We first add the class prior, then we add the likelihoods of each token for that specific class. THe last line then gives us the key with the maximum value, that is the predicted class.

We train and test the classifier. We can see that even a naive classifier can achieve a relatively good accuracy on simple problems.

classifier = naive_classifier(classes = ["positive", "negative"])
classifier.train(["./amazon_cells_labelled.txt",
                  "./imdb_labelled.txt",
                  "./yelp_labelled.txt"],
                 test=True,
                 split_ratio=0.2)
reading:  ./amazon_cells_labelled.txt
reading:  ./imdb_labelled.txt
reading:  ./yelp_labelled.txt
Vocabulary size:  4371
2400  training items
600  testing items
Training done
Train accuracy:  0.93875
Test accuracy:  0.83

We can now test the classifier with any data at all and see the result. Here is an example that belongs to the positive class.

test_text1 = "Mushoku Tensei is the greatest light novel series ever! It has great world building, foreshadowing and character development"

classifier.predict(test_text1)
1

Here is one that belongs to the negative class

test_text2 = "The new iPhone was not satisfactory. The innovation is not there anymore and prices are still through the roof"

classifier.predict(test_text2)
0