Sentiment Analysis using NBC

April 26, 2021

Goal:

To learn about Naive Bayes classification and calculate the accuracy of the modal.

Basic Definitions:

Naive Bayes assumption used:

Every event is independent in nature.

Formulae used:

1. P(A/B) = (P(B/A)*P(A))/P(B)
2. P(A/B) = (P(A and B))/P(B)

Smoothing:

It can be carried out using Laplace or Add+1 correction. It helps in avoiding zero probability values seen due to insufficient data.

Advantages of using Naive Bayes:

1. It is robust to avoid noise.
2. It can handle missing values.
3. It is robust to irrelevant attributes.

Experimental observations and analysis:

1. Cleaning the data by removing punctuations helped in getting a good accuracy score of around 55.4%

2. Applying smoothing increased the accuracy of model by around 4%. The probability values which were becoming zero initially got some value instead on smoothing.

3. Following words filtering based on their probability helped in getting better results for accuracy further.

Top 10 words that predict negative class are as follows: ['poor', 'lines', 'wasted', "can't", 'worst', 'annoying', 'these', 'dialogue',

'stupid', 'awful'] Top 10 words that predict positive class are as follows: ['interesting', 'however', 'brilliant', 'wonderful', 'liked', 'actually',

'played', 'job', 'makes', 'family']

Code:

Github link for code:

https://github.com/swatidamele/Jupyter-Notebooks/blob/main/Damele_03.ipynb

GitHub link for dataset "imdb_labelled.txt" file being used in code.

https://github.com/swatidamele/Jupyter-Notebooks/blob/main/imdb_labelled.txt

Screenshots of code:

Challenges faced:

Inconsistencies in data format made it difficult to get correct predictions initially. However, things like removing the special characters helped in getting better results.

References:

https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set
https://machinelearningmastery.com/k-fold-cross-validation/

Contribution:

Handling data variations in different formats by removing unnecessary part of content used for getting proper values. Also, the accuracy of the model could be improved by filtering special characters from the dataset and applying smoothing.

Search This Blog

Technical Content

Sentiment Analysis using NBC

Comments

Post a Comment

Popular posts from this blog

Garbage Classification

Study - Concept of overfitting using higher order linear regression