Sentiment Analysis using NBC

Goal:

To learn about Naive Bayes classification and calculate the accuracy of the modal.

Basic Definitions:

Naive Bayes assumption used

Every event is independent in nature.

Formulae used:

1. P(A/B) = (P(B/A)*P(A))/P(B)
2. P(A/B) = (P(A and B))/P(B)

Smoothing

It can be carried out using Laplace or Add+1 correction. It helps in avoiding zero probability values seen due to insufficient data.

Advantages of using Naive Bayes:

1. It is robust to avoid noise.
2. It can handle missing values.
3. It is robust to irrelevant attributes.


Experimental observations and analysis:

1. Cleaning the data by removing punctuations helped in getting a good accuracy score of around 55.4%

2. Applying smoothing increased the accuracy of model by around 4%. The probability values which were becoming zero initially got some value instead on smoothing.

3. Following words filtering based on their probability helped in getting better results for accuracy further.

Top 10 words that predict negative class are as follows: ['poor', 'lines', 'wasted', "can't", 'worst', 'annoying', 'these', 'dialogue', 
'stupid', 'awful'] Top 10 words that predict positive class are as follows: ['interesting', 'however', 'brilliant', 'wonderful', 'liked', 'actually', 
'played', 'job', 'makes', 'family']


Code:

Github link for code:

https://github.com/swatidamele/Jupyter-Notebooks/blob/main/Damele_03.ipynb

GitHub link for dataset "imdb_labelled.txt" file being used in code.

https://github.com/swatidamele/Jupyter-Notebooks/blob/main/imdb_labelled.txt

Screenshots of code:














Challenges faced: 

Inconsistencies in data format made it difficult to get correct predictions initially. However, things like removing the special characters helped in getting better results.


References:


Contribution: 

Handling data variations in different formats by removing unnecessary part of content used for getting proper values. Also, the accuracy of the model could be improved by filtering special characters from the dataset and applying smoothing.

Comments

Popular posts from this blog

Garbage Classification

Study - Concept of overfitting using higher order linear regression