Sentiment Analysis using NBC
Goal:
To learn about Naive Bayes classification and calculate the accuracy of the modal.

Basic Definitions:
Naive Bayes assumption used:
Every event is independent in nature.
Formulae used:
1. P(A/B) = (P(B/A)*P(A))/P(B)
2. P(A/B) = (P(A and B))/P(B)
Smoothing:
It can be carried out using Laplace or Add+1 correction. It helps in avoiding zero probability values seen due to insufficient data.
Advantages of using Naive Bayes:
1. It is robust to avoid noise.
2. It can handle missing values.
3. It is robust to irrelevant attributes.
Experimental observations and analysis:
1. Cleaning the data by removing punctuations helped in getting a good accuracy score of around 55.4%
2. Applying smoothing increased the accuracy of model by around 4%. The probability values which were becoming zero initially got some value instead on smoothing.
3. Following words filtering based on their probability helped in getting better results for accuracy further.
Top 10 words that predict negative class are as follows:
['poor', 'lines', 'wasted', "can't", 'worst', 'annoying', 'these', 'dialogue',
'stupid', 'awful']
Top 10 words that predict positive class are as follows:
['interesting', 'however', 'brilliant', 'wonderful', 'liked', 'actually',
'played', 'job', 'makes', 'family']
Code:
Github link for code:
https://github.com/swatidamele/Jupyter-Notebooks/blob/main/Damele_03.ipynb
GitHub link for dataset "imdb_labelled.txt" file being used in code.
https://github.com/swatidamele/Jupyter-Notebooks/blob/main/imdb_labelled.txt
Screenshots of code:
Challenges faced:
Inconsistencies in data format made it difficult to get correct predictions initially. However, things like removing the special characters helped in getting better results.
References:
References:
https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set
https://machinelearningmastery.com/k-fold-cross-validation/
Contribution:
Handling data variations in different formats by removing unnecessary part of content used for getting proper values. Also, the accuracy of the model could be improved by filtering special characters from the dataset and applying smoothing.













Comments
Post a Comment