Posts

Twitter Data Analysis

Image
Goal: Collect data of who follows whom from Twitter app and performed grouping the users by the number of users they follow. For each group, calculated the number of users belonging to that group. Basics: Map reduce :  A MapReduce program is composed of a map procedure, which performs filtering and sorting, and a reduce method, which performs a summary operation.  It  is a programming model and a n associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce framework (or system) is usually composed of three operations (or steps): Map:   each worker node applies the   map   function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of the redundant input data is processed. Shuffle:   worker nodes redistribute data based on the output keys (produced by the   map   function), such that all data belonging to...

Garbage Classification

Image
Goal: Garbage classification using  image of the product at user level directly. Algorithm used: ResNet model in CNN is used here to perform image classification Optimizers used: Adam and Adamax Evaluation score: 0.97 Future perspective: This system can be extended further to create single bin at public places which change internally based on the type of item being disposed capturing the image of it and changing the bin to be used accordingly.  Advantages : It would expand the classification of products to be recycled at user level directly. It would in turn reduce the manual effort required for the same during recycling process. A user unaware of the category might mistakenly place the wrong type of items in a bin. Automating it would avoid these manual errors and reduce to effort at backend later. Experimental observations and analysis: Implemented the model using following ways: MobileNetV2 with sigmoid MobileNetV2 with  softmax MobileNetV2 with sigmoid ResNet with sig...

Sentiment Analysis using NBC

Image
Goal: To learn about Naive Bayes classification and calculate the accuracy of the modal. Basic Definitions: Naive Bayes  assumption used :  Every event is independent in nature. Formulae used : 1. P(A/B) = (P(B/A)*P(A))/P(B) 2. P(A/B) = (P(A and B))/P(B) Smoothing :  It can be carried out using  Laplace or Add+1 correction. It helps in avoiding zero probability values seen due to insufficient data. Advantages of using Naive Bayes : 1. It is robust to avoid noise. 2. It can handle missing values. 3. It is robust to irrelevant attributes. Experimental observations and analysis: 1. Cleaning the data by removing punctuations helped in getting a good accuracy score of around 55.4% 2. Applying smoothing increased the accuracy of model by around 4%. The probability values which were becoming zero initially got some value instead on smoothing. 3. Following words filtering based on their probability helped in getting better results for accuracy further. Top 10 words that...

Study - Concept of overfitting using higher order linear regression

Image
Goal: To study the concept of overfitting using variations in higher order linear regression. Basic Definitions: There are some deficiencies that a model performance might face. They include the concept of undercutting and overfitting. Underfitting : It occurs when our model capacity is less than the available data size. It normally shows large value test error but lesser value of train error. Overfitting :  It occurs when our model capacity is more than the available data size. It normally shows large values of training and test error. It seems less believable when we see all the data points being covered in case of overfitting and say it is not a desirable way. But we need to consider the generalization aspect of our model but not fitting the data points here.  So, what is the correct way to go with? Well, we need to find a state which lies in between overfitting and  underfitting. Let's analyze it further in this blog. Gaussian noise : These are th...