Study - Concept of overfitting using higher order linear regression

Goal:

To study the concept of overfitting using variations in higher order linear regression.

Basic Definitions:


There are some deficiencies that a model performance might face. They include the concept of undercutting and overfitting.


Underfitting: It occurs when our model capacity is less than the available data size. It normally shows large value test error but lesser value of train error.


OverfittingIt occurs when our model capacity is more than the available data size. It normally shows large values of training and test error.


It seems less believable when we see all the data points being covered in case of overfitting and say it is not a desirable way. But we need to consider the generalization aspect of our model but not fitting the data points here. So, what is the correct way to go with? Well, we need to find a state which lies in between overfitting and underfitting. Let's analyze it further in this blog.


Gaussian noise: These are the deviations created in the target variables from the actual obtained output value, which follows the Gaussian distribution. This is performed to get a real world dataset which is never free from noise component.

Regularization : It is a technique which makes some modifications to the learning algorithm in such a way that it is able to generalize in a better manner. It is used for tuning the function by adding an extra penalty in the error function.This improves the model's performance on test data as well and helps in avoiding overfitting as well.


Experimental observations and analysis:


The following graph generated for degree 0 to 9 for higher order linear regression model:


                                                                       For degree 0:


   
For degree 1:


For degree 2:



For degree 3:



For degree 4:



For degree 5:



For degree 6:


For degree 7:


For degree 8:


For degree 9:






While analyzing the pattern of graphs interpolation with respect to data, it can be seen that the lower degree graphs do not have interpolation covering the data points. However as the degree of graph increases, more number of points are covered by it. Looking at the graph, it seems that 9th order graph is the best fit, but it is the case when overfitting occurs actually. The graph of test and train data helped in identifying the overfitting in this situation.


Following graph was achieved using test and train data from the same model:



It can be seen that the training error is  moving towards zero. However, the test error increased after crossing degree 3 leading to overfitting. The point after which the test error increases can be a good fit model as overfitting occurs when test error increases and train error decreases while the points fit on the graph plotted for model.

Following were the weight changes seen while applying the model:


For further improving the performance of the model, Ridge regression was used while applying the model and following graphs were plotted while using degree 9 for different values of Lambda.

For Lambda = 1:


                           


For Lambda = 1/10:




For Lambda = 1/100:




For Lambda = 1/1000:




For Lambda = 1/10000:

     


For Lambda = 1/100000:
 

On plotting training and testing data achieved after applying Ridge modal with respect to Logarithmic value of Lambda, following observation was made.





Code:


 







           

 










Challenges faced:


Handling data variations on randomly generated data when different methods were applied on them was difficult as the values shoot to very large size in nearly no time. Using normalization, helped a lot in handling it in a better manner.



References:

https://tcoil.info/data-interpolation-in-python-and-scipy/
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
https://towardsdatascience.com/polynomial-regression-bbe8b9d97491
https://www.analyticsvidhya.com/blog/2020/03/polynomial-regression-python/

https://www.geeksforgeeks.org/python-program-to-print-the-dictionary-in-table-format/
https://en.wikipedia.org/wiki/Overfitting

https://towardsdatascience.com/regularization-an-important-concept-in-machine-learning-5891628907ea
https://towardsdatascience.com/what-are-overfitting-and-underfitting-in-machine-learning-a96b30864690


Contribution: 


The pattern of root mean square error value with respect to degree of regression model can be one of the important factor for analyzing overfitting in a modal and hence avoiding it even when it fits perfectly on graph plot of data points. In our case, as the root mean square error increased drastically after degree 8. Therefore, in this case modal of degree 8 would be the best fit.






Comments

Popular posts from this blog

Garbage Classification

Sentiment Analysis using NBC