Study - Concept of overfitting using higher order linear regression
Goal:
To study the concept of overfitting using variations in higher order linear regression.
Basic Definitions:
There are some deficiencies that a model performance might face. They include the concept of undercutting and overfitting.
Underfitting: It occurs when our model capacity is less than the available data size. It normally shows large value test error but lesser value of train error.
Overfitting: It occurs when our model capacity is more than the available data size. It normally shows large values of training and test error.
It seems less believable when we see all the data points being covered in case of overfitting and say it is not a desirable way. But we need to consider the generalization aspect of our model but not fitting the data points here. So, what is the correct way to go with? Well, we need to find a state which lies in between overfitting and underfitting. Let's analyze it further in this blog.
Regularization : It is a technique which makes some modifications to the learning algorithm in such a way that it is able to generalize in a better manner. It is used for tuning the function by adding an extra penalty in the error function.This improves the model's performance on test data as well and helps in avoiding overfitting as well.
Experimental observations and analysis:
The following graph generated for degree 0 to 9 for higher order linear regression model:
For degree 0:
While analyzing the pattern of graphs interpolation with respect to data, it can be seen that the lower degree graphs do not have interpolation covering the data points. However as the degree of graph increases, more number of points are covered by it. Looking at the graph, it seems that 9th order graph is the best fit, but it is the case when overfitting occurs actually. The graph of test and train data helped in identifying the overfitting in this situation.
Following graph was achieved using test and train data from the same model:
Code:
Challenges faced:
References:
https://tcoil.info/data-interpolation-in-python-and-scipy/
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
https://towardsdatascience.com/polynomial-regression-bbe8b9d97491
https://www.analyticsvidhya.com/blog/2020/03/polynomial-regression-python/
https://www.geeksforgeeks.org/python-program-to-print-the-dictionary-in-table-format/
https://en.wikipedia.org/wiki/Overfitting
https://towardsdatascience.com/regularization-an-important-concept-in-machine-learning-5891628907ea
https://towardsdatascience.com/what-are-overfitting-and-underfitting-in-machine-learning-a96b30864690
Contribution:
The pattern of root mean square error value with respect to degree of regression model can be one of the important factor for analyzing overfitting in a modal and hence avoiding it even when it fits perfectly on graph plot of data points. In our case, as the root mean square error increased drastically after degree 8. Therefore, in this case modal of degree 8 would be the best fit.





































Comments
Post a Comment