Polynomial regression- Absence of a perfect model to get a job
This is my first submission to #datacatedweekly and possibly my first article. I am currently on a job hunt in analytics and for this article I will present a scenario of using polynomial regression for the same. Gone are the times where you are considered for an interview based on single factor. Today, there are numerous factors that play significant role to land an interview.
Consider all the events from my personal and professional life as data points and can be plotted against time. I am interested in a model that covers significant aspects of my life to land me a job in data domain. Now what are the points that this model needs to cover?
- Would just my name, qualifications and interest towards the role do? Sounds too simple, too linear. If I do get hired with just that, then the model results are biased.
- I apply with a resume, a quadratic polynomial- presents my academia, professional experience, certifications and volunteering work. But there is more to me. This curve is under-fitting. What about all the hours I put in online courses, hackathons, networking and interacting with community, right?
- Ah LinkedIn. My polynomial of kth order. I have all my skills, projects and certifications on display, rich community that I follow, people I worked with for recommendations. And also, jobs that I can apply. This is a good model for recruiter to notice me. But is it enough?
May be its not, but let’s say that a model of nth order(n>k>2) which houses all my life’s fingerprint and presents it to the recruiter, yes even my embarrassing dance moves, epic cooking disasters, falls from my bike, awkward dating experiences etc. Oh No! Too much information, lot of noise. The model is too complex. This is an example of over-fitting.
To prevent over-fitting, we can add more training samples so that the algorithm doesn’t learn the noise in the system and can become more generalized.
How do I choose a best-fit model? Let me understand bias vs variance trade-off.
Wikipedia states, “… bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (under-fitting).”
Intuitively having high bias means more likely to make wrong assumptions. Algorithm’s tendency to learn wrong things by not considering all the information in the data (under-fitting, just like my resume). Resume like linear regression is easy to understand but not flexible to learn underlying competency of my data.
Wikipedia states, “… variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (over-fitting).”
Variance refers to algorithm’s sensitivity to specific noise in data. Just like all the irrelevant information that will not help me be a star candidate in the eyes of the recruiter. High variance means the model passes through most of my life’s data points and results in over-fitting.
High bias and high variance (under-fitting and over-fitting) observed as above doesn’t make a good model. Ideally, a machine learning model should have low variance and low bias. But practically it is impossible to have both. Hence to achieve a good model that captures my present strengths and unseen competency for the job, a trade-off is made.
By: Sanathkumar Idurkar