Let the data speak!!

January 8, 2019 DATAcated Challenge 0

“Nearly half a century after computers entered the mainstream, the data has begun to accumulate to the point where we can blend and use it!”

Almost 5 Exabyte’s of data is being created every two days. With computing made easy, learn how to use it!

Hypothesis: Importantly, one has to identify the Purpose of using the data.

This purpose would set you the direction, let’s call it hypothesis.

Outliers: There is too much of data available, not everything will be useful to you every time. You have to clean-up the data based on the need. Like you don’t pack suites for a beach party! You have to remove those outliers, these outliers are those data points which will give you skewed results.

Variable selection: Identifying the variables to be used. You should probably understand/ prevent a problem called multi-collinearity.

Not only that, do a missing value treatment!

Analyse: When you have all the data ready, split the data into train, test and validation datasets (75%, 15% & 15% respectively). Understanding the problem of over-fitting and under-fitting is very important at this stage.

Model (algorithm) selection:

It’s not so simple!! When there are many algorithms, you can’t use the one you are comfortable with. Model selection doesn’t not have a basic rules. As a beginner, one has to understand the importance of P-value. Try running multiple models on your data, choose the model which has the best P-value (P<0.05).

Monitoring: When the market is so dynamic, you cannot use older data.  Keep a continuous check on the model, keep running multiple models and try to improvise the accuracy.

Keep up to the market!! There is always new to learn!

By: Gowtham Prapullakumar


