K-means Clustering? Think first day of highschool.

K-means Clustering? Think first day of highschool.

It’s your first day at a new highschool. You enter the building and you’re excited, nervous, more importantly, alone. You’re just one example of many students who are feeling exactly the same.

A couple of days in, you start having your first mini side talks here and there. You begin to test the water with the person next to you in Chemistry but no way you’re kicking it off with him. You enjoy a good laugh with the girl behind you in English so you think “Awesome; got a lunch buddy for today!” You go to lunch and realize “Mmmm not really my jam.” So on so forth, till 3 weeks in, you find yourself kicking it off with an awesome group of computer geeks and you feel like you’re finally fitting in.

So what about the k-means clustering thing I wanted to talk about? Well you know how the feeling goes with everything I said earlier? K-means is an algorithm that takes a number of sample data points (like yourself and all other new kids in highschool) and tries to find a central point to identify each group of data samples that have similar features (students with similar tastes and hobbies). Once all data samples have been assigned to their group, any new data point can then be directly labeled with the group name it fits to the most by measuring the smallest distance it has to all the central points from existing groups.

You enter highschool new, unlabeled, a loner sample amidst the groups. After some tests (looking for groups with similar likes/dislikes, you find yourself fitting in, officially labeled as part of the team you kick it off with the most! P.S. what’s the “k”? It’s the Nerd. Geek. Princess. Loser. Brain. Prude. Popular. (Not that I’m fan of this ‘system’). It’s the number of groups into which the data samples will be divided/clustered.

This is k-means in a simple form. Disclaimer: I, in no way, support the labeling of humans and do not support bullying/name calling. This is just an experience I think many of us can relate to and can help explain a tech-y data science algorithm to anyone.

By: Reem Mahmoud