Category: Data Science

Autopsy: Hierarchical Clustering

What do normal people do when they want to rank a bunch of cities according to some features? Import the data on an Excel sheet, calculate a composite score across the features, and sort according to the composite score. What do crazy people do?

Why, a clustering, of course.

I had about 20 features of 100 cities, and I wanted to put them into groups according to similarities across the 20 features. I loaded the data into R and did a hierarchical clustering analysis. It works like this. The algorithm loops through all the observations, defines some sort of dissimilarity measure between each pair of observations, and spits out a tree diagram with a fancy name: a dendrogram!


Looks pretty intimidating. I felt very sophisticated. But I wanted the names of the cities at the end of the tree, and R just wouldn’t listen. What to do? After a few hours:


Hierarchical clustering is a type of “unsupervised learning,” which is what you do when you have no idea what kind of pattern or grouping you are looking for. It’s also a “bottom-up” agglomerative clustering. It starts by comparing individual observations, and works its way up until every observation is covered.

In terms of a tree diagram, it starts from the “leaves” level and works its way up the branches, combining branches up to the trunk of the “tree.” The algorithm usually measures dissimilarity in terms of “Euclidean distance,” which is a fancy way of measuring the distance between two sets of points in an abstract space.

OK. So I found that New York is really similar to Los Angeles and Chicago. All this work was basically useless, but it was kind of fun.

Python Random Forest Classification

Yesterday, I learned to use random forest classification in Python at a workshop hosted by NYC Women in Machine Learning & Data Science, facilitated by data scientists from OnDeck.

Here is the solution file from the instructor.

Justin Law, one of the data scientists from OnDeck, said that this analysis is a very simple one compared to the ones he deals with at work on a daily basis. Still, it took more than three hours just to read through and understand.

Machine learning is a wonderful black box you can do a lot of damage with, even if you have no idea why things are happening in there.

Here’s another render of random forest classification. Once you’ve figured out the steps and the motions, hopefully you’ll understand some of the theories down the road.

© 2017

Theme by Anders NorenUp ↑