What do normal people do when they want to rank a bunch of cities according to some features? Import the data on an Excel sheet, calculate a composite score across the features, and sort according to the composite score. What do crazy people do?

Why, a clustering, of course.

I had about 20 features of 100 cities, and I wanted to put them into groups according to similarities across the 20 features. I loaded the data into R and did a hierarchical clustering analysis. It works like this. The algorithm loops through all the observations, defines some sort of dissimilarity measure between each pair of observations, and spits out a tree diagram with a fancy name: a dendrogram!


Looks pretty intimidating. I felt very sophisticated. But I wanted the names of the cities at the end of the tree, and R just wouldn’t listen. What to do? After a few hours:


Hierarchical clustering is a type of “unsupervised learning,” which is what you do when you have no idea what kind of pattern or grouping you are looking for. It’s also a “bottom-up” agglomerative clustering. It starts by comparing individual observations, and works its way up until every observation is covered.

In terms of a tree diagram, it starts from the “leaves” level and works its way up the branches, combining branches up to the trunk of the “tree.” The algorithm usually measures dissimilarity in terms of “Euclidean distance,” which is a fancy way of measuring the distance between two sets of points in an abstract space.

OK. So I found that New York is really similar to Los Angeles and Chicago. All this work was basically useless, but it was kind of fun.