Month: June 2016

Autopsy: Hierarchical Clustering

What do normal people do when they want to rank a bunch of cities according to some features? Import the data on an Excel sheet, calculate a composite score across the features, and sort according to the composite score. What do crazy people do?

Why, a clustering, of course.

I had about 20 features of 100 cities, and I wanted to put them into groups according to similarities across the 20 features. I loaded the data into R and did a hierarchical clustering analysis. It works like this. The algorithm loops through all the observations, defines some sort of dissimilarity measure between each pair of observations, and spits out a tree diagram with a fancy name: a dendrogram!


Looks pretty intimidating. I felt very sophisticated. But I wanted the names of the cities at the end of the tree, and R just wouldn’t listen. What to do? After a few hours:


Hierarchical clustering is a type of “unsupervised learning,” which is what you do when you have no idea what kind of pattern or grouping you are looking for. It’s also a “bottom-up” agglomerative clustering. It starts by comparing individual observations, and works its way up until every observation is covered.

In terms of a tree diagram, it starts from the “leaves” level and works its way up the branches, combining branches up to the trunk of the “tree.” The algorithm usually measures dissimilarity in terms of “Euclidean distance,” which is a fancy way of measuring the distance between two sets of points in an abstract space.

OK. So I found that New York is really similar to Los Angeles and Chicago. All this work was basically useless, but it was kind of fun.

Louder, Faster, Harder, Stronger: The Loudness War

If you’re in the music industry or if you nerd out on music, you would have heard of the Loudness War.

The concept is fairly intuitive: the louder the song, the more likely people will hear it, especially when they aren’t paying active attention. Being loud helps a song get noticed when it’s on the radio during your cab ride or at the grocery store. If your song is louder than your rival’s, it gets the listener’s attention, and people are more likely to remember it the next time they are streaming or buying music.

The tradeoff is that when you make a song louder, you sacrifice sound quality. In the process of boosting the overall loudness of a song, all the sounds are made louder, including the softer sounds. The contrast between loud and soft sounds is lost, along with the nuanced interactions between different instruments and the vocals. Basically, it sounds more crappy and boring.

Why do record companies commit this atrocity? Because it sells. Music analytics company Next Big Sound analyzed the audio features of 32,310 musical tracks by 751 artists, and found that loudness is positively correlated with sales, other factors held constant.


In this contour plot, the x-axis represents loudness, and the y-axis represents sales. The weird multi-layer shape in the plot is the “contour,” which tells you where the various tracks stand in their loudness-sales relationships. The color represents the density: the blue bit indicates that a lot of tracks are clustered in that area on the graph.

Most of the songs in the sample tend to be on the loud side, as the center of the blue bit sits near the right end of the x-axis. Most songs really don’t sell that much, so a large cluster of the tracks sits low on the y-axis. The tracks that turn out to be mega hits are at the upper right corner of the plot, and they are all on the loud side.

What does this tell us? Loudness is correlated with higher sales. The relationship isn’t linear though; it looks more like a roughly log-linear (exponential) relationship.

As a consumer, I’ve adapted. I’ll just listen to the shitty loud songs at the club. At home, I’ll put on some good old classical on a curated sound system.

U.S. Mass Shooting Fatalities Since 2014


And therefore never send to know for whom the bell tolls;
It tolls for thee.

– John Donne, No Man Is An Island

This chart depicts the number of fatalities in mass shootings in the U.S. from 2014 to 2016. You can see clearly that the toll from the latest shooting in Orlando, Florida far outnumbers even the sum of a number of past shootings.

Frank Bruni sums it up pretty well. This isn’t an attack on a minority subset of a population, but an attack on the “bedrock” of our society: the very idea of democracy, acceptance, and diversity.

How many of these incidents are still going to happen before something is done? Sadly, maybe quite a few. “And to actively do nothing is a decision as well.”

You can’t really say “I’m glad this didn’t/doesn’t happen where I live.” Just because it didn’t, doesn’t mean it couldn’t.

First they came for the Socialists, and I did not speak out—
Because I was not a Socialist.

Then they came for the Trade Unionists, and I did not speak out—
Because I was not a Trade Unionist.

Then they came for the Jews, and I did not speak out—
Because I was not a Jew.

Then they came for me—and there was no one left to speak for me.
– Martin Niemöller

This is a work in progress, and the code is on GitHub. The data source is Gun Violence Archive.

Python Random Forest Classification

Yesterday, I learned to use random forest classification in Python at a workshop hosted by NYC Women in Machine Learning & Data Science, facilitated by data scientists from OnDeck.

Here is the solution file from the instructor.

Justin Law, one of the data scientists from OnDeck, said that this analysis is a very simple one compared to the ones he deals with at work on a daily basis. Still, it took more than three hours just to read through and understand.

Machine learning is a wonderful black box you can do a lot of damage with, even if you have no idea why things are happening in there.

Here’s another render of random forest classification. Once you’ve figured out the steps and the motions, hopefully you’ll understand some of the theories down the road.

Python Prank

Yesterday, I was chatting on Slack with fellow RC members about object-oriented programing and the Python language, when Paul Gowder brought up a prank he had written. It’s supposed to create a security hole, and suppress errors, so it becomes impossible to find bugs.

I started analyzing it line by line, with the help of Paul, Leo Torres and Sean Martin.

Line 3: class foo(str) declares foo to be a subclass of str, which means foo will do everything str does, plus anything else you add to it.

Line 4: Here, we add the __call__ functionality to class foo(str).

Line 6: We are calling exec on self, which interprets the string as executable python code.

Line 7: The except Exception part would suppress any errors we might get.

Line 10: Adding str = foo replaces the standard implementation of str with our new foo. This line is important for the code to be malware, because everything created with str() will actually be a foo(), which means that now your strings created with str() are callable. If they’re called, they’re executed as Python code.

Line 18: If we do evil(), we end up running exec 'print "EVIL"', which is interpreted as the python code print "EVIL", which then just prints EVIL.

In a nutshell, anything that gets converted to a string with str() is turned into a function that you can call. One little typo, entering the name of a string variable rather than a function, and you’ve just executed whatever random code is contained in the string.

The stack trace you get won’t give you any obvious indication that what you called was a string. It’ll just throw errors related to whatever it is that you put in the string. Or, if there’s something that will actually run in the string, then it’ll just execute, and God only knows what happens then.

If our application is reading a value for username = input(), and the user inputs not a username but malicious code, this code ends up being run. Also, the except Exception part would suppress any errors we get from calling nonsense that way.

It kind of felt like music composition.

© 2018

Theme by Anders NorenUp ↑