Category: Visualization

Data Visualization Workflow

Extracted from an OpenVis presentation by Mike Bostock.

1. Prototypes should emphasize speed over polish.
Identify the intent of the prototype.
What hypothesis are you testing?
It needn’t look good, or even have labels.
Make just enough to evaluate the idea.
Then decide whether to go straight or turn.

2. Transition from exploring to refining near deadline.
Choose tools that facilitate transition to final product.

3. Clean as you go.
Be ruthless about deleting code.
You are a chef and the git repo is your kitchen.
You try twenty recipes before deciding on one.
Do you really want to clean up all the mess at the end?

4. Make your process reproducible.
Make a build-system that provides machine-readable documentation.
Accelerate the use of parts from previous projects.

5. Try bad ideas deliberately.
You can’t evaluate a visualization absent the data.
Don’t get too attached to your current favorite.
Don’t get stuck at local maximum; go down to go up.

Autopsy: Hierarchical Clustering

What do normal people do when they want to rank a bunch of cities according to some features? Import the data on an Excel sheet, calculate a composite score across the features, and sort according to the composite score. What do crazy people do?

Why, a clustering, of course.

I had about 20 features of 100 cities, and I wanted to put them into groups according to similarities across the 20 features. I loaded the data into R and did a hierarchical clustering analysis. It works like this. The algorithm loops through all the observations, defines some sort of dissimilarity measure between each pair of observations, and spits out a tree diagram with a fancy name: a dendrogram!

Rplot01

Looks pretty intimidating. I felt very sophisticated. But I wanted the names of the cities at the end of the tree, and R just wouldn’t listen. What to do? After a few hours:

cluster

Hierarchical clustering is a type of “unsupervised learning,” which is what you do when you have no idea what kind of pattern or grouping you are looking for. It’s also a “bottom-up” agglomerative clustering. It starts by comparing individual observations, and works its way up until every observation is covered.

In terms of a tree diagram, it starts from the “leaves” level and works its way up the branches, combining branches up to the trunk of the “tree.” The algorithm usually measures dissimilarity in terms of “Euclidean distance,” which is a fancy way of measuring the distance between two sets of points in an abstract space.

OK. So I found that New York is really similar to Los Angeles and Chicago. All this work was basically useless, but it was kind of fun.

U.S. Mass Shooting Fatalities Since 2014

shootings

And therefore never send to know for whom the bell tolls;
It tolls for thee.

– John Donne, No Man Is An Island

This chart depicts the number of fatalities in mass shootings in the U.S. from 2014 to 2016. You can see clearly that the toll from the latest shooting in Orlando, Florida far outnumbers even the sum of a number of past shootings.

Frank Bruni sums it up pretty well. This isn’t an attack on a minority subset of a population, but an attack on the “bedrock” of our society: the very idea of democracy, acceptance, and diversity.

How many of these incidents are still going to happen before something is done? Sadly, maybe quite a few. “And to actively do nothing is a decision as well.”

You can’t really say “I’m glad this didn’t/doesn’t happen where I live.” Just because it didn’t, doesn’t mean it couldn’t.

First they came for the Socialists, and I did not speak out—
Because I was not a Socialist.

Then they came for the Trade Unionists, and I did not speak out—
Because I was not a Trade Unionist.

Then they came for the Jews, and I did not speak out—
Because I was not a Jew.

Then they came for me—and there was no one left to speak for me.
– Martin Niemöller

This is a work in progress, and the code is on GitHub. The data source is Gun Violence Archive.

© 2017

Theme by Anders NorenUp ↑