Getting comfortable with the Unix pipe
27 Sep 2016Files ain’t going away anytime soon, son.
If you’re a developer, engineer, technical person who’s expected to know something about coding, you’re gonna be working with files. You might as well learn how to get glean some information from it without busting open excel, lest you want to be mistaken for a business person.
Pipe it like it’s hot
The Unix pipe is the weapon I use to get a quick picture of what’s going on in a file, whether it’s a large data file [if you’re into Big Data, you can leave] that I need to read in, or it’s some output that my system has created. The weapon is really plumbing between different unix commands that read from stdin and output to stdout. Unfortunately you can’t hurt anyone with this pipe.
Examples are good, I hear
Download this impatient_data.csv that I got from data.gov if you want to follow along. We’re gonna do some cool stuff with it. This is basically data that describes each patient’s diagnosis, along with the hospital discharge data – most importantly, the location and the money.
Here are the columns of the csv file, brought to you by head
:
Let’s see how many lines are in the file.
Ha! 163,000 lines! Good luck doing anything in Excel.
Let’s say I want to know how many of these charges there are in Jersey, my home state, the best state. And then I want to know what are the unique diagnoses DRG Definition) in Jersey.
Whoa ho ho there! Lemme break that down
grep ',NJ,' impatient_data.csv
I used grep
to filter for the charges in New Jersey. In fact, I padded ‘NJ’ with commas in my command to filter out any extraneous ‘NJ’s that showed up in other columns. There’s about 5000 data points from the Dirty Jersey.
cut -d',' -f 1
I used cut
to cut out a single column or field from the output of the grep
command. Specifically, the first field of the output, which is the diagnosis description.
sort | uniq
Well, I wanted a list of all the unique charges that happened so I’m gonna use uniq
obviously. But first you have to sort.
So together that becomes:
grep ',NJ,' impatient_data.csv | cut -d',' -f 1 | sort | uniq
This gives you all the unique Diagnosis-related groups (DRG).
Pro tip: use less
to verify the output of your piped commands
What if we want to figure out the top 10 most frequently occuring Diagnoses in Joisey? Well, that’s easy.
Interesting… Acute myocardial infarctions happen a LOT. I’ve never heard of them, but a quick Google search shows that they’re in fact Heart Attacks. What’s more surprising are all these nutrition disorders. Hmph. I should really eat my vegetables. Let’s move on to a extremely personal question of mine..
Should I retire in Florida or not?
As a 25 year old man, this question begs to be answered. Let’s find the average Total costs of hospital charges in New York and Florida.
Well, I save an average of $1800 on Heart Failure related conditions, but then again, I’ll be in Florida. That’s a tough one.
I hope you learned as much about Unix piping as I did from this Health data set. Thanks Obama.