4 - Top Secret Public Data
Some exciting news: I have been taken on for my first data science project… in The Real World. And I couldn’t have asked for a more exciting project.
A few weeks ago, Jeff Asher, FiveThirtyEight.com’s crime reporter, wrote an article entitled “The First FBI Crime Report Issued Under Trump Is Missing A Ton Of Info”. In it Jeff describes the changes in the FBI’s Crime in the United States report - which is the main go-to for researches wanting quantitative information on crime - between 2015 and 2016. Basically, there are a ton of tables missing this year.
For example, consider a researcher who wants to know what % of crimes were committed by strangers in 2015. The researcher (in this case me) could easily download the table from the website into excel.
As you can see, I was able to immediately put in an Excel formula to calculate that a mere 20% of known crimes were committed by strangers. Time count: 30s.
What if a researcher wanted to know what % of crimes were committed by strangers in 2016.
Firstly, they would have request the raw text file from the FBI - hardly like taking candy from a baby.
OK - we’ve done the hard work for you. Here is the file on Expanded Homicide Data on our project GitHub page,(it is public data). Here is a little taster:
That goes on for 22,486 lines.
Now they have two main options.
1). If they lack programming skills, they will have to scour the 22,486 lines of this mind-boggling text file, that my supervisor says “looks like it was created using Fortran in the 70s”, manually counting the REL column for ST (stranger) and UN(unknown).
Time count: Sorry Mom, I can’t make it home for Thanksgiving Weekend, spending it with my data.
2a). If they have programming skills, they will try to write a function to convert the data into a more modern, normalized form, to then be able to calculate the statistics.
Time count: a few days.
2b). If they have programming skills, but lack time, they will enlist a budding data scientist to do this for them.
And I’m happy to say that 2b is the option chosen by my supervisor, a former Palantir employee who is heading up the project along with Jeff.
I have been assigned the task of writing a Python program to convert the text file into a json file, which contains a dictionary for each crime event. The team hope to then be able to use this to recreate the tables so that crime researchers across the land can have their Thanksgiving back. They then hope to replicate this for other missing crime data besides homicides.
UPDATE: After diligent efforts in the media - led by Jeff & co. at 538 - to call out the FBI for this, they pulled their fingers out and released the tables, meaning my diligent efforts cleaning the data went unused :‘(.