Notes

carefully ordered n-grams about data science, assistive technology, and more by William Li

Journalism by Algorithm

February 24, 2014

This term, I am taking “Future of News and Participatory Media” (MAS.700), a course taught by Ethan Zuckerman in the MIT Media Lab's Center for Civic Media. In one of the class assignments, we are assigned another classmate and told to find as much information online about him or her as possible, then to follow this research with an interview.

For this assignment, I decided to envision what the full “data science” pipeline for gathering, pre-processing, analyzing, and visualizing data about a classmate would involve. What parts of this process are easy and difficult? How much could be automated and done by machines, and which still require humans?

A company, government agency, or perhaps even a well-funded media organization might be able devote entire teams and many human-hours to such an effort, and to developing software to automate and scale certain steps in the process. As just one person, I did some of the steps manually and built early-stage prototypes of different pieces of the pipeline for the purposes of illustration.

Main Insights

  1. Given the scope of this assignment, the most time-efficient thing to do was to use my own human expertise to gather information about my classmate and develop insights by integrating different data sources. However, if the assignment were to gather data on 10, 100, or 1000 people, it would be worth investing in automated tools for web crawling, natural language processing, and translation. Although getting these tasks to work well automatically is non-trivial, they are not technically impossible -- it’s worth noting, for instance, how search engines have already made this assignment much easier than it would have been two decades ago.
  2. Facebook’s Graph Search reveals a substantial amount of information about people, even if they are not connected to you as a friend or mutual friend on Facebook. In particular, “photos liked by” a person seem highly revealing in terms of many different kinds of preferences.
  3. While there is a lot of information available on the Internet about a person, its diffuse, unstructured nature still make it time-consuming to compile. It seems that any person or entity that relies on personal data, therefore, would be willing to invest substantially into the technical capacity to collect and organize this information. Better yet, it seems very worthwhile to convince people to give their information willingly in structured ways to avoid the collection problem all together.

Pipeline

Data Gathering

OUTPUT: A set of pages with interesting content about my classmate.

I used the following sources:

Data Structuring/Organizing

OUTPUT: A set of spreadsheets containing facts about my classmate.

I organized the data that I found into a series of spreadsheets containing biographical information, key events, relevant institutions, and group affiliations. Each “factoid” took on a row in a table.

Data Cleaning/Pre-Processing

OUTPUT: A relational database containing the spreadsheets.

I exported the spreadsheets into comma-separated value (CSV) files and used a script to import them into a relational database. Given the small size of this project, I used SQLite, which stores the entire database as a single file. The code I used can be found in this code repository.

Synthesis and Reporting

OUTPUT: Some analysis of the database contents.

Making the data available in a relational database allows us to filter, aggregate, or join different sources of information. Here are a few examples (I’ve omitted the actual results of the query for privacy reasons):

sqlite> SELECT fact_label,fact_value FROM bio WHERE fact_category='personal_info';

This query finds all “personal_info” facts from the “bio” table. It results in information about the person’s name, background, and career.

sqlite> SELECT fact_category, fact_label, fact_value, confidence FROM bio WHERE confidence='low' OR fact_value='???';

I encoded a “confidence” field in all tables in the database to classify facts as “high” or “low” confidence. I also used “???” as a symbol for facts that were missing from the database. A low-confidence classification might mean that a reporter should spend some effort verifying it.

sqlite> SELECT COUNT(*) FROM groups WHERE type='facebook';

A basic aggregation method is counting items with a certain field value. In this case, we can determine the number of Facebook group memberships that my classmate has.

sqlite> SELECT education.institution_name, position, num_students, founded FROM education, universities WHERE education.institution_name=universities.institution_name;

This is a “join” operation in SQL that involves getting extra information (the number of students and year of founding) from the educational institutions that my classmate attended. A query like this one would be useful if we had other sources of data (like information about groups or universities) to combine with the information we know about the person.


Previous: Male and Female Participation on csail-related@csail.mit.edu

Next: NPR's Michele Norris on the Race Card Project

Tags: data science