February 24, 2014
This term, I am taking “Future of News and Participatory Media” (MAS.700), a course taught by Ethan Zuckerman in the MIT Media Lab's Center for Civic Media. In one of the class assignments, we are assigned another classmate and told to find as much information online about him or her as possible, then to follow this research with an interview.
For this assignment, I decided to envision what the full “data science” pipeline for gathering, pre-processing, analyzing, and visualizing data about a classmate would involve. What parts of this process are easy and difficult? How much could be automated and done by machines, and which still require humans?
A company, government agency, or perhaps even a well-funded media organization might be able devote entire teams and many human-hours to such an effort, and to developing software to automate and scale certain steps in the process. As just one person, I did some of the steps manually and built early-stage prototypes of different pieces of the pipeline for the purposes of illustration.
OUTPUT: A set of pages with interesting content about my classmate.
I used the following sources:
OUTPUT: A set of spreadsheets containing facts about my classmate.
I organized the data that I found into a series of spreadsheets containing biographical information, key events, relevant institutions, and group affiliations. Each “factoid” took on a row in a table.
OUTPUT: A relational database containing the spreadsheets.
I exported the spreadsheets into comma-separated value (CSV) files and used a script to import them into a relational database. Given the small size of this project, I used SQLite, which stores the entire database as a single file. The code I used can be found in this code repository.
OUTPUT: Some analysis of the database contents.
Making the data available in a relational database allows us to filter, aggregate, or join different sources of information. Here are a few examples (I’ve omitted the actual results of the query for privacy reasons):
sqlite> SELECT fact_label,fact_value FROM bio WHERE fact_category='personal_info';
This query finds all “personal_info” facts from the “bio” table. It results in information about the person’s name, background, and career.
sqlite> SELECT fact_category, fact_label, fact_value, confidence FROM bio WHERE confidence='low' OR fact_value='???';
I encoded a “confidence” field in all tables in the database to classify facts as “high” or “low” confidence. I also used “???” as a symbol for facts that were missing from the database. A low-confidence classification might mean that a reporter should spend some effort verifying it.
sqlite> SELECT COUNT(*) FROM groups WHERE type='facebook';
A basic aggregation method is counting items with a certain field value. In this case, we can determine the number of Facebook group memberships that my classmate has.
sqlite> SELECT education.institution_name, position, num_students, founded FROM education, universities WHERE education.institution_name=universities.institution_name;
This is a “join” operation in SQL that involves getting extra information (the number of students and year of founding) from the educational institutions that my classmate attended. A query like this one would be useful if we had other sources of data (like information about groups or universities) to combine with the information we know about the person.