Keyword finder: automatic keyword extraction from text

Lavanya Sharan, Insight Data Science Fellow

in collaboration with URX

Slides Github
[Slides] [Github]
PROBLEM

For my project, I partnered with URX, a San Francisco-based startup that serves advertisments on mobile devices. The problem URX wants to solve is the following: how to match advertisers to content providers in a context-specific way? For example, if the content consists of a news article about Kendrick Lamar, a hip-hop musician, a matching ad would link to Kendrick Lamar's latest album. One way to accomplish this matching is to extract keywords from the news article, use those keywords to search a database of advertisers, and then serve the best matching ad.

My project focused on the keyword extraction step, and I built a prototype keyword extractor for URX. Given a block of text as input, my algorithm identifies keywords that describe what the text is about. The deliverables for my project were: (i) an algorithm for keyword extraction; and (ii) Python scripts to implement the algorithm.

SOLUTION
Pipeline
Figure 1: Stages of the algorithm

My algorithm for keyword extraction consists of three stages, similar to existing systems for keyword extraction [Jean-Louis, 2014]. The first stage involves generating candidate keywords. These include named entities, noun phrases, and frequent trigrams.

The second stage involves generating features for each candidate keyword. These features capture the frequency of a word (term frequency, TF-IDF score, Wikipedia frequency), the structure of a word (term length, capitalization), the type of word (named entity, noun phrase, trigram), and the relationship between the word and the input text (first occurence in text, distance between occurences in text). These 10 features are normalized to lie between 0 and 1, and are passed on to the next stage.

The final stage consists of a keyword classifier. For each candidate keyword, the classifier examines the 10 associated features, and assigns a probability, p, of the candidate being a keyword. If p is high enough (e.g., > 0.75), the candidate is selected as a keyword. Specifically, the classifier I used was a logistic regression model, and it was trained on the Crowd500 dataset.

EVALUATION
Top-15 keyword extraction results
Figure 2: Top 15 keywords were extracted for each document in the Crowd500 test set, using the evaluation methodology of [Jean-Louis, 2014]. These keywords were compared to the human-annotated keywords, and the results are presented here in terms of precision, recall, and f1-scores. For comparison, results are also presented for a random word extractor and the AlchemyAPI keyword extractor. Errors bars indicate standard error of the mean.

To evaluate the performance of my algorithm, I used a labeled dataset: Crowd500 by [Marujo, 2012]. Crowd500 consists of 500 news articles gathered from the web, and for each article, it contains a set of keywords that were identified by human annotators. This dataset is split into two sets: training (450 articles) and test (50 articles).

To train the logistic regression model, I used the Crowd500 training set as follows. For each article in the training set, I selected all keywords associated with the article (positive examples), and an equal number of non-keywords (negative examples). For each example word, I extracted the 10 features described above.

In this manner, I was able to generate a training dataset for my keyword classifier, consisting of 51,976 examples, half positive and half negative. To test the classifier, I generated a testing dataset in a similar manner, consisting of 4,882 examples, half positive and half negative. The logistic regression model achieved cross-validated accuracy of 65% on the training dataset and 65% on the testing dataset (chance = 50%).

Once the keyword classifier has been trained, the performance of the full algorithm (i.e., all three stages) can be evaluated using the [Jean-Louis, 2014] methodology. As shown in Figure 2, the performance at extracting the top 15 keywords in a document is well above chance (Random) and comparable to, if not better than, that of a state-of-the-art keyword extractor (AlchemyAPI).

Precision vs. recall f1-score vs. number of keywords
Figure 3: (Left) Precision vs. recall curves obtained by sweeping a probability threshold on the output of the keyword classifier (i.e., if probability greater than threshold, accept as keyword). (Right) F1-scores as a function of the number of top keywords selected from a document. The black dashed line indicates the top-15 keyword comparison shown in Figure 2. Spread of the blue and red lines indicates standard error of the mean, taken across all documents in the Crowd500 test set.

Figures 3 & 4 reveal that the performance of my algorithm is superior to that of AlchemyAPI for a range of parameters, whether it is the probability threshold for the keyword classifier (Figure 3) or the number of top keywords chosen for evaluation (Figure 4). That said, it is worth noting that my algorithm was trained on the Crowd500 train set, whereas the AlchemyAPI keyword extractor (presumably) was not. Additionally, AlchemyAPI excels at returning keyphrases rather than keywords, which my evaluation procedure ignores.

EXAMPLE
Best ad found using keyword search
Figure 5: Search for the query 'kendrick lamar hiphops oldschool' in a database of advertisers yields Kendrick Lamar's touring schedule on bandsintown.com. This advertisement is a successful match to the original news article about Kendrick Lamar.

The goal of the project was to provide URX with a prototype keyword extractor, so as to better serve advertisements. To assess if my algorithm satisfies this goal, I ran it on the New York Time profile about Kendrick Lamar. The keywords identified by my algorithm are marked in yellow. Note that the algorithm picks up on the artist's name and musical genre, which is sufficient to serve a matching ad (Figure 5).

"Kendrick Lamar, Hip-hop's Newest Old-School Star
By Dave Turner
Last updated: February 3rd, 2015

`Everybody just wants to have fun, be with the scene,' Kendrick Lamar said when we met in his cramped quarters inside the Barclays Center in Brooklyn last fall. 'Certain people get backstage, people that you would never expect...You ain't with the media! You ain't into music! You ain't into sports! You're just here.' The rapper, now 27, had just finished his set as the opening act on this stretch of Kanye West's Yeezus tour, and he was sitting low in an armchair in his trademark black hoodie surrounded by exactly those people. 'Hey man, thank you again, appreciate the access back'"...

ACKNOWLEDGMENTS

I am grateful to Ben Bowles, Joe Isaacson, and Greg Bowyer of URX for advice and support.