Restless Bytes

Have you ever wondered how u/autotldr produces these oftentimes really neat summaries of articles automatically? Well, let’s find out!

For those of you who’re unfamiliar with it: u/autotldr is Reddit’s automatic text summary bot. It attempts to create a summary of an article (news article, scientific article etc.) by extracting sentences from the original article that, together, get the gist of the article, reducing the original contents by at least 70 percent (plus also extracting the top 5 keywords from the article).

However, AutoTLDR does not create the summary itself but calls SMMRY, a web service for automatic text summarization. The question then is: how do THEY do it?
We can find a brief overview of their core summarization algorithm on their website which boils down to:

Split text into individual words and score them based on their popularity.
Split text into individual sentences and rank them by the sum of their word scores from point 1.
Return X of the most highly ranked sentences in the same order in which they appear in the original text.

The important piece of information that’s missing here is how exactly SMMRY calculate the score of a word. This sparked my curiosity and so I decided to come up with my own little automatic text summarization algorithm.

Note: The code for the “algorithm” from this post can be found here.

What’s our goal?

Our goal is to create an algorithm for automatic text summarization based on the outline given by SMMRY:

Come up with a data model for words and sentences.
Measure the importance of a word as some kind of word score.
Rank sentences by the sum of their words’ scores.
Return X of the most highly ranked sentences in chronological order where X are the number of sentences whose sentence score is at least 25 percent above the average sentence score.

The last two points are essentially the same as with the SMMRY algorithm. The only twist here is that I didn’t want to specify a fixed number of sentences for our summary but let the script choose them dynamically.

As for the word score I decided to keep things simple and use term frequency as a measure of the importance of a word. This is more of less inspired by algorithms such as TF-IDF which also assume that if a word appears often in a document it’s probably relevant for the meaning of the text.

Also, please note that the purpose of this isn’t about cobbling together something that’s meant to compete with SMMRY and the like; this is merely an attempt on exploring how far we can go with a very simple algorithm for automatic text summarization using automatically created summaries as a reference to compare with our results.

But with that out of the way and a plan at hand, let’s get started!

Getting data

The first step is getting data we can work with. For this I headed over to r/worldnews and downloaded the following 3 articles¹ and their TL;DRs:

I copied the texts relatively verbatim into text files which can be found in summapy/articles and their respective autotldr texts in summapy/smmry_tldr.

The data model

Since our goal is to score words and sentences I decided to represent a text as a list of (enumerated) sentences and a sentence as a list of tuples of word forms:

text            -> list of (enumerated) sentences  
sentence        -> list of tuples of word forms
word form tuple -> (text_repr, surface_repr, underlying_repr)

text_repr refers to the form of a token as it appears in the text, including punctuation, lowercase / uppercase etc.

Removing punctuation and making it lowercase (if it’s not an abbreviation) yields the surface representation (surface_repr) of the word.²

The last step is removing “grammatical” (e.g. phonological and morphological) features such as singular-plural, tense etc. which leads to the last form which I called the underlying representation (underlying_repr) of a word.

I’ll go into how and where each form is used further down the lines but first, let’s concern ourselves with text processing as a whole.

Text processing

Since most articles consist of paragraphs and paragraphs consist of sentences, the first step is splitting up the text into a list of paragraphs. That’s easy because paragraphs are indicated by (at least) two preceding newlines:

def split_into_paragraphs(text: str) -> List[str]:
    paragraphs = []
    for paragraph in text.split('\n\n'):
        if paragraph[-1] in ['.', '?', '!', '"']:
            paragraphs.append(paragraph)
        else:
            paragraphs.append(paragraph + '.')
    return paragraphs

I also decided to end all paragraphs with a fullstop if they do not end in a sentence delimiter (i.e. a fullstop, question mark or exclamation mark) already.

Now, a bigger problem, also mentioned in SMMRY’s algorithm description, was indeed deciding on whether a period signals the end of a sentence or belongs to an abbreviation but since I didn’t want to spend too much time here, I settled on a simple approach: First, I got a list of common abbreviations in English and then defined an abbreviation as follows:

A word is an abbreviation if

it is in the list of common English abbreviations
it is made up of only uppercase letters
it contains only uppercase letters and dots and ends with a period

That worked surprisingly well and the only case I came across where it failed is in headlines where all words are either uppercase letters (SCREAMING headlines) or capitalized (Like This Here):

def split_into_sentences(paragraphs: List[str], common_abbr: Set[str]) -> List[List[str]]:
    sentences = []
    for paragraph in paragraphs:
        tokens = re.split(r'[\n ]', paragraph)
        sentence = []
        t1, t2 = '', ''
        for i in range(0, len(tokens) - 1):
            t1 = tokens[i]
            t2 = tokens[i + 1]
            if t1[-1] in ['!', '?'] or t1[-2:] in ['."', '!"', '?"', '".', '"!', '"?']:
                # case: unambiguous sentence delimiters: !, ?, .", !", ?", "., "!, "?
                sentences.append(sentence + [t1])
                sentence = []
            elif t1 in common_abbr:
                # case: common abbreviation - they only appear mid-sentence so no new sentence here
                sentence.append(t1)
            elif t1.endswith('.') and unicodedata.category(t2[0]) == 'Lu':
                # case: if t1 ends with dot but t2 starts with lower case letter then we're probably still in a sentence
                sentences.append(sentence + [t1])
                sentence = []
            else:
                # base case: no new sentence - just append
                sentence.append(t1)
        sentences.append(sentence + [t2])
    return sentences

The result of our transformations up to this point is a list of lists of tokens in the text_repr form, so the next step is going to be to find their surface and underlying representations. We’ll use surface_repr later to determine the form of a word if used as a top keyword while its underlying_repr is used in calculating the frequency of a word (in the whole text, by the way).

I mentioned earlier that I use term frequency in order to determine how relevant a word is for a text. The problem with this approach is that the term frequency for words that are very common in general is also very high even if they’re completely irrelevant to the meaning of a text. TF-IDF counters this by checking a whole lot of other text, making a list of words with high frequencies across all texts and then drastically lowering scores for these very common words. But we only have one text! The solution here comes as a list of stop words. This is a list containing some of the most common words in a language (English, in our case) which we’ll use to sort out terms bearing little relevance for a specific text. I decided to use MySQL’s list of stop words (which they use in their fulltext search) since it seems pretty extensive.

Going from text_repr to surface representation is pretty straightforward: remove all sorts of punctuation and make the rest lowercase (in case it’s not an abbreviation). Now we’re left with terms such as “fox” and “foxes” which we’d like to have the same underlying representation so that they’re counted as the same word when we calculate term frequencies. In order to make this work I used Snowball stemmer for English, provided by PyStemmer on all words save for stop words:

>>> from Stemmer import Stemmer
>>> stemmer = Stemmer('english')
>>> stemmer.stemWord('fox')
'fox'
>>> stemmer.stemWord('foxes')
'fox'

Scores, Ranks and top keywords

With the text processing done, we can now go about scoring and ranking. Note that our original text was transformed into a list of lists of tuples like this:

"Two big, blue foxes! And two golden ducks." -> [
  [('"Two',    'two',    'two'), 
   ('big,',    'big',    'big'), 
   ('blue',    'blue',   'blue'), 
   ('foxes!',  'foxes',  'fox')],
  [('And',     'and',    'and'),
   ('two',     'two',    'two'),
   ('golden',  'golden', 'golden'),
   ('ducks."', 'ducks',  'duck')]
]

Calculating the term freuquencies - and therefore ‘score’ - is pretty simple:

def score(w):
    if w not in stopwords:
        return term_frequency[w]
    return 0

But if you were to inspect the code for this, you’d probably notice something odd:

def score_terms(sentences: List[List[Tuple]]) -> Dict:
    term_frequencies = dict()
    for sentence in sentences:
        for pos, tr, sr, ur, tag in sentence:
            if tag == STOPWORD_TAG:
                continue
            if (ur, tag) in term_frequencies:
                term_frequencies[(ur, tag)].append(sr)
            else:
                term_frequencies[(ur, tag)] = [sr]

    scores = dict()
    for term, tag in term_frequencies.keys():
        tf = term_frequencies[(term, tag)]
        score = len(tf)
        surface_term, _ = Counter(tf).most_common(1)[0]
        scores[term] = (score, tag, surface_term)

    return scores

What gets returned here isn’t just (the underlying representation of) the term and its frequency but the term mapped to a tuple comprising its score (aka term frequency) and surface representation (yes, and a ‘tag’ which you can ignore). The reason for this being the list of top keywords: the underlying representation of a word isn’t necessarily a “proper” word - “city”, for instance, would become “citi”. Our top keywords list would contain a lot of strange words, therefore I decided to represent a term by it’s most common form used in the original text.

The next step is again pretty straightforward: for each sentence, replace each term by its score and then sum it all up. The result is then the score of the sentence.

And the last step is to order sentences by their scores (in descending order), then select all sentences whose scores are at least 25 percent above the average score of all sentences and then return them in the order they appeared in the original text.

“But what about keywords?” Glad you asked! As mentioned, we also want a list of the top 5 keywords, so we go back to our list of scored words, order them by their scores (again, descending order) and then simply return the top 5 keywords:

def get_top_keywords(scored_terms: Dict, top_n=5) -> List[Tuple[int, str]]:
    ranked_terms = sorted(scored_terms.values(), key=lambda item: item[0], reverse=True)[:top_n]
    ranked_terms = [term for _, _, term in ranked_terms]
    return [(i+1, term) for i, term in enumerate(ranked_terms)]

Et voilà! We’ve summarized an article.

Results!

Now it’s time to apply our algorithm to the three news articles and compare our results to those from autoTLDR / SMMRY.

I won’t go over all articles and their results one-by-one but rather put a detailed description of the evaluation process and evaluations per article here.

Our simple algorithm turned out to perform pretty good, especially with keyword extraction where we outperformed SMMRY in one case, were on par in another and only in one case did worse.

We also did very well in terms of information extraction where our algorithm managed to reliably extract sentences containing all relevant information. But, and this is in my opinion the biggest downside of our algorithm: it managed to extract all information needed by extracting a lot more content then necessary which inflates our TL;DRs to a point where they aren’t significantly shorter than the original articles.

Nevertheless, I’m pretty happy with the results so far, especially considering that this was just a lazy-weekend-project! (well, a bit more than “just a weekend”, to be honest.) As always, all the code and data used in this project can be found on github.

Thanks for reading!

The only reason I chose those articles was because they were the three newest articles posted in this sub at the time. I didn’t (and still don’t) care what the articles are about.↩︎
Yes, I’m aware of the fact that “surface representation” and “underlying representation” are actual terms in phonology and morphology with a slightly different meaning.↩︎