Open Source Text Processing Project: Sumy

Sumy: Automatic text summarizer

Project Website:

Github Link:

Description

Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains simple evaluation framework for text summaries. Implemented summarization methods:

Luhn – heurestic method, reference
Edmundson heurestic method with previous statistic research, reference
Latent Semantic Analysis, LSA – one of the algorithm from http://scholar.google.com/citations?user=0fTuW_YAAAAJ&hl=en I think the author is using more advanced algorithms now. Steinberger, J. a Ježek, K. Using latent semantic an and summary evaluation. In In Proceedings ISIM ‘04. 2004. S. 93-100.
LexRank – Unsupervised approach inspired by algorithms PageRank and HITS, reference
TextRank – some sort of combination of a few resources that I found on the internet. I really don’t remember the sources. Probably Wikipedia and some papers in 1st page of Google 🙂
SumBasic – Method that is often used as a baseline in the literature. Source: Read about SumBasic
KL-Sum – Method that greedily adds sentences to a summary so long as it decreases the KL Divergence. Source: Read about KL-Sum
Installation
Make sure you have Python 2.7/3.3+ and pip (Windows, Linux) installed. Run simply (preferred way):

$ [sudo] pip install sumy
Or for the fresh version:

$ [sudo] pip install git+git://github.com/miso-belica/sumy.git
Usage
Sumy contains command line utility for quick summarization of documents.

$ sumy lex-rank –length=10 –url=http://en.wikipedia.org/wiki/Automatic_summarization # what’s summarization?
$ sumy luhn –language=czech –url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy edmundson –language=czech –length=3% –url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy –help # for more info
Various evaluation methods for some summarization method can be executed by commands below:

$ sumy_eval lex-rank reference_summary.txt –url=http://en.wikipedia.org/wiki/Automatic_summarization
$ sumy_eval lsa reference_summary.txt –language=czech –url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy_eval edmundson reference_summary.txt –language=czech –url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy_eval –help # for more info


Leave a Reply

Your email address will not be published. Required fields are marked *