Open Source Text Processing Project: The Porter Stemming Algorithm

The Porter Stemming Algorithm

Project Website:

Github Link: None

Description

This is the ‘official’ home page for distribution of the Porter Stemming Algorithm, written and maintained by its author, Martin Porter.

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

History

The original stemming algorithm paper was written in 1979 in the Computer Laboratory, Cambridge (England), as part of a larger IR project, and appeared as Chapter 6 of the final project report,
C.J. van Rijsbergen, S.E. Robertson and M.F. Porter, 1980. New models in probabilistic information retrieval. London: British Library. (British Library Research and Development Report, no. 5587).
With van Rijsbergen’s encouragement, it was also published in,
M.F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3) pp 130−137.
And since then it has been reprinted in
Karen Sparck Jones and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4.
The original stemmer was written in BCPL, a language once popular, but now defunct. For the first few years after 1980 it was distributed in its BCPL form, via the medium of punched paper tape. Versions in other languages soon began to appear, and by 1999 it was being widely used, quoted and adapted. Unfortunately there were numerous variations in functionality among these versions, and this web page was set up primarily to ‘put the record straight’ and establish a definitive version for distribution.

The ANSI C version that heads the table below is exactly equivalent to the original BCPL version. The BCPL version did, however, differ in three minor points from the published algorithm and these are clearly marked in the downloadable ANSI C version. They are discussed further below.

This ANSI C version may be regarded as definitive, in that it now acts as a better definition of the algorithm than the original published paper.


Leave a Reply

Your email address will not be published. Required fields are marked *