Open Source Text Processing Project: Text-NSP

NSP: The Ngram Statistics Package

Project Website:

Github Link: None

Description

The Ngram Statistics Package (NSP) is a collection of perl modules that aid in analyzing Ngrams in text files. We define an Ngram as a sequence of ‘n’ tokens that occur within a window of at least ‘n’ tokens in the text; what constitutes a “token” can be defined by the user.

NSP.pm is a stub that doesn’t have any real functionality. It serves as a top level module in the hierarchy and allows us to group the Text::NSP::Count and Text::NSP::Measures modules.

The modules under Text::NSP::Measures implement measures of association that are used to evaluate whether the co-occurrence of the words in a Ngram is purely by chance or statistically significant. These measures compute a numerical score for Ngrams. This score can be used to decide whether or not there is enough evidence to reject the null hypothesis (that the Ngram is not statistically significant) for that Ngram.

To use one of the measures you can either use the program statistic.pl provided under the utils directory, or write your own driver program. Program statistic.pl takes as input a list of Ngrams with their frequencies (in the format output by count.pl) and runs a user-selected statistical measure of association to compute the score for each Ngram. The Ngrams, along with their scores, are output in descending order of this score. For help on using utils/statistic.pl please refer to its perldoc (perldoc utils/statistic.pl).

If you are writing your own driver program, a basic usage example is provided above under SYNOPSIS. For further clarification please refer to the documentation of Text::NSP::Measures (perldoc Text::NSP::Measures).


Leave a Reply

Your email address will not be published. Required fields are marked *