LMOSS: Lightweight Metrics of Semantic Similarity

LMOSS calculates pointwise mutual information (PMI) scores between any pair of words in a corpus. (It can also be used to calculate a variant of PMI that takes local n-gram information into account.) When trained on a large corpus, rank-ordered PMI scores correlate quite well with rank-ordered human similarity judgments (e.g., if trained on Wikipedia, PMI will rank "car" and "automobile" as quite similar, and "car" and "humidifier" as not-so-similar). LMOSS offers an interface for PMI scores similar to the official web interface for Latent Semantic Analysis, allowing matrix comparison, one-to-many comparison, pairwise comparison, and forced-choice comparison (useful for testing PMI on standardized synonymy tests like TOEFL). You'll need your own corpus to train it on--one of the advantages of PMI is that it's fast and easy to train on your own domain-specific corpora.

For general use, numerous very large corpora are available for free on the web (e.g. WebBase or Project Gutenberg). Linguistics departments also generally maintain a collection of corpora as well.

IMPORTANT: If your corpus is not already lowercase and punctuation-free, you'll almost definitely want to check the box that reads "Convert to lower case, ignore punctuation" when training your model. Otherwise, commas, periods, etc. will be considered parts of words (meaning that strings like "That", "that" and "that," will all be considered completely separate words).

You can download the program here (Windows only; works on Vista and XP). If it won't run at all, you may need to first download and install the .NET Redistributables v. 3.5 from Microsoft.

LMOSS allows PMI to be trained using window sizes of a user-specified length. Counting co-occurrences within documents rather than windows can be simulated by setting the window size to a number of words larger than any document in the corpus, as LMOSS does not allow text windows to simultaneously include text from two adjacent documents. LMOSS treats each line of a corpus as a separate document.



PMI/ORDER is a variant of PMI that incorporates local n-gram information (e.g. positional information about surrounding words) into account, as well as co-occurrence information. LMOSS is primarily for the purpose of training plain-vanilla PMI measures on your own corpora, but if you want to, you can also try out PMI/ORDER to see how it does. We found modest improvements when incorporating positional information about surrounding words.

To calculate the similarity of two terms (say car and truck) using PMI/ORDER, LMOSS first calculates PMI(car, truck) using the standard PMI formula (dropping the log, as we're interested in relative similarity ratings). Then it adds to this value

q * (# of times word w appears immediately before car) * (# of times word w appears immediately before truck)

(freq w * freq car)(freq w * freq truck)

for each word w in the lexicon. q is a weighting parameter. That means if a particular word often appears just before car, and the same word often appears just before truck, car and truck get a similarity boost. It does this not only for the words appearing immediately before truck and car, but for a window of -m to +m different positions (m is specified by the user), with each position weighted by how close it is to the word.