dirichlet-topic.pl 1.0

日本語

This package is a simple way to find words that are indicative of a certain text genre. It uses smoothing, so there is no need to define stop-words or remove low-frequency words.

It is licensed for under the Apache License Version 2.0 and can be used for personal, research, or commercial use. If you use it for something or find it interesting , I'd appreciate if you provided a link to the original source, and told me what you did!

Download it here: dirichlet-topic.pl Ver. 1.0

I've also prepared an example data set that includes articles from Wikipedia on several subjects (IT devices, sports, and countries).

Usage

It comes with three scripts:

combine-counts.pl
Take two or more files, and combine the counts of each word into a single file.
Usage: ./combine-counts.pl file1.txt file2.txt ... > wordcount.txt
dirichlet-topic.pl
Convert a word-count file into a file indicating P(t|w) where w is a word, and t is a topic.
Usage: ./diriclet-topic.pl < wordcount.txt > wordprob.txt
find-best.pl
Find the n words that are most representative of each topic.
Usage: ./find-best.pl 100 < wordprob.txt > best.txt

Details

It does this by estimating P(t|w) where t is the text, and w is a particular word. It smooths to prevent rare words from being given high probabilities by using a Dirichlet prior:

P(t|w) = (c(w,t) + P(t)*alpha) / (c(w) + alpha)

Where c(.) is the count of each word or word/topic pair.

For more details about the method and how to estimate alpha, take a look here.