process corpus for lda
In a blog post I wrote about the python package lda, see here, I used the pre-processed data (included with the lda package) for the example. I have since received many questions regarding the document-term matrix, the titles, and the vocabulary-- where do they come from? This gist will use the textmining package to (hopefully) help answer these types of questions.
Install textmining package
To install textmining use pip (create a virtual environment first, if you'd like):
$ pip install textmining
Run script from command line
The script can be run from the command with the usual command:
$ python lda_textmine_ex.py
The output should look like:
**These are the 'documents', making up our 'corpus': document 1: John and Bob are brothers. document 2: John went to the store. The store was closed. document 3: Bob went to the store too. -- In real applications, these 'documents' might be read from files, websites, etc. **These are the 'document titles': title 1: Brothers. title 2: John to the store. title 3: Bob to the store. -- In real applications, these 'titles' might be the file name, the story title, webpage title, etc. ** The textmining packages is one tool for creating the 'document-term' matrix, 'vocabulary', etc. You can write your own, if needed. ** Output produced by the textmining package... * The 'document-term' matrix type(X): <type 'numpy.ndarray'> shape: (3, 12) X: [[1 0 1 0 1 0 1 1 0 0 0 0] [0 2 0 1 0 1 0 1 1 1 2 0] [0 1 0 1 0 0 1 0 0 1 1 1]] -- Notice there are 3 rows, for 3 'documents' and 12 columns, for 12 'vocabulary' words -- The number of rows and columns depends on the number of documents and number of unique words in -all- documents * The 'vocabulary': type(vocab): <type 'tuple'> len(vocab): 12 vocab: ('and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too') -- These are the 12 words in the vocabulary -- Often common 'stop' words, like 'and', 'the', 'to', etc are filtered out -before- creating the document-term matrix and vocab * Again, the 'titles' for this 'corpus': type(titles): <type 'tuple'> len(titles): 3 titles: ('Brothers.', 'John to the store.', 'Bob to the store.')
Hopefully this gives a sense of how a set of documents (a corpus) relates to the document-term matrix, the vocabulary, and the titles mentioned in the original post.