johnlaudun/JP_MD_fulltext

## JP_MD_fulltext
What I've been working on for the past few days is in preparation for attempting a topic model using the more established LDA instead of the NMF to see how well they compare -- with the understanding that since there is rarely a one-to-one matchup within either method, that there will be no such match across them.

Because LDA does not filter out common words on its own, the way the NMF method does, you have to start with a stoplist. I know we can begin with Blei's and a few other established lists, but I would also like to be able to compare that against our own results. My first thought was to build a dictionary of words and their frequency within the corpus. For convenience sake, I am using the NLTK.

Just as a record of what I've done, here's the usual code for loading the talks from the CSV with everything in it:

```python
import pandas
import re

# Get all talks in a list & then into one string
colnames = ['author', 'title', 'date' , 'length', 'text']
df = pandas.read_csv('../data/talks-v1b.csv', names=colnames)
talks = df.text.tolist()
alltalks = " ".join(str(item) for item in talks) # Solves pbm of floats in talks

# Clean out all punctuation except apostrophes
all_words = re.sub(r"[^\w\d'\s]+",'',alltalks).lower()
```


We still need to identify which talks have floats for values and determine what impact, if any, it has on the project.

```python
import nltk

tt_tokens = nltk.word_tokenize(all_words)

tt_freq = {}
for word in tt_tokens:
    try:
        tt_freq[word] += 1
    except:
        tt_freq[word] = 1
```

Using this method, the dictionary has 63426 entries. Most of those are going to be single-entry items or named entities, but I do think it's worth looking at them, as well as the high-frequency words that may not be a part of established stopword lists: I think it will be important to note those words which are specifically common to TED Talks.

I converted the dictionary to a list of tuples in order to be able to sort -- I see that there is a way to sort a dictionary in Python, but this is a way I know. Looking at the most common words, I see NLTK didn't get rid of punctuation: I cleared this up by removing punctuation earlier in the process, keeping the contractions (words with apostrophes), which the NLTK does not respect.

**N.B.** I tried doing this simply with a regex expression that split on white spaces, but I am still seeing contractions split into different words.

```python
tt_freq_list.sort(reverse=True)
tt_freq_list[0:20]

[(210294, 'the'),
 (151163, 'and'),
 (126887, 'to'),
 (116155, 'of'),
 (106547, 'a'),
 (96375, 'that'),
 (83740, 'i'),
 (78986, 'in'),
 (75643, 'it'),
 (71766, 'you'),
 (68573, 'we'),
 (65295, 'is'),
 (56535, "'s"),
 (49889, 'this'),
 (37525, 'so'),
 (33424, 'they'),
 (32231, 'was'),
 (30067, 'for'),
 (28869, 'are'),
 (28245, 'have')]
```

Keeping the apostrophes proved to be harder than I thought -- and I tried going a "pure Python" route and splitting only on white spaces, trying both of the following:

```python
word_list = re.split('\s+', all_words)
word_list = all_words.split()
```

I still got: ` (56535, "'s"),`. (The good news is that the counts match.)

Okay, good news. The NLTK white space tokenizer works:

```python
from nltk.tokenize import WhitespaceTokenizer
white_words = WhitespaceTokenizer().tokenize(all_words)
```

I tried using Sci-Kit Learn's `CountVectorizer` but it requires a list of strings, not one string, **and** it does not like that some of the texts are floats. So, we'll save dealing with that when it comes to looking at this corpus as a corpus and not as one giant collection of words.

```python
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
word_counts = count_vect.fit_transform(talks)

ValueError: np.nan is an invalid document, expected byte or unicode string.
```

The final, working, script of the day produces the output we want:

```python

# Tokenize on whitespace
from nltk.tokenize import WhitespaceTokenizer
tt_tokens = WhitespaceTokenizer().tokenize(all_words)

# Build a dictionary of words and their frequency in the corpus
tt_freq = {}
for word in tt_tokens:
    try:
        tt_freq[word] += 1
    except:
        tt_freq[word] = 1

# Build a list of tuples, sort, and see some results
tt_freq_list = [(val, key) for key, val in tt_freq.items()]
tt_freq_list.sort(reverse=True)
tt_freq_list[0:20]
```
	What I've been working on for the past few days is in preparation for attempting a topic model using the more established LDA instead of the NMF to see how well they compare -- with the understanding that since there is rarely a one-to-one matchup within either method, that there will be no such match across them.

	Because LDA does not filter out common words on its own, the way the NMF method does, you have to start with a stoplist. I know we can begin with Blei's and a few other established lists, but I would also like to be able to compare that against our own results. My first thought was to build a dictionary of words and their frequency within the corpus. For convenience sake, I am using the NLTK.

	Just as a record of what I've done, here's the usual code for loading the talks from the CSV with everything in it:

	```python
	import pandas
	import re

	# Get all talks in a list & then into one string
	colnames = ['author', 'title', 'date' , 'length', 'text']
	df = pandas.read_csv('../data/talks-v1b.csv', names=colnames)
	talks = df.text.tolist()
	alltalks = " ".join(str(item) for item in talks) # Solves pbm of floats in talks

	# Clean out all punctuation except apostrophes
	all_words = re.sub(r"[^\w\d'\s]+",'',alltalks).lower()
	```


	We still need to identify which talks have floats for values and determine what impact, if any, it has on the project.

	```python
	import nltk

	tt_tokens = nltk.word_tokenize(all_words)

	tt_freq = {}
	for word in tt_tokens:
	try:
	tt_freq[word] += 1
	except:
	tt_freq[word] = 1
	```

	Using this method, the dictionary has 63426 entries. Most of those are going to be single-entry items or named entities, but I do think it's worth looking at them, as well as the high-frequency words that may not be a part of established stopword lists: I think it will be important to note those words which are specifically common to TED Talks.

	I converted the dictionary to a list of tuples in order to be able to sort -- I see that there is a way to sort a dictionary in Python, but this is a way I know. Looking at the most common words, I see NLTK didn't get rid of punctuation: I cleared this up by removing punctuation earlier in the process, keeping the contractions (words with apostrophes), which the NLTK does not respect.

	N.B. I tried doing this simply with a regex expression that split on white spaces, but I am still seeing contractions split into different words.

	```python
	tt_freq_list.sort(reverse=True)
	tt_freq_list[0:20]

	[(210294, 'the'),
	(151163, 'and'),
	(126887, 'to'),
	(116155, 'of'),
	(106547, 'a'),
	(96375, 'that'),
	(83740, 'i'),
	(78986, 'in'),
	(75643, 'it'),
	(71766, 'you'),
	(68573, 'we'),
	(65295, 'is'),
	(56535, "'s"),
	(49889, 'this'),
	(37525, 'so'),
	(33424, 'they'),
	(32231, 'was'),
	(30067, 'for'),
	(28869, 'are'),
	(28245, 'have')]
	```

	Keeping the apostrophes proved to be harder than I thought -- and I tried going a "pure Python" route and splitting only on white spaces, trying both of the following:

	```python
	word_list = re.split('\s+', all_words)
	word_list = all_words.split()
	```

	I still got: ` (56535, "'s"),`. (The good news is that the counts match.)

	Okay, good news. The NLTK white space tokenizer works:

	```python
	from nltk.tokenize import WhitespaceTokenizer
	white_words = WhitespaceTokenizer().tokenize(all_words)
	```

	I tried using Sci-Kit Learn's `CountVectorizer` but it requires a list of strings, not one string, and it does not like that some of the texts are floats. So, we'll save dealing with that when it comes to looking at this corpus as a corpus and not as one giant collection of words.

	```python
	from sklearn.feature_extraction.text import CountVectorizer

	count_vect = CountVectorizer()
	word_counts = count_vect.fit_transform(talks)

	ValueError: np.nan is an invalid document, expected byte or unicode string.
	```

	The final, working, script of the day produces the output we want:

	```python

	# Tokenize on whitespace
	from nltk.tokenize import WhitespaceTokenizer
	tt_tokens = WhitespaceTokenizer().tokenize(all_words)

	# Build a dictionary of words and their frequency in the corpus
	tt_freq = {}
	for word in tt_tokens:
	try:
	tt_freq[word] += 1
	except:
	tt_freq[word] = 1

	# Build a list of tuples, sort, and see some results
	tt_freq_list = [(val, key) for key, val in tt_freq.items()]
	tt_freq_list.sort(reverse=True)
	tt_freq_list[0:20]
	```