Skip to content

Instantly share code, notes, and snippets.

@djokester
Created August 7, 2018 08:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save djokester/5ffd5bc3c841201df765c7e9080c9770 to your computer and use it in GitHub Desktop.
Save djokester/5ffd5bc3c841201df765c7e9080c9770 to your computer and use it in GitHub Desktop.
from nltk import word_tokenize
from nltk.corpus import stopwords
from gensim import models
from gensim.models.doc2vec import TaggedDocument
#Function for normalizing paragraphs.
def normalize(string):
lst = word_tokenize(string)
lst =[word.lower() for word in lst if word.isalpha()]
lst = [w for w in lst if not w in stopwords.words('english')]
return(lst)
# Aggregate questions under each topic tag as a paragraph.
# Normalize the paragraph
# Feed the normalized paragraph along with the topic tag into Gensim's Tagged Document function.
# Append the return value to docs.
docs = []
for index, item in enumerate(topic_list):
question = " ".join(question_list[index])
question = normalize(question)
docs.append(TaggedDocument(words=question, tags=[item]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment