Skip to content

Instantly share code, notes, and snippets.

@gautamkrishnar
Created December 27, 2016 08:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gautamkrishnar/011b186cc9689fde86f25df0e4a4e1f5 to your computer and use it in GitHub Desktop.
Save gautamkrishnar/011b186cc9689fde86f25df0e4a4e1f5 to your computer and use it in GitHub Desktop.
Script to generate a coma seporated file(.csv) from a given input file with normal english words and unicode characters...
from nltk.corpus import wordnet #Using Natural language toolkit (http://www.nltk.org/)
import codecs
def check(str):
if not wordnet.synsets(str):
return 1
else:
return 0
a=0
if __name__ == '__main__':
out = codecs.open('csvout.txt','w',"utf8")
with codecs.open("Dictionary.txt", "r" , "utf8") as f: #Input file
for line in f:
words = line.split(" ")
flag=0
for word in words:
test = check(word)
if test == 0:
if flag == 0:
flag = 1
out.write('"')
out.write(word)
a+=1
print("Word found:"+str(a))
if test == 1:
if flag == 1:
flag = 2
out.write('","')
out.write(word.replace("\n",""))
out.write('"\n')
print("Done....")
@gautamkrishnar
Copy link
Author

gautamkrishnar commented Dec 27, 2016

Usage instructions

  • Download and install nltk using pip
pip install nltk
python english-words.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment