Skip to content

Instantly share code, notes, and snippets.

View irfanandratama's full-sized avatar

Irfan Hanandra irfanandratama

  • Singaraja District Court
  • Bali
  • 17:08 (UTC +08:00)
View GitHub Profile
@irfanandratama
irfanandratama / senttoknize.py
Last active April 12, 2018 03:45
Tokenizer dengan Python
#memisahkan berdasarkan kalimat
def senttoken(): #Bagi per kalimat
kalimat = input() #tambah .lower() untuk melakukan case folding sekaligus
kalimat = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s', kalimat)
print(kalimat)
return kalimat
@irfanandratama
irfanandratama / TF-IDF
Created April 12, 2018 03:29
Representasi TF-IDF dengan Python
#Teks harus sudah melalui proses word tokenizing terlebih dahulu.
def tf(sudahDiTokenize): #Term Frequency
wordlist = sudahDiTokenize
#flat_list = [item for sublist in wordlist for item in sublist] #bila memakai tf normalized
#jumkata = len(flat_list) # bila memakai tf normalized
wordfreq = {}
for w in wordlist:
for o in w:
wordfreq[o] = wordfreq.get(o,0) + 1