Skip to content

Instantly share code, notes, and snippets.

@chokkan
Created October 10, 2012 13:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chokkan/3865724 to your computer and use it in GitHub Desktop.
Save chokkan/3865724 to your computer and use it in GitHub Desktop.
Feature Extractor for Spam Filtering
import sys
import os
import gzip
def ngram(T, n):
return ['%dgram=%s' % (n, '_'.join(T[i:i+n])) for i in range(len(T)-n+1)]
def process(fo, fi, label):
F = []
for line in fi:
fields = line.strip('\n').split(' ')
F += ngram(fields, 1)
F += ngram(fields, 2)
fo.write('%s ' % label)
fo.write('%s' % ' '.join([f.replace(':', '__COLON__') for f in F]))
fo.write('\n')
if __name__ == '__main__':
fo = sys.stdout
for src in sys.argv[1:]:
label = '+1' if os.path.basename(src).startswith('spmsg') else '-1'
fi = gzip.GzipFile(src) if src.endswith('.gz') else open(src)
process(fo, fi, label)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment