Skip to content

Instantly share code, notes, and snippets.

@netom
Created January 20, 2016 21:30
Show Gist options
  • Save netom/20d856e6a66aaa2b3952 to your computer and use it in GitHub Desktop.
Save netom/20d856e6a66aaa2b3952 to your computer and use it in GitHub Desktop.
Testing marisa-trie performance with 2 million filenames (in find.txt, file list is not included)
#!/usr/bin/env python
#-*- coding: UTF-8 -*-
import marisa_trie
def uread(f):
for line in f:
yield line.decode('utf8', 'replace')
ds = []
with open('find.txt', 'rb') as f:
print 'building data structure...'
#ds = uread(f) # Use raw generator, lazy read
#ds = list(uread(f)) # Build a list
ds = marisa_trie.Trie(uread(f)) # Build a trie
print 'counting... '
c = 0
for e in ds:
c += 1
print 'done: ' + str(c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment