Skip to content

Instantly share code, notes, and snippets.

@bittlingmayer
Last active March 4, 2019 22:56
Show Gist options
  • Star 14 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save bittlingmayer/7139a6a75ba0dbbc3a06325394ae3a13 to your computer and use it in GitHub Desktop.
Save bittlingmayer/7139a6a75ba0dbbc3a06325394ae3a13 to your computer and use it in GitHub Desktop.
fastText pre-trained vectors preprocessing [moved to ftio.wiki.preproc - pip install ftio / https://github.com/SignalN/ftio]
# See https://github.com/facebookresearch/fastText/blob/master/get-wikimedia.sh
#
# From https://github.com/facebookresearch/fastText/issues/161:
#
# We now have a script called 'get-wikimedia.sh', that you can use to download and
# process a recent wikipedia dump of any language. This script applies the preprocessing
# we used to create the published word vectors.
#
# The parameters we used to build the word vectors are the default skip-gram settings,
# except with a dimensionality of 300 as indicated on the top of the list of word
# vectors (we now understand that this could be more visible).
# See also: known issues with the original script https://github.com/facebookresearch/fastText/issues/281, which unfortunately we must re-implement here.
'''
sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
-e 's/"/ " /g' -e 's/\./ \. /g' -e 's/<br \/>/ /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
-e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
-e 's/«/ /g' | tr 0-9 " "
'''
SUBEXES = ["s/’/'/g", "s/′/'/g", "s/''/ /g", "s/'/ ' /g", 's/“/"/g', 's/”/"/g', 's/"/ /g', "s/\\./ \\. /g", "s/<br \\/>/ /g", "s/, / , /g", "s/(/ ( /g", "s/)/ ) /g", "s/\\!/ \\! /g", "s/\\?/ \\? /g", "s/\\;/ /g", "s/\\:/ /g", "s/-/ - /g", "s/=/ /g", "s/=/ /g", "s/*/ /g", "s/|/ /g", "s/«/ /g"]
import subprocess
def __normalize_text(s):
for subex in SUBEXES:
s = subprocess.check_output(['sed', subex], input=s.encode()).decode("utf-8")
return s
# Program to filter Wikipedia XML dumps to "clean" text consisting only of lowercase
# letters (a-z, converted from A-Z), and spaces (never consecutive)...
# All other characters are converted to spaces. Only text which normally appears.
# in the web browser is displayed. Tables are removed. Image captions are.
# preserved. Links are converted to normal text. Digits are spelled out.
# *** Modified to not spell digits or throw away non-ASCII characters ***
# Written by Matt Mahoney, June 10, 2006. This program is released to the public domain.
def __spaces(s):
return ' '.join(s.split())
def __digits(s):
return ''.join(filter(lambda c: not c.isdigit(), s))
def preproc(s):
return __digits(__spaces(__normalize_text(s.lower())))
# Example output:
#
# >>> preproc("Г. Шмидт, можно сказать «Давай давай!»?")
# 'г . шмидт , можно сказать давай давай ! » ?'
# >>> preproc('It won 1st place in the 3D film contest.')
# 'it won st place in the d film contest .'
@poppingtonic
Copy link

Should it remove numbers like this?

# >>> preproc('It won 1st place in the 3D film contest.')
# 'it won st place in the d film contest .'

@bittlingmayer
Copy link
Author

bittlingmayer commented Aug 28, 2017

@poppingtonic Unfortunately, yes. The models were trained like that so now there is no vector for tokens like '3d'.

See facebookresearch/fastText#281

@jfilter
Copy link

jfilter commented May 9, 2018

There is an ']' too much in the SUBEXES line.

@abeer-khan
Copy link

I'm using Fasttext's preprocessing method: https://gist.github.com/bi…/7139a6a75ba0dbbc3a06325394ae3a13
My text is in English.

It takes very long (about 500s for a 1000 documents, where each document isn't very long either). Is this normal?

@bittlingmayer
Copy link
Author

@jfilter Thanks, fixed.

@bittlingmayer
Copy link
Author

@abeerunscore96

It really depends on the size of the documents. Is the document a list of rows or by document do you mean a row?

0.5s per document is a lot but could be normal for a large document. If for a row there is definitely something wrong.

@bittlingmayer
Copy link
Author

Have you profiled to understand which section is taking a long time? Are you able to share the documents or one of the documents?

This code prioritises readability over performance, to more easily keep parity with the bash script.

@bittlingmayer
Copy link
Author

Dear future commenters:

Open an issue at github.com/SignalN/ftio, the repo where this gist is now maintained.

GitHub dashboard and notifications apparently do not include comments on gists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment