Skip to content

Instantly share code, notes, and snippets.

@bittlingmayer
bittlingmayer / ft_wiki_preproc.py
Last active March 4, 2019 22:56
fastText pre-trained vectors preprocessing [moved to ftio.wiki.preproc - pip install ftio / https://github.com/SignalN/ftio]
# See https://github.com/facebookresearch/fastText/blob/master/get-wikimedia.sh
#
# From https://github.com/facebookresearch/fastText/issues/161:
#
# We now have a script called 'get-wikimedia.sh', that you can use to download and
# process a recent wikipedia dump of any language. This script applies the preprocessing
# we used to create the published word vectors.
#
# The parameters we used to build the word vectors are the default skip-gram settings,
# except with a dimensionality of 300 as indicated on the top of the list of word
@kohlmeier
kohlmeier / compressed_features.py
Last active March 2, 2024 18:08
Example of computing compressed features. NOTE: If you want to want to create such features consistently across process, you will need to persist the random components. Easy enough, but I've written the code for that, too, here: https://github.com/Khan/analytics/blob/master/map_reduce/py/random_features.py
import collections
import numpy as np
class CompressedFeatures:
def __init__(self, num_features=50):
self.random_components = collections.defaultdict(
self._generate_component)
self.num_features = num_features
@endolith
endolith / Has weird right-to-left characters.txt
Last active April 7, 2024 01:38
Unicode kaomoji smileys emoticons emoji
ּ_בּ
בּ_בּ
טּ_טּ
כּ‗כּ
לּ_לּ
מּ_מּ
סּ_סּ
תּ_תּ
٩(×̯×)۶
٩(̾●̮̮̃̾•̃̾)۶