Skip to content

Instantly share code, notes, and snippets.

@dtrizna
Last active September 22, 2022 14:13
Show Gist options
  • Save dtrizna/ac3e6272acd37c4c46b48b039d38b577 to your computer and use it in GitHub Desktop.
Save dtrizna/ac3e6272acd37c4c46b48b039d38b577 to your computer and use it in GitHub Desktop.
import re
from nltk.tokenize import WordPunctTokenizer
from sklearn.feature_extraction.text import HashingVectorizer
wpt = WordPunctTokenizer()
hvwpt = HashingVectorizer(
preprocessor=lambda x: re.sub(r"(?:[0-9]{1,3}\.){3}[0-9]{1,3}", "_IPADDRESS_", x),
tokenizer=wpt.tokenize,
token_pattern=None,
lowercase=False,
ngram_range=(1,2),
n_features=2**18
)
X = {}
X["HashingVectorizer"] = hvwpt.fit_transform(raw_commands)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment