Skip to content

Instantly share code, notes, and snippets.

@titomus
Created May 23, 2022 08:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save titomus/266d819990b21579bd3078ab75fe1f64 to your computer and use it in GitHub Desktop.
Save titomus/266d819990b21579bd3078ab75fe1f64 to your computer and use it in GitHub Desktop.
Tokenize js
function tokenize(txt) {
// on sépare en phrases pour avoir quelques points de départ dans la génération
let tokens = [];
const sentences = txt.split(/\n/gim).filter((x) => x);
// on tokenize chaque phrase en splitant les mots
for (let i = 0; i < sentences.length; i++) {
// on insert un START
tokens.push("START");
let tks = sentences[i].match(/\S+/gim).filter((x) => x);
tks.map((token) => tokens.push(token));
}
return tokens;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment