Skip to content

Instantly share code, notes, and snippets.

@titomus
Created May 23, 2022 08:55
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Embed
What would you like to do?
Tokenize js
function tokenize(txt) {
// on sépare en phrases pour avoir quelques points de départ dans la génération
let tokens = [];
const sentences = txt.split(/\n/gim).filter((x) => x);
// on tokenize chaque phrase en splitant les mots
for (let i = 0; i < sentences.length; i++) {
// on insert un START
tokens.push("START");
let tks = sentences[i].match(/\S+/gim).filter((x) => x);
tks.map((token) => tokens.push(token));
}
return tokens;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment