Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
A compact tokenizer written in JavaScript.
/*
* Tiny tokenizer
*
* - Accepts a subject string and an object of regular expressions for parsing
* - Returns an array of token objects
*
* tokenize('this is text.', { word:/\w+/, whitespace:/\s+/, punctuation:/[^\w\s]/ }, 'invalid');
* result => [{ token="this", type="word" },{ token=" ", type="whitespace" }, Object { token="is", type="word" }, ... ]
*
*/
function tokenize ( s, parsers, deftok ) {
var m, r, l, t, tokens = [];
while ( s ) {
t = null;
m = s.length;
for ( var key in parsers ) {
r = parsers[ key ].exec( s );
// try to choose the best match if there are several
// where "best" is the closest to the current starting point
if ( r && ( r.index < m ) ) {
t = {
token: r[ 0 ],
type: key,
matches: r.slice( 1 )
}
m = r.index;
}
}
if ( m ) {
// there is text between last token and currently
// matched token - push that out as default or "unknown"
tokens.push({
token : s.substr( 0, m ),
type : deftok || 'unknown'
});
}
if ( t ) {
// push current token onto sequence
tokens.push( t );
}
s = s.substr( m + (t ? t.token.length : 0) );
}
return tokens;
}
@klappy

This comment has been minimized.

Copy link

@klappy klappy commented Sep 16, 2019

Thanks @borgar for making this available all these years!

We've used this approach as the basis for our JavaScript tokenizer for the past couple of years.

@borgar

This comment has been minimized.

Copy link
Owner Author

@borgar borgar commented Sep 17, 2019

Wow, awesome! I think only ever used this to syntax highlight some code on a webpage. I thought it was useful enough to post and maybe come back to later, so it's great to hear that someone did. 😄

@ImMaax

This comment has been minimized.

Copy link

@ImMaax ImMaax commented Jun 1, 2020

Thanks for making this public, @borgar! Really helps with a small project I'm currently experimenting with!

@VlatkoStojkoski

This comment has been minimized.

Copy link

@VlatkoStojkoski VlatkoStojkoski commented Nov 6, 2020

@borgar, dude u have no idea how much time you've saved me by making this public

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment