Skip to content

Instantly share code, notes, and snippets.

@borgar
Created June 24, 2010 12:33
Show Gist options
  • Star 33 You must be signed in to star a gist
  • Fork 10 You must be signed in to fork a gist
  • Save borgar/451393 to your computer and use it in GitHub Desktop.
Save borgar/451393 to your computer and use it in GitHub Desktop.
A compact tokenizer written in JavaScript.
/*
* Tiny tokenizer
*
* - Accepts a subject string and an object of regular expressions for parsing
* - Returns an array of token objects
*
* tokenize('this is text.', { word:/\w+/, whitespace:/\s+/, punctuation:/[^\w\s]/ }, 'invalid');
* result => [{ token="this", type="word" },{ token=" ", type="whitespace" }, Object { token="is", type="word" }, ... ]
*
*/
function tokenize ( s, parsers, deftok ) {
var m, r, l, t, tokens = [];
while ( s ) {
t = null;
m = s.length;
for ( var key in parsers ) {
r = parsers[ key ].exec( s );
// try to choose the best match if there are several
// where "best" is the closest to the current starting point
if ( r && ( r.index < m ) ) {
t = {
token: r[ 0 ],
type: key,
matches: r.slice( 1 )
}
m = r.index;
}
}
if ( m ) {
// there is text between last token and currently
// matched token - push that out as default or "unknown"
tokens.push({
token : s.substr( 0, m ),
type : deftok || 'unknown'
});
}
if ( t ) {
// push current token onto sequence
tokens.push( t );
}
s = s.substr( m + (t ? t.token.length : 0) );
}
return tokens;
}
@klappy
Copy link

klappy commented Sep 16, 2019

Thanks @borgar for making this available all these years!

We've used this approach as the basis for our JavaScript tokenizer for the past couple of years.

@borgar
Copy link
Author

borgar commented Sep 17, 2019

Wow, awesome! I think only ever used this to syntax highlight some code on a webpage. I thought it was useful enough to post and maybe come back to later, so it's great to hear that someone did. 😄

@ImMaax
Copy link

ImMaax commented Jun 1, 2020

Thanks for making this public, @borgar! Really helps with a small project I'm currently experimenting with!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment