Skip to content

Instantly share code, notes, and snippets.

@nirlanka
Last active April 24, 2020 03:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nirlanka/04e0c8d63fd2ec2cbbaae322b7f826c3 to your computer and use it in GitHub Desktop.
Save nirlanka/04e0c8d63fd2ec2cbbaae322b7f826c3 to your computer and use it in GitHub Desktop.
Simple tokenizer examples
`<div>
Some text here
<div>
<h3>Lorem <a href="./abc/def.html">ipsum</a></h3>
<p>Dolor sit</p>
<p>amet</p>
</div>`
.split(/([<>\s="\/]{1})/g)
.map(s=>s.trim())
.filter(Boolean)
/*
Result:
["<", "div", ">", "Some", "text", "here", "<", "div", ">", "<",
"h3", ">", "Lorem", "<", "a", "href", "=", """, ".", "/", "abc",
"/", "def.html", """, ">", "ipsum", "<", "/", "a", ">", "<", "/",
"h3", ">", "<", "p", ">", "Dolor", "sit", "<", "/", "p", ">", "<",
"p", ">", "amet", "<", "/", "p", ">", "<", "/", "div", ">"]
*/
`pkg bar {
func foo() {
a = 1;
d = false; #lorem ipsum#
e = {
x: 1.35, #dolor sit#
y: 2.2,
f: "\"to\""
}
};
b = "abc";
c = foo();
}`
.split(/([\s\(\)\{\}=;"\.\,#]{1})/g)
.map(s=>s.trim())
.filter(Boolean);
/*
Result:
["pkg", "bar", "{", "func", "foo", "(", ")", "{",
"a", "=", "1", ";", "d", "=", "false", ";", "#",
"lorem", "ipsum", "#", "e", "=", "{", "x:", "1",
".", "35", ",", "#", "dolor", "sit", "#", "y:",
"2", ".", "2", ",", "f:", """, "\", """, "to",
"\", """, """, "}", "}", ";", "b", "=", """,
"abc", """, ";", "c", "=", "foo", "(", ")", ";",
"}"]
*/
@nirlanka
Copy link
Author

Known issue 1: Multiple spaces and line-breaks inside strings are lost.
Possible solution/idea: Removing the .map(s=>s.trim()) operation will keep all spaces, and spaces will be available with context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment