Skip to content

Instantly share code, notes, and snippets.

@gorango
Created November 18, 2017 10:42
Show Gist options
  • Save gorango/bd7e70d731cbebc7b8136098cc7b15e0 to your computer and use it in GitHub Desktop.
Save gorango/bd7e70d731cbebc7b8136098cc7b15e0 to your computer and use it in GitHub Desktop.
Concise, efficient, and 99% reliable sentence tokenizer for all Latin languages.
function sentencesArray (text) {
return text
.replace(/([\s,]?[\d,-]?([A-Z][a-z]{3,}|[a-z]{2,}|[0-9])[.?!…\n]+([\s\n"]))/g, '$1|')
.split('|')
.map(s => s.trim())
.filter(s => s)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment