Skip to content

Instantly share code, notes, and snippets.

@ncalm
Last active June 15, 2023 09:43
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save ncalm/bb252d098d5aada0b6c9740ce03cdc82 to your computer and use it in GitHub Desktop.
This Excel LAMBDA function converts a text string into n-grams
/*
NGRAMS
Converts a text string to n-grams
Inputs:
- text (string): the text string to convert
- n (int): the number of elements to output in each n-gram
- strict (bool): whether to output incomplete n-grams at the end of the array
Returns:
An array with 1 + n columns and wordcount rows if strict = FALSE,
or (wordcount - n + 1) rows if strict = TRUE. Each n-gram is on a separate
row. The first column contains the original text in each row.
Example 1:
text: "Hello this is a text string"
n: 3
strict: FALSE
NGRAMS:
{
"Hello this is a text string","Hello","this","is";
"Hello this is a text string","this","is","a";
"Hello this is a text string","is","a","text";
"Hello this is a text string","a","text","string";
"Hello this is a text string","text","string",#REF!;
"Hello this is a text string","string",#REF!,#REF!
}
Example 2:
text: "Hello this is a text string"
n: 3
strict: TRUE
NGRAMS:
{
"Hello this is a text string","Hello","this","is";
"Hello this is a text string","this","is","a";
"Hello this is a text string","is","a","text";
"Hello this is a text string","a","text","string";
}
*/
NGRAMS = LAMBDA(text,n,strict,
LET(
words,IFERROR(TEXTSPLITXML(text," "),""),
wordcount,ROWS(words),
witherrors,MAKEARRAY(
wordcount,
n+1,
LAMBDA(r,c,
IF(
c=1,
text,
INDEX(
LET(
ind,MAKEARRAY(wordcount,n,LAMBDA(r,c,r+c-1)),
INDEX(words,ind)
),
r,
c-1
)
)
)
),
IF(
strict,
FILTER(
witherrors,
BYROW(witherrors,LAMBDA(a,SUM(N(ISERROR(a)))))=0
),
witherrors
)
)
);
TEXTSPLITXML =LAMBDA(text, delim, FILTERXML("<x><y>" & SUBSTITUTE(text, delim, "</y><y>") & "</y></x>", "//y"));
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment