Skip to content

Instantly share code, notes, and snippets.

@calebwaldner
Last active May 12, 2021 15:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save calebwaldner/9c784ba7952d0b4f7c57d93f19ceaad1 to your computer and use it in GitHub Desktop.
Save calebwaldner/9c784ba7952d0b4f7c57d93f19ceaad1 to your computer and use it in GitHub Desktop.
A BigQuery function that accepts an array of similar strings and returns the string that occurs most within array.
CREATE TEMP FUNCTION GetMainStringValue(arr ARRAY<STRING>) RETURNS STRING AS (
ARRAY(
SELECT list FROM (
SELECT list,
count(*) count,
FROM UNNEST(arr) list
GROUP BY list
ORDER BY count DESC
)
)[SAFE_OFFSET(0)]
);
SELECT GetMainStringValue(["CoolWords", "CoolerWords ", "CoolWord", "CoolWard", "CoolWord", "CoolWord"]) AS string_with_most_occurrences
@calebwaldner
Copy link
Author

A BigQuery function that accepts an array of similar strings and returns the string that occurs most within array. Intended for use with SOUNDEX() function to facilitate users inputing the similar words but with some possible spelling errors; GetMainStringValue would accept that array of user values and output what seems the be the main intended value (assuming the more users spelled the words correctly than did not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment