Skip to content

Instantly share code, notes, and snippets.

@kipcole9
Last active March 2, 2023 06:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kipcole9/6f66236350f4ae9eb0f2c4c63b3b1869 to your computer and use it in GitHub Desktop.
Save kipcole9/6f66236350f4ae9eb0f2c4c63b3b1869 to your computer and use it in GitHub Desktop.
Sanitize a string using [unicode_set](https://hex.pm/packages/unicode_set)
defmodule Sanitize do
# Unicode sets are defined at https://unicode-org.github.io/icu/userguide/strings/unicodeset.html
require Unicode.Set
# Defines a guard that is the intersection of alphanumerics and the latin script plus the
# space and underscore characters. Note that the set is resolved at compile time into an
# integer expression and is therefore acceptably performant at runtime.
defguard latin_alphanum(c) when Unicode.Set.match?(c, "[[:Alnum:]&[:script=Latin:][_\\ ]]")
def sanitize_string(<<"">>), do: ""
def sanitize_string(<<c::utf8, rest::binary>>) when latin_alphanum(c), do: <<c::utf8, sanitize_string(rest)::binary>>
def sanitize_string(<<_c::utf8, rest::binary>>), do: sanitize_string(rest)
end
@kipcole9
Copy link
Author

kipcole9 commented Mar 2, 2023

Example

iex> Sanitize.sanitize_string("this is a ๓ thai _char !!!")
"this is a  thai _char "

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment