Skip to content

Instantly share code, notes, and snippets.

@allanbatista
Created May 13, 2020 16:15
Show Gist options
  • Save allanbatista/e429b3a2caf40258dad5101eef0ff6f9 to your computer and use it in GitHub Desktop.
Save allanbatista/e429b3a2caf40258dad5101eef0ff6f9 to your computer and use it in GitHub Desktop.
Bigquery function to simple normalize HTML into formated TEXT.
CREATE OR REPLACE FUNCTION `project-id.dataset-id.html_to_text`(text STRING) AS (
TRIM(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
LOWER(text),
r"<br>|<br\/>|<br\s+\/>|<\/p>", "\n"),
r"<[^>]*>", ""),
r"\n{2,}", "\n\n"),
r"^\s+", ""),
r"\n\s+", "\n")
)
);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment