Skip to content

Instantly share code, notes, and snippets.

@aialenti
Last active September 20, 2020 14:06
Show Gist options
  • Save aialenti/5ab634feecb7486e60206b4c251af3e3 to your computer and use it in GitHub Desktop.
Save aialenti/5ab634feecb7486e60206b4c251af3e3 to your computer and use it in GitHub Desktop.
# Read the source tables in Parquet format
sales_table = spark.read.parquet("./data/sales_parquet")
'''
SELECT DISTINCT REGEXP_EXTRACT(bill_raw_text, '(ab[cd]{2,4})|(aa[abcde]{1,2})') AS extracted_pattern
WHERE REGEXP_EXTRACT(bill_raw_text, '(ab[cd]{2,4})|(aa[abcde]{1,2})') <> "
FROM sales_table
'''
sales_table_execution_plan = sales_table.select(
# The last integer indicates which group to extract
regexp_extract(col('bill_raw_text'), "(ab[cd]{2,4})|(aa[abcde]{1,2})", 0).alias("extracted_pattern")
).where(col("extracted_pattern") != "").distinct()
sales_table_execution_plan.show(100,False)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment