Skip to content

Instantly share code, notes, and snippets.

@semyont
Created August 3, 2016 10:48
Show Gist options
  • Save semyont/eb40a7672d7d860b35c1914bfdd8dd94 to your computer and use it in GitHub Desktop.
Save semyont/eb40a7672d7d860b35c1914bfdd8dd94 to your computer and use it in GitHub Desktop.
Regex for extracting log data
from pyspark.sql.functions import split, regexp_extract
split_df = base_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_df.show(truncate=False)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment