Skip to content

Instantly share code, notes, and snippets.

@guangningyu
Last active October 9, 2018 10:42
Show Gist options
  • Save guangningyu/0a13ad76cc1de4ed24b0c8de430645c7 to your computer and use it in GitHub Desktop.
Save guangningyu/0a13ad76cc1de4ed24b0c8de430645c7 to your computer and use it in GitHub Desktop.
Apply filters defined in yaml file to PySpark dataframe
#!/usr/bin/env python
import yaml
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
# create dataframe
df = sqlContext.createDataFrame([
("Mary", 15),
("John", 18),
("Alex", 30),
], ["name", "age"])
# read rules from yaml file
# - 'age > 15 or name != "Mary"'
# - 'name != "Alex"'
with open('test.yaml', 'rb') as f:
rules = yaml.load(f)
# apply filters
for rule in rules:
df = df.filter(rule)
print df.collect()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment