Skip to content

Instantly share code, notes, and snippets.

@mkaranasou
Last active October 4, 2019 13:48
Show Gist options
  • Save mkaranasou/bfca9fe5034ec6372cec6a3d66e8dc5c to your computer and use it in GitHub Desktop.
Save mkaranasou/bfca9fe5034ec6372cec6a3d66e8dc5c to your computer and use it in GitHub Desktop.
Read a txt file with pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession, functions as F
conf = SparkConf()
# optional but it would be good to set the amount of ram the driver can use to
# a reasonable (regarding the size of the file we want to read) amount, so that we don't get an OOM exception
conf.set('spark.driver.memory', '6G')
spark = SparkSession.builder \
.config(conf=conf) \
.appName('Homework-App') \
.getOrCreate()
df = spark.read.text('full/path/to/file.txt)
df = df.withColumn('has_big_data', F.when(F.col('value').contains('big data'), True).otherwise(False))
result = df.select('value').where(F.col('has_big_data')==True).count()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment