Skip to content

Instantly share code, notes, and snippets.

@psenger
Forked from ari-vedant-jain/Unzipping using Python & Pyspark
Last active February 8, 2023 15:24
Show Gist options
  • Save psenger/767883cca891632f216b88a465c90417 to your computer and use it in GitHub Desktop.
Save psenger/767883cca891632f216b88a465c90417 to your computer and use it in GitHub Desktop.
[Unzipping using Python & Pyspark] #Python #Spark #Pyspark
# Using Python
import os, zipfile
z = zipfile.ZipFile('/databricks/driver/D-Dfiles.zip')
for f in z.namelist():
if f.endswith('/'):
os.makedirs(f)
# Reading zipped folder data in Pyspark
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles("dbfs:/mnt/vedant-demo/ONG/data/las_raw/D-Dfiles.zip")
files_data = zips.map(zip_extract)
@zwxxx121
Copy link

zwxxx121 commented Feb 8, 2023

Thank you for sharing. Do you know how to see the file in unzipped folder after your code? I tried files_data.collect() and saveAsTextFile, but both show errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment