Forked from ari-vedant-jain/Unzipping using Python & Pyspark
Last active
February 8, 2023 15:24
-
-
Save psenger/767883cca891632f216b88a465c90417 to your computer and use it in GitHub Desktop.
[Unzipping using Python & Pyspark] #Python #Spark #Pyspark
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Using Python | |
import os, zipfile | |
z = zipfile.ZipFile('/databricks/driver/D-Dfiles.zip') | |
for f in z.namelist(): | |
if f.endswith('/'): | |
os.makedirs(f) | |
# Reading zipped folder data in Pyspark | |
import zipfile | |
import io | |
def zip_extract(x): | |
in_memory_data = io.BytesIO(x[1]) | |
file_obj = zipfile.ZipFile(in_memory_data, "r") | |
files = [i for i in file_obj.namelist()] | |
return dict(zip(files, [file_obj.open(file).read() for file in files])) | |
zips = sc.binaryFiles("dbfs:/mnt/vedant-demo/ONG/data/las_raw/D-Dfiles.zip") | |
files_data = zips.map(zip_extract) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thank you for sharing. Do you know how to see the file in unzipped folder after your code? I tried files_data.collect() and saveAsTextFile, but both show errors.