Skip to content

Instantly share code, notes, and snippets.

View mrietveld's full-sized avatar

Marco Rietveld mrietveld

View GitHub Profile
@g-clef
g-clef / multiprocessingmanagers.md
Created June 2, 2020 18:24
Using Python's Multiprocessing Managers to share a queue between hosts

To share a queue between hosts with Python, you can use a "SyncManager". There'll be one machine that's the manager/owner of the queue, and then multiple other machines can connect to that queue to pull jobs. You can also share multiprocessing Events (for example a "shutdown" Event to tell all the workers on the queue that they should stop).

First, make an empty class that's a subclass of the multiprocessing.managers.SyncManager:

class CentralManager(multiprocessing.managers.SyncManager):
    """
CentralManager:
A SyncManager class. This synchronizes a shared object across multiple
#!/usr/bin/python
# -*- coding: utf-8 -*-
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
from pyspark import SQLContext
from itertools import islice
from pyspark.sql.functions import col
# Using Python
import os, zipfile
z = zipfile.ZipFile('/databricks/driver/D-Dfiles.zip')
for f in z.namelist():
if f.endswith('/'):
os.makedirs(f)
# Reading zipped folder data in Pyspark