Skip to content

Instantly share code, notes, and snippets.

View markcutajar's full-sized avatar
💭
One cannot step twice in the same river - Heraclitus

Mark Cutajar markcutajar

💭
One cannot step twice in the same river - Heraclitus
View GitHub Profile
@markcutajar
markcutajar / fxbeam-v1-input-types.py
Last active March 15, 2021 20:45
FxBeam process different input types
if self.input_file_type == 'csv':
rows = rows | 'Convert CSV to tick elements' >> beam.ParDo(
ParseDataRows(),
headers=self.input_columns
)
else:
rows = rows | 'Convert JSON to tick elements' >> beam.Map(json.loads)
@markcutajar
markcutajar / fxbeam-v1-read-data.py
Last active March 15, 2021 20:46
Fx-Beam reading data from text
rows = self.pipeline | 'Read data file' >> ReadFromText(
self.input_file_pattern,
compression_type=_compression
)
@markcutajar
markcutajar / beam-spark-notes.md
Last active October 13, 2020 18:51
Running an Apache Beam pipeline in a Windows Local Spark Cluster

Running an Apache Beam pipeline in a Windows Local Spark Cluster

This file will include details on attempting to run an apache beam pipeline on a windows spark setup. Unlike many other techies out there, i am a firm Microsoft fan. I do believe they build quality products and over the last few years since Satya Nadella was handed the position of CEO, we're seeing massive strides on Microsoft products.

To that point, all my machines at home are windows based including my processing machine which i mosty use to run models or occasionally play a few games. A couple of weeks ago i decided to build a simple Apache Beam pipeline to process forex tick data into windowed candle stick form. Now this can be easily done in pandas if the data was smaller, having a 7GB tick data file, in order to do it using pandas we'd have to split the file into day or maybe week segments and process them one by one. In order to obviously avoid the simple way out, i decided to attempt to build a beam pipeline as it is promised to pr