Mark Cutajar markcutajar

## fxbeam-v1-input-types.py
if self.input_file_type == 'csv':
    rows = rows | 'Convert CSV to tick elements' >> beam.ParDo(
        ParseDataRows(),
        headers=self.input_columns
    )
else:
    rows = rows | 'Convert JSON to tick elements' >> beam.Map(json.loads)

## fxbeam-v1-read-data.py
rows = self.pipeline | 'Read data file' >> ReadFromText(
    self.input_file_pattern,
    compression_type=_compression
)

## beam-spark-notes.md

      
        
          
            
              
              1 file
            
          
          
            
              
              0 forks
            
          
            
              
                
                0 comments
              
            
          
            
              
              0 stars
            
          
        
        
          
              
          
          
            
                markcutajar
                / beam-spark-notes.md
            
            
              Last active
              October 13, 2020 18:51
            
              
                Running an Apache Beam pipeline in a Windows Local Spark Cluster
              
          
        
      
        
  
      
    Running an Apache Beam pipeline in a Windows Local Spark Cluster

This file will include details on attempting to run an apache beam pipeline on a windows spark setup. Unlike many other techies out there, i am a firm Microsoft fan. I do believe they build quality products and over the last few years since Satya Nadella was handed the position of CEO, we're seeing massive strides on Microsoft products.
To that point, all my machines at home are windows based including my processing machine which i mosty use to run models or occasionally play a few games. A couple of weeks ago i decided to build a simple Apache Beam pipeline to process forex tick data into windowed candle stick form. Now this can be easily done in pandas if the data was smaller, having a 7GB tick data file, in order to do it using pandas we'd have to split the file into day or maybe week segments and process them one by one. In order to obviously avoid the simple way out, i decided to attempt to build a beam pipeline as it is promised to pr
	if self.input_file_type == 'csv':
	rows = rows \| 'Convert CSV to tick elements' >> beam.ParDo(
	ParseDataRows(),
	headers=self.input_columns
	)
	else:
	rows = rows \| 'Convert JSON to tick elements' >> beam.Map(json.loads)
	rows = self.pipeline \| 'Read data file' >> ReadFromText(
	self.input_file_pattern,
	compression_type=_compression
	)