gopigof/datafeed-processing-rewrite.md

## datafeed-processing-rewrite.md

      
    Raw
  

              datafeed-processing-rewrite.md
            
          
    Class Abstractions


DataFeed Class : Tracking each datafeed
LogSummary Class : Tracking the log for each datafeed. (part of DataFeed object)

New Code Directory Style


Instead of having code spread across core/generic, core/stages and the python files inside them, move all the class related abstractions to a single file datafeed.py and the "transformations" & "actions" level code into a single folder (can be divided by stage if need be).
For the datafeeds, each one of them named after the feed (eg. "avent", "bos") has a package created in the {root_dir}/datafeeds directory. If custom logic is present for that feed, a file named based on the stage is created and the logic is written there.
All pure functions (functions that return the same output for the same input) should be put in utils package.

{root_dir}
|_commands (no longer needed? unless a CLI tool is going to be built)
|_ core
|____init.py
|____datafeed.py
|____log_summary.py
|_datafeeds
|____bos
|________init.py
|________normalise.py
|_datafeeds_lambda
|_docker
|_legacy
|_terraform
|_tests
|_utils


Other changes


With loads of repetition existing in the config, we can use smart-templating (hold most repeated data in one block and then use those variables from there).

base_feed {
	names {
		raw = "raw"
		preprocess = "preprocess"
		normalised = "normalised"
		postprocess = "postprocess"
	}
	source {
		type = "sftp"
	}
	preprocess {
      s3_raw_path = "datafeed-etl-pipeline/development/datafeeds/"
      s3_preprocess_path = "datafeed-etl-pipeline/development/datafeeds/"
    }
	...
}


A markdown file for each datafeed tracking the status of custom logic no longer seems necessary.
Whenever dictionaries or complex data-structures passed around, improve doc-strings for the following by adding the struct of the data-structure to doc-string. (for example: process_efinity_datafeed : 56)
Add unit tests, or atleast integration + functionality tests.

P.S.: This doc takes a software engineering perspective while attempting to re-think the implementation for datafeeds-pipeline repo and not much of a business perspective.