Skip to content

Instantly share code, notes, and snippets.

@gopigof
Last active February 2, 2021 08:26
Show Gist options
  • Save gopigof/acd9ea8772c291ad291f35f218e9b229 to your computer and use it in GitHub Desktop.
Save gopigof/acd9ea8772c291ad291f35f218e9b229 to your computer and use it in GitHub Desktop.

Class Abstractions

  • DataFeed Class : Tracking each datafeed
  • LogSummary Class : Tracking the log for each datafeed. (part of DataFeed object)

New Code Directory Style

  • Instead of having code spread across core/generic, core/stages and the python files inside them, move all the class related abstractions to a single file datafeed.py and the "transformations" & "actions" level code into a single folder (can be divided by stage if need be).
  • For the datafeeds, each one of them named after the feed (eg. "avent", "bos") has a package created in the {root_dir}/datafeeds directory. If custom logic is present for that feed, a file named based on the stage is created and the logic is written there.
  • All pure functions (functions that return the same output for the same input) should be put in utils package.
{root_dir}
|_commands (no longer needed? unless a CLI tool is going to be built)
|_ core
|____init.py
|____datafeed.py
|____log_summary.py
|_datafeeds
|____bos
|________init.py
|________normalise.py
|_datafeeds_lambda
|_docker
|_legacy
|_terraform
|_tests
|_utils

Other changes

  • With loads of repetition existing in the config, we can use smart-templating (hold most repeated data in one block and then use those variables from there).
base_feed {
	names {
		raw = "raw"
		preprocess = "preprocess"
		normalised = "normalised"
		postprocess = "postprocess"
	}
	source {
		type = "sftp"
	}
	preprocess {
      s3_raw_path = "datafeed-etl-pipeline/development/datafeeds/"
      s3_preprocess_path = "datafeed-etl-pipeline/development/datafeeds/"
    }
	...
}
  • A markdown file for each datafeed tracking the status of custom logic no longer seems necessary.
  • Whenever dictionaries or complex data-structures passed around, improve doc-strings for the following by adding the struct of the data-structure to doc-string. (for example: process_efinity_datafeed : 56)
  • Add unit tests, or atleast integration + functionality tests.

P.S.: This doc takes a software engineering perspective while attempting to re-think the implementation for datafeeds-pipeline repo and not much of a business perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment