DataFeed
Class : Tracking each datafeedLogSummary
Class : Tracking the log for each datafeed. (part of DataFeed object)
- Instead of having code spread across
core/generic
,core/stages
and the python files inside them, move all the class related abstractions to a single filedatafeed.py
and the "transformations" & "actions" level code into a single folder (can be divided by stage if need be). - For the datafeeds, each one of them named after the feed (eg. "avent", "bos") has a package created in the
{root_dir}/datafeeds
directory. If custom logic is present for that feed, a file named based on the stage is created and the logic is written there. - All pure functions (functions that return the same output for the same input) should be put in
utils
package.
{root_dir}
|_commands (no longer needed? unless a CLI tool is going to be built)
|_ core
|____init.py
|____datafeed.py
|____log_summary.py
|_datafeeds
|____bos
|________init.py
|________normalise.py
|_datafeeds_lambda
|_docker
|_legacy
|_terraform
|_tests
|_utils
- With loads of repetition existing in the config, we can use smart-templating (hold most repeated data in one block and then use those variables from there).
base_feed {
names {
raw = "raw"
preprocess = "preprocess"
normalised = "normalised"
postprocess = "postprocess"
}
source {
type = "sftp"
}
preprocess {
s3_raw_path = "datafeed-etl-pipeline/development/datafeeds/"
s3_preprocess_path = "datafeed-etl-pipeline/development/datafeeds/"
}
...
}
- A markdown file for each datafeed tracking the status of custom logic no longer seems necessary.
- Whenever dictionaries or complex data-structures passed around, improve doc-strings for the following by adding the struct of the data-structure to doc-string. (for example:
process_efinity_datafeed
: 56) - Add unit tests, or atleast integration + functionality tests.
P.S.: This doc takes a software engineering perspective while attempting to re-think the implementation for datafeeds-pipeline
repo and not much of a business perspective.