Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Lessons learned from the recent years
Lessons learned
I. Software engineering
1. Automate everything, have CI/CD and test automation from the beginning
2. Security shall not be an afterthought (have TLS and avoid plain text passwords from the start)
3. Component shall be stateless, state shall be extracted and kept in a external store (key-value store or databases)
4. Favor open source and off-the-shelf solutions instead of building proprietary solutions
5. Avoid hard-coding dependencies, try to inject them via command line parameters, configuration files, env variables, etc
II. Data engineering
1. Have a strategy to handle delayed records
2. Use a decent scheduling software to manage job dependencies and handle retries
3. Make data processing jobs idempotent
4. Code the data pipeline in UTC
5. Have a data cleansing job as the first job in the data pipelines
6. Divide and conquer instead of the shared main script monster
7. Reading large amount of data and filtering/projecting out unneeded stuff is often cheaper than customized data logging, collection, and transportation
8. Hadoop is better at handling fewer larger files than more small files
9. Related information shall be logged together to avoid expensive joining in the later processing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment