PostMortem/Root Cause Analysis for AES EDI | JIRA Issue ( AESEDI-53447)
Prasanjit Singh Operations Team
The customer data was not sent from AES EDI. The investigation showed that the file with the data was sent however, it did not get processed due to an issue with the AES CIS service.
About 486,000 records were affected and the EDI to CIS monitoring service too.
Sending files with a large number of records while simultaneously running the patching script caused a wreck in the data processing. This issue with AES CIS service futher escalated to data not getting processed leading to missing records.
A large amount of files were not processed.
Reloading the AES CIS monitoring service allowed us to spot the missed records that were not discovered automatically. Following this the data file was resent leading to the resolution.
Customer created a Jira Ticket to alert us on this failure. Please refer JIRA Issue: (AESEDI-53447)
|Writing of monitoring policy to detect records missings||prevent||Prasanjit Singh||DONE|
|Monitor the data ingesters and processors (ETL)||prevent||Prasanjit Singh||(Jira Issue No: AESCIS-38263)TODO|
- More monitoring plugins and modules to watch this critical part of our infrastructure.
- Slack notifications have to be added for alerting the team whenever a data discrepancy is detected in future so that such occurences are prevented in future.
- Patching opearations should not be executed while data processing is in progress at AES EDI.
2019-06-07 (all times UTC)
|11:56||Discovering of the missing files|
|12:00||Restarting of the AES CIS monitoring module|
|12:15||Starting of the data processing of the records files|
|13:00||Completion of the data processing of all the 486,000 records files|