Skip to content

Instantly share code, notes, and snippets.

View ianhancock-riotinto's full-sized avatar

ianhancock-riotinto

View GitHub Profile

1. Design Purpose

The detailed implementation goals involve creating secure, automated, and scalable processes for ingesting external data into the Lakehouse, aligning precisely with the defined user stories. The design ensures stringent security controls, automated processing, malware and compliance checks, quarantining of problematic files, and safe transfer of validated data.

2. Scope

In Scope:

  • Creation of secure S3 landing bucket with enforced access control and encryption.
  • AWS Transfer Family and ingestion API for controlled external data delivery.

High-Level Design Document: Machine Utilization Data Analysis with AWS

1. Purpose

This document outlines the high-level architecture designed to enable comprehensive analysis of industrial machine utilization data leveraging AWS cloud services. The key objective is to ensure reliable, structured, and secure data ingestion and processing, optimizing decision-making through enhanced visibility and predictive analytics.

2. In Scope and Out of Scope

In Scope

Based on the decision made in the Key Decision Document (KDD) to support both methods for Change Data Capture (CDC)—Microsoft CDC and Microsoft Transactional Replication (Option 2)—the following changes are recommended for the Data Ingestion and CDC sections of the High-Level Design (HLD).

Data Sources & Ingestion Section Updates:

Current State (HLD, page 11-13):

  • Primarily focused on data ingestion through CDC from databases, utilizing standard methods.
  • Preference towards using Qlik Replicate as the ingestion tool.

Recommended Changes:

Here's detailed feedback aligned with Big Andy's responsibilities, goals, and pain points:

Strengths of the Design:

  • High Availability: Clear consideration of high availability through multi-AZ deployment of both EM and Replicate nodes, addressing critical business continuity.
  • Automation: The use of API-driven interactions and automated recovery significantly improves reliability and reduces downtime.
  • Scalability and Performance: Recognizes scalability via adding nodes horizontally and using enhanced AWS networking, beneficial for mining operations requiring timely data.

User Story: Secure S3 Landing Zone

Description: As a Data Platform Engineer, I want to create a dedicated S3 bucket in the isolated EDIZ account enforcing strict access controls (bucket policies with service-linked roles) and encryption, so that incoming files are securely received and isolated prior to processing.

Acceptance Criteria:

  • Bucket created in isolated EDIZ AWS account
  • Access controlled via bucket policies and service-linked roles
  • Encryption at rest enabled (AES-256 or AWS KMS)