tuxdna/problem.md

## problem.md

      
    Raw
  

              problem.md
            
          
    Data Engineering problem from a LinkedIn post

Original LinkedIn discussion: here
I am creating this document for brainstorming and future reference.
Problem Description

As a data engineer how do you solve this problem:

You have billions of records of json documents, could be complex and nested.
You need to do an analytic query on particular fields often that traverse a large number of records
What is the best approach to architect this?

You might already have an Elastic Search or Mongo database setup to look and query the data. But its getting slow and expensive to run these queries.

If you are choosing to parse and load into a columnar db like BigQuery, Athena, Snowflake, then how do you manage the parsing workload and how do you manage uncertainty on the incoming schema?

My Approach

Here is how I would approach this problem:
Assumptions


a record is a typical JSON document with some fixed fields and some varying/nested members
a document is 1KB on average ( approx 1000 bytes )
a billion documents will be 1000000000 * 1000 which is approx 1TB of data
typical we have some knowledge of fields we put in the select and where part of query. i.e. we know what the fields in query mean before running the query, right?

Problems to solve


queries should run fast

(how fast ? assume a baseline of couple of seconds to start with)


cost effective

(depends on the workload, and actual cost at the moment ? use out-of core processing)


how to parse the documents to extract query results?

(pre-process the data, to avoid parsing at the time of querying data)


incoming data can be inconsistent with the schema

(solved using schema versioning)


Approach


First define our first strict-schema say S1 (this will be first version of your schema).
You need two queues Q1 and Q2 (or channels or sreams or whatever you call them).
When you parse an incoming document, and there are no parse failures, and it matches with your schema, send it you Q1 i.e success queue.
All other failed documents go to Q2 (failure queue). Some also call it a dead-letter queue.
Anything coming from Q1, gets stored into your DATABASE with schema S1.
Anything coming from Q2, gets dumped into JSONL formatted log files, also containg schema version.

A typical record of failed document:
{
	"documentId": "d3",
	"source": {"original... document"},
	"schema": "S2",
	"cause_of_failure": {"reason": "a bug or schema mismatch or network issue etc.", "stack_trace": ""},
	"timestamp",
}

Grep and manually analyze the failed records coming from Q2. An LLM based code-assist tool may be helpful in this step.
Update your schema S1 to new version S2. Also fix code for parsing exceptions.
Re-deploy your pipleline with updated code, and new schema S2. Re-drive the failed documents from Q2 into your system, along with the new incoming data.

This will be an iterative process. However you can iterate quickly if you start with a small sub-set of the data, using CI/CD and appropriate tests.
Database

You now have

a strict schema with proper columns and types
maybe indexed
optimized schema in a columnar format

This can be stored in following ways:

just plain CSV file
Parquet files, Arrow format,
or some RDBMS database
for multi-user, concurrent queries use hosted solutions which support querying with low latency: BigQuery, SnowFlake etc.

It is very likely that at this point the data is much smaller that 1TB (because it is a subset of full schema).
These tools work on TBs of data ( for example ):

Vaex: https://www.youtube.com/watch?v=2Tt0i823-ec
DuckDB with Polars: https://duckdb.org/docs/guides/python/polars.html

Key Ideas


Understand the data first, this will give us initial strict-schema i.e. subset of fields from incoming documents. We only run meaningful analytics queries on the fields you understand what they mean.
Transform un-structured data into columnar / tabular format. Use columnar data-file formats, indexes, materialized_views, appropriate data-types to make queries run faster.
1TB of can be queried on a single machine, doesn't really need those expensive services. We can run such queries using tools like DuckDB, Polars, Vaex etc.
If the number of concurrent queries is large or there are many users running queries, then we may need proper DB services, maybe on cloud or, self-hosted, or on-premise.
We should pick stream-processing for real-time, batch-processing for mostly-static data.
In case our data is sourced from an event-driven system, we can also have a field that tells us the order of events, and event-schema version. Maybe we only need change-sets to rebuild a final version of event data. Not sure of this from the problem description above.