Skip to content

Instantly share code, notes, and snippets.

@hwetsman
Forked from AlexMercedCoder/readme.md
Last active July 15, 2022 19:59
Show Gist options
  • Save hwetsman/bcff27f1c0463494de869d2cd695e236 to your computer and use it in GitHub Desktop.
Save hwetsman/bcff27f1c0463494de869d2cd695e236 to your computer and use it in GitHub Desktop.
Data Terms/Concepts Cheatsheet

Data Analytics/Science Terms and Concepts Cheatsheet

Structured Data

Data is organized to meet a schema. Think tables which organize data into rows and columns.

Unstructured Data

Data is unorganized and lacks a schema. Imagine collections of html documents including text and images not organized in any consistent way.

ETL - Export Transform Load

  • Export from data source to target storage for analysis
  • Transform data during export via appropriate logic
  • Load transformed data to target for analysis

Pro: Data can be immediately analyzed once loaded

Con: Transformations can slow down the transfer of data making real time analysis difficult

ELT - Export Load Transform

  • Export from data source to target storage for analaysis
  • Load the untransformed data to the target
  • Transform the data for analysis

Pro: Can be faster for real-time data analysis

Con: Can require more storage space and compute resources, this constraint has been improved/alleviated by cloud providers

Data Warehouse

Platform for storing large amounts of structured data for analysis. Data must be structured prior, which can make loading large flows of incoming data difficult and cumbersome.

DW Strategies

  • Inmon: Everything is cleaned and normalized upfront for a single source of truth that is distributed in department level marts. Most common strategy but slow queries due to normalization and difficult to share data between departments without creating extra marts.

  • Kimball: Start with what the data should look like to the end user first (the marts), then curate the data as needed with that in mind. May have more duplication of data and more complex setup, but faster queries and more robust availability of data. Not as often done cause usually sources already tend to inmon.

  • Data Vault: Used for long term historical storage because it tracks all changes and each change holds a reference source as a value.

ACID (Atomicity, Consistency, Isolation, Durability)

Charachteristics of database transaction that make the data transactional (all or nothing, no partial completion creating broken data)

Data Lake

Platform for storing large amounts of structured and unstructured data for analysis. Since data does not have to be prepared prior much faster for loading realtime data.

Data can be Schema on Read, so Schemas aren't enforced on write like traditional databases.

How the data is stored in the data lake is determined by table format that can help provide ACID compliance, help quality governance, etc.

  • Apache Iceberg (from Netflix)
  • Apache Hudi (From Uber)
  • Delta Lake (From Databricks)
  • Spark Tables
  • Hive Tables

OLAP - Online Analytics Processing

Processes for Buisness Intelligence often using large aggregated data with complex queries. Not as regularly backed up due to data being completley recalculated each time. Used by internal Analysis professionals

OLTP - Online Transaction Processing

Processes for day to day business operations that result in small changes to data with simpler queries that require more frequent backups (to preserve the integrity of data like bank withdrawls, etc.). Used by customers and frontline staff.

Query Languages

Langauges to expressing queries to a database.

  • SQL (Structured Query Langauge): The most popular and standard used for a variety of Relational Databases and data platforms. SELECT * FROM dogs;
  • HQL (Hive Query Language): Used for SQL like query over Hadoop/Spark data lakes.
  • CQL (Cypher Query Langauge): Langauge used for NEO4J Graph Database MATCH (tom:Person {name: 'Tom Hanks'}) RETURN tom
  • CQL (Cassandra Query Language): SQL like languages used for Cassandra DB
  • AQL (Arango Query Language): Langauge used for Arango Graph Database
  • Document Database MongoDB uses a very Javascript like syntax for forming queries db.collection.find({})

Normailization

Taking data and eliminating redundancy by breaking parts of the data into smaller related tables.

Hadoop vs Spark

Hadoop is made of three main parts

  • HDFS: Allows you to store files among a cluster of computers
  • YARN: (Yet Another Resource Negotiator): Helps orchestrate resources for cluster operations
  • MapReduce: For processing data, relatively slow and difficult to code, writes to disk (only handled batch jobs), can support jobs written in Java, C++, Python and R

Spark is a data processing framework that can replace MapReduce and work with different storage layers like HDFS. More expensive than using MapReduce as infrastructure needs lots of memory. Can support jobs written in Scala, Java, Python, R.

  • Spark loads data into memory and processes for much faster speed
  • Can facilitate Batch, Streaming and Graph jobs
Hadoop Terminology
  • Name node: Node that tracks all the segments of data and which nodes they are stored on.
Spark Terminology
  • RDD (Resilient Distributed Dataset) the entry point for a spark job, defined with the spark context method.

Parquet Files

Binary Columnar format meant to be the sweet spot between file storage (CSV/JSON) vs using a Database

  • Databases can be must faster to query but much more expensive to maintain
  • Files like CSV/JSON are cheaper, but lack a schema, slow to query, and can't use the pushdown feature of spark

Parquet is a file representation of data that includes the schema, is fast, and can take advantage of some of the advanced spark features.

Apache Arrow

A standard format for organizing data in memory. This format organizes the data by column instead of by row for faster more efficient querying. (For example if I query data by data in a particular column it doesn't have to traverse all the data in each row increasing speed.)

Apache Arrow Flight is an interface for collecting data from sources that only support antiquated JDBC/ODBC connectors

Apache Drill

A SQL Query Engine that can handle a JSON Data format which allows SQL to be used on Flat, Complex, Schema-less and Schema-defined data. SQL all the things.

Hive Tables

A way organizing data sources and data files within Hadoop/Spark to make it possible to run SQL queries on top of them. In term of Hadoop, HQL queries get turned MapReduce jobs.

Apache Iceberg

Table format to address many of the pain points in Hive tables (renaming fields, supporting time-travel, non-atomic changes). Instead of saving data in a tree it saves lists of data files in a table like structure with additional meta data, snapshot. Snapshots track changes in state from one snapshot to the next (kind of like git).

Project Nessie

Literally git for data lakes.

Dremio - Data Lake House

Dremio Docker Container

Data Lake Engine combines the benefits of a Data Lake and Data Wharehouse. With Dremio I don't have to move data from their current location (cloud data store, CSV files, SQL Databses and certain non-sql databases like MongoDB). With Dremio we can query and transform our data from several source and quickly create datasets that can be used for BI dashboards and other purposes.

Benefits

  • Keep data where it is, in the format it's in
  • Free up data engineers and empower frontline data consumers to curate data
  • Control what data different consumers have access too
  • Use reflections to speed up high priority queries
  • scale dynamically with dremio cloud
  • BOTTOM LINE: decrease overall cost, increase speed of data analysis

Terminology

  • PDS (Primary Data Source): this is data that is "promoted" from a configured external data source (Database, Files, etc.). This data is immutable but can be used to create Virtual Data Sources.
  • VDS (Virtual Data Source): data sources that can be transformed that are created from either a PDS or other VDS. They are essentially the data set that is the result of an SQL query, which can then be the subject of another query.
  • Spaces: Semantic layers for organizing your VDS's (All PDS are associated with the home space ONLY), spaces other than the home space can be shared with others.
  • folder: unit for organizing data in a space or source, useful for setting up semantic layers
  • Reflections: A query or part of query that is saved as a intermediate parquet file to speed up future queries. Reflections can be:
    • Managed (Dremio based on disgnated queries/VDS will determine which reflections will be optimize)
    • User Defined (User Scripts from Spark or elsewhere represented in SQL to help define the reflection)
    • Reflections can refresh on a set schedule, incremental or with machine learning patterns

With use of SQL on Dremio not only can we query the data:

  • we can convert the type of a column
  • pull out nested data into its own column
  • rename columns
  • All these changes don't modify the original raw data, but help define Virtual Data Sets that can be shared through the semantic layer to allow people to have the data they need how they need it without making copies.

Apache Calcite

SQL Parser

Relational Caching

Video

Things that determine speed

  • Distance (Disk, Memory, CPU Cache)
  • Format Ready for Processing (Columnar vs Row Format, levels of compression)
  • Relevancy (How similar is the dataset to the question being asked)

Caching Techniques

  • In-memory file pinning (load data to memory, process in memory)
  • Columnar Disk Caching (save file to disk in colunar format, Parquet)
  • In-Memory Block Cache, save useful data in "Hot Blocks of Memory"
  • Near-CPU Data Caching (save the data in the CPU cache, faster than memory)
  • Cube Relational Caching, having sub-datasets that may be more relavent
  • Arbitrary Relational Caching, having different sub pieces that can satisfy parts of the question faster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment