Skip to content

Instantly share code, notes, and snippets.

@misho-kr
Last active May 1, 2024 06:41
Show Gist options
  • Save misho-kr/df32d78ff69b410f66d7ae99bf468150 to your computer and use it in GitHub Desktop.
Save misho-kr/df32d78ff69b410f66d7ae99bf468150 to your computer and use it in GitHub Desktop.
Summary of "Introduction to NoSQL" from DataCamp.Com

NoSQL databases have seen a surge in popularity due to their ability to handle large volumes of unstructured data. Learning to leverage NoSQL tools such as column-oriented, document, key-value, and graph databases allows data engineers, analysts, and scientists to interact with a wider breadth of data. These NoSQL databases bring additional functionality to data storage and retrieval that isn’t available in traditional relational databases.

By Jake Roach, Senior AI Engineer

Introduction to NoSQL databases

Basics of NoSQL databases. Identify the pros and cons of using column-oriented, document, key-value, and graph databases to make data available to downstream processes and data consumers.

  • Traditional relational data stores (RDBMS)
    • Organize data in tables, using columns and rows
    • Leverage SQL to manage and manipulate data
    • Enforce integrity through constraints on databases and tables
  • NoSQL stands for not only SQL
    • Allows for a wider range of data to be persisted and accessed
    • More flexible with regards to schema, captures data as it changes
    • Better scaling and performance
  • NoSQL data stores
    • Column-oriented - Snowflake
    • Document - Postgress
    • Key-value - Redis
    • Graph
import snowflake.connector

conn = snowflake.connector.connect(
  user="<user>",
  password="<password>",
  account="<account_identifier>",
  database="<database_name>",
  schema="<schema_name>",
  warehouse="<warehouse_name>"
)

# Build a query in a string (or multi-line string)
query = """
SELECT title, price
FROM books
WHERE price < 50.00;
"""

# Execute the query, print the results
results = conn.cursor().execute(query).fetch_pandas_all()
print(results)
import sqlalchemy

# Create a connection string, and an engine
connection_string = "postgresql+psycopg2://<user>:<password>@<host>:<port>/<database>"
db_engine = sqlalchemy.create_engine(connection_string)

import pandas as pd

# Build the query
query = """
SELECT
  books -> 'title' AS title,
  books -> 'price' AS price
FROM data_science_resources;
"""

# Execute the query
result = pd.read_sql(query, db_engine)
print(result)

Column-oriented Databases

Snowflake tables, fundamentals of micro-partitioning and data clustering to optimize query performance with query pruning. Common table expressions, materialized and non-materialized views to streamline analytics workflows. Semi-structured data with Snowflake's VARIANT types.

  • Row-oriented vs. column-oriented databases
    • Use for analytics workflows
    • Perform well when loading, updating, or deleting data in-bulk
COPY INTO books
FROM 'file://data_science_books.csv'
FILE_FORMAT = (
  TYPE = 'CSV'
  FIELD_DELIMITER = ','
  SKIP_HEADER = 1
);

CREATE OR REPLACE TABLE premium_books AS
SELECT * FROM books
WHERE price > 50.00;
  • Micro-partitioning data with Snowflake
    • Creates smaller "chunks" of rows, stored in columnar format
    • Stores metadata about each partition
    • Query pruning to reduce the amount of data accessed
    • Efficient execution of DML (data manipulation language)
  • Data clustering with Snowflake
    • Organizing or grouping similar data points together
    • Automatically performed during data load
    • Decreasing data accessed during execution
    • Improved query performance
  • Common table expressions (CTEs)
    • Named sub-queries/temporary tables, defined using the WITH keyword
    • Creates a object that can be later queried
    • Reduce the amount of data that is being queried and/or JOIN'ed
    • More modular, easier to troubleshoot
  • Views with Snowflake
WITH premium_books AS (
  SELECT title, author, avg_reviews
  FROM books
  WHERE price > 25.00
)
SELECT author, MIN(avg_reviews) AS min_avg_reviews, MAX(avg_reviews) AS max_avg_reviews
FROM premium_books
GROUP BY author;
  • Semi-structured data in Snowflake
    • OBJECT is similar to dictionaries in Python
    • ARRAY is similar to lists in Python
    • VARIANT type stores semi-structured data in a single column
    • Bracket and dot notations

Document Databases

Postgres JSON, and intricacies of semi-structured data. Extract and transform top-level and nested document data using built-in operators and functions.

  • JSON and JSONB in Postgres
  • Turning tabular data into JSON format
  • Extracting keys from JSON
  • Querying JSON data with Postgres
    • -> operator takes a key, returns field as JSON
    • ->> operator takes a key, returns field as text
  • Finding the type of data store in JSON objects
INSERT INTO students (school, age, address, parent_meta) VALUES (
  'GP', 18, 'U', '{\"guardian\": \"mother\", ... \"P2\": \"at_home\"}}'
);

COPY students FROM 'students.csv' DELIMITER ',' CSV, HEADER;

SELECT row_to_json(row(
  school, age, address
))
FROM students;

SELECT json_object_keys(parent_meta)
FROM students;

SELECT
  parent_meta -> 'guardian' AS guardian
  parent_meta ->> 'status' AS status
FROM student;

SELECT
  parent_meta -> 'jobs' ->> 'P1' AS jobs_P1,
  parent_meta -> 'jobs' ->> 'P2' AS jobs_P2
FROM student;

SELECT
  parent_meta -> 'educations' ->> 0
  parent_meta -> 'educations' ->> 1
FROM student;

SELECT json_typeof(parent_meta -> 'jobs')
FROM students;
  • Advanced Postgres JSON query techniques
    • #> and #>> operators
    • json_extract_path and json_extract_path_text functions
SELECT
  parent_meta #> '{jobs}' AS jobs,
  parent_meta #> '{jobs, P1}' AS jobs_P1,
  parent_meta #> '{jobs, income}' AS income,
  parent_meta #>> '{jobs, P2}' AS jobs_P2
FROM student;

SELECT
  json_extract_path(parent_meta, 'jobs') AS jobs,
  json_extract_path(parent_meta, 'jobs', 'P1') AS jobs_P1,
  json_extract_path(parent_meta, 'jobs', 'income') AS income,
  json_extract_path_text(parent_meta, 'jobs', 'P2') AS jobs_P2,
FROM student;

Key-value and Graph Databases

Fundamentals of key-value databases. Common use-cases for key-value data, use Redis and Python to read and write data. Basic overview of graph databases.

  • A NoSQL database that uses simple key-value pairs (similar to dictionaries in Python) to store data
    • Can be searched by key, not the value
    • Store values of type string, hash, or lists
    • Performant for simple read and writes of data
    • Data is stored in memory, instead of on disk
  • Commonly used in web applications, for:
    • Session management
    • Caching frequently accessed data
    • Track user preferences and behavior
  • Key-value databases source analytics platforms - Snowflake, Redshift
# Import redis, make a connection
r = redis.Redis(...)

# Store a key-value pair
r.set("username", "JDoe")
r.set("age", 27)
r.set("username", "BSmith")

# Retrive the key-value pair
username = r.get("username")

# Store a dictionary using .hset()
r.hset(
  "shopping_cart",
  mapping={
    "item_id": "1003",
    "quantity": 2,
    "price": 79.99
  }
)

# Retrieve the dictionary
r.hgetall("shopping_cart")
  • Graph databases
    • NoSQL data stores that persist data in a network of nodes and edges
    • Each node represents an entity
    • Each edge represents a relationship between those entities
  • Graph databases are commonly used for
    • Social networks
    • Recommendation engines
    • Fraud detection
    • Studying patterns and relationships
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment