misho-kr/Introduction to NoSQL.md

## Introduction to NoSQL.md

      
    Raw
  

              Introduction to NoSQL.md
            
          
    Introduction to NoSQL

NoSQL databases have seen a surge in popularity due to their ability to handle large volumes of unstructured data. Learning to leverage NoSQL tools such as column-oriented, document, key-value, and graph databases allows data engineers, analysts, and scientists to interact with a wider breadth of data. These NoSQL databases bring additional functionality to data storage and retrieval that isn’t available in traditional relational databases.
By Jake Roach, Senior AI Engineer
Introduction to NoSQL databases

Basics of NoSQL databases. Identify the pros and cons of using column-oriented, document, key-value, and graph databases to make data available to downstream processes and data consumers.

Traditional relational data stores (RDBMS)

Organize data in tables, using columns and rows
Leverage SQL to manage and manipulate data
Enforce integrity through constraints on databases and tables


NoSQL stands for not only SQL

Allows for a wider range of data to be persisted and accessed
More flexible with regards to schema, captures data as it changes
Better scaling and performance


NoSQL data stores

Column-oriented - Snowflake
Document - Postgress
Key-value - Redis
Graph


import snowflake.connector

conn = snowflake.connector.connect(
  user="<user>",
  password="<password>",
  account="<account_identifier>",
  database="<database_name>",
  schema="<schema_name>",
  warehouse="<warehouse_name>"
)

# Build a query in a string (or multi-line string)
query = """
SELECT title, price
FROM books
WHERE price < 50.00;
"""

# Execute the query, print the results
results = conn.cursor().execute(query).fetch_pandas_all()
print(results)
import sqlalchemy

# Create a connection string, and an engine
connection_string = "postgresql+psycopg2://<user>:<password>@<host>:<port>/<database>"
db_engine = sqlalchemy.create_engine(connection_string)

import pandas as pd

# Build the query
query = """
SELECT
  books -> 'title' AS title,
  books -> 'price' AS price
FROM data_science_resources;
"""

# Execute the query
result = pd.read_sql(query, db_engine)
print(result)
Column-oriented Databases

Snowflake tables, fundamentals of micro-partitioning and data clustering to optimize query performance with query pruning. Common table expressions, materialized and non-materialized views to streamline analytics workflows. Semi-structured data with Snowflake's VARIANT types.

Row-oriented vs. column-oriented databases

Use for analytics workflows
Perform well when loading, updating, or deleting data in-bulk


COPY INTO books
FROM 'file://data_science_books.csv'
FILE_FORMAT = (
  TYPE = 'CSV'
  FIELD_DELIMITER = ','
  SKIP_HEADER = 1
);

CREATE OR REPLACE TABLE premium_books AS
SELECT * FROM books
WHERE price > 50.00;

Micro-partitioning data with Snowflake

Creates smaller "chunks" of rows, stored in columnar format
Stores metadata about each partition
Query pruning to reduce the amount of data accessed
Efficient execution of DML (data manipulation language)


Data clustering with Snowflake

Organizing or grouping similar data points together
Automatically performed during data load
Decreasing data accessed during execution
Improved query performance


Common table expressions (CTEs)

Named sub-queries/temporary tables, defined using the WITH keyword
Creates a object that can be later queried
Reduce the amount of data that is being queried and/or JOIN'ed
More modular, easier to troubleshoot


Views with Snowflake

WITH premium_books AS (
  SELECT title, author, avg_reviews
  FROM books
  WHERE price > 25.00
)
SELECT author, MIN(avg_reviews) AS min_avg_reviews, MAX(avg_reviews) AS max_avg_reviews
FROM premium_books
GROUP BY author;

Semi-structured data in Snowflake

OBJECT is similar to dictionaries in Python
ARRAY is similar to lists in Python
VARIANT type stores semi-structured data in a single column
Bracket and dot notations


Document Databases

Postgres JSON, and intricacies of semi-structured data. Extract and transform top-level and nested document data using built-in operators and functions.

JSON and JSONB in Postgres
Turning tabular data into JSON format
Extracting keys from JSON
Querying JSON data with Postgres

-> operator takes a key, returns field as JSON
->> operator takes a key, returns field as text


Finding the type of data store in JSON objects

INSERT INTO students (school, age, address, parent_meta) VALUES (
  'GP', 18, 'U', '{\"guardian\": \"mother\", ... \"P2\": \"at_home\"}}'
);

COPY students FROM 'students.csv' DELIMITER ',' CSV, HEADER;

SELECT row_to_json(row(
  school, age, address
))
FROM students;

SELECT json_object_keys(parent_meta)
FROM students;

SELECT
  parent_meta -> 'guardian' AS guardian
  parent_meta ->> 'status' AS status
FROM student;

SELECT
  parent_meta -> 'jobs' ->> 'P1' AS jobs_P1,
  parent_meta -> 'jobs' ->> 'P2' AS jobs_P2
FROM student;

SELECT
  parent_meta -> 'educations' ->> 0
  parent_meta -> 'educations' ->> 1
FROM student;

SELECT json_typeof(parent_meta -> 'jobs')
FROM students;

Advanced Postgres JSON query techniques

#> and #>> operators
json_extract_path and json_extract_path_text functions


SELECT
  parent_meta #> '{jobs}' AS jobs,
  parent_meta #> '{jobs, P1}' AS jobs_P1,
  parent_meta #> '{jobs, income}' AS income,
  parent_meta #>> '{jobs, P2}' AS jobs_P2
FROM student;

SELECT
  json_extract_path(parent_meta, 'jobs') AS jobs,
  json_extract_path(parent_meta, 'jobs', 'P1') AS jobs_P1,
  json_extract_path(parent_meta, 'jobs', 'income') AS income,
  json_extract_path_text(parent_meta, 'jobs', 'P2') AS jobs_P2,
FROM student;
Key-value and Graph Databases

Fundamentals of key-value databases. Common use-cases for key-value data, use Redis and Python to read and write data. Basic overview of graph databases.

A NoSQL database that uses simple key-value pairs (similar to dictionaries in Python) to store data

Can be searched by key, not the value
Store values of type string, hash, or lists
Performant for simple read and writes of data
Data is stored in memory, instead of on disk


Commonly used in web applications, for:

Session management
Caching frequently accessed data
Track user preferences and behavior


Key-value databases source analytics platforms - Snowflake, Redshift

# Import redis, make a connection
r = redis.Redis(...)

# Store a key-value pair
r.set("username", "JDoe")
r.set("age", 27)
r.set("username", "BSmith")

# Retrive the key-value pair
username = r.get("username")

# Store a dictionary using .hset()
r.hset(
  "shopping_cart",
  mapping={
    "item_id": "1003",
    "quantity": 2,
    "price": 79.99
  }
)

# Retrieve the dictionary
r.hgetall("shopping_cart")

Graph databases

NoSQL data stores that persist data in a network of nodes and edges
Each node represents an entity
Each edge represents a relationship between those entities


Graph databases are commonly used for

Social networks
Recommendation engines
Fraud detection
Studying patterns and relationships