Skip to content

Instantly share code, notes, and snippets.

@tomfun
Last active December 6, 2023 14:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tomfun/e80237ec438aa9d26da6362aef67f2db to your computer and use it in GitHub Desktop.
Save tomfun/e80237ec438aa9d26da6362aef67f2db to your computer and use it in GitHub Desktop.

Article Introduction: Navigating Database Technologies for Hierarchical and Social Network Data Management

Overview

In today's digital landscape, efficient data management is pivotal, especially when dealing with complex structures like trees and graphs. These structures, foundational in computer science, dictate how data is stored, accessed, and manipulated across various platforms, from social networks to organizational systems. This article provides a comprehensive exploration of tree and graph structures and their implementation in different database systems, assessing their effectiveness in contexts ranging from hierarchical data management in traditional databases to sophisticated relationship mapping in graph databases.

Tree and Graph: A Comparative Study of Data Structures

Tree Structure
  • Characteristics: Trees epitomize a hierarchical model, with each node (except the root) having one clear parent, forming a unique path from the root.
  • Use Cases: Ideal for organizational charts, file systems, and taxonomies.
  • Limitations: The single parent restriction and scalability issues can hinder handling complex datasets.
Graph Structure
  • Characteristics: Graphs represent a network model without hierarchical constraints, allowing multiple and varied node relationships.
  • Use Cases: Best suited for social networks, geographic information systems, and network analysis.
  • Limitations: The complexity and resource intensity of graph operations can be challenging.

Graph Databases

Neo4j. Renowned for its robustness in cloud environments, Neo4j leads in graph database technology but limits horizontal scaling in its community edition.

Microsoft Azure Cosmos DB. Mirroring Neo4j's capabilities, it shines in cloud-based deployment, especially for mature projects.

OrientDB. Now community-driven, OrientDB supports cluster nodes efficiently with a moderate resource requirement. 4GB is enough to run cluster node

TerminusDB, TypeDB, and others. These databases bring unique features to the table, with considerations like community support and licensing playing a crucial role in selection. But they are not so popular.

PostgreSQL for Comment Handling in Social Networks

Table Structure

To manage comments in a social network context, we can use a table structure in PostgreSQL like this:

CREATE TABLE comments (
    id UUID PRIMARY KEY,
    post_id UUID NOT NULL REFERENCES posts(id),
    user_id UUID NOT NULL REFERENCES users(id),
    parent_comment_id UUID REFERENCES comments(id),
    content TEXT NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP
);

Retrieving Comments and Their Parents

We can retrieve a comment along with all its parent comments using recursive queries. Here's an example that limits the recursion depth to prevent infinite loops:

WITH RECURSIVE recursive_cte AS (
    SELECT
        id,
        parent_comment_id,
        1 AS level
    FROM
        comments
    WHERE 
        id = _id
    
    UNION 

    SELECT
        c.parent_comment_id AS id,
        c.parent_comment_id,
        rec.level + 1 AS level
    FROM
        recursive_cte rec
        INNER JOIN comments c ON rec.parent_comment_id = c.id
    WHERE
        level < _max_recursion 
)
SELECT r.id, r.parent_comment_id, r.level
FROM recursive_cte r;

Keyset Pagination for Comments

To load comments page by page:

SELECT * FROM comments
WHERE post_id = $post_id AND created_at < $last_created_at
ORDER BY created_at DESC

This approach uses the created_at field to paginate through comments, effectively implementing keyset pagination.

Partitioning Strategies

  • Range Partitioning: Comments can be partitioned based on the creation date (e.g., by month or year).
  • List Partitioning: Partitioning can be done based on post_id, where each partition contains comments for a subset of posts.

Materialized Path for Popular Comments

For identifying popular comments, a materialized view can be used:

CREATE MATERIALIZED VIEW popular_comments AS
SELECT c.*, COUNT(l.id) AS likes_count
FROM comments c
LEFT JOIN likes l ON l.comment_id = c.id
GROUP BY c.id
ORDER BY likes_count DESC;

Scaling Strategies

Vertical Scaling: Enhancing server hardware capabilities.

Horizontal Scaling: Implementing read replicas, data sharding, and using partitioning.

Optimization Techniques: Involves query optimization, efficient schema design, and appropriate indexing.

Experiment with PostgreSQL ltree

To assess PostgreSQL's capabilities in handling and querying hierarchical data, a comprehensive test was conducted. This test involved creating a large dataset using the ltree extension in PostgreSQL, followed by executing a series of queries to evaluate performance. Four key scripts were used in this experiment, each serving a distinct purpose.

Script 1: Data Generation

The first script focused on generating a substantial volume of hierarchical data. The script created a table big_tree_test with an ltree column to effectively represent hierarchical paths. Key points of this script include:

  • Table and Index Creation: Setting up a table with ltree[] and creating a B-tree index on this column.
  • Data Insertion: Inserting a large volume of rows to simulate a complex hierarchical structure.

Performance Metrics:

  • Rows Generated: 21,510,000
  • Time Taken: Approximately 45 minutes
  • Memory Usage: About 5.859 GiB
  • CPU Utilization: Utilized only one core, indicating an area for potential optimization.

Shortened version of script:

create or replace function pg_temp.random_int(_min int, _max int) returns int language sql as
'select floor(random()* (_max - _min + 1) + _min)::int;';

insert into big_tree_test
select
    t1.id as id,
    null as parent_id,
    '#' || t1.id || ' root' || t1.r
from (
     select id, pg_temp.random_int(1, _root_rows) r
     from generate_series(1, _root_rows) id
 ) t1;
-- after that we can create comments with `parent_id` from generated ones

insert into big_tree_test
select
    t1.id as id,
    p as parent_id,
    ' l' || i || ' ' || t1.id || ' sub' || t1.r
from (
     select _root_rows + _rows_per_group * 1 + id as id,
            pg_temp.random_int(1, _root_rows + _rows_per_group * 1) p, -- random parent_id
            pg_temp.random_int(1, _rows_per_group) r -- just random number
     from generate_series(1, _rows_per_group) id
 ) t1;

Full sql with generating ltree path:

do
language plpgsql
$$
    declare
        _root_rows int = 1500000;
        _rows_per_group int = 200;
        _max_group int = 100000;
        _very_deep int = 10000;
    begin
        drop table if exists big_tree_test;

        CREATE TABLE big_tree_test (
                                       id serial PRIMARY KEY,
                                       parent_id int NULL,
                                       content text NOT NULL,
                                       permissions ltree[]
        );

        create or replace function pg_temp.random_int(_min int, _max int) returns int language sql as
        'select floor(random()* (_max - _min + 1) + _min)::int;';

        insert into big_tree_test
        select
                t1.id as id,
                null as parent_id,
                '#' || t1.id || ' root' || t1.r,
                cast((concat('{"n', t1.id,'"}')) as ltree[]) as permissions
        from (
                 select id, pg_temp.random_int(1, _root_rows) r
                 from generate_series(1, _root_rows) id
             ) t1;
        -- the most readable is to use already created *page* as parents
        -- after that use 2 pages, 3, ... else we would have null as parent_id
        FOR i IN 0..(_max_group - 1) LOOP
                insert into big_tree_test
                select
                        t1.id as id,
                        p as parent_id,
                        ' l' || i || ' ' || t1.id || ' sub' || t1.r,
                        cast((
                            SELECT
                                array_append('{}'::ltree[], concat(ss.permissions[1], '.s_', i, '_', t1.id)::ltree)
                            FROM big_tree_test ss
                            WHERE id = t1.p
                        ) as ltree[]) AS permissions
                from (
                         select _root_rows + _rows_per_group * i + id as id,
                                pg_temp.random_int(1, _root_rows + _rows_per_group * i) p,
                                pg_temp.random_int(1, _rows_per_group) r
                         from generate_series(1, _rows_per_group) id
                     ) t1;
        END LOOP;
        -- to create very deep comments we have to use parent_id of the last created
        FOR i IN 1..(_very_deep) LOOP
                insert into big_tree_test
                select
                        t1.id as id,
                        p as parent_id,
                        ' dd' || i || ' ' || t1.id || ' sub' || t1.r,
                        cast((
                            SELECT
                                array_append('{}'::ltree[], concat(ss.permissions[1], '.s_', i, '_', t1.id)::ltree)
                            FROM big_tree_test ss
                            WHERE id = t1.p
                        ) as ltree[]) AS permissions
                from (
                         select _root_rows + _rows_per_group * _max_group + i as id,
                                pg_temp.random_int(_root_rows + _rows_per_group * _max_group + i * 9 / 10, _root_rows + _rows_per_group * _max_group + i - 1) p,
                                pg_temp.random_int(1, _rows_per_group) r
                     ) t1;
        END LOOP;

    end;
$$;

Indexes

-- bad one. it may be used for full eq
CREATE INDEX idx_big_tree_test_permissions ON big_tree_test USING btree(permissions);
-- good one
CREATE INDEX idx_big_tree_gist_permissions ON big_tree_test USING gist(permissions);

Size of data with indexes: 9,9G
Without btree: 6,4G
And without gist: 4,9G
Create gist:

  • time for : 6m 37s
  • CPU: 1
  • RAM: up to 2.2G

Script 2: Full Scan Query

This script performed a full table scan using the nlevel function on the permissions column.

Performance Metrics:

  • Execution Time: 1 minute and 11 seconds
  • This time reflects the duration for a full scan without the benefit of an optimized index for the operation.
SELECT
    *,
    nlevel(t1.permissions[1]) as level
FROM big_tree_test t1
ORDER BY level DESC

here and everywhere in the article: psql (16.1 (Debian 16.1-1.pgdg120+1))

 Gather Merge  (cost=3422605.51..5514001.53 rows=17925000 width=136)
   Workers Planned: 2
   ->  Sort  (cost=3421605.49..3444011.74 rows=8962500 width=136)
         Sort Key: (nlevel(permissions[1])) DESC
         ->  Parallel Seq Scan on big_tree_test t1  (cost=0.00..548625.25 rows=8962500 width=136)
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true

Script 3: Select Query with Pattern Matching

The third script executed a query using the ~ operator for pattern matching against the permissions column.

Performance Metrics:

  • Rows Fetched: 10,000
  • First Run Time: 19 seconds
  • Subsequent Run Time: 5 seconds
  • The improved performance on the second run demonstrates PostgreSQL's efficiency in caching and query optimization.
SELECT
    COUNT(*)
FROM big_tree_test t1
WHERE '*.s_16340_4768122.*' ~ t1.permissions
 Aggregate  (cost=8265.99..8266.00 rows=1 width=8)
   ->  Bitmap Heap Scan on big_tree_test t1  (cost=101.09..8260.61 rows=2151 width=0)
         Recheck Cond: ('*.s_16340_4768122.*'::lquery ~ permissions)
         ->  Bitmap Index Scan on idx_big_tree_gist_permissions  (cost=0.00..100.55 rows=2151 width=0)
               Index Cond: (permissions ~ '*.s_16340_4768122.*'::lquery)

https://explain.dalibo.com/plan/c6bg7g1f2cbacgd3
image image

  1. Recheck Cond ('.s_16340_4768122.'::lquery ~ t1.permissions): This indicates that during the bitmap heap scan, the database needed to recheck the rows fetched by the index scan against the condition *.s_16340_4768122.*'::lquery ~ t1.permissions. This recheck is necessary because the bitmap index scan is lossy---it might fetch rows that seem to match based on the index but actually do not meet the condition upon rechecking.

  2. Rows Removed by Index Recheck (9,602,264): This is the number of rows that were initially identified by the index scan as potential matches but were later discarded upon rechecking the condition. It shows that the initial index scan fetched a lot of rows that did not actually meet the query condition.

https://wiki.postgresql.org/wiki/Bitmap_Indexes#General

Script 4: Select Query with Tree Operator

The final script used the @> operator to fetch specific hierarchical data.

Performance Metrics:

  • Rows Fetched: 10,000
  • Execution Time: Approximately 200 milliseconds
  • This query's speed showcases the effectiveness of GiST indexes in handling ltree operations.
SELECT
    *
FROM big_tree_test t1
WHERE 'n1258534.s_13709_4241826.s_16340_4768122.s_99999_21500000' @> t1.permissions
 Bitmap Heap Scan on big_tree_test t1  (cost=101.09..8260.61 rows=2151 width=132)
   Recheck Cond: ('n1258534.s_13709_4241826.s_16340_4768122.s_99999_21500000'::ltree @> permissions)
   ->  Bitmap Index Scan on idx_big_tree_gist_permissions  (cost=0.00..100.55 rows=2151 width=0)
         Index Cond: (permissions <@ 'n1258534.s_13709_4241826.s_16340_4768122.s_99999_21500000'::ltree)

https://explain.dalibo.com/plan/5g3a7g5eb172gb04
image

@tomfun
Copy link
Author

tomfun commented Dec 5, 2023


pratical ltree:
https://www.cybertec-postgresql.com/en/postgresql-speeding-up-recursive-queries-and-hierarchic-data/

slightly different task - create another structure - loop graph
vb-consulting/blog#4
vb-consulting/blog#5

idea on future :) Generating Test Workloads on Neo4j
https://medium.com/neo4j/generating-test-workloads-on-neo4j-97acc71298e6
https://guides.neo4j.com/sandbox/twitter/index.html
https://db-engines.com/en/article/Graph+DBMS
https://en.wikipedia.org/wiki/Graph_database

Instead of cunclusion:
https://stackoverflow.com/questions/13046442/comparison-of-relational-databases-and-graph-databases

  • A relational database is much faster when operating on huge numbers of records. In a graph database, each record has to be examined individually during a query in order to determine the structure of the data, while this is known ahead of time in a relational database.
  • Relational databases use less storage space, because they don't have to store all of those relationships.

Storing all of the relationships at the individual-record level only makes sense if there is going to be a lot of variation in the relationships; otherwise you are just duplicating the same things over and over. This means that graph databases are well-suited to irregular, complex structures. But in the real world, most databases require regular, relatively simple structures. This is why relational databases predominate.

@tomfun
Copy link
Author

tomfun commented Dec 6, 2023

Use Cases in "Ero-like" Comment System

  1. Reading Comments on Posts (User's Own and Others'):

    • Challenge: Efficiently retrieve all comments under a post, potentially in large numbers.
    • Solution: Use Common Table Expressions (CTEs) with recursive queries to fetch nested comments. Implement pagination to manage large volumes of comments. Utilize indexes on columns like post_id and parent_comment_id to speed up retrieval.
  2. Viewing Replies to a User's Comment:

    • Challenge: Fetch not only direct replies but also nested replies at varying levels.
    • Solution: Recursive CTEs enable fetching replies and their nested replies. Partitioning the comments table based on post_id or creation date can improve performance for large datasets.
  3. Displaying Popular Comments:

    • Challenge: Identifying and prioritizing comments based on popularity metrics like likes or replies.
    • Solution: Use a combination of aggregate functions and window functions to rank comments based on popularity metrics.
  4. Searching Text in Comments:

    • Challenge: Implement a full-text search within comments.
    • Solution: Leverage PostgreSQL's full-text search capabilities, utilizing GIN (Generalized Inverted Index) to index text content for efficient searching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment