Skip to content

Instantly share code, notes, and snippets.

@robson-koji
Last active June 14, 2023 14:11
Show Gist options
  • Save robson-koji/abb5b79277a896763147a6909f78ce6f to your computer and use it in GitHub Desktop.
Save robson-koji/abb5b79277a896763147a6909f78ce6f to your computer and use it in GitHub Desktop.
PL/SQL - PL/PgSQL

PL/pgSQL

PL/pgSQL is a loadable procedural language for the Postgres database system.

ETL

Extract Transform Load

We can create an entire ETL process in PL/pgSQL to handle data stored in our databases. It is the fastest and most reliable way to do this. It's fast because we don't have to move data from the DB engine, and considering that all related tables are indexed, we only have to write and run the code in the engine to retrieve the information. Even during the transformation and loading process, the engine handles pointers and indices very efficiently.

Since we are working with data inside the database, we can perform operations with large datasets without any issues regarding data transfer (network) and memory usage. Instead of making atomic transactions, we can perform large set operations with large datasets. To do this, we will use SQL Functions.

User-Defined Functions (UDFs)

PL/pgSQL allows you to write procedural code directly in the database, enabling you to create complex stored procedures, functions, and triggers. It supports variables, control structures (such as loops and conditionals), exception handling, and more. This procedural approach enhances the expressiveness and flexibility of your database logic.

PL/pgSQL code is executed within the PostgreSQL server, which eliminates the overhead of network communication between the application and the database. This results in faster execution times and improved performance for database operations. PL/pgSQL also leverages the optimizer of PostgreSQL, allowing it to optimize query plans and provide efficient execution.

PL/pgSQL supports the creation of reusable code modules in the form of functions and procedures. This promotes code modularization and enables code reuse across different parts of your application or within the database itself. This modular approach enhances code maintainability, readability, and reduces duplication.

ETL Architecture

An abstract overview of an ETL using PL/pgSQL and User-Defined Functions (UDFs)

Extract

Individual functions that access the database directly to extract data from specific tables. There is some level of transformation here, mainly using JOIN, GROUP BY, and WHERE filters.

We are using UDFs, so the functions return temporary tables for the next level usage.

Transform

Control flow of data with more complex transformation of the previously retrieved data from the created temporary tables.

In this layer, the SQL Functions don't have direct access to database tables, and the main responsibility is to make data suitable for reusability. This is an important concept that will be clear in the next layer for loading data.

Load

Wrapper Functions to be used in a front application. A collection of SQL Functions that can be embedded in applications to serve specific purposes.

This layer assembles the functions provided by the previous layer. The functions provided here are SQL APIs for specific purposes that can be reused in different front-end technologies, but they are not meant to be reused for different business needs. The output information, the dataset, should be in a usable state.

USE CASE

I will exemplify the usage of PL/pgSQL with an application I developed to WildePod - wildlife research https://wildepod.org

We have a series of Django Apps to encompass all the features of the system. Some apps are related to administrative tasks, while others are directly related to database entities and embrace the core business of the system, which is to annotate images collected in the natural environment to identify species of animals. You can see an overview of the schema << LINK TO THE SCHEMA GIST >>

Extract / Transform

The Extraction and Transform stages of the ETL run for each Database Entity / Django App, in specific folder. Each Database Entity / Django App folder contains two files extract.py and transform.py.

extract.py

The Extraction layer is direct related to the schema/model.

transform.py

The Transform layer serves as a middleware to handle data provided by functions and temporary tables from the Extract Layer. The output of the Transform layer can be reused by different loaders, each serving a specific purpose.

The Load stage of the ETL is encapsulated in a folder called API.

Load

The Load stage of the ETL is encapsulated in a folder called API. This layer serves the clients and we can have as many clients as needed. This layer is close related to applications in general, for instance:

  • Export Files
  • Web client
  • Mobile client (app/web)
  • Data exploration

Folder structure use case

+ ETL + 
      |
      +-- entities + 
      |            |
      |            +-- annotator +
      |            |             +-- extract.sql
      |            |
      |            +-- bounding_box +
      |            |                +-- extract.sql
      |            |
      |            +-- category +
      |            |            +-- extract.sql
      |            |            +-- transform.sql
      |            |
      |            +-- image +
      |            |         +-- extract.sql
      |            |
      |            +-- species +
      |                        +-- extract.sql
      |                        +-- transform.sql
      |                        +-- test.sql
      |             
      +-- api + 
              |
              +-- export +
                         +-- portal.sql

Note that not all entity provide a transform.sql when it is not needed.

Tests can be writen for each entity as unit tests, or an integration test can be performed on the API folder, or even more specific tests for entity relationships.

Code Reusability

This is a scalable structure that can grow logarithmically.

The effort to expand the ETL structure and write code may exhibit logarithmic characteristics since, when you write code for the load layer, you can reuse code already written for the extract and transformation layers.

When the extraction and transformation layers are designed to be reusable and modular, the effort required to write code for the load layer does not increase proportionally as the ETL structure expands to handle different structures or datasets. This is because the existing code for the extraction and transformation layers can be leveraged and reused for new load operations.

Basic Usage

  1. Connect to the database instance.
  2. Store functions in the database instance.
  3. Test functions.
  4. Use the functions in your code as needed (Python im my case).

SQL code example

Check here my Git SQL code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment