Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
This gist is a simple guide of effective storing 1.4 billion email and password pairs in PostgreSQL for fast searching and reducing required storage space

What would you need:

Hardware requirements

The following configuration ensures normal operation of the database

requirements
CPU count 8
Memory 32G
Disk storage space > 75G

Postgres 9.6.6 Docker container creation

Dockerfile for Postgres 9.6.6 (postgres/Dockerfile)

FROM postgres:9.6.6

 USER root

 # Install cstore_fdw dependencies
 RUN apt-get update && apt-get install -y \
   apt-utils \
   build-essential \
   git \
   protobuf-c-compiler \
   libprotobuf-c0-dev \
   postgresql-server-dev-9.6

 # Install cstore_fdw
 WORKDIR /opt/
 RUN git clone https://github.com/citusdata/cstore_fdw
 WORKDIR /opt/cstore_fdw

 RUN PATH=/usr/local/pgsql/bin/:$PATH make ; \
   PATH=/usr/local/pgsql/bin/:$PATH make install

Configuration file for Postgres 9.6.6 (postgres/postgresql.conf)

listen_addresses = '0.0.0.0'
shared_preload_libraries = 'cstore_fdw'
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 4GB
maintenance_work_mem = 2GB
min_wal_size = 4GB
max_wal_size = 8GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 4

Docker Compose file (docker-compose.yml)

version: "2.1"
 
 services:
   postgres:
     build: ./postgres/
     restart: always
     environment:
       - POSTGRES_USER=postgres
       - POSTGRES_PASSWORD=seacret_password
     volumes:
       - ./postgres/docker-persistence:/var/lib/postgresql/data
       - ./postgres/postgresql.conf:/var/lib/postgresql/data/postgresql.conf
       - ../accounts-leaks/BreachCompilation/data:/opt/BreachCompilation
     ports:
       - 127.0.0.1:5432:5432
 
 networks:
   default:

Now you can run Postgres 9.6.6 Docker container by command:

docker-compose up

Database creation

Create database accounts by command in Postgres 9.6.6 Docker container console from user postgres:

createdb accounts

Then you should activate cstore_fdw extention by command:

echo "CREATE EXTENSION cstore_fdw ; CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;" | psql accounts

Now create table breachcompilation in database accounts and activate cstore_fdw extention by command:

echo "CREATE FOREIGN TABLE breachcompilation ( 
    email           text, 
    password        text 

) 

SERVER cstore_server 
OPTIONS(compression 'pglz');"  | psql accounts

Import Breach Compilation files from container's filesystem to table breachcompilation by command:

for file in $(find /opt/BreachCompilation -type f)
do
    echo "COPY breachcompilation(email, password) FROM PROGRAM 'cat ${file} | grep -E -o \"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}+[\:\;][A-Za-z0-9!@#$%^&*()_+]+$\" | sed \"s/\;\|\:/,/\" | sed \"s/.*,/\L&/\" ' WITH (format csv, delimiter E',');" | psql accounts
done

Breach Compilation is a collection of several data sources, some of which have invalid data. Therefore, during import, you must filter the input stream with some regular expression.

Done!

Enjoy, but remember some limitations caused by cstore_fdw extension: "don't support updating table using DELETE, and UPDATE commands. We also don't support single row inserts."

Useful links:

@metalerk

This comment has been minimized.

Copy link

metalerk commented Feb 18, 2019

Did you find a useful login?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.