Skip to content

Instantly share code, notes, and snippets.

@omarish
Last active December 8, 2019 02:44
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save omarish/9084126 to your computer and use it in GitHub Desktop.
Save omarish/9084126 to your computer and use it in GitHub Desktop.
Quick primer on loading a large CSV file into a Vertica Database.

Load a Large CSV into Vertica

Here's an efficient way to load a dataset into Vertica by splitting it up into multiple pieces and then parallelizing the load process.

Note that this only makes sense if your Vertica cluster is a single node. If it's running more nodes, there are definitely more efficient ways of doing this.

For this example, the large CSV file will be called large_file.csv. If your file is under 1GB, it probably makes sense to load it using a single COPY command.

1. Split the File

Next, split up your CSV file into multiple parts, like this:

$ mkdir -p /tmp/parts/
$ split -C 500m large_file.csv /tmp/parts/large_file.csv- 

This splits up the CSV file into smaller files, each is 500MB.

2. Load with Python

vertica-python bindings

Install Uber's vertica-python bindings (Everybody loves Uber except for when they do surge pricing):

$ pip install vertica-python

Create the Loading Script

Create a file called loader.py with these contents:

import os
import Queue
import threading

from vertica_python import connect

conn_dict = {
    'host':'127.0.0.1',
    'port': 5433,
    'user': '<< username >>',
    'password': '<< password >>',
    'database': '<< dbname >>'
}


class ThreadLoader(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        conn = connect(conn_dict)
        cursor = conn.cursor()
        while True:
            filename = self.queue.get()
            print "Loading %s..." % filename
            cursor.execute("COPY << table_name >> FROM '%s' DELIMITER ',' DIRECT" % filename)
            print cursor.fetchall()
            self.queue.task_done()


if __name__ == '__main__':
    # Load into queue.
    dirname = '/tmp/parts/'  # Change this to the directory where the split files reside.
    queue = Queue.Queue()

    # You can increase or decrease the number of inserter threads by 
    # increasing or decreasing the range below.
    for i in range(8):
        t = ThreadLoader(queue)
        t.start()

    for filepath in os.listdir(dirname):
        print filepath
        queue.put(dirname + filepath)

    queue.join()

You'll need to change the items that are in << hard brackets >> with your own values.

Run It!

$ python loader.py

That's all it took for me - leave a comment if you need a hand getting this to work and I'll be happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment