omarish/Load_CSV_to_Vertica.md

## Load_CSV_to_Vertica.md

      
    Raw
  

              Load_CSV_to_Vertica.md
            
          
    Load a Large CSV into Vertica

Here's an efficient way to load a dataset into Vertica by splitting it
up into multiple pieces and then parallelizing the load process.
Note that this only makes sense if your Vertica cluster is a single node. If it's running
more nodes, there are definitely more efficient ways of doing this.
For this example, the large CSV file will be called large_file.csv. If your file is under 1GB, it
probably makes sense to load it using a single COPY command.
1. Split the File

Next, split up your CSV file into multiple parts, like this:
$ mkdir -p /tmp/parts/
$ split -C 500m large_file.csv /tmp/parts/large_file.csv- 
This splits up the CSV file into smaller files, each is 500MB.
2. Load with Python

vertica-python bindings

Install Uber's vertica-python bindings (Everybody loves Uber except for when they do surge pricing):
$ pip install vertica-python
Create the Loading Script

Create a file called loader.py with these contents:
import os
import Queue
import threading

from vertica_python import connect

conn_dict = {
    'host':'127.0.0.1',
    'port': 5433,
    'user': '<< username >>',
    'password': '<< password >>',
    'database': '<< dbname >>'
}


class ThreadLoader(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        conn = connect(conn_dict)
        cursor = conn.cursor()
        while True:
            filename = self.queue.get()
            print "Loading %s..." % filename
            cursor.execute("COPY << table_name >> FROM '%s' DELIMITER ',' DIRECT" % filename)
            print cursor.fetchall()
            self.queue.task_done()


if __name__ == '__main__':
    # Load into queue.
    dirname = '/tmp/parts/'  # Change this to the directory where the split files reside.
    queue = Queue.Queue()

    # You can increase or decrease the number of inserter threads by 
    # increasing or decreasing the range below.
    for i in range(8):
        t = ThreadLoader(queue)
        t.start()

    for filepath in os.listdir(dirname):
        print filepath
        queue.put(dirname + filepath)

    queue.join()
You'll need to change the items that are in << hard brackets >> with your own values.
Run It!

$ python loader.py
That's all it took for me - leave a comment if you need a hand getting this to work and I'll be happy to help.