Skip to content

Instantly share code, notes, and snippets.

@kelvintaywl
Last active May 9, 2024 11:39
Show Gist options
  • Save kelvintaywl/37dbfaea789707ec5f48 to your computer and use it in GitHub Desktop.
Save kelvintaywl/37dbfaea789707ec5f48 to your computer and use it in GitHub Desktop.
Python Script to split CSV files into smaller files based on number of lines
import csv
import sys
import os
# example usage: python split.py example.csv 200
# above command would split the `example.csv` into smaller CSV files of 200 rows each (with header included)
# if example.csv has 401 rows for instance, this creates 3 files in same directory:
# - `example_1.csv` (row 1 - 200)
# - `example_2.csv` (row 201 - 400)
# - `example_3.csv` (row 401)
CURRENT_DIR = os.path.dirname(os.path.realpath(__file__))
filename = sys.argv[1]
full_file_path = os.path.join(CURRENT_DIR, filename)
file_name = os.path.splitext(full_file_path)[0]
rows_per_csv = int(sys.argv[2]) if len(sys.argv) > 2 else 5000
with open(filename) as infile:
reader = csv.DictReader(infile)
header = reader.fieldnames
rows = [row for row in reader]
pages = []
row_count = len(rows)
start_index = 0
# here, we slice the total rows into pages, each page having [row_per_csv] rows
while start_index < row_count:
pages.append(rows[start_index: start_index+rows_per_csv])
start_index += rows_per_csv
for i, page in enumerate(pages):
with open('{}_{}.csv'.format(file_name, i+1), 'w+') as outfile:
writer = csv.DictWriter(outfile, fieldnames=header)
writer.writeheader()
for row in page:
writer.writerow(row)
print('DONE splitting {} into {} files'.format(filename, len(pages)))
@Bannerman
Copy link

Thanks! That just saved me a few hours!

@Jingyuanisxzs
Copy link

Thank you!

@kevin25
Copy link

kevin25 commented Feb 27, 2020

there is no error and no output on Mac OS.

@NguyenDa18
Copy link

Just what I needed! Thank you! Added it in the same directory as my CSVs, ran the script with the split size and split CSVs were generated beautifully.

@gitclub-rk
Copy link

why output split files have double quotes after it is split?

@svercillo
Copy link

Dude this is so beautiful, this why I love github :)

@justinTM
Copy link

thanks Kelvin

@theacemada
Copy link

Life saver

@badbeef
Copy link

badbeef commented Nov 19, 2021

No offense. I wonder why this program needs the csv module?

Isn't a csv file made up of a schema line and data lines in a text file? In which case, the program just needs to split a csv file by reading and writing lines instead of importing and exporting csv data.

@kelvintaywl
Copy link
Author

@badbeef

that is a great question!
In a simple CSV file, it is probably true that you can get away with manipulating lines without using the csv module indeed.

However, think about the case when one of the value for a column may be perhaps a multi-line string?

For example:

title,body
"Zen of Python, By Tim Peters","Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!"

If we did not use the csv module, we need to deal with parsing quotations to "know" when exactly the end of a column value may be in the case above.

Hope this answers your question!

Also, thank you for all the kind comments from everyone :)

This was something I wrote a long time ago (when Python 2.7 was still the "main" runtime version).
I'm glad to see that this is still useful, and working for you folks! 🍺

@pbp0881
Copy link

pbp0881 commented Apr 29, 2022

This is a great code!! works perfectly fine. But I have a file which is 4GB and when I run this code over it its giving me the following error:

Traceback (most recent call last):
rows = [row for row in reader]
File "****\Split_CSV_sys.py", line 23, in
rows = [row for row in reader]
File "C:\Program Files\Python39\lib\csv.py", line 111, in next
row = next(self.reader)
_csv.Error: field larger than field limit (131072)

would really appreciate if there's a fix for this!!

I tried running it over other files smaller that this and it works just fine.

@badbeef
Copy link

badbeef commented Apr 29, 2022

Simple version but will work for large files:

title = None
LINES_PER_FILE = 4000
lines_written = 0
file_no = 1
with open('test.txt', 'r') as f:
    for line in f:
        if not title:
            title = line
        if not lines_written:
            g = open('test%03d.txt' % file_no, 'w')
            g.write(title)
        else:
            g.write(line)
        lines_written += 1
        if lines_written >= LINES_PER_FILE:
            g.close()
            file_no += 1
            lines_written = 0
    g.close()

@jon-hedgerows
Copy link

Simple version but will work for large files:

except that it doesn't work for perfectly valid csv files with multi-line fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment