Skip to content

Instantly share code, notes, and snippets.

@kelvintaywl
Last active May 9, 2024 11:39
Show Gist options
  • Save kelvintaywl/37dbfaea789707ec5f48 to your computer and use it in GitHub Desktop.
Save kelvintaywl/37dbfaea789707ec5f48 to your computer and use it in GitHub Desktop.
Python Script to split CSV files into smaller files based on number of lines
import csv
import sys
import os
# example usage: python split.py example.csv 200
# above command would split the `example.csv` into smaller CSV files of 200 rows each (with header included)
# if example.csv has 401 rows for instance, this creates 3 files in same directory:
# - `example_1.csv` (row 1 - 200)
# - `example_2.csv` (row 201 - 400)
# - `example_3.csv` (row 401)
CURRENT_DIR = os.path.dirname(os.path.realpath(__file__))
filename = sys.argv[1]
full_file_path = os.path.join(CURRENT_DIR, filename)
file_name = os.path.splitext(full_file_path)[0]
rows_per_csv = int(sys.argv[2]) if len(sys.argv) > 2 else 5000
with open(filename) as infile:
reader = csv.DictReader(infile)
header = reader.fieldnames
rows = [row for row in reader]
pages = []
row_count = len(rows)
start_index = 0
# here, we slice the total rows into pages, each page having [row_per_csv] rows
while start_index < row_count:
pages.append(rows[start_index: start_index+rows_per_csv])
start_index += rows_per_csv
for i, page in enumerate(pages):
with open('{}_{}.csv'.format(file_name, i+1), 'w+') as outfile:
writer = csv.DictWriter(outfile, fieldnames=header)
writer.writeheader()
for row in page:
writer.writerow(row)
print('DONE splitting {} into {} files'.format(filename, len(pages)))
@pbp0881
Copy link

pbp0881 commented Apr 29, 2022

This is a great code!! works perfectly fine. But I have a file which is 4GB and when I run this code over it its giving me the following error:

Traceback (most recent call last):
rows = [row for row in reader]
File "****\Split_CSV_sys.py", line 23, in
rows = [row for row in reader]
File "C:\Program Files\Python39\lib\csv.py", line 111, in next
row = next(self.reader)
_csv.Error: field larger than field limit (131072)

would really appreciate if there's a fix for this!!

I tried running it over other files smaller that this and it works just fine.

@badbeef
Copy link

badbeef commented Apr 29, 2022

Simple version but will work for large files:

title = None
LINES_PER_FILE = 4000
lines_written = 0
file_no = 1
with open('test.txt', 'r') as f:
    for line in f:
        if not title:
            title = line
        if not lines_written:
            g = open('test%03d.txt' % file_no, 'w')
            g.write(title)
        else:
            g.write(line)
        lines_written += 1
        if lines_written >= LINES_PER_FILE:
            g.close()
            file_no += 1
            lines_written = 0
    g.close()

@jon-hedgerows
Copy link

Simple version but will work for large files:

except that it doesn't work for perfectly valid csv files with multi-line fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment