Skip to content

Instantly share code, notes, and snippets.

@bennadel
Created March 25, 2014 00:03
Breaking Enormous CSV Files Into Smaller CSV Files
<!---
Set the number of lines to use per smaller data file. This
is the max number of lines per file; there is potential
for the last file created to have less than this number.
--->
<cfset REQUEST.MaxLines = 1000 />
<!--- Get path to HUGE CSV data file.. --->
<cfset REQUEST.DataFilePath = ExpandPath( "./data.txt" ) />
<!---
Create line number reader. As you can see, as per usual
Java stylings, we get this by wrapping the file several
times within utility classes.
--->
<cfset REQUEST.LineReader = CreateObject( "java", "java.io.LineNumberReader" ).Init(
CreateObject( "java", "java.io.FileReader" ).Init(
CreateObject( "java", "java.io.File" ).Init(
REQUEST.DataFilePath
)
)
) />
<!---
This is a string buffer for building the smaller CSV data
files. The string buffer allows us to append strings at
one time (instead of for every string concatenation).
--->
<cfset REQUEST.CSVData = CreateObject(
"java",
"java.lang.StringBuffer"
).Init() />
<!--- Read the first line of data. --->
<cfset REQUEST.LineData = REQUEST.LineReader.ReadLine() />
<!---
Continue while we still have lines to read. Once we
run out of lines to read, the LineReader will return
null. That will cause the key, "LineData," to be
removed from its parent scope, REQUEST.
--->
<cfloop condition="StructKeyExists( REQUEST, 'LineData' )">
<!--- Get the line number for this iteration. --->
<cfset REQUEST.LineNumber = REQUEST.LineReader.GetLineNumber() />
<!---
Add this line of data to the string buffer. Be sure
to add new lines as the line reader strips out the
new line / carriage return data.
--->
<cfset REQUEST.CSVData.Append(
REQUEST.LineData & Chr( 13 ) & Chr( 10 )
) />
<!--- Read the next line. --->
<cfset REQUEST.LineData = REQUEST.LineReader.ReadLine() />
<!---
Check to see if our buffer is big enough. For this demo,
we will be creating files that are 1000 lines or less.
At this point, we might have 100 lines, or, we might not
have ANY lines left to read. If we do not have any lines
left, then the LineData variable will no longer exist.
--->
<cfif (
(NOT (REQUEST.LineNumber MOD REQUEST.MaxLines)) OR
(NOT StructKeyExists( REQUEST, "LineData" ))
)>
<!---
Get the CSV file name for this smaller file. The
file name is based on the last line number that
was read in at the end of the previous iteration.
--->
<cfset REQUEST.SmallFilePath = ExpandPath(
"./small/" &
NumberFormat( REQUEST.LineNumber, '0000000000' ) &
".txt"
) />
<!---
We have a 1000 lines. Output the CSV buffer to the
smaller data file. For ease of use, the file name
will be based on the line numbers (see just above).
--->
<cffile
action="WRITE"
file="#REQUEST.SmallFilePath#"
output="#REQUEST.CSVData.ToString()#"
addnewline="false"
fixnewline="true"
/>
<!---
Create a new string buffer to be used with the
next CSV file.
--->
<cfset REQUEST.CSVData = CreateObject(
"java",
"java.lang.StringBuffer"
).Init() />
</cfif>
</cfloop>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment