Skip to content

Instantly share code, notes, and snippets.

@bennadel
Created March 25, 2014 12:10
Show Gist options
  • Save bennadel/9760573 to your computer and use it in GitHub Desktop.
Save bennadel/9760573 to your computer and use it in GitHub Desktop.
Parsing CSV Data With An Input Stream And A Finite State Machine
<cfcomponent
output="false"
hint="I listen for CSV parser events and compile a array of arrays.">
<cffunction
name="init"
access="public"
returntype="any"
output="false"
hint="I initialize this component.">
<!--- Set up the data. --->
<cfset variables.csvData = [] />
<!--- Return this object reference. --->
<cfreturn this />
</cffunction>
<cffunction
name="getData"
access="public"
returntype="array"
output="false"
hint="I return the current data collection.">
<cfreturn duplicate( variables.csvData ) />
</cffunction>
<cffunction
name="handleEvent"
access="public"
returntype="any"
output="false"
hint="I listen for and then response to events published by a CSV parser.">
<!--- Define arguments. --->
<cfargument
name="eventType"
type="string"
required="true"
hint="I am the type of event being raised."
/>
<cfargument
name="eventData"
type="string"
required="false"
default=""
hint="I am the (optional) data being published along with the CSV parsing event."
/>
<!---
<cffile
action="append"
file="#expandPath( './log.txt' )#"
output="#arguments.eventType# [#arguments.eventData#]"
addnewline="true"
/>
--->
<!--- Check to see what kind of event we have. --->
<cfif (arguments.eventType eq "startRow")>
<!--- Push a new row onto the data. --->
<cfset arrayAppend(
variables.csvData,
arrayNew( 1 )
) />
<cfelseif (arguments.eventType eq "endField")>
<!--- Push this field onto the latest row. --->
<cfset arrayAppend(
variables.csvData[ arrayLen( variables.csvData ) ],
arguments.eventData
) />
</cfif>
<!--- Return this object reference. --->
<cfreturn this />
</cffunction>
</cfcomponent>
<!---
Get the file path to the CSV data file that we will be reading
in with an Input Stream (so as not to have to read the whole
file at one time).
--->
<cfset filePath = expandPath( "./widgets.csv" ) />
<!---
Create our handler. This must have one method - handleEvent() -
which can respond to events published by the CSV parser.
--->
<cfset handler = createObject( "component", "Handler" ).init() />
<!--- Create our CSV data evented parser. --->
<cfset parser = createObject( "component", "CSVParser" ).init(
filePath,
handler
) />
<!--- Output the result. --->
<cfdump
var="#handler.getData()#"
label="CSV Data"
/>
<cfcomponent
output="false"
hint="I parse a CSV file using a buffered input reader. Rather than parsing the entire file at one time, events are published as aspects of the file are read.">
<!---
This finite state machine is used to parse Comma Serparated
Values. The following states available are:
- Pre-Data (first state - only used once)
- Between Fields
- Non-Quoted Value
- Quoted Value
- Escaped Value
- Carriage Return
- New Line
--->
<cffunction
name="init"
access="public"
returntype="any"
output="false"
hint="I initialize this component instance.">
<!--- Define arguments. --->
<cfargument
name="filePath"
type="string"
required="true"
hint="I am the file path to the CSV data."
/>
<cfargument
name="handler"
type="any"
required="true"
hint="I am the object that listens for CSV parsing events. Only one method is requires: handleEvent()."
/>
<!--- Store the file path. --->
<cfset variables.filePath = arguments.filePath />
<!--- Store the handler for the parsing events. --->
<cfset variables.handler = arguments.handler />
<!---
Store a buffered input stream to the given file path.
This will allow us to optimize the input process while,
at the same time, not having to parse the entier file
in memory at any given time.
--->
<cfset variables.inputStream = createObject( "java", "java.io.BufferedInputStream" ).init(
createObject( "java", "java.io.FileInputStream" ).init(
javaCast( "string", variables.filePath )
)
) />
<!---
I am the current value buffer. As we are building field
values up, a character at a time, we will need a place
to hold them before we publish a field event.
--->
<cfset variables.fieldBuffer = [] />
<!---
I am the current state. In our case, a state is
represented by a parser that can take one character
at a time. To begin with, we will put the state into a
pre-data state (this is the only time that it will be
used in order to see if any data is in the document).
--->
<cfset variables.state = this.inPreData />
<!--- Start the actual CSV input stream parsing. --->
<cfset this.parse() />
<!--- Return this object reference. --->
<cfreturn this />
</cffunction>
<cffunction
name="parse"
access="public"
returntype="any"
output="false"
hint="I perform the actual parsing of the CSV input stream.">
<!--- Define the local scope. --->
<cfset var local = {} />
<!---
Even if the document has no data, we will, at the very
least, start and end the document.
--->
<cfset this.publish( "startDocument" ) />
<!--- Read the first character in the CSV input stream. --->
<cfset local.nextByte = variables.inputStream.read() />
<!---
The input stream will be providing a single byte at a
time. It will continue doing this until it hits the end
of the stream, at which point, it will return -1.
--->
<cfloop condition="(local.nextByte neq -1)">
<!--- Get the character version of the byte. --->
<cfset local.nextCharacter = chr( local.nextByte ) />
<!---
Pass the character off to the current state. When the
state looks at the character, it will (potentially)
announce events and then return the state to which we
should transition.
--->
<cfset variables.state = variables.state( local.nextCharacter ) />
<!--- Read the next byte. --->
<cfset local.nextByte = variables.inputStream.read() />
</cfloop>
<!---
Now that the document has ended, we need to pass it onto
the current state so that it can wrap it up appropriately
(or fail if the End-of-File is in an inappropriate place).
At this point, we don't care about storing the resultant
state since we are done parsing.
NOTE: We are using EOT (end of transmission) to denote
the "End of File" since we can use RegEx to find that.
--->
<cfset variables.state( chr( 4 ) ) />
<!--- End the document. --->
<cfset this.publish( "endDocument" ) />
<!--- Return this object reference. --->
<cfreturn this />
</cffunction>
<cffunction
name="publish"
access="public"
returntype="any"
output="false"
hint="I publish the given event with the given data.">
<!--- Define arguments. --->
<cfargument
name="eventType"
type="string"
required="true"
hint="I am the even type. Possible types are: startDocument, startRow, startField, endField, endRow, endDocument."
/>
<cfargument
name="eventData"
type="string"
required="false"
hint="I am the optional data to announce with the event."
/>
<!---
For our purposes, we'll just pass the invocation
arguments along to the event handler.
--->
<cfset variables.handler.handleEvent(
argumentCollection = arguments
) />
<!--- Return this object reference for method chaining. --->
<cfreturn this />
</cffunction>
<cffunction
name="inBetweenFields"
access="public"
returntype="any"
output="false"
hint="">
<!--- Define arguments. --->
<cfargument
name="nextCharacter"
type="string"
required="true"
hint="I am the next character in the input stream."
/>
<!--- Define the local scope. --->
<cfset var local = {} />
<!--- Field character. --->
<cfif reFind( "[^\r\n,""\x04]", arguments.nextCharacter )>
<!--- Start the new field. --->
<cfset this.publish( "startField" ) />
<!---
Add the current character to the field buffer (as we
being to build up the field value).
--->
<cfset arrayAppend(
variables.fieldBuffer,
arguments.nextCharacter
) />
<!--- Move to the non-quoted value. --->
<cfreturn this.inNonQuotedValue />
<!--- Comma. --->
<cfelseif (arguments.nextCharacter eq ",")>
<!--- Start and end an empty field. --->
<cfset this.publish( "startField" ) />
<cfset this.publish( "endField", "" ) />
<!--- Move to in between fields. --->
<cfreturn this.inBetweenFields />
<!--- Carriage return. --->
<cfelseif reFind( "\r", arguments.nextCharacter )>
<!--- Start and end an empty field. --->
<cfset this.publish( "startField" ) />
<cfset this.publish( "endField", "" ) />
<!--- End the row. --->
<cfset this.publish( "endRow" ) />
<!--- Move the carriage return. --->
<cfreturn this.inCarriageReturn />
<!--- New line. --->
<cfelseif reFind( "\n", arguments.nextCharacter )>
<!--- Start and end an empty field. --->
<cfset this.publish( "startField" ) />
<cfset this.publish( "endField", "" ) />
<!--- End the row. --->
<cfset this.publish( "endRow" ) />
<!--- Move to the new line. --->
<cfreturn this.inNewLine />
<!--- Double Quote. --->
<cfelseif (arguments.nextCharacter eq """")>
<!--- Start the new field. --->
<cfset this.publish( "startField" ) />
<!--- Move to the quoted value. --->
<cfreturn this.inQuotedValue />
<!--- End of Transmission. --->
<cfelseif reFind( "\x04", arguments.nextCharacter )>
<!--- Start and end an empty field. --->
<cfset this.publish( "startField" ) />
<cfset this.publish( "endField", "" ) />
<!--- End the row. --->
<cfset this.publish( "endRow" ) />
<cfelse>
<!---
If we made it this far, this state has been put
into an invalid state / transition.
--->
<cfthrow
type="InvalidStateTransition"
message="inBetweenFields[#arguments.nextCharacter#]"
/>
</cfif>
</cffunction>
<cffunction
name="inCarriageReturn"
access="public"
returntype="any"
output="false"
hint="">
<!--- Define arguments. --->
<cfargument
name="nextCharacter"
type="string"
required="true"
hint="I am the next character in the input stream."
/>
<!--- Define the local scope. --->
<cfset var local = {} />
<!--- New line. --->
<cfif reFind( "\n", arguments.nextCharacter )>
<!--- Move to the new line. --->
<cfreturn this.inNewLine />
<!--- Carriage return. --->
<cfelseif reFind( "\r", arguments.nextCharacter )>
<!--- Start and end an empty row. --->
<cfset this.publish( "startRow" ) />
<cfset this.publish( "endRow" ) />
<!--- Move the carriage return. --->
<cfreturn this.inCarriageReturn />
<!--- Field character. --->
<cfelseif reFind( "[^\r\n,""\x04]", arguments.nextCharacter )>
<!--- Start the next row. --->
<cfset this.publish( "startRow" ) />
<!--- Start the next field. --->
<cfset this.publish( "startField" ) />
<!--- Add the current character to the field buffer. --->
<cfset arrayAppend(
variables.fieldBuffer,
arguments.nextCharacter
) />
<!--- Move to the non-quoted value. --->
<cfreturn this.inNonQuotedValue />
<!--- Comma. --->
<cfelseif (arguments.nextCharacter eq ",")>
<!--- Start the new row. --->
<cfset this.publish( "startRow" ) />
<!--- Start and end the empty field. --->
<cfset this.publish( "startField" ) />
<cfset this.publish( "endField", "" ) />
<!--- Move to in between fields. --->
<cfreturn this.inBetweenFields />
<!--- End of Transmission. --->
<cfelseif reFind( "\x04", arguments.nextCharacter )>
<!--- Already ended the row - nothing to publish. --->
<cfelse>
<!---
If we made it this far, this state has been put
into an invalid state / transition.
--->
<cfthrow
type="InvalidStateTransition"
message="inCarriageReturn[#arguments.nextCharacter#]"
/>
</cfif>
</cffunction>
<cffunction
name="inEscapedValue"
access="public"
returntype="any"
output="false"
hint="">
<!--- Define arguments. --->
<cfargument
name="nextCharacter"
type="string"
required="true"
hint="I am the next character in the input stream."
/>
<!--- Define the local scope. --->
<cfset var local = {} />
<!--- Double-escaped quote. --->
<cfif (arguments.nextCharacter eq """")>
<!---
This is just an embedded quote. Add it to the
field buffer. We don't have to worry about the
previous double-quote as it was only used to
escape this one.
--->
<cfset arrayAppend(
variables.fieldBuffer,
arguments.nextCharacter
) />
<!--- Return back to the quoted value. --->
<cfreturn this.inQuotedValue />
<!--- Comma. --->
<cfelseif (arguments.nextCharacter eq ",")>
<!---
The previous quote was actually the end of the
previous field. End the previous field.
--->
<cfset this.publish(
"endField",
arrayToList( variables.fieldBuffer, "" )
) />
<!--- Clear the field buffer. --->
<cfset variables.fieldBuffer = [] />
<!--- Move to in between fields. --->
<cfreturn this.inBetweenFields />
<!--- Carriage return. --->
<cfelseif reFind( "\r", arguments.nextCharacter )>
<!---
The previous quote was actually the end of the
previous field. End the current field.
--->
<cfset this.publish(
"endField",
arrayToList( variables.fieldBuffer, "" )
) />
<!--- Clear the field buffer. --->
<cfset variables.fieldBuffer = [] />
<!--- End the row. --->
<cfset this.publish( "endRow" ) />
<!--- Move the carriage return. --->
<cfreturn this.inCarriageReturn />
<!--- New line. --->
<cfelseif reFind( "\n", arguments.nextCharacter )>
<!---
The previous quote was actually the end of the
previous field. End the current field.
--->
<cfset this.publish(
"endField",
arrayToList( variables.fieldBuffer, "" )
) />
<!--- Clear the field buffer. --->
<cfset variables.fieldBuffer = [] />
<!--- End the row. --->
<cfset this.publish( "endRow" ) />
<!--- Move to the new line. --->
<cfreturn this.inNewLine />
<!--- End of Transmission. --->
<cfelseif reFind( "\x04", arguments.nextCharacter )>
<!---
The previous quote was actually the end of the
previous field. End the current field.
--->
<cfset this.publish(
"endField",
arrayToList( variables.fieldBuffer, "" )
) />
<!--- Clear the field buffer. --->
<cfset variables.fieldBuffer = [] />
<!--- End the row. --->
<cfset this.publish( "endRow" ) />
<cfelse>
<!---
If we made it this far, this state has been put
into an invalid state / transition.
--->
<cfthrow
type="InvalidStateTransition"
message="inEscapedValue[#arguments.nextCharacter#]"
/>
</cfif>
</cffunction>
<cffunction
name="inNewLine"
access="public"
returntype="any"
output="false"
hint="">
<!--- Define arguments. --->
<cfargument
name="nextCharacter"
type="string"
required="true"
hint="I am the next character in the input stream."
/>
<!--- Define the local scope. --->
<cfset var local = {} />
<!--- Carriage return. --->
<cfif reFind( "\r", arguments.nextCharacter )>
<!--- End the row. --->
<cfset this.publish( "endRow" ) />
<!--- Move the carriage return. --->
<cfreturn this.inCarriageReturn />
<!--- New line. --->
<cfelseif reFind( "\n", arguments.nextCharacter )>
<!--- Start and end an empty row. --->
<cfset this.publish( "startRow" ) />
<cfset this.publish( "endRow" ) />
<!--- Move to the new line. --->
<cfreturn this.inNewLine />
<!--- Field character. --->
<cfelseif reFind( "[^\r\n,""\x04]", arguments.nextCharacter )>
<!--- Start the next row. --->
<cfset this.publish( "startRow" ) />
<!--- Start the next field. --->
<cfset this.publish( "startField" ) />
<!--- Add the current character to the field buffer. --->
<cfset arrayAppend(
variables.fieldBuffer,
arguments.nextCharacter
) />
<!--- Move to the non-quoted value. --->
<cfreturn this.inNonQuotedValue />
<!--- Comma. --->
<cfelseif (arguments.nextCharacter eq ",")>
<!--- Start the new row. --->
<cfset this.publish( "startRow" ) />
<!--- Start and end the empty field. --->
<cfset this.publish( "startField" ) />
<cfset this.publish( "endField", "" ) />
<!--- Move to in between fields. --->
<cfreturn this.inBetweenFields />
<!--- Double-quote. --->
<cfelseif (arguments.nextCharacter eq """")>
<!--- Start the new row. --->
<cfset this.publish( "startRow" ) />
<!--- Start the next field. --->
<cfset this.publish( "startField" ) />
<!--- Move to quoted value. --->
<cfreturn this.inQuotedValue />
<!--- End of Transmission. --->
<cfelseif reFind( "\x04", arguments.nextCharacter )>
<!--- Already ended the row, nothing left to do. --->
<cfelse>
<!---
If we made it this far, this state has been put
into an invalid state / transition.
--->
<cfthrow
type="InvalidStateTransition"
message="inNewLine[#arguments.nextCharacter#]"
/>
</cfif>
</cffunction>
<cffunction
name="inNonQuotedValue"
access="public"
returntype="any"
output="false"
hint="">
<!--- Define arguments. --->
<cfargument
name="nextCharacter"
type="string"
required="true"
hint="I am the next character in the input stream."
/>
<!--- Define the local scope. --->
<cfset var local = {} />
<!--- Field character. --->
<cfif reFind( "[^\r\n,""\x04]", arguments.nextCharacter )>
<!--- Add the current character to the field buffer. --->
<cfset arrayAppend(
variables.fieldBuffer,
arguments.nextCharacter
) />
<!--- Move to the non-quoted value. --->
<cfreturn this.inNonQuotedValue />
<!--- Comma. --->
<cfelseif (arguments.nextCharacter eq ",")>
<!--- End the current field. --->
<cfset this.publish(
"endField",
arrayToList( variables.fieldBuffer, "" )
) />
<!--- Clear the field buffer. --->
<cfset variables.fieldBuffer = [] />
<!--- Move to in between fields. --->
<cfreturn this.inBetweenFields />
<!--- Carriage return. --->
<cfelseif reFind( "\r", arguments.nextCharacter )>
<!--- End the current field. --->
<cfset this.publish(
"endField",
arrayToList( variables.fieldBuffer, "" )
) />
<!--- Clear the field buffer. --->
<cfset variables.fieldBuffer = [] />
<!--- End the row. --->
<cfset this.publish( "endRow" ) />
<!--- Move the carriage return. --->
<cfreturn this.inCarriageReturn />
<!--- New line. --->
<cfelseif reFind( "\n", arguments.nextCharacter )>
<!--- End the current field. --->
<cfset this.publish(
"endField",
arrayToList( variables.fieldBuffer, "" )
) />
<!--- Clear the field buffer. --->
<cfset variables.fieldBuffer = [] />
<!--- End the row. --->
<cfset this.publish( "endRow" ) />
<!--- Move to the new line. --->
<cfreturn this.inNewLine />
<!--- End of Transmission. --->
<cfelseif reFind( "\x04", arguments.nextCharacter )>
<!--- End the current field. --->
<cfset this.publish(
"endField",
arrayToList( variables.fieldBuffer, "" )
) />
<!--- Clear the field buffer. --->
<cfset variables.fieldBuffer = [] />
<!--- End the row. --->
<cfset this.publish( "endRow" ) />
<cfelse>
<!---
If we made it this far, this state has been put
into an invalid state / transition.
--->
<cfthrow
type="InvalidStateTransition"
message="inNonQuotedValue[#arguments.nextCharacter#]"
/>
</cfif>
</cffunction>
<cffunction
name="inPreData"
access="public"
returntype="any"
output="false"
hint="">
<!--- Define arguments. --->
<cfargument
name="nextCharacter"
type="string"
required="true"
hint="I am the next character in the input stream."
/>
<!--- Define the local scope. --->
<cfset var local = {} />
<!--- Comma. --->
<cfif (arguments.nextCharacter eq ",")>
<!--- Start the current row. --->
<cfset this.publish( "startRow" ) />
<!--- Start and end an empty field. --->
<cfset this.publish( "startField" ) />
<cfset this.publish( "endField", "" ) />
<!--- Move to in between fields. --->
<cfreturn this.inBetweenFields />
<!--- Carriage return. --->
<cfelseif reFind( "\r", arguments.nextCharacter )>
<!--- Start and end the row. --->
<cfset this.publish( "startRow" ) />
<cfset this.publish( "endRow" ) />
<!--- Move the carriage return. --->
<cfreturn this.inCarriageReturn />
<!--- New line. --->
<cfelseif reFind( "\n", arguments.nextCharacter )>
<!--- Start and end the row. --->
<cfset this.publish( "startRow" ) />
<cfset this.publish( "endRow" ) />
<!--- Move to the new line. --->
<cfreturn this.inNewLine />
<!--- Double Quote. --->
<cfelseif (arguments.nextCharacter eq """")>
<!--- Start the first row. --->
<cfset this.publish( "startRow" ) />
<!--- Start the new field. --->
<cfset this.publish( "startField" ) />
<!--- Move to the quoted value. --->
<cfreturn this.inQuotedValue />
<!--- Field character. --->
<cfelseif reFind( "[^\r\n,""\x04]", arguments.nextCharacter )>
<!--- Start the first row. --->
<cfset this.publish( "startRow" ) />
<!--- Start the new field. --->
<cfset this.publish( "startField" ) />
<!---
Add the current character to the field buffer (as we
being to build up the field value).
--->
<cfset arrayAppend(
variables.fieldBuffer,
arguments.nextCharacter
) />
<!--- Move to the non-quoted value. --->
<cfreturn this.inNonQuotedValue />
<!--- End of Transmission. --->
<cfelseif reFind( "\x04", arguments.nextCharacter )>
<!--- This file had no data, nothing left to do. --->
<cfelse>
<!---
If we made it this far, this state has been put
into an invalid state / transition.
--->
<cfthrow
type="InvalidStateTransition"
message="inBetweenRows[#arguments.nextCharacter#]"
/>
</cfif>
</cffunction>
<cffunction
name="inQuotedValue"
access="public"
returntype="any"
output="false"
hint="">
<!--- Define arguments. --->
<cfargument
name="nextCharacter"
type="string"
required="true"
hint="I am the next character in the input stream."
/>
<!--- Define the local scope. --->
<cfset var local = {} />
<!--- Non-double-quote. --->
<cfif (arguments.nextCharacter neq """")>
<!--- Add the current character to the field buffer. --->
<cfset arrayAppend(
variables.fieldBuffer,
arguments.nextCharacter
) />
<!--- Move to the quoted value. --->
<cfreturn this.inQuotedValue />
<!--- Double quote. --->
<cfelseif (arguments.nextCharacter eq """")>
<!---
Not sure if this quote is an escaped quote or is the
end of this quoted field. Move to the escaped state
for further testing.
--->
<cfreturn this.inEscapedValue />
<cfelse>
<!---
If we made it this far, this state has been put
into an invalid state / transition.
--->
<cfthrow
type="InvalidStateTransition"
message="inQuotedValue[#arguments.nextCharacter#]"
/>
</cfif>
</cffunction>
</cfcomponent>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment