Skip to content

Instantly share code, notes, and snippets.

@kurt-mueller-osumc
Last active December 31, 2021 17:36
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save kurt-mueller-osumc/0dffb4eb30b7dd7f8c2682f241d72a2b to your computer and use it in GitHub Desktop.
Save kurt-mueller-osumc/0dffb4eb30b7dd7f8c2682f241d72a2b to your computer and use it in GitHub Desktop.
ParsingCSVsWithFsharp.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# How I parse & validate CSVs with FSharp"
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## Introduction\n",
"\n",
"I use F# for common everyday tasks - like parsing/validating CSVs, XMLs, and JSONs, and making http api requests. In my day job I wrangle a lot of data - moving data from one location, transforming it into various formats, and loading it into another. F# is great for this - it has an expressive yet concise syntax similar to Ruby/Python, built-in types like `Option` and `Result`, and unique language features like `Active Patterns` and to parse and validate data. I also make heavy use of its discriminated unions, records, and units of measure to model that data as well as computation expressions to help compose those parsers/validators together. F# has a solid feature set that clicks with me and I have yet to experience in any other language.\n",
"\n",
"This article is about a basic, everyday task: parsing and validating CSVs with F#. This means reading CSVs from disk with the `FSharp.Data` library, making extensive use of F#'s `Result` type to validate CSV fields, active patterns and computation expressions to compose results together, and modeling the application's domain using F#'s algebraic data types."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"## An example CSV: Patient Vital Statuses\n",
"\n",
"I currently work as a software engineer at the James Cancer Hospital & Solove Research Institute at the Ohio State University Medical Center. One of the data sources I read in with F# are patient's vital statuses."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### Modeling the Domain\n",
"\n",
"A patient's vital status can be one of three options: they are either alive, dead, or lost to followup. If they're dead, they have a year of death, an age of death, as well as a cause of death. F# is excellent at succinctly modeling this domain:"
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"type VitalStatus =\n",
" | Alive\n",
" | Dead of DeathInfo\n",
" | ``Lost to follow-up``\n",
"and DeathInfo = {\n",
" YearOfDeath: YearOfDeath\n",
" AgeAtDeath: AgeAtDeath\n",
" CauseOfDeath: CauseOfDeath\n",
"}\n",
"and YearOfDeath = YearOfDeath of year : uint\n",
"and AgeAtDeath = AgeAtDeath of age : uint\n",
"and CauseOfDeath =\n",
" | ``Due to Cancer``\n",
" | ``Due to Other Cause``\n",
" | ``Probably Due to Cancer``\n",
" | ``Unknown Cause``\n",
" | ``Unlikely Due to Cancer``\n",
"\n",
"type PatientVitalStatus = {\n",
" PatientId: PatientId\n",
" VitalStatus: VitalStatus\n",
"}\n",
"and PatientId = PatientId of id: string"
],
"outputs": [],
"execution_count": 7,
"metadata": {
"dotnet_interactive": {
"language": "fsharp"
},
"execution": {
"iopub.status.busy": "2021-07-29T21:05:37.936Z",
"iopub.execute_input": "2021-07-29T21:05:37.941Z",
"iopub.status.idle": "2021-07-29T21:05:38.026Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"Each instance of `PatientVitalStatus` must have a patient id and a vital status. That vital status can either be `Alive` or `Dead`. If the `VitalStatus` is `Dead`, there must be `DeathInfo` associated with it: the `YearOfDeath`, `AgeAtDeath`, and `CauseOfDeath`. All *valid* rows in the \"vital statuses\" csv will be turned into instances of the `PatientVitalStatus` type. There is no way to create this type that is not a valid instance of a patient's vital status.\n",
"\n",
"I like to use \"single case discriminated unions types\" for the `YearOfDeath` and `AgeAtDeath` to signify to other code readers that instead of just everyday, unremmarkable unsigned integers, they instead have semantic, distinct meanings. If you'd like to read more about single cased union types and the rationale behind using them, the venerable Scott Wlaschin has an excellent blog post on them, [here](https://fsharpforfunandprofit.com/posts/designing-with-types-single-case-dus/).\n",
"\n",
"Getting back to validating csvs with F#, below are two instances of valid patient statuses. Note that it is *impossible* to create an invalid vital status."
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"let deadPatient = {\n",
" PatientId = PatientId \"A000001\"\n",
" VitalStatus = Dead {\n",
" YearOfDeath = YearOfDeath 2015u\n",
" AgeAtDeath = AgeAtDeath 65u\n",
" CauseOfDeath = ``Due to Cancer``\n",
" }\n",
"}\n",
"\n",
"let alivePatient = {\n",
" PatientId = PatientId \"A000002\"\n",
" VitalStatus = Alive\n",
"}\n",
"\n",
"[ deadPatient; alivePatient ]"
],
"outputs": [
{
"output_type": "execute_result",
"execution_count": 2,
"data": {
"text/html": "<table><thead><tr><th><i>index</i></th><th>PatientId</th><th>VitalStatus</th></tr></thead><tbody><tr><td>0</td><td><div class=\"dni-plaintext\">PatientId &quot;A000001&quot;</div></td><td><div class=\"dni-plaintext\">Dead { YearOfDeath = YearOfDeath 2015u\n AgeAtDeath = AgeAtDeath 65u\n CauseOfDeath = Due to Cancer }</div></td></tr><tr><td>1</td><td><div class=\"dni-plaintext\">PatientId &quot;A000002&quot;</div></td><td><div class=\"dni-plaintext\">Alive</div></td></tr></tbody></table>"
},
"metadata": {}
}
],
"execution_count": 2,
"metadata": {
"dotnet_interactive": {
"language": "fsharp"
},
"execution": {
"iopub.status.busy": "2021-07-29T21:05:27.180Z",
"iopub.execute_input": "2021-07-29T21:05:27.188Z",
"iopub.status.idle": "2021-07-29T21:05:27.627Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Validating Inputs\n",
"\n",
"Each row in the vital statuses csv represents an input to be parsed and validated. If the row is valid, it'll be turned into patient vital status. If the row is invalid, a validation error for the row will be returned.\n",
"\n",
"Below are examples of rows found within the csv. The first few rows show examples of valid patient vital statuses. The last few rows show examples of invalid patient vital statuses."
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"// Headers: Patient ID, Vital Status, Year of Death, Age at Death, Cause of Death\n",
"let vitalStatusRows = [\n",
" // Valid inputs\n",
" [\"A000001\"; \"Dead\"; \"2017\"; \"60\"; \"Due to Cancer\"] // patient died at the age of 60.0 due to cancer\n",
" [\"A000002\"; \"Alive\"; \"\"; \"\"; \"\"] // patient is still alive\n",
" [\"A000003\"; \"Lost to follow-up\"; \"\"; \"\"; \"\"] // patient lost to follow-up\n",
"\n",
" // Invalid inputs\n",
" [\"A000004\"; \"Alive\"; \"2015\"; \"45\"; \"Due to Cancer\"] // patient is alive yet has a year, age, and cause of death\n",
" [\"A000005\"; \"Dead\"; \"\"; \"\"; \"\"] // patient dead yet has no year, age, or cause of death\n",
"]\n",
"\n",
"vitalStatusRows"
],
"outputs": [
{
"output_type": "execute_result",
"execution_count": 3,
"data": {
"text/html": "<table><thead><tr><th><i>index</i></th><th>value</th></tr></thead><tbody><tr><td>0</td><td><div class=\"dni-plaintext\">[ A000001, Dead, 2017, 60, Due to Cancer ]</div></td></tr><tr><td>1</td><td><div class=\"dni-plaintext\">[ A000002, Alive, , , ]</div></td></tr><tr><td>2</td><td><div class=\"dni-plaintext\">[ A000003, Lost to follow-up, , , ]</div></td></tr><tr><td>3</td><td><div class=\"dni-plaintext\">[ A000004, Alive, 2015, 45, Due to Cancer ]</div></td></tr><tr><td>4</td><td><div class=\"dni-plaintext\">[ A000005, Dead, , , ]</div></td></tr></tbody></table>"
},
"metadata": {}
}
],
"execution_count": 3,
"metadata": {
"dotnet_interactive": {
"language": "fsharp"
},
"execution": {
"iopub.status.busy": "2021-07-29T21:05:27.638Z",
"iopub.execute_input": "2021-07-29T21:05:27.644Z",
"iopub.status.idle": "2021-07-29T21:05:27.662Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"Each vital status row can't (and shouldn't) be read directly into a `PatientVitalStatus`. Again, each row must first be validated and only if the row is valid should it be turned into a `PatientVitalStatus`. If the row is not valid, a validation error must be returned.\n",
"\n",
"Below is an example a type I'd use to represent each row in the csv... the unvalidated input for a patient vital status."
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"type PatientVitalStatusInput = {\n",
" PatientId: string\n",
" Status: string\n",
" AgeAtDeath: string\n",
" YearOfDeath: string\n",
" CauseOfDeath: string\n",
"}\n",
"\n",
"let patientVitalStatusInputs =\n",
" vitalStatusRows\n",
" |> List.map (fun [patientId; vitalStatus; yearOfDeath; ageAtDeath; causeOfDeath] ->\n",
" {\n",
" PatientId = patientId\n",
" Status = vitalStatus\n",
" YearOfDeath = yearOfDeath\n",
" AgeAtDeath = ageAtDeath\n",
" CauseOfDeath = causeOfDeath\n",
" }\n",
" )\n",
"\n",
"patientVitalStatusInputs"
],
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"\n",
"input.fsx (11,22)-(11,85) typecheck warning Incomplete pattern matches on this expression. For example, the value '[_;_;_;_;_;_]' may indicate a case not covered by the pattern(s).\n",
"\n"
]
},
{
"output_type": "execute_result",
"execution_count": 4,
"data": {
"text/html": "<table><thead><tr><th><i>index</i></th><th>PatientId</th><th>Status</th><th>AgeAtDeath</th><th>YearOfDeath</th><th>CauseOfDeath</th></tr></thead><tbody><tr><td>0</td><td><div class=\"dni-plaintext\">A000001</div></td><td><div class=\"dni-plaintext\">Dead</div></td><td><div class=\"dni-plaintext\">60</div></td><td><div class=\"dni-plaintext\">2017</div></td><td><div class=\"dni-plaintext\">Due to Cancer</div></td></tr><tr><td>1</td><td><div class=\"dni-plaintext\">A000002</div></td><td><div class=\"dni-plaintext\">Alive</div></td><td></td><td></td><td></td></tr><tr><td>2</td><td><div class=\"dni-plaintext\">A000003</div></td><td><div class=\"dni-plaintext\">Lost to follow-up</div></td><td></td><td></td><td></td></tr><tr><td>3</td><td><div class=\"dni-plaintext\">A000004</div></td><td><div class=\"dni-plaintext\">Alive</div></td><td><div class=\"dni-plaintext\">45</div></td><td><div class=\"dni-plaintext\">2015</div></td><td><div class=\"dni-plaintext\">Due to Cancer</div></td></tr><tr><td>4</td><td><div class=\"dni-plaintext\">A000005</div></td><td><div class=\"dni-plaintext\">Dead</div></td><td></td><td></td><td></td></tr></tbody></table>"
},
"metadata": {}
}
],
"execution_count": 4,
"metadata": {
"dotnet_interactive": {
"language": "fsharp"
},
"execution": {
"iopub.status.busy": "2021-07-29T21:05:27.672Z",
"iopub.execute_input": "2021-07-29T21:05:27.678Z",
"iopub.status.idle": "2021-07-29T21:05:27.917Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"### Validating one input\n",
"\n",
"Now that we have unvalidated inputs, we must write our validation business logic in the form of code. Each row in our csv has at least one attribute, all stringly typed. For now, we'll concentrate only one attribute, the `AgeAtDeath`. Below is code to validate that attribute, include a helper function to try to parse unsigned integers."
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"module UnsignedInteger =\n",
" /// Try to parse strings into unsigned integers\n",
" ///\n",
" /// tryParse \"64\" = Some 64u\n",
" /// tryParse \"foo\" = None\n",
" let tryParse (input: string) : uint option =\n",
" match System.UInt32.TryParse(input) with\n",
" | (true, num) -> Some num\n",
" | _ -> None\n",
"\n",
"module AgeAtDeath =\n",
" // Validate that age of death is a parseable, unsigned integer.\n",
" //\n",
" // validate \"64\" = Ok (AgeAtDath 64u)\n",
" // validate \"foo\" = Error \"Not a valid age of death: foo\"\n",
" let validate (ageAtDeath: string) : Result<AgeAtDeath, string> =\n",
" match UnsignedInteger.tryParse ageAtDeath with\n",
" | Some age -> Ok (AgeAtDeath age)\n",
" | _ -> Error $\"Not a valid age of death: {ageAtDeath}\"\n",
"\n",
"// iterate through list and validate each element\n",
"[ \"64\"; \"foo\" ] |> List.map AgeAtDeath.validate"
],
"outputs": [
{
"output_type": "execute_result",
"execution_count": 5,
"data": {
"text/html": "<table><thead><tr><th><i>index</i></th><th>ResultValue</th><th>ErrorValue</th></tr></thead><tbody><tr><td>0</td><td><div class=\"dni-plaintext\">AgeAtDeath 64u</div></td><td><div class=\"dni-plaintext\">&lt;null&gt;</div></td></tr><tr><td>1</td><td><div class=\"dni-plaintext\">&lt;null&gt;</div></td><td><div class=\"dni-plaintext\">Not a valid age of death: foo</div></td></tr></tbody></table>"
},
"metadata": {}
}
],
"execution_count": 5,
"metadata": {
"dotnet_interactive": {
"language": "fsharp"
},
"execution": {
"iopub.status.busy": "2021-07-29T21:05:27.929Z",
"iopub.execute_input": "2021-07-29T21:05:27.934Z",
"iopub.status.idle": "2021-07-29T21:05:28.198Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"The `UnsignedInteger.tryParse` function takes in a string and returns an optional unsigned integer. If the string is a parseable unsigned integer, it will return the parsed number wrapped in the `Some` type. If the number is not a parseable unsigned integer, it will return `None`.\n",
"\n",
"The `AgeAtDeath.validate` function takes in a string, attempts to parse the string into an unsigned integer, and, if successful, creates an `AgeAtDeath` with the newly parsed age.\n",
"\n",
"Since '64' is parseable as an integer, it ends up becoming a valid `AgeAtDeath`. Since 'foo' is not parseable as an integer, it ends up becoming an error message."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### Composing Validations\n",
"\n",
"A valid `AgeAtDeath` isn't the only valid attribute needed for a `VitalStatus`. A patient with a `VitalStatus` of `Alive` but with a valid `AgeAtDeath` is still invalid overall. Each row's attributes must be looked at together as a whole to determine if the row is a valid vital status. Below are some example validations, written in the form of `Active Expressions`, and the `PatientVitalStatus.validate` function that uses those active expressions in a match expression."
],
"metadata": {}
},
{
"cell_type": "code",
"source": [
"let (|AliveStatus|DeadStatus|LostStatus|InvalidStatus|) input =\n",
" match input with\n",
" | \"Alive\" -> AliveStatus\n",
" | \"Dead\" -> DeadStatus\n",
" | \"Lost to follow-up\" -> LostStatus\n",
" | _ -> InvalidStatus\n",
"\n",
"let (|ValidAgeAtDeath|_|) = UnsignedInteger.tryParse >> Option.map AgeAtDeath\n",
"let (|ValidYearOfDeath|_|) = UnsignedInteger.tryParse >> Option.map YearOfDeath\n",
"let (|ValidCauseOfDeath|_|) causeOfDeath =\n",
" match causeOfDeath with\n",
" | \"Due to Cancer\" -> Some ``Due to Cancer``\n",
" | \"Due to Other Cause\" -> Some ``Due to Other Cause``\n",
" | \"Probably Due to Cancer\" -> Some ``Probably Due to Cancer``\n",
" | \"Unknown Cause\" -> Some ``Unknown Cause``\n",
" | \"Unlikely Due to Cancer\" -> Some ``Unlikely Due to Cancer``\n",
" | _ -> None\n",
"\n",
"module PatientVitalStatus =\n",
" let validate (input: PatientVitalStatusInput) =\n",
" match (input.Status, input.AgeAtDeath, input.YearOfDeath, input.CauseOfDeath) with\n",
" // Valid vital status: patient is dead with a valid age at death, year of death, and cause of death\n",
" | (DeadStatus, ValidAgeAtDeath age, ValidYearOfDeath year, ValidCauseOfDeath cause) ->\n",
" Ok {\n",
" PatientId = PatientId input.PatientId\n",
" VitalStatus = Dead {\n",
" AgeAtDeath = age\n",
" YearOfDeath = year\n",
" CauseOfDeath = cause\n",
" }\n",
" }\n",
" // Valid vital status: patient is alive with blank age at death, year of death, and cause of death\n",
" | (AliveStatus, \"\", \"\", \"\") ->\n",
" Ok {\n",
" PatientId = PatientId input.PatientId\n",
" VitalStatus = Alive\n",
" }\n",
" // Valid vital status: patient is lost to follow-up with blank age at death, year of death, and cause of death\n",
" | (LostStatus, \"\", \"\", \"\") ->\n",
" Ok {\n",
" PatientId = PatientId input.PatientId\n",
" VitalStatus = ``Lost to follow-up``\n",
" }\n",
" // Invalid vital status\n",
" | _ ->\n",
" Error $\"Invalid vital status: {input}\"\n",
"\n",
"patientVitalStatusInputs\n",
"|> List.map PatientVitalStatus.validate"
],
"outputs": [
{
"output_type": "execute_result",
"execution_count": 6,
"data": {
"text/html": "<table><thead><tr><th><i>index</i></th><th>ResultValue</th><th>ErrorValue</th></tr></thead><tbody><tr><td>0</td><td><div class=\"dni-plaintext\">{ PatientId = PatientId &quot;A000001&quot;\n VitalStatus = Dead { YearOfDeath = YearOfDeath 2017u\n AgeAtDeath = AgeAtDeath 60u\n CauseOfDeath = Due to Cancer } }</div></td><td><div class=\"dni-plaintext\">&lt;null&gt;</div></td></tr><tr><td>1</td><td><div class=\"dni-plaintext\">{ PatientId = PatientId &quot;A000002&quot;\n VitalStatus = Alive }</div></td><td><div class=\"dni-plaintext\">&lt;null&gt;</div></td></tr><tr><td>2</td><td><div class=\"dni-plaintext\">{ PatientId = PatientId &quot;A000003&quot;\n VitalStatus = Lost to follow-up }</div></td><td><div class=\"dni-plaintext\">&lt;null&gt;</div></td></tr><tr><td>3</td><td><div class=\"dni-plaintext\">&lt;null&gt;</div></td><td><div class=\"dni-plaintext\">Invalid vital status: { PatientId = &quot;A000004&quot;\n Status = &quot;Alive&quot;\n AgeAtDeath = &quot;45&quot;\n YearOfDeath = &quot;2015&quot;\n CauseOfDeath = &quot;Due to Cancer&quot; }</div></td></tr><tr><td>4</td><td><div class=\"dni-plaintext\">&lt;null&gt;</div></td><td><div class=\"dni-plaintext\">Invalid vital status: { PatientId = &quot;A000005&quot;\n Status = &quot;Dead&quot;\n AgeAtDeath = &quot;&quot;\n YearOfDeath = &quot;&quot;\n CauseOfDeath = &quot;&quot; }</div></td></tr></tbody></table>"
},
"metadata": {}
}
],
"execution_count": 6,
"metadata": {
"dotnet_interactive": {
"language": "fsharp"
},
"execution": {
"iopub.status.busy": "2021-07-29T21:05:28.208Z",
"iopub.execute_input": "2021-07-29T21:05:28.214Z",
"iopub.status.idle": "2021-07-29T21:05:28.279Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Conclusion\n",
"\n",
"F# is a great language for everyday business tasks, like parsing and validating rows in a CSV. In this post, we saw a brief example of how I use F# to read in patient vital statuses and validate each status.\n",
"\n",
"In my next post, I'll address how I read in XML and JSON files (F# XML Provider and `Thoth.Json.Net` respectively), as well as how I use the awesome `FsToolkit.Errorhandling` library to compose validations together."
],
"metadata": {}
}
],
"metadata": {
"kernelspec": {
"display_name": ".NET (C#)",
"language": "C#",
"name": ".net-csharp"
},
"language_info": {
"name": "C#",
"version": "4.5",
"mimetype": "text/x-fsharp",
"file_extension": ".fs",
"pygments_lexer": "fsharp"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment