Skip to content

Instantly share code, notes, and snippets.

@jmcastagnetto
Last active April 25, 2024 14:22
Show Gist options
  • Save jmcastagnetto/cf25a604b564ac15b9beb43f7cc1cb91 to your computer and use it in GitHub Desktop.
Save jmcastagnetto/cf25a604b564ac15b9beb43f7cc1cb91 to your computer and use it in GitHub Desktop.
Example showing the truncation in rio::import() with malformed CSV files
library(rio)
packageVersion("rio")
# [1] ‘1.0.1’
d1 <- import("datos_abiertos_vigilancia_dengue.csv")
# Warning message:
# In (function (input = "", file = NULL, text = NULL, cmd = NULL, :
# Stopped early on line 87871. Expected 14 fields but found 16. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<PIURA,TALARA,PARIѐAS,ENACE I\,II\,III,DENGUE SIN SEÑALES DE ALARMA,2009,9,A97.0,31,200701,2007010008,23,A,F>>
nrow(d1)
# [1] 87869
library(readr)
packageVersion("readr")
# [1] ‘2.1.5’
d2 <- read_csv("datos_abiertos_vigilancia_dengue.csv")
# Rows: 501692 Columns: 14
# ── Column specification ─────────────────────────────────────────────────────────────
# Delimiter: ","
# chr (11): departamento, provincia, distrito, localidad, enfermedad, diagnostic, d...
# dbl (3): ano, semana, edad
#
# ℹ Use `spec()` to retrieve the full column specification for this data.
# ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Warning message:
# vOne or more parsing issues, call `problems()` on your data frame for details, e.g.:
# dat <- vroom(...)
# problems(dat)
nrow(d2)
# [1] 501692
@jmcastagnetto
Copy link
Author

I tested that on my Linux and Windows partitions, using R 4.3.3 in both OSes and the same versions of the packages. The error is the same in all cases.

@schochastics
Copy link

This shows the parsing errors of readr

file <- "datos_abiertos_vigilancia_dengue.csv"
a <- rio::import(file,fill=TRUE)
#> Warning in (function (input = "", file = NULL, text = NULL, cmd = NULL, :
#> Stopped early on line 87871. Expected 14 fields but found 16. Consider
#> fill=TRUE and comment.char=. First discarded non-empty line:
#> <<PIURA,TALARA,PARIѐAS,ENACE I\,II\,III,DENGUE SIN SEÑALES DE
#> ALARMA,2009,9,A97.0,31,200701,2007010008,23,A,F>>
b <- readr::read_csv(file)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> Rows: 501692 Columns: 14
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (11): departamento, provincia, distrito, localidad, enfermedad, diagnost...
#> dbl  (3): ano, semana, edad
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
b[87869:87871, ]
#> # A tibble: 3 × 14
#>   departamento provincia distrito localidad   enfermedad   ano semana diagnostic
#>   <chr>        <chr>     <chr>    <chr>       <chr>      <dbl>  <dbl> <chr>     
#> 1 PIURA        TALARA    PARIѐAS  "TALARA"    "DENGUE S…  2009      9 A97.0     
#> 2 PIURA        TALARA    PARIѐAS  "ENACE I\\" "II\\"        NA     NA 2009      
#> 3 PIURA        TALARA    PARIѐAS  "TALARA"    "DENGUE S…  2009      9 A97.0     
#> # ℹ 6 more variables: diresa <chr>, ubigeo <chr>, localcod <chr>, edad <dbl>,
#> #   tipo_edad <chr>, sexo <chr>

Created on 2024-04-25 with reprex v2.1.0

@chainsawriot
Copy link

chainsawriot commented Apr 25, 2024

@jmcastagnetto If readr::read_csv() can do what you like it to behave, then use it. As stated in fread()'s help file: "fread is for regular delimited files; i.e., where every row has the same number of columns.". That's real life (bad) data after all, teaching students to try different tools is not a bad idea. But as @schochastics said, one can argue whether read::read_csv() is actually doing the right thing.

Perhaps you've already known the issue, but I reiterate it anyway. To extract the essence of the problem:

data.table::fread("distrito,localidad\nPARIѐAS,TALARA\nPARIѐAS,PARIѐAS,ENACE I\\,II\\,III\nPARIѐAS,TALARA\n")

The file uses "," as separator, but at the same time use "slash ," to try to escape "," in cases like "ENACE I,II,III". fread's sep parameter can accept regex but AFAIK, it's difficult to express ", but not slash ," in regex.

What I would do usually is to just apply some standard Unix things:

grep '\\,' datos_abiertos_vigilancia_dengue.csv
## replace \, with space
sed -i 's/\\,/ /g' datos_abiertos_vigilancia_dengue.csv
Rscript -e "nrow(rio::import('datos_abiertos_vigilancia_dengue.csv'))"

I think this is perhaps the "correct" way, at least it is not hiding the problem likes readr::read_csv()'s default. But I would also agree it is difficult to teach beginners to do these things. It's also OS dependent.

If one prefers a pure R solution and use nothing but rio and base functions:

path <- "datos_abiertos_vigilancia_dengue.csv"
## basically readLines, but faster
raw <- rio::import(path, sep = "", header = FALSE)[[1]]

raw[87871] ## show the students what the problem is

mraw <- gsub("\\\\,", " ", raw)
mraw[87871] ## how it solved the problem

writeLines(mraw, "modified_data.csv")

nrow(rio::import("modified_data.csv"))

@jmcastagnetto
Copy link
Author

jmcastagnetto commented Apr 25, 2024

Thanks for your comments @chainsawriot, in the thread https://masto.machlis.com/@smach/112325794467872813, I mentioned that {rio} has this specific requirement for well-formed input data, so even though it provides a very uniform API for reading, when used for teaching to beginners in R (and programming), this requirement will be counterproductive, in particular if the participants are using real life data (warts and all), as the instructors will have to spend some time explaining why the issues appear, and why {rio} cannot be used, even though the aim was to teach them how to use {rio} to make it easy to read any input data they will encounter.

This happened to us when we were teaching people working in analysis of health surveillance data, previously we had used readr::read_csv(), vroom::read_csv(), base read.csv(), etc., but some of the instructors wanted to make it easy for the participants, so we changed all the material to use rio::import(), the session was OK when showing an example w/ "clean" data, the problems started when they had to practice in class with the data they brought from their daily work. We had to have an extra session to explain the issues and the workarounds.

Bottom line: it distracted from the aim of teaching how to get data into R, and was confusing to the participants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment