Skip to content

Instantly share code, notes, and snippets.

@jmcastagnetto
Last active April 25, 2024 14:22
Show Gist options
  • Save jmcastagnetto/cf25a604b564ac15b9beb43f7cc1cb91 to your computer and use it in GitHub Desktop.
Save jmcastagnetto/cf25a604b564ac15b9beb43f7cc1cb91 to your computer and use it in GitHub Desktop.
Example showing the truncation in rio::import() with malformed CSV files
library(rio)
packageVersion("rio")
# [1] ‘1.0.1’
d1 <- import("datos_abiertos_vigilancia_dengue.csv")
# Warning message:
# In (function (input = "", file = NULL, text = NULL, cmd = NULL, :
# Stopped early on line 87871. Expected 14 fields but found 16. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<PIURA,TALARA,PARIѐAS,ENACE I\,II\,III,DENGUE SIN SEÑALES DE ALARMA,2009,9,A97.0,31,200701,2007010008,23,A,F>>
nrow(d1)
# [1] 87869
library(readr)
packageVersion("readr")
# [1] ‘2.1.5’
d2 <- read_csv("datos_abiertos_vigilancia_dengue.csv")
# Rows: 501692 Columns: 14
# ── Column specification ─────────────────────────────────────────────────────────────
# Delimiter: ","
# chr (11): departamento, provincia, distrito, localidad, enfermedad, diagnostic, d...
# dbl (3): ano, semana, edad
#
# ℹ Use `spec()` to retrieve the full column specification for this data.
# ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Warning message:
# vOne or more parsing issues, call `problems()` on your data frame for details, e.g.:
# dat <- vroom(...)
# problems(dat)
nrow(d2)
# [1] 501692
@jmcastagnetto
Copy link
Author

jmcastagnetto commented Apr 25, 2024

Thanks for your comments @chainsawriot, in the thread https://masto.machlis.com/@smach/112325794467872813, I mentioned that {rio} has this specific requirement for well-formed input data, so even though it provides a very uniform API for reading, when used for teaching to beginners in R (and programming), this requirement will be counterproductive, in particular if the participants are using real life data (warts and all), as the instructors will have to spend some time explaining why the issues appear, and why {rio} cannot be used, even though the aim was to teach them how to use {rio} to make it easy to read any input data they will encounter.

This happened to us when we were teaching people working in analysis of health surveillance data, previously we had used readr::read_csv(), vroom::read_csv(), base read.csv(), etc., but some of the instructors wanted to make it easy for the participants, so we changed all the material to use rio::import(), the session was OK when showing an example w/ "clean" data, the problems started when they had to practice in class with the data they brought from their daily work. We had to have an extra session to explain the issues and the workarounds.

Bottom line: it distracted from the aim of teaching how to get data into R, and was confusing to the participants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment