Skip to content

Instantly share code, notes, and snippets.

@knbknb
Forked from chasemc/skimr.r
Last active September 22, 2022 10:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save knbknb/224b787efd7ebc94885a00b680623f35 to your computer and use it in GitHub Desktop.
Save knbknb/224b787efd7ebc94885a00b680623f35 to your computer and use it in GitHub Desktop.
R: data.table::fread's ability to use unix-cmdtool to "stream in" data
# see: https://www.youtube.com/watch?v=RYhwZW6ofbI&t=6s
# R tip #3: use pipe connections
# by jim Hester
library(data.table) # for files < ~10GB
library(skimr) # another summary()
# filter a file: get only lines containing word UNK
# same, but requires no 3rd Party Code dependencies:
temp3 <- read.csv(pipe( "grep -w UNK bigfile.csv"))
# prefilter 30-column-table, directly from internet
# extremely fast
temp2 <- fread(
"curl -s https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt | head -n 20",
sep = "\t",
header = TRUE
)
skim(temp2)
# same, but requires no 3rd Party Code dependencies:
# slow, downloads entire file first
temp4 <- read.csv(url( "https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt"))
temp5 <- read.csv(
pipe("curl -s https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt |
perl -MEnglish -ne 'print if $INPUT_LINE_NUMBER == 1 || /Arabidopsis/i'"), sep="\t", header=TRUE
)
# Help / Documentation
?base::connections
??pipe
# pipe()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment