Skip to content

Instantly share code, notes, and snippets.

@benmarwick
Last active March 23, 2022 02:29
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save benmarwick/9265414 to your computer and use it in GitHub Desktop.
Save benmarwick/9265414 to your computer and use it in GitHub Desktop.
Convert a folder of text files into a single CSV file with one column for the file names and one column of the text of the file. A function in R.
# test it by creating some small text files to run the function on
txt <- c("here is", "some text", "to test", "this function with", "'including a leading quote", '"and another leading quote')
# make text files
dir.create("testdir")
for(i in 1:length(txt)){
writeLines(txt[i], paste0("testdir/outfile-", i, ".txt"))
}
# run the function and then look in the CSV file that is produced.
txt2csv("testdir", "theoutfile")
#' Compiling several text files into a single CSV file
#'
#' Convert a folder of text files into a single CSV file
#' with one column for the file names and one column of the
#' text of the file. A function in R.
#'
#' To use this function for the first time run this next line:
#' install.packages("devtools")
#' then thereafter you just need to load the function
#' fom github like so, with these two lines:
#' library(devtools) # windows users need Rtools installed, mac users need XCode installed
#' source_url("https://gist.github.com/benmarwick/9265414/raw/text2csv.R")
#'
#' Here's how to set the arguments to the function:
#'
#' mydir is the full path of the folder that contains your txt files
#' for example "C:/Downloads/mytextfiles" Note that it must have
#' quote marks around it and forward slashes, which are not default
#' in windows.
#'
#' mycsvfilename is the name that you want your CSV file to
#' have, it must have quote marks around it, but not
#' the .csv bit at the end
#'
#' A full example, assuming you've sourced the
#' function from github already:
#'
#' txt2csv("C:/Downloads/mytextfiles", "mybigcsvfile")
#'
#' and after a moment you'll get a message in the R console
#' saying 'Your CSV file is called mybigcsvfile.csv and
#' can be found in C:/Downloads/mytextfiles'
txt2csv <- function(mydir, mycsvfilename){
starting_dir <- getwd()
# Get the names of all the txt files (and only txt files)
myfiles <- list.files(mydir, full.names = TRUE, pattern = "*.txt")
# Read the actual contexts of the text files into R and rearrange a little.
# create a list of dataframes containing the text
mytxts <- lapply(myfiles, readLines)
# combine the rows of each dataframe to make one
# long character vector where each item in the vector
# is a single text file
mytxts1lines <- unlist(mytxts)
# make a dataframe with the file names and texts
mytxtsdf <- data.frame(filename = basename(myfiles), # just use filename as text identifier
fulltext = mytxts1lines) # full text character vectors in col 2
# Now write them all into a single CSV file, one txt file per row
setwd(mydir) # make sure the CSV goes into the dir where the txt files are
# write the CSV file...
write.table(mytxtsdf, file = paste0(mycsvfilename, ".csv"), sep = ",", row.names = FALSE, col.names = FALSE, quote = FALSE)
# now check your folder to see the csv file
message(paste0("Your CSV file is called ", paste0(mycsvfilename, ".csv"), ' and can be found in ', getwd()))
# return original working directory
setwd(starting_dir)
}
@Terelet
Copy link

Terelet commented May 31, 2018

Update: this worked when running the modified script in the elarkin fork. Extremely useful - Thank you both!

Hi,

this is running for my 3 .txt files, but giving the error message below. My text files have thousands of lines each and I just need line 6 (or even the first 6 lines concatenated) from each file. Is there something I can change in the code to make this work?

Error in data.frame(filename = basename(myfiles), fulltext = mytxts1lines) :
arguments imply differing number of rows: 3, 3548

Thanks, Dom

@alessiolevis
Copy link

I have the same problem as Terelet.

Error in data.frame(filename = basename(myfiles), fulltext = mytxts1lines) :
arguments imply differing number of rows: 4, 529

do you have a solution for that?

thanks for your work

alessio

@HedvigS
Copy link

HedvigS commented Feb 8, 2019

Thanks, but also I have the same problem as @alessiolevis :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment