Skip to content

Instantly share code, notes, and snippets.

@dannguyen
Last active August 27, 2020 07:00
Show Gist options
  • Save dannguyen/376c69380f99d1eeca7ad6e890f96259 to your computer and use it in GitHub Desktop.
Save dannguyen/376c69380f99d1eeca7ad6e890f96259 to your computer and use it in GitHub Desktop.
How to install and use xsv to split a large CSV file (Windows)

How to use xsv (in Windows) to split up a CSV file too big for Excel

I wrote these instructions on how to install and use xsv – a powerful CSV-handling command-line tool, because someone asked how to deal with a data file that was too big to open in Excel or even Notepad. I didn't know how familiar the person was with installing/running downloadable .exe files or with Powershell, so I've tried to include some general instructions that hopefully are useful to even novices.

This mini-guide is not at all meant to be exhaustive as it basically shows just one of xsv's many useful functions. But if you're new to the idea of using command-line tools to do things, hopefully this can be a friendly intro to it.


Here's an example of a CSV that, at 3 million rows, is too big for Excel to open: https://burntsushi.net/stuff/worldcitiespop.csv

If you want to skip the tedious/convoluted setup info, jump to the section titled Using xsv split to see the easy xsv one-liner.

Or to save you a click, here it is:

.\xsv.exe split -s 900000 outputdir worldcitiespop.csv

Installation instructions

Caveat: if these instructions seem dumb, it's because I'm dumb and haven't really spent a lot of time learning Win10 beyond the bare minimum to do what little I do in Windows. Obviously do what you know to be better

Go to the xsv releases/downloads page to find a list of zipped, downloadable binaries:

https://github.com/BurntSushi/xsv/releases/tag/0.13.0

I'm not sure what the differences are between the different versions, but I picked the first pc-windows binary – xsv-0.13.0-i686-pc-windows-gnu.zip, and it worked just fine on my macOS bootcamp install of Windows 10 Home edition:

https://github.com/BurntSushi/xsv/releases/download/0.13.0/xsv-0.13.0-i686-pc-windows-gnu.zip

Simple "installation" of xsv.exe for the purposes of this example

Download and open the zip file. You'll find that it contains a single executable file named xsv.exe. Extract that into your Downloads folder.

Note: I actually recommend making a subfolder in your Downloads folder to do things in. For the example and screenshots I show in this guide, I made a subfolder with path Downloads\foo and put xsv.exe and the example CSV file into it.

The rest of this guide assumes you too have a folder named Downloads\foo, so substitute accordingly

Better/alternative installation process

The proper way to install xsv.exe so that it's easily usable on your system is to put it in your system path, e.g. C:\Windows\System32. I'll assume you don't know or care about doing that for now, but you can read more info about it on this StackOverflow answer

The xsv homepage also mentions the possibility of installing cargo (Rust's package manager), which then makes installing xsv as easy as running this Powershell command:

cargo install xsv

(I have no idea how easy it is to install cargo on Windows, but it works pretty well on the Mac side!)

For the purposes of this guide, I'll assume you've installed xsv the "simple" (i.e. limited) way.

Use Powershell to run xsv.exe

So double-clicking xsv.exe to run it won't work, as xsv is a command-line application, which means you need to run it from a command line interface (i.e CLI). Assuming you're on a relatively modern version of Windows, you should have a built-in CLI called Powershell.

The easiest way to get to Powershell is to open your Start Menu and start typing powersh in the search bar.

Opening it should result in a window and text-based interface that looks like this:

image

Assuming you installed xsv.exe the "simple" way, and xsv.exe is in Downloads\foo (or whatever you named your subfolder), you now have to change to that directory in Powershell, i.e. make Downloads\foo your "working directory". Do this via the cd (i.e. "change directory") command.

This should work:

cd ~\Downloads\foo

image

Now finally we can run xsv.exe:

.\xsv.exe

And you should get output that looks like:

image

If that's what you see, that means xsv.exe "works" on your machine, and now you can use it to split your data file.

Using xsv's subcommands

For the purposes of this example, let's use a publicly available large CSV file. Click and download the following file into your working directory alongside xsv.exe – e.g. in Downloads\foo

https://burntsushi.net/stuff/worldcitiespop.csv (150MB/3 million rows)

(I'm assuming the file was downloaded and saved as worldcitiespop.csv)

As we saw when invoking .\xsv.exe, xsv.exe doesn't do anything when run by itself, other than print out a big help message. In general, it expects to be run with a subcommand – e.g. something like count, headers, split – followed by the name of an input file, e.g. worldcitiespop.csv

Trying out count and headers

So, running the count subcommand like this:

.\xsv.exe count worldcitiespop.csv

Will result in xsv printing out the total number of rows:

image

The following use of headers will list the column names (i.e. the first row):

.\xsv.exe headers worldcitiespop.csv

image

count and headers are simple subcommands to inspect a data file. You don't need to run them before running split, but it's a good idea to, as it'll at least verify you have a properly formatted CSV file.

Using xsv split

Without getting into the details of xsv's options, or how command-line things work in general, it's enough to say that the split subcommand at a minimum requires a few extra parameters from us, namely:

  • the number of records per split file
  • a subdirectory to dump the split files (the original input file, e.g. worldcitiespop.csv, will be unaltered)

You can read the details in the xsv split documentation. But it's probably easy enough to show you an example command that will just work:

.\xsv.exe split -s 900000 outputdir worldcitiespop.csv

The above command tells xsv to split worldcitiespop.csv file into smaller files – each no bigger than 900,000 records – and to put those smaller files into a new subdirectory named outputdir.

If it works (and it took less than a second to complete on my machine), this is what the relevant files in your working directory (e.g. Downloads\foo) should look like:

image

And this is what that outputdir subfolder should contain: 4 smaller CSV files, each named after the record number that the file begins with:

outputdir list

In Excel, you'll find that each of those files has the exact same header and structure:

image

The end

And that's the minimum info needed on how to use xsv to split a big file into smaller chunks. However, that's only one of xsv's very useful features. And if you're a relatively experienced data wrangler, you'll find that xsv has a great list of features that can likely replace a lot of common tasks that you've been doing in clunky SQL/pandas/R.

-30-

(ignore the list of full size images in this Gistfile below, I have no idea if I can just delete them without deleting them from the actual guide text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment