Skip to content

Instantly share code, notes, and snippets.

@cardasac
Last active August 10, 2020 22:49
Show Gist options
  • Save cardasac/e3f6368381a1853ad360fae766ddc88c to your computer and use it in GitHub Desktop.
Save cardasac/e3f6368381a1853ad360fae766ddc88c to your computer and use it in GitHub Desktop.
Learn in2csv by practice with some simple setup and following the notes.

Instructions

All the links to the csv to practice using in2csv library are located in the url_list.txt.

For simplicity please use the get_files.sh script which downloads the required files for you.

Totally inspired by the course on Datacamp Data Processing in Shell.

#!/bin/bash
#
# Download some files that you can use to practice in2csv commands.
readonly FILE_NAMES=("Section" "Student" "School" "StudentEnrollment" "TeacherRoster" "Teacher")
readonly URL="https://raw.githubusercontent.com/OfficeDev/O365-EDU-Tools/master/CSV Samples/SDS Format/100 Users/"
readonly FOLDER_NAME="data"
#######################################
# Download school related files from Microsoft repository.
# Globals:
# None
# Arguments:
# None
#######################################
download_school_files() {
for val in "${FILE_NAMES[@]}"; do
wget -qbc --limit-rate=2500k -P "$FOLDER_NAME" "$URL/$val.csv"
done
}
#######################################
# Download anything else you would like to download that is placed inside the url_list.txt file.
# Globals:
# None
# Arguments:
# None
#######################################
download_list_files() {
wget -qbc -P "$FOLDER_NAME" -i "url_list.txt"
}
entire_setup() {
download_school_files && download_list_files && find data/
}
if entire_setup; then
printf "\u2714 All your files are downloaded! You are ready to start learning :)\n"
else
printf "\u2718 Something went wrong :(\n"
fi

Data Processing In Shell

curl

  • Short for Client for URLs
  • Is a unix command line tool
  • transfers data to and from a server
  • Used to download data from HTTP(S) sites and FTP servers

man curl to check installation and manual.

curl --help for checking help on specific commands.

-O to save the file with its original name curl -O <url> . -o for new file name curl -o newfilename.txt <url>.

Can use a wildcard * to download multiple files curl -O <url.com/datafilename*.txt>.

Can also using Globbing Parser curl -O https://website.com/filename[001-100].txt. To download the 10th file use curl -O https://website.com/filename[001-100:10].txt.

Preemptive Troubleshooting

curl has two particularly useful option flags in case of timeouts during download:

  • -L redirects HTTP URL if a 300 error code occurs
  • -C resumes a previous file transfer if it times out before completion, use - after the flag to make it automatically assume where to re-download.

Putting everything together curl -O -C -L - https://website.com/filename[001-100].txt.

wget

  • Derives from World Wide Web and get
  • Native to Linux but compatible for all operating systems
  • Used to download data from HTTP(S) and FTP
  • Better than curl at downloading multiple files recursively

Check installation with which wget. Check manual with wget --help.

Unique option flags to wget:

  • -b go to background immediately after startup
  • -q turn off output
  • -c resume broken download (continue getting a partially downloaded file)

wget -bqc https://website.com/file.txt

cat wget-log to check the download status and incase any file goes amiss.

Advanced Downloading With wget

Can store a list of downloads in a file like url_list.txt. Then download all with -i flag. Put other flags before -i.

Can set download limit rate with --limit-rate.

wget --limit-rate={rate}k {file_location}

`wget --limit-rate=200k -i url_list.txt

Setting constraints for small files by setting a mandatory pause time in seconds between file downloads with --wait.

wget --wait=2.5 -i url_list.txt

curl VS wget

curl advantages:

  • Can be used for downloading and uploading files from 20+ protocols
  • Easier to install

wget advantages:_

  • Has many built-in functionalities for handling multiple file downloads
  • Can handle various file formats for download (file directory, HTML pages)

csvkit

  • Is a suite of command-line tools
  • Is developed in Python by Wireservice
  • Offers data processing and cleaning capabilities on CSV files
  • Has data capabilities that rival Python, R, and SQL

Install with pip install csvkit or upgrade with pip install --upgrade csvkit.

This package does not have a man page, but instead it has a html based documentation.

Can only process csv files. This is just the suite, use the below commands to convert and process files into csv and analyze them.

in2csv

This module is part of the suite and it does include a man page in2csv --help / -h.

in2csv SpotifyData.xlsx > SpotifyData.csv

Not doing a redirecting arrow (output >) will only print the first page in2csv SpotifyData.xlsx.

Use --name or -n to print all sheet names.

Use --sheet option followed by sheet name to convert it.

in2csv World_Bank_CO2.xlsx --sheet "About" > World_Bank_CO2.csv

in2csv does not print logs to console. Run ls for sanity check.

csvlook

Renders a CSV to the command line in a Mardown-compatible, fixed-width format. Use this to preview files

csvlook --max-rows 10 "World_Bank_CO2.csv

csvstat

Prints descriptive summary statistics on all columns in CSV (mean, median, unique value counts).

Filtering Data

Can create a subset of the original data file by:

  1. Filtering the data by column
  2. Filtering the data by row

csvcut filters data using column name or position.

csvgrep filters data by row value through exact match, pattern matching or even regex.

csvcut

Use --names or -n to print all column names/headers.

csvcut -n World_Bank_CO2.csv

Or return the first column in the data by position.

csvcut -c 1 World_Bank_CO2.csv

Or return the first column in the data by name.

csvcut -c "Year" World_Bank_CO2.csv

Return the second and third column in the data by position.

csvcut -c 2,3 World_Bank_CO2.csv

Can do the same with name.

csvcut -c "Region","Year" World_Bank_CO2.csv

Be careful not to introduce spaces between the column names or position.

csvgrep

  • Filters by row using exact match or regex fuzzy matching
  • Must be paired with one of these options:
    • -m followed by the exact row value to filter
    • -r followed by a regex pattern
    • -f followed by the path to a file

Find in World_Bank_CO2.csv where Country Name = Aruba

csvgrep -c "Country Name" -m Aruba World_Bank_CO2.csv

Similar can be done with position rather than the name.

csvgrep -c 4 -m 1969 World_Bank_CO2.csv

https://mkt.tableau.com/Public/Datasets/World_Bank_CO2.xlsx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment