cardasac/come-and-learn-in2csv.md

## come-and-learn-in2csv.md

      
    Raw
  

              come-and-learn-in2csv.md
            
          
    Instructions

All the links to the csv to practice using in2csv library are located in the url_list.txt.
For simplicity please use the get_files.sh script which downloads the required files for you.
Totally inspired by the course on Datacamp Data Processing in Shell.

  
## get_files.sh
#!/bin/bash
#
# Download some files that you can use to practice in2csv commands.

readonly FILE_NAMES=("Section" "Student" "School" "StudentEnrollment" "TeacherRoster" "Teacher")
readonly URL="https://raw.githubusercontent.com/OfficeDev/O365-EDU-Tools/master/CSV Samples/SDS Format/100 Users/"
readonly FOLDER_NAME="data"

#######################################
# Download school related files from Microsoft repository.
# Globals:
#   None
# Arguments:
#   None
#######################################
download_school_files() {
  for val in "${FILE_NAMES[@]}"; do
    wget -qbc --limit-rate=2500k -P "$FOLDER_NAME" "$URL/$val.csv"
  done
}

#######################################
# Download anything else you would like to download that is placed inside the url_list.txt file.
# Globals:
#   None
# Arguments:
#   None
#######################################
download_list_files() {
  wget -qbc -P "$FOLDER_NAME" -i "url_list.txt"
}

entire_setup() {
  download_school_files && download_list_files && find data/
}

if entire_setup; then
  printf "\u2714 All your files are downloaded! You are ready to start learning :)\n"
else
  printf "\u2718 Something went wrong :(\n"
fi

## notes.md

      
    Raw
  

              notes.md
            
          
    Data Processing In Shell

curl


Short for Client for URLs
Is a unix command line tool
transfers data to and from a server
Used to download data from HTTP(S) sites and FTP servers

man curl to check installation and manual.
curl --help for checking help on specific commands.
-O to save the file with its original name curl -O <url> . -o for new file name curl -o newfilename.txt <url>.
Can use a wildcard * to download multiple files curl -O <url.com/datafilename*.txt>.
Can also using Globbing Parser curl -O https://website.com/filename[001-100].txt. To download the 10th file use curl -O https://website.com/filename[001-100:10].txt.
Preemptive Troubleshooting

curl has two particularly useful option flags in case of timeouts during download:

-L redirects HTTP URL if a 300 error code occurs
-C resumes a previous file transfer if it times out before completion, use - after the flag to make it automatically assume where to re-download.

Putting everything together curl -O -C -L - https://website.com/filename[001-100].txt.
wget


Derives from World Wide Web and get
Native to Linux but compatible for all operating systems
Used to download data from HTTP(S) and FTP
Better than curl at downloading multiple files recursively

Check installation with which wget. Check manual with wget --help.
Unique option flags to wget:

-b go to background immediately after startup
-q turn off output
-c resume broken download (continue getting a partially downloaded file)

wget -bqc https://website.com/file.txt
cat wget-log to check the download status and incase any file goes amiss.
Advanced Downloading With wget

Can store a list of downloads in a file like url_list.txt. Then download all with -i flag. Put other flags before -i.
Can set download limit rate with --limit-rate.
wget --limit-rate={rate}k {file_location}
`wget --limit-rate=200k -i url_list.txt
Setting constraints for small files by setting a mandatory pause time in seconds between file downloads with --wait.
wget --wait=2.5 -i url_list.txt
curl VS wget

curl advantages:

Can be used for downloading and uploading files from 20+ protocols
Easier to install

wget advantages:_

Has many built-in functionalities for handling multiple file downloads
Can handle various file formats for download (file directory, HTML pages)

csvkit


Is a suite of command-line tools
Is developed in Python by Wireservice
Offers data processing and cleaning capabilities on CSV files
Has data capabilities that rival Python, R, and SQL

Install with pip install csvkit or upgrade with pip install --upgrade csvkit.
This package does not have a man page, but instead it has a html based documentation.
Can only process csv files. This is just the suite, use the below commands to convert and process files into csv and analyze them.
in2csv

This module is part of the suite and it does include a man page in2csv --help / -h.
in2csv SpotifyData.xlsx > SpotifyData.csv
Not doing a redirecting arrow (output >) will only print the first page in2csv SpotifyData.xlsx.
Use --name or -n to print all sheet names.
Use --sheet option followed by sheet name to convert it.
in2csv World_Bank_CO2.xlsx --sheet "About" > World_Bank_CO2.csv
in2csv does not print logs to console. Run ls for sanity check.
csvlook

Renders a CSV to the command line in a Mardown-compatible, fixed-width format. Use this to preview files
csvlook --max-rows 10 "World_Bank_CO2.csv
csvstat

Prints descriptive summary statistics on all columns in CSV (mean, median, unique value counts).
Filtering Data

Can create a subset of the original data file by:

Filtering the data by column
Filtering the data by row

csvcut filters data using column name or position.
csvgrep filters data by row value through exact match, pattern matching or even regex.
csvcut

Use --names or -n to print all column names/headers.
csvcut -n World_Bank_CO2.csv
Or return the first column in the data by position.
csvcut -c 1 World_Bank_CO2.csv
Or return the first column in the data by name.
csvcut -c "Year" World_Bank_CO2.csv
Return the second and third column in the data by position.
csvcut -c 2,3 World_Bank_CO2.csv
Can do the same with name.
csvcut -c "Region","Year" World_Bank_CO2.csv

Be careful not to introduce spaces between the column names or position.

csvgrep


Filters by row using exact match or regex fuzzy matching
Must be paired with one of these options:

-m followed by the exact row value to filter
-r followed by a regex pattern
-f followed by the path to a file


Find in World_Bank_CO2.csv where Country Name = Aruba
csvgrep -c "Country Name" -m Aruba World_Bank_CO2.csv
Similar can be done with position rather than the name.
csvgrep -c 4 -m 1969 World_Bank_CO2.csv

  
## url_list.txt
https://mkt.tableau.com/Public/Datasets/World_Bank_CO2.xlsx
	#!/bin/bash
	#
	# Download some files that you can use to practice in2csv commands.

	readonly FILE_NAMES=("Section" "Student" "School" "StudentEnrollment" "TeacherRoster" "Teacher")
	readonly URL="https://raw.githubusercontent.com/OfficeDev/O365-EDU-Tools/master/CSV Samples/SDS Format/100 Users/"
	readonly FOLDER_NAME="data"

	#######################################
	# Download school related files from Microsoft repository.
	# Globals:
	# None
	# Arguments:
	# None
	#######################################
	download_school_files() {
	for val in "${FILE_NAMES[@]}"; do
	wget -qbc --limit-rate=2500k -P "$FOLDER_NAME" "$URL/$val.csv"
	done
	}

	#######################################
	# Download anything else you would like to download that is placed inside the url_list.txt file.
	# Globals:
	# None
	# Arguments:
	# None
	#######################################
	download_list_files() {
	wget -qbc -P "$FOLDER_NAME" -i "url_list.txt"
	}

	entire_setup() {
	download_school_files && download_list_files && find data/
	}

	if entire_setup; then
	printf "\u2714 All your files are downloaded! You are ready to start learning :)\n"
	else
	printf "\u2718 Something went wrong :(\n"
	fi