Skip to content

Instantly share code, notes, and snippets.

@ohjho
Last active January 19, 2024 22:31
Show Gist options
  • Save ohjho/e6e2dd1806159382aa9557e3f67e5871 to your computer and use it in GitHub Desktop.
Save ohjho/e6e2dd1806159382aa9557e3f67e5871 to your computer and use it in GitHub Desktop.
text file handling edition

Find Differences in two Text Files

grep -Fxvf file1 file2

Flags mean:

-F, --fixed-strings
              Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.    
-x, --line-regexp
              Select only those matches that exactly match the whole line.
-v, --invert-match
              Invert the sense of matching, to select non-matching lines.
-f FILE, --file=FILE
              Obtain patterns from FILE, one per line.  The empty file contains zero patterns, and therefore matches nothing.

Reference

Get only lines that's {x} character long

# Get lines that's 5 character long
grep -x '.\{5\}'

# To run the operation on a file
less input_file.txt | grep -x '.\{5\}'

Get line x to line y of a file

we can use a combo of head and tail. For example, to get line 50 to line 100 of a file:

head -n 100 input_file.txt | tail -n 50

How many lines in a file?

less input_file.txt | wc -l

Copying a lot of files (argument list too long)

for i in source_dir/images/*.jpg ; do cp "$i" target_dir/images/; done

Removing Extra Whitespace Characters from CSV

sometimes CSV's linebreak includes an extra \r to remove these for using other CLI like xargs in Linux, run the following instead of cat (ref):

sed $'s/\r$//' your_file.csv

use xargs to pipe each line into a command

for example to run a python script once per image in target_dir/

 ls target_dir/*.jpg | xargs -n 1 -I{} python script.py --image target_dir/{}

the -I{} flag gives you control over where exactly to place the piped variable. The -n 1 flag tells xargs to read the command one line at a time

examples

move random files to a different directoy

use case: generating a train-test split ref: link

ls | shuf -n 10 | xargs -i mv {} path-to-new-folder

Post All Images in a Directory to an API

use case: load testing a API

 ls | xargs -i curl -X 'POST' 'http://your.api.url/apply?model_name=FancyModel' -H 'accept: image/jpeg' -H 'Content-Type: multipart/form-data' -F 'file_obj=@{};type=image/jpeg' -o path_to_output_folder/{}
  • the output will be saved with the same filename as the input but in the specified output directory (using the -o flag)
  • add a 2>&1 | tee path_to_your_log_file.log to save a log of each of your API calls and also see it in the terminal. Hat tip to this stackoverflow post
  • to time the whole command, just add time in the beginning

Check if JSON file contain specific string

use case: retry an API call if the JSON response was Time-out or Error

grep -Ril "Error 500" path_to_output_folder/ | cut -c12- | rev |cut -c6- | rev | xargs -i curl -X 'POST'   'http://your.api.url/apply?model_name=FancyModel' -H 'accept: image/jpeg' -H 'Content-Type: multipart/form-data' -F 'file_obj=@{};type=image/jpeg' -o path_to_output_folder/{}.json
  • grep -Ril: is used to find files in the given directory where the content of the file contain the given search pattern. In this case Error 500 (could also be Time-out or Internal Server Error). This command returns a list of files. For more, see this stackoverflow post
  • cut: this removes part of the returned file path. For example .json
  • rev: is a hack to use cut to remove the last X characters

Join two (or more) CSV together

assuming that all the CSVs have the same columns:

cat first.csv <(tail +2 second.csv ) <(tail +2 third.csv ) > all.csv

More CSV commands

Selectively Show Columns using cut

This command will show only column 3 of the input.csv:

cut -f3 -d, input.csv

while this command will NOT show column 3:

cut --complement -f 3 -d, input.csv
  • note that ---complement is not available in the MacOS version of cut, for work around keep reading below...

And finally this command will show column 1 to 2 and column 4 onwards (inclusive):

cut -f1-2,4- -d, input.csv

Selecting Lines from a CSV using sed

for example:

sed -n '1p;1001,2001p' path/to/your.csv > path/to/your_subset.csv

will print the 1st line (the CSV headers) and the 1000th row to 2000th row to your_subset.csv

Replacing the first line (i.e. header) of your CSV/ any text file

for example, when your script is expecting a slightly different column name in your dataframe:

sed -i '1cCol1,Col2,NewCol3' dataframe.csv 
  • -i this make changes to the CSV in place (see original solution here)

jq resources

wget tips

  • use -O if you want to specify the output filename (see reference)
  • use -P if you want to download the file to a specific directory

working with tar and Zip files

to create a tar archive:

cd targert/dir/
tar -czvf logs_archive.tar.gz *

to create a zip file, it's the easiest to put everything you wanna zip in a folder first (.e.g. ~/dir_to_zip/) then run:

zip -r package.zip ~/dir_to_zip

to unzip a tar file:

tar -xvf filename.tar.gz -C /target/dir/
  • -v: verbose, here is optional
  • -C: this is also optional but specify where to extract (x) the file (f) to instead of the current working directory

for zipped files:

unzip file.zip -d destination_folder
  • -d is optional if you are okay extracting the file to the current directory with the same name as the zipped file

Saving Terminal outputs to a file

there's about 10 ways as describled in details here but my favorite is:

command |& tee output.txt

which will capture both the standard and error streams into output.txt while still being visible in the terminal (this variation will overwrite output.txt if exists)

What's my IP Address?

curl ifconfig.me
curl checkip.amazonaws.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment