Skip to content

Instantly share code, notes, and snippets.

@jsoma
Last active February 18, 2024 16:17
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save jsoma/9dc49aed3af25d5f15586ee48beca109 to your computer and use it in GitHub Desktop.
Save jsoma/9dc49aed3af25d5f15586ee48beca109 to your computer and use it in GitHub Desktop.
I promise the the command line is fun! It can be awful, sure, but also fun.

The command line is fun, I promise!

Let's do some fun stuff on the command line! This is a lot of "figuring out what to do" as opposed to "applying skills we learned in class." It will definitely make you feel uncomfortable and like you don't know anything, but that's okay!

Do these in any order you want. Be sure to check out the very last one, it's crazy.

A general tip: When you're searching around on the internet, pip install has a hundred ways of being talked about. python3 -m pip install and pipx install and pip3 install anything that vaguely looks like that can probably just be substituted with pip install.

ChatGPT can be really helpful for figuring out the specific command-line flags and arguments you need to get your CLI tools operating how you want them to. Unless you're using ffmpeg and convert every day of your life, memorizing exactly how these command-line tools work is prrrrrobably not the best use of your brainpower.

Section 1: Becoming independently wealthy by selling posters

The Colors of Motion used to charge like five zillion dollars for movie prints, but I guess whoever sold their prints mysteriously shut down in the past couple months???

Let's see if we can become billionaires doing the same thing. This process leverages three sets of command-line tools:

  • youtube-dl to download a youtube video
  • ffmpeg to extract the frames
  • imagemagick to combine and resize the images

Your mission: create a time-series color swatch like the Colors of Motion visuals with the title of the movie on top.

Someone on reddit makes a suggestion about how to build these visuals yourself that was then posted as a gist (a tiny code snippet). Seems like a good starting point, although they have no titles and are horizontal.

Hints and tips

  • You want to use yt-dlp to download a couple youtube videos (or find a "real" movie to download on the internet archive or something). youtube-dl is the original of course but yt-dlp is definitely better these days.
  • I'm not sure what the default youtube download format is: you probably want to download as an mp4, not a webm.
  • ffmpeg is ffmpeg, while convert and mogrify are both imagemagick. Strangely, the command to use youtube-dl is yt-dlp.
  • ffmpeg and imagemagick can be real pains to install if they don't immediately install correctly. They're very useful, though, so it's worth it: ffmpeg does everything with video, and imagemagick can do anything with images (although I guess it's less professionally useful and more fun-stuff useful). If you get errors when installing, it probably isn't your fault. Post in #foundations or #helpme and me or something else can help debug. If you're on OS X, you should probably find instructions about installing them with homebrew (this is a general rule with installing software!)
  • <output folder> means put the name of the folder you want the screenshots to go into.
  • You know how cd [foldername] means cd whatever, and not cd [whatever]? In the same way, you'll be getting rid of the < and > when you're creating the output folder.
  • You might need to use mkdir to make a new directory before the ffmpeg command, as ffmpeg won't save into a directory that doesn't exist.
  • If you get a "there are too many arguments!!!" when using convert, it's because you have too many screenshots for convert to process. Maybe delete your screenshots folder and tell ffmpeg to not take every single frame. ChatGPT can also help you customize the instruction to only take every _n_th frame, one frame per 15 seconds, etc.
  • Stacking horizontally vs vertically use slightly different imagemagick commands
  • I'm assuming that adding a title (or a border, if you're feeling fancy) must be included in someone's list of "top 10 useful imagemagick examples" or the like somewhere on the internet.

Section 2: AI-based transcription, summarization and translation

While OpenAI is best known for creating ChatGPT, they also made a great "speech-to-text" tool called Whisper. This means Whisper does transcription, and as a bonus it supposedly works a lot better with non-English languages (and accented speech) than more common tools like Otter.ai. It's also free, and when using it you don't need to send anything up to the cloud!

But how well does it really work?

Part 1: ENGLISH

  1. Use yt-dlp to download the audio for this video I made about generating ideas for data-driven stories.
  2. If you look at the Available models and languages section of their GitHub page, you'll see that Whisper comes in different versions. Each version has a tradeoff between quality and speed. Try it first with the tiny.en model, then with the base.en model.
  • Is their performance comparable? You can compare both outputs easily in VS Code.
  • How long did each one take? Note that the progress bar is for downloading the model, not the actual transcription process. It only happens the first time.
  1. Summarize the transcript with ChatGPT using something as simple as "Summarize the following transcript," and then pasting in the transcript.
  2. Ask a follow-up question, something like "Can you give me some ideas about how i might do this myself?"

Part 2: NOT ENGLISH

  1. Use yt-dlp to download the audio for a non-English-language video. I'm using this Polish one about making dumplings. It's better if it's short.
  2. Either transcribe into the original language (if you know the language) or use the --translate option (if you don't know the language)
  • Again, try using two different model sizes.
  1. Look at them! How did it do, and how do they compare to one another?

Hints and tips

  • Instead of downloading the video, yt-dlp can also just download audio. You'll probably want it to be an mp3.
  • If you have a video already on your computer that you'd like to get the audio from, ffmpeg can be used to extract the audio.
  • The first time you run whisper you'll get a LOT of files. Use whisper --help to figure out how to save only a text file.
  • Watch out you don't overwrite your tiny.en result with your base.en result!
  • You can use the time command to see how long a command takes
  • You can use the diff command on the terminal to compare to files, but it's more fun in VS Code.
  • You can use the time command to time how long it took.

Section 3: Command line data analysis

You're going to use the command-line tool htmltab to download the largest 2022 watermelons, then use command-line tools to analyze it.

Click around on that page until you find a place with installation instructions! Use the pip install instructions, not the pipx ones, and ignore the virtual environment mention_._ _ What do all of these things mean???? _ You'll slowly come to understand! Right now if you're at a "I know I need to copy this into the command line" point that's definitely a lot of growth since Monday morning!

The command you'll use to download the table as a CSV is:

htmltab --output output.csv --select .ReportResults "http://www.bigpumpkins.com/WeighoffResultsGPC.aspx?c=W&y=2022"

The last line of the csv file is not actually data! Remove it with ANY command-line tool before you do your analysis. It can be one of the tools you're using for this or a tool you've previously used.

There are several command-line tools to do data analysis, such as xsv , csvkit , and Miller. Try to find and install those tools, and use them to perform the analysis listed below.

Note that you can't do everything with each piece of software , and some of the documentation is absolutely awful. Answers like "this tool can't do it," "I don't think this tool can do it," or "I tried to look at url and the documentation is impossible and then I looked at some examples at url and they were all awful and impossible, too."

This is more about reading documentation, comparing tools, and trying to apply examples to your own data. Unless you're really interested in being a power user, don't spend more than a few minutes on any question.

You also might want to do this one tool at a time instead of one question at a time.

  1. Find and install each of the tools.
  2. Take a look at all of the entries from Italy
  3. Make a CSV file of all of the entries from Kentucky
  4. What percent of the events were a "weigh off?"
  5. I guess the "GPC site" is the event where they showed off the watermelon. What are the top 3 events, and how many watermelons on the list are from each?
  6. How many watermelons were over 300 pounds? (if automatically calculating this using the command-line tool doesn't work, maybe try manually counting)
  7. What was the median watermelon weight out of all the entries?
  8. If you put all of the watermelons into big piles for each country, how much would each country's pile weigh?
  9. FOR EXTRA FUN Finding command-line charting libraries that made quality graphics is way more difficult than it should be. I think you should make a chart using something like https://github.com/mkaz/termgraph, and maybe it can look fun like this

Hints and tips

You can find hints and tips at https://gist.github.com/jsoma/8474141c70df4a61b105c76ea4ce0838, but it's mostly just struggle and pain and giving up on documentation/examples if you can't find it. But for the last one, I'll say two things:

  • If you use termgraph, install it with the normal pip install thing and ignore their instructions
  • Look at the data files in the data/ folder on the GitHub repo. They mostly don't have headers! If you're using Miller --headerless-csv-output will probably be your friend (I'm telling you because IT'S VERY HARD TO FIND but seeing so many watermelons is very rewarding)

Section 4: More exciting command line data analysis

Jeremy Singer-Vine is the only famous data journalist, and he loves loves loves visidata. Use it to analyze the watermelon data. It's a command line tool, but it... slightly different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment