Skip to content

Instantly share code, notes, and snippets.

@verachell
Created November 29, 2022 01:27
Show Gist options
  • Save verachell/44047995f244fcea726613ceb99531c5 to your computer and use it in GitHub Desktop.
Save verachell/44047995f244fcea726613ceb99531c5 to your computer and use it in GitHub Desktop.
Designed for getting phrases from a Project Gutenberg book
#! /bin/bash
# Designed for getting phrases from a Project Gutenberg book.
# Phrases are defined here as anything between commas and between certain other punctuation,
# with the punctuation removed in the process.
# Additional unwanted or outdated punctuation and chapter headings are also removed in this script
# This script was created to work with a particular book, your mileage may vary,
# especially when it comes to chapter heading removals as each book may format these differently.
#
# First word wrapping is removed, then lines are split at certain punctuation marks.
# Next question marks are handled, since we wish to retain the question mark after splitting
# Next '--' punctuation is replaced by a simple space and '_' punctuation removed
# Next, indicators of chapter headings are removed.
# Finally, empty lines are removed (this was not essential for my use case but have included
# it for the sake of completeness). Credit for empty line removal using grep is from
# https://stackoverflow.com/questions/16414410/delete-empty-lines-using-sed
cat $1 | tr -s '\r\n' ' ' | tr '",;.' '\n' | sed 's/?/?\n/g' |sed 's/--/ /g' | tr -d '_' | grep -Fv '* *' | grep -v \§ |grep -v '^[[:space:]]*$'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment