Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Awk for parsing through wikipedia articles separated by lines of five equal signs
# articles look like:
# Title
# Article text...
# ....
# =====
# FS="\n", set the field separator to be newlines, used to get the title (which will be $2)
# RS="=====" set the record separtor to be five equal signs
# gsub("/", "_", $2): replace all forward slashes with underscores in the title line, needed so that we don't upset anyone
# if statement checks to see if there is already a file with the title we're looking at
# print statement redirects all the output for a particular record to <title>.txt. We need the .txt because I don't
# know how to get awk to just write to a file without passing a string as part of the redirection output. We also
# print the article title for monitoring
cat ../out |
awk '{FS="\n";RS"=====";gsub("/", "_", $2);if(system("[ -e "$2".txt ]") == 0) {print > $2 ".txt"; print $2 }}'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment