Skip to content

Instantly share code, notes, and snippets.

@kousu
Last active August 29, 2015 14:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kousu/bec035687d93edecc9d6 to your computer and use it in GitHub Desktop.
Save kousu/bec035687d93edecc9d6 to your computer and use it in GitHub Desktop.
partition plain text files by section marker for literate programming the way it should be done
...possibly python wins this one.
#!/usr/bin/awk -f
# sectionate.awk
# usage: sectionate code.mat [code2.mat, ...]
#
# Find and split up marked sections in text files or code
# to separate sub-files.
# <nick@kousu.ca> 2-Clause BSD. No Warranty. Back up your shit before use.
# Docs:
# Used in conjunction with make and LaTeX's listings package you can
# without much effort approximate something like literate programming
# (e.g. Sweave + LyX or {IPython, Mathematica, Maple} Notebooks), while
# maintaining a standard, complete, plain text code file that gets along
# nicely with plain text editors and plain text source control.
#
# Section markers look like
# "#--- Section A: Grobble the Frops"
# and result in filenames like
# "code.mat.section_a:_grobble_the_frops"
#
# If there are sections you *do not* want to include, for example, blocks of
# comments or utilty code, simply blank their names (i.e. leave empty/as whitespace).
#
# Tips:
# * If you still want to label blanked section, the quickest way to do this
# is just to move the section name to a comment on the next line.
# e.g. transform
# "#--- part iv"
# to
# "#---
# #part iv"
# * name sections identically to make all those sections end up in the same
# file, in their original order. I didn't write this feature, it is a
# side-effect of awk's awesome 80%-of-your-needs-in-one-place design.
# But consider in a situation like that if it's better to explicitly make
# separate code files instead of weirdly interleaving subprojects.
# * to sectionate output from your code too, make your code print
# section headers, redirect its stdout to a file, and run sectionate on that.
# * Remove all blank lines between section markers and following code
# so that your output sections pack tightly into your LaTeX or whatever.
# TODO:
# [x] lowercase section names
# [ ] make what marks a section break configurable (environment variable?)
# [x] support non-printing sections
# [ ] Figure out a way not to have to expensively use split-rejoin; tho compared to all the other ops awk does this one is not super expensive.
############################################################################
## Code
#
# Design:
# awk constrains all our processing to happen in sequential steps done to whole records
# (well, we can write loops and things within those sequential steps, but it's not idiomatic)
# With these constraints in mind, way this design works is a simple state-machine.
# RS is left default (newlines), so all the rules below run for each line,
# and through changing the value of mode and section, implement
# A simple diagram of the state transitions mode has:
# -> HEADER <-> CONTENT
# That is:
# initially, the state is undefined
# when the first #--- is run into, the state switches into "HEADER"
# then the code which handles "HEADER" records kicks in and extracts the section title
# then the state switches immediately to "CONTENT" (and processing of the header record runs off the end)
# then the code which handles "CONTENT" kicks in on the next record, and performs the splitting
# and the state switches back to HEADER on seeing #--- again
#
# There's also a sub-mode: when in CONTENT, section says where output goes to.
# However, if we are in a non-printing section (one with no name) then SECTION_FILE
# is blank then we don't print.
# (when not in CONTENT section is used as working space and/or ignored and
# I make no promises that it contains anything sensible--which is a bit messy I admit)
#
BEGIN {
DEBUG="" #"dbgAwk"
mode = -1
section = -1
}
/^\#---/ {
mode = "HEADER"
}
# when we're within a section
# output the lines in that section to SECTION_FILE
mode == "CONTENT" {
if(section) {
print > section #remember, 'print' means "print $0" means "print the entire current record"
# subtlety: ">" doesn't mean quite the same thing as it does in bash
# because > vs >> translates open()'s O_CREAT vs O_APPEND, and awk holds files open unless you explicitly close() them (by string name, not by fd; awk doesn't expose fds.. does it?)
#(and this is a good, elegant, thing, though the semantic overloading is unfortunate)
# So using a single > here means when rerunning the whole awk program, the output files are written from scratch (which is what we want!) but still contain all the lines we want
}
}
# The code to handle HEADER happens *after* that for CONTENT,
# because it (has to!) unconditionally sets the mode to CONTENT
# and if it was written first that would make awk process the HEADER line as if
mode == "HEADER" {
# This code chunk in python, in case that helps you read it:
# title=line.split(maxsplit=1)[-1].strip().lower()
#
# awk doesn't have split(maxsplit=1) available,
# so I need to split and rejoin (join isn't even a built in!!
# I had to get it from the gnu people!)
FS="[[:space:]]*" #this effectively does strip() because of the *
n = split($0, arySection)
section = tolower(join(arySection, 2, n, "_"))
if(section) {
# if section is non-blank, stretch it to include the FILENAME we are sectionating
section=(DEBUG FILENAME "." section)
}
mode = "CONTENT"
}
#######################################
# Utilities
# join.awk --- join an array into a string
# <https://www.gnu.org/software/gawk/manual/html_node/Join-Function.html>
function join(array, start, end, sep, result, i)
{
if (sep == "")
sep = " "
else if (sep == SUBSEP) # magic value
sep = ""
result = array[start]
for (i = start + 1; i <= end; i++)
result = result sep array[i]
return result
}
#!/usr/bin/env python
#sectionate.py
# same as sectionate.awk, but in python
# actually there's a couple of small behaviour differences:
# * an empty section still makes a file (arguably more sensible)
# * does *not* compress spaces in a section name like the awk version does (definitely more sensible)
# * possibly there will be other differences on Windows because of line endings?
import sys
DEBUG="" #"dbgPY"
MARKER = "#---"
import re
SPACES = re.compile("\s") #python doesn't do eregexes?? [[:space:]]
files = sys.argv[1:]
if not files:
files = ["-"]
file = None
section = None
for FILENAME in files:
with open(FILENAME) as file:
for line in file:
if line.startswith(MARKER):
if section is not None:
section.close()
section = SPACES.sub("_", line[len(MARKER):].strip().lower())
if section:
section = DEBUG+FILENAME+"."+section
section = open(section, "w")
else:
section = None
else:
if section:
print(line, end="", file=section) #TODO: is it better to end="" here or to strip() the input line
# TODO: this is probably better written as a generator around generators
# because the reason I had to fake a state machine in awk was because awk doesn't have flexible notions of states and substates
# whereas generators/coroutines are exactly that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment