kousu/README

## README
...possibly python wins this one.

## sectionate.awk
#!/usr/bin/awk -f
# sectionate.awk
# usage: sectionate code.mat [code2.mat, ...]
#
# Find and split up marked sections in text files or code
# to separate sub-files.
# <nick@kousu.ca> 2-Clause BSD. No Warranty. Back up your shit before use.

# Docs:
# Used in conjunction with make and LaTeX's listings package you can
# without much effort approximate something like literate programming
# (e.g. Sweave + LyX or {IPython, Mathematica, Maple} Notebooks), while
# maintaining a standard, complete, plain text code file that gets along
# nicely with plain text editors and plain text source control.
#
# Section markers look like
# "#--- Section A: Grobble the Frops"
# and result in filenames like
# "code.mat.section_a:_grobble_the_frops"
#
# If there are sections you *do not* want to include, for example, blocks of
# comments or utilty code, simply blank their names (i.e. leave empty/as whitespace).
#
# Tips:
# * If you still want to label blanked section, the quickest way to do this
#   is just to move the section name to a comment on the next line.
#   e.g. transform
#       "#--- part iv"
#   to
#       "#---
#        #part iv"
# * name sections identically to make all those sections end up in the same
#   file, in their original order.  I didn't write this feature, it is a
#   side-effect of awk's awesome 80%-of-your-needs-in-one-place design.
#   But consider in a situation like that if it's better to explicitly make
#   separate code files instead of weirdly interleaving subprojects.
# * to sectionate output from your code too, make your code print
#  section headers, redirect its stdout to a file, and run sectionate on that.
# * Remove all blank lines between section markers and following code
#   so that your output sections pack tightly into your LaTeX or whatever.

# TODO:
# [x] lowercase section names
# [ ] make what marks a section break configurable (environment variable?)
# [x] support non-printing sections
# [ ] Figure out a way not to have to expensively use split-rejoin; tho compared to all the other ops awk does this one is not super expensive.


############################################################################
## Code
#
# Design:
# awk constrains all our processing to happen in sequential steps done to whole records
# (well, we can write loops and things within those sequential steps, but it's not idiomatic)
# With these constraints in mind, way this design works is a simple state-machine.
#  RS is left default (newlines), so all the rules below run for each line,
#  and through changing the value of mode and section, implement

# A simple diagram of the state transitions mode has:
# -> HEADER <-> CONTENT
# That is:
#  initially, the state is undefined
#  when the first #--- is run into, the state switches into "HEADER"
#  then the code which handles "HEADER" records kicks in and extracts the section title
#  then the state switches immediately to "CONTENT" (and processing of the header record runs off the end)
#  then the code which handles "CONTENT" kicks in on the next record, and performs the splitting
#  and the state switches back to HEADER on seeing #--- again
#
# There's also a sub-mode: when in CONTENT, section says where output goes to.
# However, if we are in a non-printing section (one with no name) then SECTION_FILE
# is blank then we don't print.
# (when not in CONTENT section is used as working space and/or ignored and
#  I make no promises that it contains anything sensible--which is a bit messy I admit)
#


BEGIN {
  DEBUG="" #"dbgAwk"

  mode    = -1
  section = -1
}


/^\#---/ {
  mode = "HEADER"
}

# when we're within a section
# output the lines in that section to SECTION_FILE
mode == "CONTENT" {
  if(section) {
    print > section  #remember, 'print' means "print $0" means "print the entire current record"
    # subtlety: ">" doesn't mean quite the same thing as it does in bash
    #  because > vs >> translates open()'s O_CREAT vs O_APPEND, and awk holds files open unless you explicitly close() them (by string name, not by fd; awk doesn't expose fds.. does it?)
    #(and this is a good, elegant, thing, though the semantic overloading is unfortunate)
    # So using a single > here means when rerunning the whole awk program, the output files are written from scratch (which is what we want!) but still contain all the lines we want
  }
}

# The code to handle HEADER happens *after* that for CONTENT,
# because it (has to!) unconditionally sets the mode to CONTENT
# and if it was written first that would make awk process the HEADER line as if
mode == "HEADER" {
  # This code chunk in python, in case that helps you read it:
  # title=line.split(maxsplit=1)[-1].strip().lower()
  #
  # awk doesn't have split(maxsplit=1) available,
  # so I need to split and rejoin (join isn't even a built in!!
  # I had to get it from the gnu people!)

  FS="[[:space:]]*"         #this effectively does strip() because of the *
  n = split($0, arySection)
  section = tolower(join(arySection, 2, n, "_"))

  if(section) {
    # if section is non-blank, stretch it to include the FILENAME we are sectionating
    section=(DEBUG FILENAME "." section)
  }

  mode = "CONTENT"
}


#######################################
# Utilities

# join.awk --- join an array into a string
# <https://www.gnu.org/software/gawk/manual/html_node/Join-Function.html>
function join(array, start, end, sep,    result, i)
{
    if (sep == "")
       sep = " "
    else if (sep == SUBSEP) # magic value
       sep = ""
    result = array[start]
    for (i = start + 1; i <= end; i++)
        result = result sep array[i]
    return result
}

## sectionate.py
#!/usr/bin/env python
#sectionate.py
# same as sectionate.awk, but in python
# actually there's a couple of small behaviour differences:
# * an empty section still makes a file (arguably more sensible)
# * does *not* compress spaces in a section name like the awk version does (definitely more sensible)
# * possibly there will be other differences on Windows because of line endings?

import sys

DEBUG="" #"dbgPY"

MARKER = "#---"
import re
SPACES = re.compile("\s") #python doesn't do eregexes?? [[:space:]]

files = sys.argv[1:]
if not files:
    files = ["-"]

file = None
section = None
for FILENAME in files:
    with open(FILENAME) as file:
        for line in file:

            if line.startswith(MARKER):
                if section is not None:
                    section.close()

                section = SPACES.sub("_", line[len(MARKER):].strip().lower())
                if section:
                    section = DEBUG+FILENAME+"."+section
                    section = open(section, "w")
                else:
                    section = None

            else:
                if section:
                    print(line, end="", file=section) #TODO: is it better to end="" here or to strip() the input line

# TODO: this is probably better written as a generator around generators
# because the reason I had to fake a state machine in awk was because awk doesn't have flexible notions of states and substates
# whereas generators/coroutines are exactly that.
	#!/usr/bin/awk -f
	# sectionate.awk
	# usage: sectionate code.mat [code2.mat, ...]
	#
	# Find and split up marked sections in text files or code
	# to separate sub-files.
	# <nick@kousu.ca> 2-Clause BSD. No Warranty. Back up your shit before use.

	# Docs:
	# Used in conjunction with make and LaTeX's listings package you can
	# without much effort approximate something like literate programming
	# (e.g. Sweave + LyX or {IPython, Mathematica, Maple} Notebooks), while
	# maintaining a standard, complete, plain text code file that gets along
	# nicely with plain text editors and plain text source control.
	#
	# Section markers look like
	# "#--- Section A: Grobble the Frops"
	# and result in filenames like
	# "code.mat.section_a:_grobble_the_frops"
	#
	# If there are sections you do not want to include, for example, blocks of
	# comments or utilty code, simply blank their names (i.e. leave empty/as whitespace).
	#
	# Tips:
	# * If you still want to label blanked section, the quickest way to do this
	# is just to move the section name to a comment on the next line.
	# e.g. transform
	# "#--- part iv"
	# to
	# "#---
	# #part iv"
	# * name sections identically to make all those sections end up in the same
	# file, in their original order. I didn't write this feature, it is a
	# side-effect of awk's awesome 80%-of-your-needs-in-one-place design.
	# But consider in a situation like that if it's better to explicitly make
	# separate code files instead of weirdly interleaving subprojects.
	# * to sectionate output from your code too, make your code print
	# section headers, redirect its stdout to a file, and run sectionate on that.
	# * Remove all blank lines between section markers and following code
	# so that your output sections pack tightly into your LaTeX or whatever.

	# TODO:
	# [x] lowercase section names
	# [ ] make what marks a section break configurable (environment variable?)
	# [x] support non-printing sections
	# [ ] Figure out a way not to have to expensively use split-rejoin; tho compared to all the other ops awk does this one is not super expensive.



	############################################################################
	## Code
	#
	# Design:
	# awk constrains all our processing to happen in sequential steps done to whole records
	# (well, we can write loops and things within those sequential steps, but it's not idiomatic)
	# With these constraints in mind, way this design works is a simple state-machine.
	# RS is left default (newlines), so all the rules below run for each line,
	# and through changing the value of mode and section, implement

	# A simple diagram of the state transitions mode has:
	# -> HEADER <-> CONTENT
	# That is:
	# initially, the state is undefined
	# when the first #--- is run into, the state switches into "HEADER"
	# then the code which handles "HEADER" records kicks in and extracts the section title
	# then the state switches immediately to "CONTENT" (and processing of the header record runs off the end)
	# then the code which handles "CONTENT" kicks in on the next record, and performs the splitting
	# and the state switches back to HEADER on seeing #--- again
	#
	# There's also a sub-mode: when in CONTENT, section says where output goes to.
	# However, if we are in a non-printing section (one with no name) then SECTION_FILE
	# is blank then we don't print.
	# (when not in CONTENT section is used as working space and/or ignored and
	# I make no promises that it contains anything sensible--which is a bit messy I admit)
	#


	BEGIN {
	DEBUG="" #"dbgAwk"

	mode = -1
	section = -1
	}


	/^\#---/ {
	mode = "HEADER"
	}

	# when we're within a section
	# output the lines in that section to SECTION_FILE
	mode == "CONTENT" {
	if(section) {
	print > section #remember, 'print' means "print $0" means "print the entire current record"
	# subtlety: ">" doesn't mean quite the same thing as it does in bash
	# because > vs >> translates open()'s O_CREAT vs O_APPEND, and awk holds files open unless you explicitly close() them (by string name, not by fd; awk doesn't expose fds.. does it?)
	#(and this is a good, elegant, thing, though the semantic overloading is unfortunate)
	# So using a single > here means when rerunning the whole awk program, the output files are written from scratch (which is what we want!) but still contain all the lines we want
	}
	}

	# The code to handle HEADER happens after that for CONTENT,
	# because it (has to!) unconditionally sets the mode to CONTENT
	# and if it was written first that would make awk process the HEADER line as if
	mode == "HEADER" {
	# This code chunk in python, in case that helps you read it:
	# title=line.split(maxsplit=1)[-1].strip().lower()
	#
	# awk doesn't have split(maxsplit=1) available,
	# so I need to split and rejoin (join isn't even a built in!!
	# I had to get it from the gnu people!)

	FS="[[:space:]]" #this effectively does strip() because of the
	n = split($0, arySection)
	section = tolower(join(arySection, 2, n, "_"))

	if(section) {
	# if section is non-blank, stretch it to include the FILENAME we are sectionating
	section=(DEBUG FILENAME "." section)
	}

	mode = "CONTENT"
	}




	#######################################
	# Utilities

	# join.awk --- join an array into a string
	# <https://www.gnu.org/software/gawk/manual/html_node/Join-Function.html>
	function join(array, start, end, sep, result, i)
	{
	if (sep == "")
	sep = " "
	else if (sep == SUBSEP) # magic value
	sep = ""
	result = array[start]
	for (i = start + 1; i <= end; i++)
	result = result sep array[i]
	return result
	}
	#!/usr/bin/env python
	#sectionate.py
	# same as sectionate.awk, but in python
	# actually there's a couple of small behaviour differences:
	# * an empty section still makes a file (arguably more sensible)
	# * does not compress spaces in a section name like the awk version does (definitely more sensible)
	# * possibly there will be other differences on Windows because of line endings?

	import sys

	DEBUG="" #"dbgPY"

	MARKER = "#---"
	import re
	SPACES = re.compile("\s") #python doesn't do eregexes?? [[:space:]]

	files = sys.argv[1:]
	if not files:
	files = ["-"]

	file = None
	section = None
	for FILENAME in files:
	with open(FILENAME) as file:
	for line in file:

	if line.startswith(MARKER):
	if section is not None:
	section.close()

	section = SPACES.sub("_", line[len(MARKER):].strip().lower())
	if section:
	section = DEBUG+FILENAME+"."+section
	section = open(section, "w")
	else:
	section = None

	else:
	if section:
	print(line, end="", file=section) #TODO: is it better to end="" here or to strip() the input line

	# TODO: this is probably better written as a generator around generators
	# because the reason I had to fake a state machine in awk was because awk doesn't have flexible notions of states and substates
	# whereas generators/coroutines are exactly that.