Skip to content

Instantly share code, notes, and snippets.

@dunn
Created July 8, 2012 07:15
Show Gist options
  • Save dunn/3069754 to your computer and use it in GitHub Desktop.
Save dunn/3069754 to your computer and use it in GitHub Desktop.
BBEdit/TW filter for pandoc header IDs
#!/usr/bin/perl
# written by Alex Dunn in 2012
# this code may be used for any purpose whatever
# this script is shitty and will not always produce correct header
# identifers; everything that should occur is documented at:
# http://johnmacfarlane.net/pandoc/README.html#header-identifiers-in-html-latex-and-context
use strict;
use warnings;
my $header;
# grab the text
$header = <>;
# remove one set of matching underscores (italics)
$header =~ s/(.*?)_(.*?)_(.*?)/$1$2$3/g;
# strip everything but alphanumberic characters and underscores,
# periods, spaces, and hyphens.
# the list of accented characters is from:
# http://stackoverflow.com/a/6664820/1431858
$header =~ s/[^âãäåæçèéêëìíîïðñòóôõøùúûüýþÿıA-Za-z0-9-\_\.\ ]//g;
# remove en- and em-dashes
$header =~ s/-{2,3}//g;
# dashes for spaces
$header =~ s/\ /-/g;
# LOWER CASE
$header =~ s/([A-Z])/\L$1/g;
print $header;
@dunn
Copy link
Author

dunn commented Aug 23, 2012

Any number of matching underscores (and asterisks) can be removed by using a while loop. (But the regex needs to be a bit more sophisticated anyway because in-word underscores (the_frog is_great) do not trigger italics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment