Skip to content

Instantly share code, notes, and snippets.

@bpj
Created November 19, 2015 15:23
Show Gist options
  • Save bpj/5bd1dbfdee3b570ee574 to your computer and use it in GitHub Desktop.
Save bpj/5bd1dbfdee3b570ee574 to your computer and use it in GitHub Desktop.
Extract HTML-like attributes from an arbitrary string with CSS-like shortcuts

Note: This documents the usage of both the perl and the python version. I have made the perl and python version work as alike as I could. In order to not favoritize either side hashes/dictionaries are called "associative arrays" and depicted as JSON objects in all their ugliness! :-)

The string2attrs function parses a string, e.g. a title string from a Pandoc Link or Image element, with attributes similar to those used with Pandoc Code, CodeBlock and Header elements (minus the braces) into an associative array of key--value pairs. Something like

"#foo .bar lang=la baz='biz buz'"

becomes

{"baz":"biz buz","id":"foo","lang":"la","class":"bar"}

The attribute 'syntax' is rather more permissive than Pandoc, HTML or XML attributes in terms of which characters are permitted and no validation is performed. This is intentional.

  • The first, unnamed, argument is the string to be parsed.
  • default is an associative array giving default values for attributes.
  • alias is an associative array mapping aliases (which may be punctuation characters or 'words') to full attribute names like {".":"class","#":"id"} (which are also the defaults). Note that if you want to define your own (e.g. {":":"lang","l":"lang"}) you will have to explicitly include the id and class ones too since the defaults will be overwritten.
  • as_list is a string with whitespace-separated attribute names. The attributes given here can be repeated, and their values will be returned as an array under the respective associative array key. Not surprisingly the default is 'class'. Again the default will be overwritten by your custom value so you will have to include class if you want it. For attributes not given here multiple values will overwrite each other.
  • moniker is whatever should be used in error messages instead of "string with attributes".

Attributes in the string come in two flavors:

  • NAME=VALUE

    where NAME is an unquoted string not containing any of the characters " ' = or whitespace.

    VALUE can be any of:

    • a string in single quotes,
    • a string in double quotes,
    • an unquoted string not containing any of the characters " ' or whitespace.

    to include a quote of the same type in a quoted value you should double it: 'don''t' or "don""t". The purpose of this escaping style is that you already need to backslash quote characters inside a pandoc title string, so you will type "boast=\"don\"\"t\"" which is easier on the human parser than two levels of backslash-escaping would be (i.e. the unsupported alternative syntax "boast=\"don\\\"t\""). It also reminds you that unescaping of things like \n is not performed by the filter (it is currently not performed by pandoc either!)

    If NAME occurs as a key in the alias argument associative array it will be replaced with the value of that key.

  • *VALUE

    where * is any punctuation character (actually anything matching [^\w\s]). If the punctuation character occurs as a key in the alias argument associative array it will be replaced with the value of that key, e.g. as if you had written class=VALUE instead of .VALUE. Otherwise the punctuation character itself becomes the key, as if you actually had typed e.g. *=foo. That's not valid either as an HTML or a pandoc attribute name but I don't consider it worth fixing; let pandoc/tidy/whatever which already checks for this complain instead! VALUE must be an unquoted string not containing any of the characters " ' or whitespace and not starting with =.

    This is analogous to the CSS-selector style shortcuts supported by pandoc #id and .class except that you can define your own prefixes in the alias associative array, e.g.

    {":":"lang","#":"id","@":"href",".":"class"}
    

    The mechanism is rather shallow. A punctuation character/name key is simply replaced with whatever it is mapped to in the alias argument associative array if anything. You could actually write #=id just as well as #id or id=id. I consider this a bug not worth fixing. On the other hand you could also e.g. map lingua to lang which I consider a feature.

#!/usr/bin/env perl
use Carp qw[ carp croak confess cluck ];
no warnings qw[ uninitialized numeric ];
# Function to convert a string with HTML-attribute like key--
# value pairs into an associative array. Originally intended to
# let filters overload the title string of pandoc links, but I
# can easily imagine other uses.
#
# Supports default values and permits (customizable) aliases like
# `#foo` instead of `id=foo` and `.bar` instead of `class=bar`
# (the defaults) or `lingua=la` instead of `lang=la`.
#
# Comes in a perl and a python version.
my $attr_re = qr{
(?<!\S) # preceded by whitespace or start of string
( # capture key
# a shortcut like #id or .class
# comment out to disable punctuation aliases without equals sign
[^\s\w] (?= [^\s="']+ (?! \S ) ) |
# an attribute name
[^\s="']+ (?= = )
)
=?
( # capture value
' [^']* (?: '' [^']* )* ' # a single-quoted string
| " [^"]* (?: "" [^"]* )* " # or a double-quoted string
| [^\s"']+ # or unquoted, quoteless non-whitespace
)
(?!\S)
| ( \S+ ) # a gremlin substring
}msx;
sub string2attrs {
my($string, %p) = @_;
%p = (
default => +{}, short => +{ '#' => 'id', '.' => 'class' },
as_list => 'class', moniker => 'string with attributes',
%p,
);
my %attrs = %{ $p{default} };
my %to_list = map {; $_ => 1 } split /\s+/, $p{as_list};
my @matches = $string =~ /$attr_re/g;
$string =~ s/\A\s+|\s+\z//g; # strip
if ( length $string and !@matches ) {
croak "Invalid $p{moniker}: $string";
}
while ( @matches ) {
my($key, $val, $bad) = splice @matches, 0, 3;
if ( length $bad ) {
croak "Invalid substring in $p{moniker}: $bad ($string)";
}
if ( $val =~ s{^(["'])(.*)\1$}{$2} ) {
$val =~ s/$1($1)/$1/g;
}
my $name = $p{short}{$key} || $key;
if ( $to_list{$name} ) {
my $values = $attrs{$name} ||= [];
unless ( 'ARRAY' eq ref $values ) {
$values = $attrs{$name} = [$values];
}
push @$values, $val;
}
else {
$attrs{$name} = $val;
}
}
return \%attrs;
}
#!/usr/bin/env python2
# Function to convert a string with HTML-attribute like key--
# value pairs into an associative array. Originally intended to
# let filters overload the title string of pandoc links, but I
# can easily imagine other uses.
#
# Supports default values and permits (customizable) aliases like
# `#foo` instead of `id=foo` and `.bar` instead of `class=bar`
# (the defaults) or `lingua=la` instead of `lang=la`.
#
# Comes in a perl and a python version.
import sys
import re
import os
attr_re = r'''(?msx)
(?<!\S) # preceded by whitespace or start of string
( # capture key
# a shortcut like #id or .class
# comment out to disable punctuation aliases without equals sign
[^\s\w] (?= [^\s="']+ (?! \S ) ) |
# an attribute name
[^\s="']+ (?= = )
)
=?
( # capture value
' [^']* (?: '' [^']* )* ' # a single-quoted string
| " [^"]* (?: "" [^"]* )* " # or a double-quoted string
| [^\s"']+ # or unquoted, quoteless non-whitespace
)
(?!\S)
| ( \S+ ) # a gremlin substring
'''
attr_re = re.compile(attr_re)
def string2attrs(
string,
default={},
short={'#': 'id', '.': 'class'},
as_list='class',
moniker='string with attributes'):
attrs = dict(default)
to_list = as_list.split()
matches = attr_re.findall(string)
if not len(matches) and len(string):
sys.exit( "Invalid %s: %s" % ( moniker, string.strip() ) )
for attr in matches:
(key, val, bad) = attr
if len(bad):
sys.exit( "Invalid substring in %s: %s (%s)"
% ( moniker, bad, string.strip() ) )
temp = re.sub(r'''^(['"])(.*)\1$''', r'\2', val)
if len(temp) < len(val):
val = temp.replace(val[0]+val[0], val[0])
name = short.get(key) or key
if name in to_list:
values = list(attrs.get(name) or [])
values.append(val)
attrs[name] = values
else:
attrs[name] = val
return attrs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment