Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Tagging Math Expressions in PDF documents with Lualatex
-- element tree, (is a bit buggy for the {decl = false} option)
local el = require "etree"
-- nodelist visualization (optional),
-- local viz = require "viznodelist"
function convertToMathML(head)
return {tag="not implemented"}
-- create content MathML for every math formula
function(head, display, penalty)
texio.write_nl('NEW mathlist')
result = convertToMathML(head)
if result ~= nil then
et = etree.ElementTree({tag = "math", result}, {decl = false})
local pdf ="whatsit", "pdf_annot")
local buffer = etree.StringBuffer()
et:write(buffer) = '/Subtype /MathML /Contents (' .. tostring(buffer) .. ')'
head = node.insert_before(head, head, pdf)
return node.mlist_to_hlist(head, display, penalty)
"content MathML generator")
-- add content MathML as a PDF annotation
local whatsit ='whatsit')
local hlist ='hlist')
local vlist ='vlist')
local math_node ='math')
local function add_size_to_annot(head, hbox)
while head do
typ =
if typ == vlist then
add_size_to_annot(head.head, hbox)
elseif typ == hlist then
add_size_to_annot(head.head, {width=head.width,height=head.height,depth=head.depth})
elseif typ == whatsit and head.subtype == 15 and
string.sub(, 1, 16) == '/Subtype /MathML' then
if head.prev ~= nil and == math_node and head.prev.subtype == 0 then
tail = head
for test_node in node.traverse_id(math_node, do
if test_node.subtype == 1 then
tail = test_node
w, h, d = node.dimensions(head.prev, tail)
hbox = {width=w,height=h,depth=d}
--texio.write_nl(string.format("add height %gpt, width %gpt, depth %gpt",hbox.height / 2^16, hbox.width / 2^16, hbox.depth / 2^16))
head.width = hbox.width
head.height = hbox.height
head.depth = hbox.depth
-- texio.write_nl('found node '..node.type(
head =
local vpack_counter = 1
-- viz.nodelist_visualize(head, "vpack"..vpack_counter..".gv")
vpack_counter = vpack_counter + 1
return head
,"find math bounding box")
\pdfcompresslevel=0 % to make everything visible in the pdf
%% a collection of Hans Hagens MathML examples, and some additions
$$b \equiv b$$
$$1 + x \over 1 - x$$
$$x \ge 4$$
$$a b$$
$$ x\in\mathbb{N}$$
$$ 1A2C_{16} + 0101_{16} = 1B2D_{16}$$
$$ 2+5i\in\mathbb{C}$$
%% eq, neq, gt, lt, geq, leq
$$ a\le b\le c$$
%% equivalent, approx, implies
$$ a+b \equiv b+a $$
$$ 3.14159 \approx \pi $$
%% minus, plus
$$37 -x$$
%% times
%% divide
$$1-{1 \over 3}+{1\over 5}-{1\over 7}+\ldots = \frac{\pi}{4}$$
$${-b - \sqrt{a} \over (b-b) -\sqrt{a}}$$
%%$${-b - -b - \sqrt{a} \over (b-b)- -b -\sqrt{a}}$$
%% power
$$x^2 + \sin^2 x$$
%% root, degree
$$\sqrt[3]{64} = 4$$
%% sin, cos, tan, cot, scs, sec, ..
$$\sin(x+y)=\sin x \cos y + \cos x \sin y$$
$$\cos\pi = -1$$
%% log, ln, exp
$$\ln(e+2)\approx 1.55$$
$$e^2=7.3890560989307$$ %% is false!
%% quotient, rem
$$ \lfloor a/b \rfloor $$
%% factorial
$$ n! = n\times(n-1)\times(n-2)\times\cdots\times 1$$
%% min, max, gcd, lcm
$$z=\min\left\{(x+y),2x,{1\over y}\right\}$$
%% and, or, xor, not
$$1001_2 0101_2=0001$$
%% set, bvar
$$ \left\{1,4,8\right\}\neq$$
$$ \left\{x | 2<x<\right\}$$
%% list
%% union, intersect, ...
$$U\cup V$$
$$U\cap V$$
$$v\in V$$
$$u\notin V$$
%% interval
%% inverse
$$ \sin^{-1}x$$
%% sum, product, limit, lowlimit, uplimit, bvar
$$ \sum_{i=1}^{n} {1 \over x} $$
$$ \prod_{i} {1 \over x}$$
$$ \prod_{x\in\mathbb{R}}f(x)$$
$$ \lim_{x\rightarrow 0}\sin x$$
%% int, diff, partialdiff, bvar, degree
$${d \left(\int_p^q f(x,a)dx \right) \over da}$$
$${d^2f(x) \over dx^2}$$
$${d^4f \over x df^2}$$
$${d^kf(x,y) \over x df(x,y)^m}$$
$${d^{m+n}f(x,y) \over x df(x,y)^m}$$
%% fn
Copy link

fkuehnel commented Mar 11, 2012

This example demonstrates how Lua(La)Tex could be used to create math expression annotations in PDF documents. Here, the purpose is to tag math expressions with a bounding box. Obviously, it would be quite valuable to annotate the (La)Tex math expressions with the proper Content MathML, however this would go far beyond the simple program snippets presented here.

My own experience is that for simple LaTex math formulas it is quite easy to generate the proper Content MathML equivalents. However the approach using Context Free Grammar parsers (i.e. lpeg) doesn't apply well to the breadth of LaTex documents for which the meaning of math expressions is rather context sensitive!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment