I was recently tasked with scraping data from a few thousand of word
documents each of which contained nested docx documents which were
referenced by w:altChunk
elements in the body of the main
./word/document.xml
file, like so:
<!-- word/document.xml -->
<w:docment>
<w:body>
...
<w:altChunk r:id="AltChunkIda1c8521d-4233-44e2-8efb-1a2b3c4d5e6f"/>
...
</w:body>
</w:docment>
the r:id
property of the altChunk
element refers to a relation in the
./word/_rels/document.xml.rels
file, like so:
<!-- word/_rels/document.xml.rels -->
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
...
<Relationship
Id="AltChunkIda1c8521d-4233-44e2-8efb-1a2b3c4d5e6f"
Target="/word/afchunk2.docx"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk"/>
...
</Relationships>
which in turn indicates that the nested docx file is located within the
zipfile at /word/afchunk2.docx
, and that it is of type
.../relationships/aFChunk
Finally, the content-type
of the target file is described in the file
[content-types].xml
, which contains defaults content-types for contained
files by their extension, and overrides for particular part files, such as
the
<!--[ content-types].xml -->
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
...
<Default
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
Extension="docx"/>
...
<Override
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml"
PartName="/word/header1.xml"/>
...
</Types>
In particular, the [content-type].xml
file contains a default
content-type for nested part files with the .docx
extension of
... document.main+xml
which indicates that the part file is an
XML file matching a particular schema. Furthermore, there were no
overrides present for the specific altChunk part files (e.g.
/word/afchunk2.docx
) to indicate otherwise.
Because of this, when python-docx
attempts to read the outer .docx
file, it attempts to read in each of the individual part files according to
their content-type and when it encounters the nested /word/afchunk2.docx
,
it it attempts to load this file with with an xml parser (lxml.etree
),
which of course throws an XMLSyntaxError
.
The second problem is that python-docx
does not implement the full
Office Open XML Spec
(which is over 5000 pages long!), and in particular, it does not natively
handle altChunk parts, which means it's up to us to extend the
python-docx
file to handle nested document chunks.
To handle these nested document chunks, we'll need to implement a new
AltchunkPart
which inherits from the
Part
class,
which will be responsible for converting the internal part file into a python
object, holding a reference to returned object, and serializing the
object when writing the whole document to file. We'll leverage the fact
that .docx
files can already be parsed and serialized by the
docx.Document
object, and returns an appropriate document object, and the
returned object has a .blob
property which is a serialized representation
of itself. Hence we'll create a new file at docx/parts/altchunk.py
with
the following:
# encoding: utf-8
"""The |Altchunk| and closely objects"""
from __future__ import absolute_import, division, print_function, unicode_literals
from docx import Document
from ..opc.part import Part
from io import BytesIO
class AltchunkPart(Part):
"""AltChunkPart for word document
An AltChunk is a nested word document
"""
def __init__(self, partname, content_type, element, package):
super(AltchunkPart, self).__init__(
partname, content_type, package=package
)
# store the parsed document in the _element instance attribute
self._element = element
@property
def blob(self):
# Let the Document element handle serialization
stream = BytesIO()
self._element.save(stream)
return stream.getvalue()
@property
def element(self):
"""
The root XML element of this XML part.
"""
return self._element
@classmethod
def load(cls, partname, content_type, blob, package):
# Parse nested documents using the Document class
element = Document(BytesIO(blob))
return cls(partname, content_type, element, package)
@property
def part(self):
"""
Part of the parent protocol, "children" of the document will not know
the part that contains them so must ask their parent object. That
chain of delegation ends here for child objects.
"""
return self
This class is capable of handling nested docx files, but as of yet,
python-docx
does not know when to use it to handle a given part file.
For this, we'll take advantage of the fact that as of this writing
in python-docx
(version 0.8.10) there is a hook in docx/__init__.py
called
part_class_selector
which is responsible for overriding the Part
class
that is used to handle a given part file depending on it's content-type
and/ relation type. We'll add a hook that maps Relationships with
Type "afChunk" to our new Part class like so:
def part_class_selector(content_type, reltype):
if reltype == RT.A_F_CHUNK:
return AltchunkPart
if reltype == RT.IMAGE:
return ImagePart
return None
At this point, if we are load a document containing afChunks, we can query the body element of the document for altchunks in the document body, and access the nested document (part file) like so:
>>> import docx
>>> document = docx.Document('my-favorite-file.docx')
>>> body = document.part.element.body
>>> children = body.getchildren()
>>> children
[<CT_P '<w:p>' at 0x2d47b6b0728>,
<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}altChunk at 0x2d47b696a08>,
<CT_SectPr '<w:sectPr>' at 0x2d47b6bfc78>,
<CT_P '<w:p>' at 0x2d47b6bfcc8>,
<CT_Tbl '<w:tbl>' at 0x2d47b6bfd18>]
>>> altchunk_element = children[1]
>>> altchunk_id = altchunk_element.get('{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id')
>>> chunk_part = document.part.related_parts[altchunk_id]
>>> nested_document = chunk_part.element
There are number of things which are sub-optimal about this code, however.
First, notice that the alt_chunk element is represented as an lxml.etree
element (<Element {http://schemas....}altChunk at 0x2d47b696a08>
) and not
a python-docx
content-type element. In particular, content-type elements
like the paragraph (CT_P
) have additional functionality, such as handy
getters and setters for required attributes (e.g. id
element of the
altChunk
element) and can be referenced with ease by parent containers.
In the python-docx
, content-type classes are created almost mystical
the XML alchemy helpers found in
docx/oxml/xmlchemy.py.
Although the actual magic isn't for the faint of heart, it's patterns
are pretty easy to learn by scanning through a few classes that inherit
from the _BaseChildEelment
, like CT_Body
and CT_Num
which used the magical class attributes created by one one of the dispatchable
classes to define optional attributes and/or child elements of the
content-type element wrapper.
While the above may be a bit abstract, let's make it real by defining a
content type to handle the altChunk elements. Because the altchunk is so
closely tied to the document element, I chose do define our new
CT_Altchunk
type in the docx/oxml/document.py
, as follows:
class CT_AltChunk(BaseOxmlElement):
"""`w:altChunk` element"""
rId = RequiredAttribute('r:id', XsdString)
This class is a wrapper for the underlying XML element that has convenience
property called new rId
that will return the r:Id
. We
will of course have to inform python-docx
that w:altChunks
nodes should
be wrapped with our special wrapper by adding this line in the
docx/oxml/__init__.py
file:
from .document import CT_AltChunk, CT_Body, CT_Document
register_element_cls('w:altChunk', CT_AltChunk)
In addition, we may want to let python-docx
that zero or more of these
CT_AltChunk
may appear within a document body (CT_body
) by adding the
following class attribute to the CT_body
class:
altChunk = ZeroOrOne('w:altChunk', successors=())
which will add the special .altchunk_lst
property to CT_Body
elements
With these helpers added to python-docx
, our the above code becomes
>>> import docx
>>> document = docx.Document('my-favorite-file.docx')
>>> body = document.part.element.body
>>> chunks = body.getchildren().altChunk_lst
>>> chunks
[<CT_AltChunk at 0x2d47b696a08>]
>>> chunk_part = document.part.related_parts[chunks[0].rId]
>>> nested_document = chunk_part.element
In definition of our custom CT_AltChunk
class, we used one of the special
dispatchable
properties which resulted in the convenient rId
member
which contained r:Id
property of the underlying XML element. There are
several of these special dispatchable
properties which are used to
represent either XML properties or XML children elements
The dispatchable
elements called OptionalAttribute
and
RequiredAttribute
are used for XML Properties, and we already saw the
RequiredAttribute
in use above. In addition, there are 5 types of
dispatchable
types which are used to describe the child elements
contained in parent element, which include OneAndOnlyOne
, OneOrMore
,
ZeroOrMore
, ZeroOrOne
, ZeroOrOneChoice
.
To demonstrate these, child dipatchable
helpers, lets implement another
feature of the docx Spec which is not currently implemented by
python-docx
: Forms. While the entire forms spec is to much to implement
here, we'll implement one specific form elements called w:ddList
,
which appear on page 1281 of the Office Open XML Spec.
While it would be better to implement these classes based on the spec
itself, we'll just use the example code snippet provided by the
spec which is copied here:
<w:ddList>
<w:default w:val="1" />
<w:result w:val="2" />
<w:listEntry w:val="One" />
<w:listEntry w:val="Two" />
<w:listEntry w:val="Three" />
</w:ddList>
From the example we can see that the dropdown list is should have one or
more w:listEntry
children, and possibly a w:default
and w:result
indicating the default value and user selection if on has been made.
To implement this element, we can use the ZeroOrOne
and OneOrMore
elements as follows:
class CT_DDList(BaseOxmlElement):
'''The w:ddList element'''
default = ZeroOrOne("w:default")
result = ZeroOrOne("w:result")
listEntry = OneOrMore("w:listEntry")
and we could of course implement each of the child elements like so:
class ElementWithValue(BaseOxmlElement):
value = RequiredAttribute('w:val', XsdString)
class CT_Default(ElementWithValue): pass
class CT_Result(ElementWithValue): pass
class CT_ListEntry(ElementWithValue): pass
We will of course have to register these classes with python-docx
as we
did above, and once we do so, python-docx will wrap our w:ddList
elements with
convenience properties such as el.default
, el.result
, and
el.listEntry_lst
, and convenience setters such as
el.get_or_set_default()
and el.add_listEntry
.
If you prefer to view these updates in their context this commit contains the modifications which add the ability to handle a new part file (the i.e. the nested afChunk files) and handlers for the form elments.
Is there a way to copy a certain xml piece using get children and create a new page with the same elements from that xml tree? Need to duplicate pages and add them into a document for reporting and with each iteration within the report want to use properties from one page in the word doc