jdthorpe/Extending python-docx.md

## Extending python-docx.md

      
    Raw
  

              Extending python-docx.md
            
          
    Extending python-docx

Background

I was recently tasked with scraping data from a few thousand of word
documents each of which contained nested docx documents which were
referenced by w:altChunk elements in the body of the main
./word/document.xml file, like so:
<!-- word/document.xml -->
<w:docment>
  <w:body>
    ...
    <w:altChunk r:id="AltChunkIda1c8521d-4233-44e2-8efb-1a2b3c4d5e6f"/>
    ...
  </w:body>
</w:docment>
the r:id property of the altChunk element refers to a relation in the
./word/_rels/document.xml.rels file, like so:
<!-- word/_rels/document.xml.rels -->
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  ...
  <Relationship 
    Id="AltChunkIda1c8521d-4233-44e2-8efb-1a2b3c4d5e6f" 
    Target="/word/afchunk2.docx" 
    Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk"/>
  ...
</Relationships>
which in turn indicates that the nested docx file is located within the
zipfile at /word/afchunk2.docx, and that it is of type
.../relationships/aFChunk
Finally, the content-type of the target file is described in the file
[content-types].xml, which contains defaults content-types for contained
files by their extension, and overrides for particular part files, such as
the
<!--[ content-types].xml -->
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
  ...
  <Default 
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
    Extension="docx"/>
  ...
  <Override 
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml" 
    PartName="/word/header1.xml"/>
  ...
</Types>
In particular, the [content-type].xml file contains a default
content-type for nested part files with the .docx extension of
... document.main+xml which indicates that the part file is an
XML file matching a particular schema.   Furthermore, there were no
overrides present for the specific altChunk part files (e.g.
/word/afchunk2.docx) to indicate otherwise.
Because of this, when python-docx attempts to read the outer .docx
file, it attempts to read in each of the individual part files according to
their content-type and when it encounters the nested /word/afchunk2.docx,
it it attempts to load this file with with an xml parser (lxml.etree),
which of course throws an XMLSyntaxError.
The second problem is that python-docx does not implement the full
Office Open XML Spec
(which is over 5000 pages long!), and in particular, it does not natively
handle altChunk parts, which means it's up to us to extend the
python-docx file to handle nested document chunks.
The solution

To handle these nested document chunks, we'll need to implement a new
AltchunkPart which inherits from the
Part
class,
which will be responsible for converting the internal part file into a python
object, holding a reference to returned object, and serializing the
object when writing the whole document to file.  We'll leverage the fact
that .docx files can already be parsed and serialized by the
docx.Document object, and returns an appropriate document object, and the
returned object has a .blob property which is a serialized representation
of itself. Hence we'll create a new file at docx/parts/altchunk.py with
the following:
# encoding: utf-8

"""The |Altchunk| and closely objects"""

from __future__ import absolute_import, division, print_function, unicode_literals

from docx import Document
from ..opc.part import Part
from io import BytesIO

class AltchunkPart(Part):
    """AltChunkPart for word document

    An AltChunk is a nested word document
    """
    def __init__(self, partname, content_type, element, package):
        super(AltchunkPart, self).__init__(
            partname, content_type, package=package
        )
        # store the parsed document in the _element instance attribute
        self._element = element

    @property
    def blob(self):
        # Let the Document element handle serialization
        stream = BytesIO()
        self._element.save(stream)
        return stream.getvalue()

    @property
    def element(self):
        """
        The root XML element of this XML part.
        """
        return self._element

    @classmethod
    def load(cls, partname, content_type, blob, package):
        # Parse nested documents using the Document class
        element = Document(BytesIO(blob)) 
        return cls(partname, content_type, element, package)

    @property
    def part(self):
        """
        Part of the parent protocol, "children" of the document will not know
        the part that contains them so must ask their parent object. That
        chain of delegation ends here for child objects.
        """
        return self
This class is capable of handling nested docx files, but as of yet,
python-docx does not know when to use it to handle a given part file.
For this, we'll take advantage of the fact that as of this writing
in python-docx (version 0.8.10) there is a hook in docx/__init__.py called
part_class_selector which is responsible for overriding the Part class
that is used to handle a given part file depending on it's content-type
and/ relation type.  We'll add a hook that maps Relationships with
Type "afChunk" to our new Part class like so:
def part_class_selector(content_type, reltype):
    if reltype == RT.A_F_CHUNK:
        return AltchunkPart
    if reltype == RT.IMAGE:
        return ImagePart
    return None
At this point, if we are load a document containing afChunks, we can query
the body element of the document for altchunks in the document body, and
access the nested document (part file) like so:
>>> import docx
>>> document = docx.Document('my-favorite-file.docx')
>>> body = document.part.element.body
>>> children = body.getchildren()
>>> children
[<CT_P '<w:p>' at 0x2d47b6b0728>,
 <Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}altChunk at 0x2d47b696a08>,
 <CT_SectPr '<w:sectPr>' at 0x2d47b6bfc78>,
 <CT_P '<w:p>' at 0x2d47b6bfcc8>,
 <CT_Tbl '<w:tbl>' at 0x2d47b6bfd18>]
>>> altchunk_element = children[1]
>>> altchunk_id = altchunk_element.get('{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id')
>>> chunk_part = document.part.related_parts[altchunk_id]
>>> nested_document = chunk_part.element
There are number of things which are sub-optimal about this code, however.
First, notice that the alt_chunk element is represented as an lxml.etree
element (<Element {http://schemas....}altChunk at 0x2d47b696a08>) and not
a python-docx content-type element.  In particular, content-type elements
like the paragraph (CT_P) have additional functionality, such as handy
getters and setters for required attributes (e.g. id element of the
altChunk element) and can be referenced with ease by parent containers.
In the python-docx, content-type classes are created almost mystical
the XML alchemy helpers found in
docx/oxml/xmlchemy.py.
Although the actual magic isn't for the faint of heart, it's patterns
are pretty easy to learn by scanning through a few classes that inherit
from the _BaseChildEelment, like CT_Body and CT_Num
which used the magical class attributes created by one one of the dispatchable
classes to define optional attributes and/or child elements of the
content-type element wrapper.
While the above may be a bit abstract, let's make it real by defining a
content type to handle the altChunk elements. Because the altchunk is so
closely tied to the document element, I chose do define our new
CT_Altchunk type in the docx/oxml/document.py, as follows:
class CT_AltChunk(BaseOxmlElement):
    """`w:altChunk` element"""
    rId = RequiredAttribute('r:id', XsdString)
This class is a wrapper for the underlying XML element that has convenience
property called new rId that will return the r:Id.  We
will of course have to inform python-docx that w:altChunks nodes should
be wrapped with our special wrapper by adding this line in the
docx/oxml/__init__.py file:
from .document import CT_AltChunk, CT_Body, CT_Document
register_element_cls('w:altChunk',     CT_AltChunk)

In addition, we may want to let python-docx that zero or more of these
CT_AltChunk may appear within a document body (CT_body) by adding the
following class attribute to the CT_body class:
    altChunk = ZeroOrOne('w:altChunk', successors=())
which will add the special .altchunk_lst property to CT_Body elements
With these helpers added to python-docx, our the above code becomes
>>> import docx
>>> document = docx.Document('my-favorite-file.docx')
>>> body = document.part.element.body
>>> chunks = body.getchildren().altChunk_lst
>>> chunks
[<CT_AltChunk at 0x2d47b696a08>]
>>> chunk_part = document.part.related_parts[chunks[0].rId]
>>> nested_document = chunk_part.element
Dispatchable Properties

In definition of our custom CT_AltChunk class, we used one of the special
dispatchable properties which resulted in the convenient rId member
which contained r:Id property of the underlying XML element.  There are
several of these special dispatchable properties which are used to
represent either XML properties or XML children elements
The dispatchable elements called OptionalAttribute and
RequiredAttribute are used for XML Properties, and we already saw the
RequiredAttribute in use above.  In addition, there are 5 types of
dispatchable types which are used to describe the child elements
contained in parent element, which include OneAndOnlyOne, OneOrMore,
ZeroOrMore, ZeroOrOne, ZeroOrOneChoice.
To demonstrate these, child dipatchable helpers, lets implement another
feature of the docx Spec which is not currently implemented by
python-docx: Forms.  While the entire forms spec is to much to implement
here, we'll implement one specific form elements called w:ddList,
which appear on page 1281 of the Office Open XML Spec.
While it would be better to implement these classes based on the spec
itself, we'll just use the example code snippet provided by the
spec which is copied here:
<w:ddList>
   <w:default w:val="1" />
   <w:result w:val="2" />
   <w:listEntry w:val="One" />
   <w:listEntry w:val="Two" />
   <w:listEntry w:val="Three" />
</w:ddList> 
From the example we can see that the dropdown list is should have one or
more w:listEntry children, and possibly a w:default and w:result
indicating the default value and user selection if on has been made.
To implement this element, we can use the ZeroOrOne and OneOrMore
elements as follows:
class CT_DDList(BaseOxmlElement):
    '''The w:ddList element'''
    default = ZeroOrOne("w:default")
    result = ZeroOrOne("w:result")
    listEntry = OneOrMore("w:listEntry")
and we could of course implement each of the child elements like so:
class ElementWithValue(BaseOxmlElement):
    value = RequiredAttribute('w:val', XsdString)
class CT_Default(ElementWithValue): pass
class CT_Result(ElementWithValue): pass
class CT_ListEntry(ElementWithValue): pass
We will of course have to register these classes with python-docx as we
did above, and once we do so, python-docx will wrap our w:ddList elements with
convenience properties such as el.default, el.result, and
el.listEntry_lst, and convenience setters such as
el.get_or_set_default() and el.add_listEntry.
Code

If you prefer to view these updates in their context
this commit
contains the modifications which add the ability to handle a
new part file (the i.e. the nested afChunk files) and handlers
for the form elments.