Skip to content

Instantly share code, notes, and snippets.

@jdthorpe
Last active November 4, 2023 15:41
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jdthorpe/498b6c462929d7c4fe1700b4eddfda90 to your computer and use it in GitHub Desktop.
Save jdthorpe/498b6c462929d7c4fe1700b4eddfda90 to your computer and use it in GitHub Desktop.
Extending python-docx

Extending python-docx

Background

I was recently tasked with scraping data from a few thousand of word documents each of which contained nested docx documents which were referenced by w:altChunk elements in the body of the main ./word/document.xml file, like so:

<!-- word/document.xml -->
<w:docment>
  <w:body>
    ...
    <w:altChunk r:id="AltChunkIda1c8521d-4233-44e2-8efb-1a2b3c4d5e6f"/>
    ...
  </w:body>
</w:docment>

the r:id property of the altChunk element refers to a relation in the ./word/_rels/document.xml.rels file, like so:

<!-- word/_rels/document.xml.rels -->
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  ...
  <Relationship 
    Id="AltChunkIda1c8521d-4233-44e2-8efb-1a2b3c4d5e6f" 
    Target="/word/afchunk2.docx" 
    Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk"/>
  ...
</Relationships>

which in turn indicates that the nested docx file is located within the zipfile at /word/afchunk2.docx, and that it is of type .../relationships/aFChunk

Finally, the content-type of the target file is described in the file [content-types].xml, which contains defaults content-types for contained files by their extension, and overrides for particular part files, such as the

<!--[ content-types].xml -->
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
  ...
  <Default 
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
    Extension="docx"/>
  ...
  <Override 
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml" 
    PartName="/word/header1.xml"/>
  ...
</Types>

In particular, the [content-type].xml file contains a default content-type for nested part files with the .docx extension of ... document.main+xml which indicates that the part file is an XML file matching a particular schema. Furthermore, there were no overrides present for the specific altChunk part files (e.g. /word/afchunk2.docx) to indicate otherwise.

Because of this, when python-docx attempts to read the outer .docx file, it attempts to read in each of the individual part files according to their content-type and when it encounters the nested /word/afchunk2.docx, it it attempts to load this file with with an xml parser (lxml.etree), which of course throws an XMLSyntaxError.

The second problem is that python-docx does not implement the full Office Open XML Spec (which is over 5000 pages long!), and in particular, it does not natively handle altChunk parts, which means it's up to us to extend the python-docx file to handle nested document chunks.

The solution

To handle these nested document chunks, we'll need to implement a new AltchunkPart which inherits from the Part class, which will be responsible for converting the internal part file into a python object, holding a reference to returned object, and serializing the object when writing the whole document to file. We'll leverage the fact that .docx files can already be parsed and serialized by the docx.Document object, and returns an appropriate document object, and the returned object has a .blob property which is a serialized representation of itself. Hence we'll create a new file at docx/parts/altchunk.py with the following:

# encoding: utf-8

"""The |Altchunk| and closely objects"""

from __future__ import absolute_import, division, print_function, unicode_literals

from docx import Document
from ..opc.part import Part
from io import BytesIO

class AltchunkPart(Part):
    """AltChunkPart for word document

    An AltChunk is a nested word document
    """
    def __init__(self, partname, content_type, element, package):
        super(AltchunkPart, self).__init__(
            partname, content_type, package=package
        )
        # store the parsed document in the _element instance attribute
        self._element = element

    @property
    def blob(self):
        # Let the Document element handle serialization
        stream = BytesIO()
        self._element.save(stream)
        return stream.getvalue()

    @property
    def element(self):
        """
        The root XML element of this XML part.
        """
        return self._element

    @classmethod
    def load(cls, partname, content_type, blob, package):
        # Parse nested documents using the Document class
        element = Document(BytesIO(blob)) 
        return cls(partname, content_type, element, package)

    @property
    def part(self):
        """
        Part of the parent protocol, "children" of the document will not know
        the part that contains them so must ask their parent object. That
        chain of delegation ends here for child objects.
        """
        return self

This class is capable of handling nested docx files, but as of yet, python-docx does not know when to use it to handle a given part file. For this, we'll take advantage of the fact that as of this writing in python-docx (version 0.8.10) there is a hook in docx/__init__.py called part_class_selector which is responsible for overriding the Part class that is used to handle a given part file depending on it's content-type and/ relation type. We'll add a hook that maps Relationships with Type "afChunk" to our new Part class like so:

def part_class_selector(content_type, reltype):
    if reltype == RT.A_F_CHUNK:
        return AltchunkPart
    if reltype == RT.IMAGE:
        return ImagePart
    return None

At this point, if we are load a document containing afChunks, we can query the body element of the document for altchunks in the document body, and access the nested document (part file) like so:

>>> import docx
>>> document = docx.Document('my-favorite-file.docx')
>>> body = document.part.element.body
>>> children = body.getchildren()
>>> children
[<CT_P '<w:p>' at 0x2d47b6b0728>,
 <Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}altChunk at 0x2d47b696a08>,
 <CT_SectPr '<w:sectPr>' at 0x2d47b6bfc78>,
 <CT_P '<w:p>' at 0x2d47b6bfcc8>,
 <CT_Tbl '<w:tbl>' at 0x2d47b6bfd18>]
>>> altchunk_element = children[1]
>>> altchunk_id = altchunk_element.get('{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id')
>>> chunk_part = document.part.related_parts[altchunk_id]
>>> nested_document = chunk_part.element

There are number of things which are sub-optimal about this code, however. First, notice that the alt_chunk element is represented as an lxml.etree element (<Element {http://schemas....}altChunk at 0x2d47b696a08>) and not a python-docx content-type element. In particular, content-type elements like the paragraph (CT_P) have additional functionality, such as handy getters and setters for required attributes (e.g. id element of the altChunk element) and can be referenced with ease by parent containers.

In the python-docx, content-type classes are created almost mystical the XML alchemy helpers found in docx/oxml/xmlchemy.py. Although the actual magic isn't for the faint of heart, it's patterns are pretty easy to learn by scanning through a few classes that inherit from the _BaseChildEelment, like CT_Body and CT_Num which used the magical class attributes created by one one of the dispatchable classes to define optional attributes and/or child elements of the content-type element wrapper.

While the above may be a bit abstract, let's make it real by defining a content type to handle the altChunk elements. Because the altchunk is so closely tied to the document element, I chose do define our new CT_Altchunk type in the docx/oxml/document.py, as follows:

class CT_AltChunk(BaseOxmlElement):
    """`w:altChunk` element"""
    rId = RequiredAttribute('r:id', XsdString)

This class is a wrapper for the underlying XML element that has convenience property called new rId that will return the r:Id. We will of course have to inform python-docx that w:altChunks nodes should be wrapped with our special wrapper by adding this line in the docx/oxml/__init__.py file:

from .document import CT_AltChunk, CT_Body, CT_Document
register_element_cls('w:altChunk',     CT_AltChunk)

In addition, we may want to let python-docx that zero or more of these CT_AltChunk may appear within a document body (CT_body) by adding the following class attribute to the CT_body class:

    altChunk = ZeroOrOne('w:altChunk', successors=())

which will add the special .altchunk_lst property to CT_Body elements

With these helpers added to python-docx, our the above code becomes

>>> import docx
>>> document = docx.Document('my-favorite-file.docx')
>>> body = document.part.element.body
>>> chunks = body.getchildren().altChunk_lst
>>> chunks
[<CT_AltChunk at 0x2d47b696a08>]
>>> chunk_part = document.part.related_parts[chunks[0].rId]
>>> nested_document = chunk_part.element

Dispatchable Properties

In definition of our custom CT_AltChunk class, we used one of the special dispatchable properties which resulted in the convenient rId member which contained r:Id property of the underlying XML element. There are several of these special dispatchable properties which are used to represent either XML properties or XML children elements

The dispatchable elements called OptionalAttribute and RequiredAttribute are used for XML Properties, and we already saw the RequiredAttribute in use above. In addition, there are 5 types of dispatchable types which are used to describe the child elements contained in parent element, which include OneAndOnlyOne, OneOrMore, ZeroOrMore, ZeroOrOne, ZeroOrOneChoice.

To demonstrate these, child dipatchable helpers, lets implement another feature of the docx Spec which is not currently implemented by python-docx: Forms. While the entire forms spec is to much to implement here, we'll implement one specific form elements called w:ddList, which appear on page 1281 of the Office Open XML Spec. While it would be better to implement these classes based on the spec itself, we'll just use the example code snippet provided by the spec which is copied here:

<w:ddList>
   <w:default w:val="1" />
   <w:result w:val="2" />
   <w:listEntry w:val="One" />
   <w:listEntry w:val="Two" />
   <w:listEntry w:val="Three" />
</w:ddList> 

From the example we can see that the dropdown list is should have one or more w:listEntry children, and possibly a w:default and w:result indicating the default value and user selection if on has been made.

To implement this element, we can use the ZeroOrOne and OneOrMore elements as follows:

class CT_DDList(BaseOxmlElement):
    '''The w:ddList element'''
    default = ZeroOrOne("w:default")
    result = ZeroOrOne("w:result")
    listEntry = OneOrMore("w:listEntry")

and we could of course implement each of the child elements like so:

class ElementWithValue(BaseOxmlElement):
    value = RequiredAttribute('w:val', XsdString)
class CT_Default(ElementWithValue): pass
class CT_Result(ElementWithValue): pass
class CT_ListEntry(ElementWithValue): pass

We will of course have to register these classes with python-docx as we did above, and once we do so, python-docx will wrap our w:ddList elements with convenience properties such as el.default, el.result, and el.listEntry_lst, and convenience setters such as el.get_or_set_default() and el.add_listEntry.

Code

If you prefer to view these updates in their context this commit contains the modifications which add the ability to handle a new part file (the i.e. the nested afChunk files) and handlers for the form elments.

@RobPodi
Copy link

RobPodi commented Nov 4, 2023

Is there a way to copy a certain xml piece using get children and create a new page with the same elements from that xml tree? Need to duplicate pages and add them into a document for reporting and with each iteration within the report want to use properties from one page in the word doc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment