Skip to content

Instantly share code, notes, and snippets.

@zackbatist
Last active November 3, 2021 21:40
Show Gist options
  • Save zackbatist/aae5975b6229ff93e56c01aa91dee8f5 to your computer and use it in GitHub Desktop.
Save zackbatist/aae5975b6229ff93e56c01aa91dee8f5 to your computer and use it in GitHub Desktop.
Potential workflow for exporting a MAXQDA document as textfiles with html-style tagging.

Prepare textfile with codes marked up as html tags (<>)

This is a proposed workflow for exporting a MAXQDA document as html with semantic tags, inspired by Anselm (https://github.com/edsu/anselm). Once it's all formatted as text it will presumably be much easier to work with.

A significant challenge that this is meant to resolve is that MAXQDA does not export character-level positions of codings, only paragraph-level positions. But it stores this info in the database, I just have to get it out and assign it to the relevant parts of the text.

I need to consult with someone who is more knowledgable with CSS to design better tag templates that would allow tags to be formatted according to custom styles. For example, the ID schema may be used to apply style to sets of codes based on their semantic position in the code tree. Some ideas to check out:

  • Include all parental codes in the code name, so instead of:

    <mark ID = 123: "Zack Batist", "Database manager", "Archaeological data is the best!">

    it ends up being more like:

    <mark ID = 123: "[Activity Domain].[People].[Individuals].Zack Batist", "[Activity Domain].[People].[Roles].[Roles by position].Database manager", "[Figuration Domain].[Positions].Archaeological data is the best!">

  • Figure out how to appropriately assign unique IDs, i.e. by span or by code. Currently it's set to work by span, but could probably be modified.

This will have to be updated to account for offsets caused by timestamp chatacters. Create a condition that detects how many timestamps are included in the span covered by the coding (based on regex pattern matching), and determine how many characters are contributed through timestamps. This value may then be added to OpenTagOffset and CloseTagOffset, as needed.

  1. Import a textfile with the plaintext transcript.
  2. Filter list of codings by the document name.
    • SELECT Codings.* WHERE Codings.TextID == <specified by user, and translated to the TextID by referring to the Texts table>.
  3. Calculate the total number of characters in the document as DocTotalChars.
  4. Create OpenTagOffset with default value 0.
  5. Create CloseTagPositionOffset with default value 0.
  6. Create a template string for the open tag as OpenTagText.
    • Apply a filter to the table to include all records with identical values for Codings.SegPos1X and Codings.SegPos2X.
    • Create an array with strings corresponding with the Codings.Name values pertaining to each row in this filtered table.
    • Parse those strings into a template following this pattern: <mark ID = 123: "item1", "item2", "item3">.
  7. Create a template string for the close tag </mark ID = 123>) as CloseTagText.
  8. Retrieve the total number of characters in OpenTagText as OpenTagChars.
  9. Retrieve the total number of characters in CloseTagTextas CloseTagChars.
  10. Create OpenTagPosition and set it the total integer value from Codings.SegPos1X + OpenTagOffset.
  11. Create CloseTagPosition and set it the total integer value from Codings.SegPos2X + CloseTagOffset.
  12. Insert OpenTagText as a string at the position indicated by OpenTagPosition.
  13. Insert CloseTagText at the position indicated by CloseTagOffset.
  14. Set OpenTagOffset as the sum of integer values from OpenTagChar and CloseTagChar.
  15. Loop back to step 6.

Notes from dissecting the MaxQDA 2020 SQLite database.

  • Codings.WordID refers to Codewords.ID, which are unique identifiers for each code.
  • Codings.TextID refers to Texts.ID, which are unique identifiers for each document.
  • Codings.SegPos1StdUnit refers to the coding's starting paragraph.
  • Codings.SegPos2StdUnit refers to the coding's ending paragraph.
  • Codings.SegPos1W refers to the coding's starting paragraph.
  • Codings.SegPos2W refers to the coding's ending paragraph.
  • Codings.SegPos1X refers to the coding's starting character.
  • Codings.SegPos2X refers to the coding's ending character.
  • Codings.SegPos1Y, Codings.SegPos1Z, Codings.SegPos1U, Codings.SegPos1V / Codings.SegPos2Y, Codings.SegPos2Z, Codings.SegPos2U, Codings.SegPos2V are set to default value of '0'.
  • Codings.Area refers to the difference between integer values stored in Codings.SegPos2X and Codings.SegPos1X.
  • Codings.ID and Codings.TAID refer to unique identifiers for each coding. They complement the values in Codewords.TID to form a complete or global list of values.
  • Texts.Codes refers to the list of codings with a document.
  • Texts.TID refers to Codings.TID.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment