Skip to content

Instantly share code, notes, and snippets.

@zackbatist
Last active December 20, 2020 15:32
Show Gist options
  • Save zackbatist/ff553964f86ef4501a2ca614270308d3 to your computer and use it in GitHub Desktop.
Save zackbatist/ff553964f86ef4501a2ca614270308d3 to your computer and use it in GitHub Desktop.
Documenting how I migrate data from MaxQDA to a Neo4j graph database.

These notes are meant to document the processes through which I migrate data from MaxQDA 2020 to a Neo4j graph database.

Some of these processes depend on certain coding conventions when marking up documents in MaxQDA, which will be documented elsewhere soon.

Some processes documented here generated results that needed to be hand-edited, and therefore this is not meant to be used as an aide for reproducible research. Instead, it is a series of processes that I'm sharing in case others might find them useful.

Migrating the code system

Creating nodes from codes

Export a table with all codes.

In Excel, populate a new column called NodeID with ascending integers. If there are already nodes in the database, then offset the NodeID by the highest existing value.

In a new column called CorrectedCode, use the following formula to extract the code from its full path. Replace the variable in curly brackets with the column that MaxQDA automatically names Code.

=TRIM(RIGHT(SUBSTITUTE(A2,"\",REPT(" ",LEN(A2))),LEN(A2)))

Add node labels and property values in additional columns, as deemed necessary.

Some of these may be extracted from the names of codes, depending on coding conventions (i.e. characters used as shorthand to represent certain aspects of the code, see [[CodingConventions.md]]) using the following Excel formulas:

Extract values:

=SUBSTITUTE(A2,"P:","Position")

Remove anything before a specific character (in this case, :):

=TRIM(RIGHT(SUBSTITUTE(A2,":",REPT(" ",LEN(A2))),LEN(A2)))

Remove anything after a specific character (in this case, :):

=TRIM(LEFT(SUBSTITUTE(A2,":",REPT(" ",LEN(A2))),LEN(A2)))

Trim extrenuous whitespace at the beginning and end of a string:

=TRIM(A2)

Pass the values from selected columns into a template Cypher query into a new column called CreateNode using a concatenation formula. Replace the variables or labels with columns where your values are located and/or include them in the formula manually:

="CREATE (n"&B2&":"&"Code:ActivityDomain"&" {NodeID:"&B2&", name:"&CHAR(34)&C2&CHAR(34)&"});"

ProTip: Excel uses double quotes as special characters in concatenation formulae, so to wrap strings in double quotes (as required by the Cypher spec) use CHAR(34) instead.


In the Neo4j Desktop settings, ensure that the "Enable multi statement query editor" box is checked.

Copy and paste the contents of the 'CreateNode' column into the Neo4j terminal, and execute the query.

Recreating the code hierarchy

In another spreadsheet, create a table with columns From and To.

Paste the list of all NodeIDs into the From column.

Referring to the MaxQDA code tree and your table of codes and their associated NodeIDs, manually indicate in the To column the NodeID pertaining to the direct parent code of the code represented in the From column.

Add relationship type and property values in additional columns, as deemed necessary.

Pass the values from selected columns into a template Cypher query into a new column called CreateRelationship using a concatenation formula:

="MATCH (a {NodeID:"&A2&"}), (b {NodeID:"&B2&"}) MERGE (a)-[:CHILD_OF {warrant:"&CHAR(34)&C2&CHAR(34)&"}]->(b);"

In the Neo4j Desktop settings, ensure that the "Enable multi statement query editor" box is checked.

Copy and paste the contents of the 'CreateRelationship' column into the Neo4J terminal, and execute the query.

Set info from code memos as node labels or properties [in progress]

Export code memos from MaxQDA.

Align them with the NodeIDs.

Parse the info into a Cypher query.

Execute the cypher queries.

Migrating coded segments

Export coded segments from MaxQDA using the "Export to Excel" button in the retrieved segments pane.

Execute the following R script to process the data:

library(reshape2)
library(dplyr)
library(stringr)

ExportCodedSegments <- read.csv("path/to/file", header=TRUE, check.names = FALSE)

# Extract the name of the code from the full path.
ExportCodedSegments$Code <- sapply(strsplit(ExportCodedSegments$Code, "\\\\s*"), tail, 1)
ExportCodedSegments$Code <- sapply(strsplit(ExportCodedSegments$Code, ":\\s*"), tail, 1)
ExportCodedSegments$Code <- str_trim(ExportCodedSegments$Code)

# Retrieve the NodeIDs assigned to codes, already imported into the graph database.
Codes <- read.csv("path/to/file", header = TRUE, check.names = FALSE)
ExportCodedSegments$CodeID <- Codes$NodeID[match(ExportCodedSegments$Code, Codes$Code)]

# Create NodeIDs for the coded segments.
ExportCodedSegments$SegmentID <- cumsum(!duplicated(ExportCodedSegments$Segment))
ExportCodedSegments$SegmentID <- ExportCodedSegments$SegmentID + 1955

# Create or modify node labels, relationship type and property values.
ExportCodedSegments$RelationshipType <- "REFERS_TO"
ExportCodedSegments$`Created by`[ExportCodedSegments$`Created by` == "zackbatist"] <- "Zack Batist"
ExportCodedSegments$`Created by`[ExportCodedSegments$`Created by` == "costi"] <- "Costis Dallas"
ExportCodedSegments$Created <- sub(' ', 'T', ExportCodedSegments$Created)
ExportCodedSegments$Created <- paste0(ExportCodedSegments$Created, "Z")
ExportCodedSegments$Segment <- gsub('\\n\\n', ' <ParagraphBreak> ', ExportCodedSegments$Segment, fixed = TRUE)
ExportCodedSegments$Segment <- gsub('"', '\'', ExportCodedSegments$Segment, fixed = TRUE)
ExportCodedSegments$Segment <- gsub(':', '\:', ExportCodedSegments$Segment, fixed = TRUE)

# Pass the values from selected columns into template Cypher queries using concatenation formulae.
ExportCodedSegments$CreateNode <- paste0("CREATE (n", ExportCodedSegments$SegmentID, ":CodedSegment {NodeID:", ExportCodedSegments$SegmentID, ", document:'", ExportCodedSegments$`Document name`, "', startPara:", ExportCodedSegments$Beginning, ", endPara:", ExportCodedSegments$End, ", area:", ExportCodedSegments$Area, ", coverage:", ExportCodedSegments$`Coverage %`, ", text:'", ExportCodedSegments$Segment, "'});", sep = "")
ExportCodedSegments$CreateRelationship <- paste0("MATCH (a {NodeID:", ExportCodedSegments$CodeID, "}), (b {NodeID:", ExportCodedSegments$SegmentID, "}) CREATE (a)-[r:", ExportCodedSegments$RelationshipType, " {weight:", ExportCodedSegments$`Weight score`, ", createdBy:'", ExportCodedSegments$`Created by`, "', createdDate:'", ExportCodedSegments$Created, "' }]->(b);", sep = "")

# Write queries files.
CreateSegmentNodesCypher <- ExportCodedSegments[!duplicated(ExportCodedSegments$SegmentID),]
CreateSegmentNodesCypher <- CreateSegmentNodesCypher$CreateNode
write.csv(CreateSegmentNodesCypher, "path/to/file", row.names = FALSE)

CreateSegmentRelationshipsCypher <- ExportCodedSegments$CreateRelationship
write.csv(CreateSegmentRelationshipsCypher, "path/to/file", row.names = FALSE)

Open the output CreateNode-Segments.csv and CreateRelationship-Segments.csv in a text editor and remove the header line and any trailing quotes. Find and replace is your friend for any touch-ups, but be wary when making any mass changes to the segment text at this stage.

Copy and paste into the Neo4j terminal, and execute the query.

Obtain character-level granularity for the beginning and end of coded segments

Copy and paste each transcript into an individual text file.

In MaxQDA, activate each document as well as the codes of interest, and export the retrieved segments as an Excel spreadsheet. Convert the Excel spreadsheet to CSV format. Do this for each document independently.

Execute the following R script to process the data:

library(readr)
library(stringi)

# Read the data.
filename <- file.choose()
FullTranscript <- read_file(filename)
filename2 <- file.choose()
CodedSegments <- read.csv(filename2, header = TRUE)

# Allocate existing NodeIDs to the recently exported file of coded segments.
AllCodedSegments <- read.csv("path/to/file", header = TRUE, check.names = FALSE)
CodedSegments$NodeID <- AllCodedSegments$SegmentID[match(CodedSegments$Segment, AllCodedSegments$Segment)]

# Identify the character positions for each segment and copy them over to their own columns.
QQ <- stri_locate_all(pattern = CodedSegments$Segment, FullTranscript, fixed = TRUE)
CodedSegments$startChar <- sapply(1:length(QQ), function(i) as.numeric(QQ[[i]][[1]]))
CodedSegments$endChar <- sapply(1:length(QQ), function(i) as.numeric(QQ[[i]][[2]]))

# Pass the values from selected columns into template Cypher queries using concatenation formulae.
CodedSegments$CypherStartChar <- paste0("MATCH (n { NodeID: ",CodedSegments$NodeID," }) SET n.startChar = ",CodedSegments$startChar, " RETURN n.NodeID, n.startChar;")
CodedSegments$CypherEndChar <- paste0("MATCH (n { NodeID: ",CodedSegments$NodeID," }) SET n.endChar = ",CodedSegments$endChar, " RETURN n.NodeID, n.endChar;")

# Write to file.
ImportFilename <- basename(filename)
ExportFilename <- paste0("segment-character-resolution/",ImportFilename,"-QQ.csv")
write_csv(CodedSegments, ExportFilename)

Open the output CSV file, copy the contents of the CypherStartChar and CypherEndChar columns into the Neo4j terminal and execute the queries.

Creating relationships between coded segments

It may be helpful to relate the distribution of coded segments. This is implemented through OVERLAPS, CONTAINS and FOLLOWS relationship types.

The following Cypher queries create relationships based on the positions of starting and ending characters.

This should be done on a document-by-document basis. Change the property value "XXX" to the name of the document.

Note that in the graph label model for the project from which these examples were drawn, the node label Saying is applied to all coded segments.

Overlaps:

MATCH (n:Saying), (m:Saying)
FOREACH(ignoreMe IN CASE
  WHEN n.document = "XXX"
  AND m.document = "XXX"
  AND n.endChar > m.startChar
  AND n.endChar < m.endChar
  THEN [1] ELSE [] END | 
  MERGE (n)-[r:OVERLAPS]->(m)
)

Contains:

MATCH (n:Saying), (m:Saying)
FOREACH(ignoreMe IN CASE
  WHEN n.document = "XXX"
  AND m.document = "XXX"
  AND n.startChar > m.startChar
  AND n.endChar < m.endChar
  THEN [1] ELSE [] END |
  MERGE (m)-[r:CONTAINS]->(n)
)

Follows:

MATCH (n:Saying), (m:Saying)
FOREACH(ignoreMe IN CASE
  WHEN n.document = "XXX"
  AND m.document = "XXX"
  AND m.startChar > n.endChar
  AND m.startChar - n.endChar = 2
  THEN [1] ELSE [] END |
  MERGE (m)-[r:FOLLOWS]->(n)
)

Migrating in-document memos [in progress]

Export in-document memos from MaxQDA.

Apply the R script to generate Cypher queries for creating nodes.

Execute the Cypher queries.

Relating speakers with coded segments

In Neo4j, export a csv containing the NodeID, startPara and endPara property values for coded segments on a document-by-document basis:

MATCH (n:Saying)
WHERE n.document:"XXX"
RETURN n.NodeID, n.startPara, n.endPara

Execute the following R script:

library(reshape2)
library(dplyr)
library(stringr)
library(stringi)
library(readr)
library(tidyr)
library(data.table)

# Read and prepare the transcript.
FullTranscript <- read_file("path/to/file")
AA <- str_split(FullTranscript, "\n\n")
text <- unlist(AA)
BB <- as.data.frame(text)
BB$speaker <- str_match(BB$text, "(.*?):")
BB$speaker <- gsub(":", "", BB$speaker, fixed = TRUE)
BB$para <- seq.int(nrow(BB))

# Read the csv exported from neo4j.
KK <- read.csv("path/to/file", header = TRUE, check.names = FALSE)
colnames(KK)[1] <- "NodeID"
colnames(KK)[2] <- "startPara"
colnames(KK)[3] <- "endPara"

# Identify speakers associated with each paragraph within the range of paragraphs pertaining to each coded segment.
for (i in (1:length(KK$startPara))) {
  X <- KK$startPara[i]
  Y <- KK$endPara[i]
  KK$Z[i] <- paste0(unique(X:Y), collapse = ";")
}

# Tidy up the data and write to file.
dt <- data.table(KK)
RR <- dt[ , list( Z = unlist( strsplit( Z , ";" ) ) ) , by = NodeID ]
RR$W <- as.character(BB$speaker[match(RR$Z, BB$para)])
write.csv(RR, "path/to/file")

Open the csv exported from R in Excel and apply the SUBSTITITE function to replace speaker's names with their NodeIDs.

=SUBSTITUTE(C2,"Zack","5719")

Use column filters to identify cases when multiple speakers, unknown speakers or placeholder speakers were indicated prior to the : in the transcripts. Create nodes for new individuals as needed.

Remove NA values, which correspond with paragraphs containing no :, such as notes from the transcriber or descriptions of people's actions or gestures.

Pass the values from selected columns into a template Cypher query into a new column called CreateRelationship using a concatenation formula:

="MATCH (a {NodeID:"&F2&"}), (b {NodeID:"&B2&"}) MERGE (a)-[:SAYS]->(b);"

In the Neo4j Desktop settings, ensure that the "Enable multi statement query editor" box is checked.

Copy and paste the contents of the 'CreateRelationship' column into the Neo4J terminal, and execute the query.

General tips for matching and updating data in Neo4j

Match all nodes with label Code underneath [Tools] in the code hierarchy:

MATCH ({name: "[Tools]"})<-[IS_A*]-(n:Code)
RETURN n

The maximum number of links (or levels in the code hierarchy) can be specified by an integer following the *:

MATCH ({name: "[Tools]"})<-[IS_A*3]-(n:Code)
RETURN n

Assign labels to matched nodes:

MATCH ({name: "[Tools]"})<-[IS_A*]-(n:Code)
SET n:Tool
RETURN n

Set new properties and fill their values:

MATCH (n:Tool)
SET n.kind = "hardware"
RETURN n.name, n.kind

Rename a node's property:

MATCH (n:Tool)
WHERE n.name IS NULL
SET n.name = n.CodeName
REMOVE n.CodeName
RETURN n.name

Replace relationship:

MATCH (n:Tool)<-[r:CodedWith]-(m:Saying)
CREATE (n)<-[r2:REFERS_TO]-(m)
SET r2.createdBy = r.CreatedBy
SET r2.createdDate = r.CreatedDate
DELETE r
RETURN n, m

Check for nodes with values in phantom properties:

MATCH (n)
WHERE n.Priority IS NOT NULL
RETURN n

Delete nodes and their relationships:

MATCH (n {NodeID:743}) DETACH DELETE n;

Check for duplicate nodes:

Detele duplicate relationships:

match ()-[r]->() 
match (s)-[r]->(e) 
with s,e,type(r) as typ, tail(collect(r)) as coll 
foreach(x in coll | delete x)

Integrating with hypothes.is

Refining MaxQDA HTML output

This will hide timestamps when rendering HTML output in the browser, but keep them as data associated with each coding.

Export each transcript as HTML, and check the box that says "display timestamps".

Create styles.css with the following:

p, li {
  white-space: pre-wrap;
}

.timestamp {
  display:none;
  font-family:'Helvetica Neue';
  font-size:12pt;
  color:#000000;
  background-color:transparent;
}

.segment {
  font-family:'Helvetica Neue';
  font-size:12pt;
  background-color:transparent;
}

.gap1 {
  -qt-paragraph-type:empty;
  margin-top:0px;
  margin-bottom:0px;
  margin-left:0px;
  margin-right:0px;
  -qt-block-indent:0;
  text-indent:0px;
  font-family:'Helvetica Neue';
  font-size:12pt;
}

In the html files, find and replace the following snippets:

Find Replace
`<style type="text/css">
p, li { white-space: pre-wrap; }
</style>` <html><head><meta name="qrichtext" content="1" /><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link rel="stylesheet" href="styles.css"></style></head>
<span style=" font-family:'Helvetica Neue'; font-size:12pt; color:#000000; background-color:transparent;"> <div class="timestamp">
<span style=" font-family:'Helvetica Neue'; font-size:12pt; background-color:transparent;"> <div class="segment">
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; font-family:'Helvetica Neue'; font-size:12pt;"><br /></p> <div class="gap1"><p></p></div>
</span> </div>

Then prettify the html.

Set up a web server to host the documents [in progress]

Automatically create hypothes.is annotations [in progress]

Set the annotation URIs as node properties [in progress]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment