These notes are meant to document the processes through which I migrate data from MaxQDA 2020 to a Neo4j graph database.
Some of these processes depend on certain coding conventions when marking up documents in MaxQDA, which will be documented elsewhere soon.
Some processes documented here generated results that needed to be hand-edited, and therefore this is not meant to be used as an aide for reproducible research. Instead, it is a series of processes that I'm sharing in case others might find them useful.
Export a table with all codes.
In Excel, populate a new column called NodeID
with ascending integers. If there are already nodes in the database, then offset the NodeID by the highest existing value.
In a new column called CorrectedCode
, use the following formula to extract the code from its full path. Replace the variable in curly brackets with the column that MaxQDA automatically names Code
.
=TRIM(RIGHT(SUBSTITUTE(A2,"\",REPT(" ",LEN(A2))),LEN(A2)))
Add node labels and property values in additional columns, as deemed necessary.
Some of these may be extracted from the names of codes, depending on coding conventions (i.e. characters used as shorthand to represent certain aspects of the code, see [[CodingConventions.md]]) using the following Excel formulas:
Extract values:
=SUBSTITUTE(A2,"P:","Position")
Remove anything before a specific character (in this case, :
):
=TRIM(RIGHT(SUBSTITUTE(A2,":",REPT(" ",LEN(A2))),LEN(A2)))
Remove anything after a specific character (in this case, :
):
=TRIM(LEFT(SUBSTITUTE(A2,":",REPT(" ",LEN(A2))),LEN(A2)))
Trim extrenuous whitespace at the beginning and end of a string:
=TRIM(A2)
Pass the values from selected columns into a template Cypher query into a new column called CreateNode
using a concatenation formula. Replace the variables or labels with columns where your values are located and/or include them in the formula manually:
="CREATE (n"&B2&":"&"Code:ActivityDomain"&" {NodeID:"&B2&", name:"&CHAR(34)&C2&CHAR(34)&"});"
ProTip: Excel uses double quotes as special characters in concatenation formulae, so to wrap strings in double quotes (as required by the Cypher spec) use CHAR(34)
instead.
In the Neo4j Desktop settings, ensure that the "Enable multi statement query editor" box is checked.
Copy and paste the contents of the 'CreateNode' column into the Neo4j terminal, and execute the query.
In another spreadsheet, create a table with columns From
and To
.
Paste the list of all NodeIDs into the From column.
Referring to the MaxQDA code tree and your table of codes and their associated NodeIDs, manually indicate in the To column the NodeID pertaining to the direct parent code of the code represented in the From column.
Add relationship type and property values in additional columns, as deemed necessary.
Pass the values from selected columns into a template Cypher query into a new column called CreateRelationship
using a concatenation formula:
="MATCH (a {NodeID:"&A2&"}), (b {NodeID:"&B2&"}) MERGE (a)-[:CHILD_OF {warrant:"&CHAR(34)&C2&CHAR(34)&"}]->(b);"
In the Neo4j Desktop settings, ensure that the "Enable multi statement query editor" box is checked.
Copy and paste the contents of the 'CreateRelationship' column into the Neo4J terminal, and execute the query.
Export code memos from MaxQDA.
Align them with the NodeIDs.
Parse the info into a Cypher query.
Execute the cypher queries.
Export coded segments from MaxQDA using the "Export to Excel" button in the retrieved segments pane.
Execute the following R script to process the data:
library(reshape2)
library(dplyr)
library(stringr)
ExportCodedSegments <- read.csv("path/to/file", header=TRUE, check.names = FALSE)
# Extract the name of the code from the full path.
ExportCodedSegments$Code <- sapply(strsplit(ExportCodedSegments$Code, "\\\\s*"), tail, 1)
ExportCodedSegments$Code <- sapply(strsplit(ExportCodedSegments$Code, ":\\s*"), tail, 1)
ExportCodedSegments$Code <- str_trim(ExportCodedSegments$Code)
# Retrieve the NodeIDs assigned to codes, already imported into the graph database.
Codes <- read.csv("path/to/file", header = TRUE, check.names = FALSE)
ExportCodedSegments$CodeID <- Codes$NodeID[match(ExportCodedSegments$Code, Codes$Code)]
# Create NodeIDs for the coded segments.
ExportCodedSegments$SegmentID <- cumsum(!duplicated(ExportCodedSegments$Segment))
ExportCodedSegments$SegmentID <- ExportCodedSegments$SegmentID + 1955
# Create or modify node labels, relationship type and property values.
ExportCodedSegments$RelationshipType <- "REFERS_TO"
ExportCodedSegments$`Created by`[ExportCodedSegments$`Created by` == "zackbatist"] <- "Zack Batist"
ExportCodedSegments$`Created by`[ExportCodedSegments$`Created by` == "costi"] <- "Costis Dallas"
ExportCodedSegments$Created <- sub(' ', 'T', ExportCodedSegments$Created)
ExportCodedSegments$Created <- paste0(ExportCodedSegments$Created, "Z")
ExportCodedSegments$Segment <- gsub('\\n\\n', ' <ParagraphBreak> ', ExportCodedSegments$Segment, fixed = TRUE)
ExportCodedSegments$Segment <- gsub('"', '\'', ExportCodedSegments$Segment, fixed = TRUE)
ExportCodedSegments$Segment <- gsub(':', '\:', ExportCodedSegments$Segment, fixed = TRUE)
# Pass the values from selected columns into template Cypher queries using concatenation formulae.
ExportCodedSegments$CreateNode <- paste0("CREATE (n", ExportCodedSegments$SegmentID, ":CodedSegment {NodeID:", ExportCodedSegments$SegmentID, ", document:'", ExportCodedSegments$`Document name`, "', startPara:", ExportCodedSegments$Beginning, ", endPara:", ExportCodedSegments$End, ", area:", ExportCodedSegments$Area, ", coverage:", ExportCodedSegments$`Coverage %`, ", text:'", ExportCodedSegments$Segment, "'});", sep = "")
ExportCodedSegments$CreateRelationship <- paste0("MATCH (a {NodeID:", ExportCodedSegments$CodeID, "}), (b {NodeID:", ExportCodedSegments$SegmentID, "}) CREATE (a)-[r:", ExportCodedSegments$RelationshipType, " {weight:", ExportCodedSegments$`Weight score`, ", createdBy:'", ExportCodedSegments$`Created by`, "', createdDate:'", ExportCodedSegments$Created, "' }]->(b);", sep = "")
# Write queries files.
CreateSegmentNodesCypher <- ExportCodedSegments[!duplicated(ExportCodedSegments$SegmentID),]
CreateSegmentNodesCypher <- CreateSegmentNodesCypher$CreateNode
write.csv(CreateSegmentNodesCypher, "path/to/file", row.names = FALSE)
CreateSegmentRelationshipsCypher <- ExportCodedSegments$CreateRelationship
write.csv(CreateSegmentRelationshipsCypher, "path/to/file", row.names = FALSE)
Open the output CreateNode-Segments.csv
and CreateRelationship-Segments.csv
in a text editor and remove the header line and any trailing quotes. Find and replace is your friend for any touch-ups, but be wary when making any mass changes to the segment text at this stage.
Copy and paste into the Neo4j terminal, and execute the query.
Copy and paste each transcript into an individual text file.
In MaxQDA, activate each document as well as the codes of interest, and export the retrieved segments as an Excel spreadsheet. Convert the Excel spreadsheet to CSV format. Do this for each document independently.
Execute the following R script to process the data:
library(readr)
library(stringi)
# Read the data.
filename <- file.choose()
FullTranscript <- read_file(filename)
filename2 <- file.choose()
CodedSegments <- read.csv(filename2, header = TRUE)
# Allocate existing NodeIDs to the recently exported file of coded segments.
AllCodedSegments <- read.csv("path/to/file", header = TRUE, check.names = FALSE)
CodedSegments$NodeID <- AllCodedSegments$SegmentID[match(CodedSegments$Segment, AllCodedSegments$Segment)]
# Identify the character positions for each segment and copy them over to their own columns.
QQ <- stri_locate_all(pattern = CodedSegments$Segment, FullTranscript, fixed = TRUE)
CodedSegments$startChar <- sapply(1:length(QQ), function(i) as.numeric(QQ[[i]][[1]]))
CodedSegments$endChar <- sapply(1:length(QQ), function(i) as.numeric(QQ[[i]][[2]]))
# Pass the values from selected columns into template Cypher queries using concatenation formulae.
CodedSegments$CypherStartChar <- paste0("MATCH (n { NodeID: ",CodedSegments$NodeID," }) SET n.startChar = ",CodedSegments$startChar, " RETURN n.NodeID, n.startChar;")
CodedSegments$CypherEndChar <- paste0("MATCH (n { NodeID: ",CodedSegments$NodeID," }) SET n.endChar = ",CodedSegments$endChar, " RETURN n.NodeID, n.endChar;")
# Write to file.
ImportFilename <- basename(filename)
ExportFilename <- paste0("segment-character-resolution/",ImportFilename,"-QQ.csv")
write_csv(CodedSegments, ExportFilename)
Open the output CSV file, copy the contents of the CypherStartChar
and CypherEndChar
columns into the Neo4j terminal and execute the queries.
It may be helpful to relate the distribution of coded segments. This is implemented through OVERLAPS, CONTAINS and FOLLOWS relationship types.
The following Cypher queries create relationships based on the positions of starting and ending characters.
This should be done on a document-by-document basis. Change the property value "XXX" to the name of the document.
Note that in the graph label model for the project from which these examples were drawn, the node label Saying is applied to all coded segments.
Overlaps:
MATCH (n:Saying), (m:Saying)
FOREACH(ignoreMe IN CASE
WHEN n.document = "XXX"
AND m.document = "XXX"
AND n.endChar > m.startChar
AND n.endChar < m.endChar
THEN [1] ELSE [] END |
MERGE (n)-[r:OVERLAPS]->(m)
)
Contains:
MATCH (n:Saying), (m:Saying)
FOREACH(ignoreMe IN CASE
WHEN n.document = "XXX"
AND m.document = "XXX"
AND n.startChar > m.startChar
AND n.endChar < m.endChar
THEN [1] ELSE [] END |
MERGE (m)-[r:CONTAINS]->(n)
)
Follows:
MATCH (n:Saying), (m:Saying)
FOREACH(ignoreMe IN CASE
WHEN n.document = "XXX"
AND m.document = "XXX"
AND m.startChar > n.endChar
AND m.startChar - n.endChar = 2
THEN [1] ELSE [] END |
MERGE (m)-[r:FOLLOWS]->(n)
)
Export in-document memos from MaxQDA.
Apply the R script to generate Cypher queries for creating nodes.
Execute the Cypher queries.
In Neo4j, export a csv containing the NodeID, startPara and endPara property values for coded segments on a document-by-document basis:
MATCH (n:Saying)
WHERE n.document:"XXX"
RETURN n.NodeID, n.startPara, n.endPara
Execute the following R script:
library(reshape2)
library(dplyr)
library(stringr)
library(stringi)
library(readr)
library(tidyr)
library(data.table)
# Read and prepare the transcript.
FullTranscript <- read_file("path/to/file")
AA <- str_split(FullTranscript, "\n\n")
text <- unlist(AA)
BB <- as.data.frame(text)
BB$speaker <- str_match(BB$text, "(.*?):")
BB$speaker <- gsub(":", "", BB$speaker, fixed = TRUE)
BB$para <- seq.int(nrow(BB))
# Read the csv exported from neo4j.
KK <- read.csv("path/to/file", header = TRUE, check.names = FALSE)
colnames(KK)[1] <- "NodeID"
colnames(KK)[2] <- "startPara"
colnames(KK)[3] <- "endPara"
# Identify speakers associated with each paragraph within the range of paragraphs pertaining to each coded segment.
for (i in (1:length(KK$startPara))) {
X <- KK$startPara[i]
Y <- KK$endPara[i]
KK$Z[i] <- paste0(unique(X:Y), collapse = ";")
}
# Tidy up the data and write to file.
dt <- data.table(KK)
RR <- dt[ , list( Z = unlist( strsplit( Z , ";" ) ) ) , by = NodeID ]
RR$W <- as.character(BB$speaker[match(RR$Z, BB$para)])
write.csv(RR, "path/to/file")
Open the csv exported from R in Excel and apply the SUBSTITITE
function to replace speaker's names with their NodeIDs.
=SUBSTITUTE(C2,"Zack","5719")
Use column filters to identify cases when multiple speakers, unknown speakers or placeholder speakers were indicated prior to the :
in the transcripts. Create nodes for new individuals as needed.
Remove NA
values, which correspond with paragraphs containing no :
, such as notes from the transcriber or descriptions of people's actions or gestures.
Pass the values from selected columns into a template Cypher query into a new column called CreateRelationship
using a concatenation formula:
="MATCH (a {NodeID:"&F2&"}), (b {NodeID:"&B2&"}) MERGE (a)-[:SAYS]->(b);"
In the Neo4j Desktop settings, ensure that the "Enable multi statement query editor" box is checked.
Copy and paste the contents of the 'CreateRelationship' column into the Neo4J terminal, and execute the query.
Match all nodes with label Code underneath [Tools] in the code hierarchy:
MATCH ({name: "[Tools]"})<-[IS_A*]-(n:Code)
RETURN n
The maximum number of links (or levels in the code hierarchy) can be specified by an integer following the *
:
MATCH ({name: "[Tools]"})<-[IS_A*3]-(n:Code)
RETURN n
Assign labels to matched nodes:
MATCH ({name: "[Tools]"})<-[IS_A*]-(n:Code)
SET n:Tool
RETURN n
Set new properties and fill their values:
MATCH (n:Tool)
SET n.kind = "hardware"
RETURN n.name, n.kind
Rename a node's property:
MATCH (n:Tool)
WHERE n.name IS NULL
SET n.name = n.CodeName
REMOVE n.CodeName
RETURN n.name
Replace relationship:
MATCH (n:Tool)<-[r:CodedWith]-(m:Saying)
CREATE (n)<-[r2:REFERS_TO]-(m)
SET r2.createdBy = r.CreatedBy
SET r2.createdDate = r.CreatedDate
DELETE r
RETURN n, m
Check for nodes with values in phantom properties:
MATCH (n)
WHERE n.Priority IS NOT NULL
RETURN n
Delete nodes and their relationships:
MATCH (n {NodeID:743}) DETACH DELETE n;
Check for duplicate nodes:
Detele duplicate relationships:
match ()-[r]->()
match (s)-[r]->(e)
with s,e,type(r) as typ, tail(collect(r)) as coll
foreach(x in coll | delete x)
This will hide timestamps when rendering HTML output in the browser, but keep them as data associated with each coding.
Export each transcript as HTML, and check the box that says "display timestamps".
Create styles.css
with the following:
p, li {
white-space: pre-wrap;
}
.timestamp {
display:none;
font-family:'Helvetica Neue';
font-size:12pt;
color:#000000;
background-color:transparent;
}
.segment {
font-family:'Helvetica Neue';
font-size:12pt;
background-color:transparent;
}
.gap1 {
-qt-paragraph-type:empty;
margin-top:0px;
margin-bottom:0px;
margin-left:0px;
margin-right:0px;
-qt-block-indent:0;
text-indent:0px;
font-family:'Helvetica Neue';
font-size:12pt;
}
In the html files, find and replace the following snippets:
Find | Replace |
---|---|
`<style type="text/css"> | |
p, li { white-space: pre-wrap; } | |
</style>` | <html><head><meta name="qrichtext" content="1" /><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link rel="stylesheet" href="styles.css"></style></head> |
<span style=" font-family:'Helvetica Neue'; font-size:12pt; color:#000000; background-color:transparent;"> |
<div class="timestamp"> |
<span style=" font-family:'Helvetica Neue'; font-size:12pt; background-color:transparent;"> |
<div class="segment"> |
<p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; font-family:'Helvetica Neue'; font-size:12pt;"><br /></p> |
<div class="gap1"><p></p></div> |
</span> |
</div> |
Then prettify the html.