The R package XML
for parsing and manipulation of XML documents in R is not actively maintained anymore, but used by many:
The R package xml2
is an actively maintained, more recent alternative.
This file documents useful resources and steps for moving from XML
to xml2
.
- Collected at r-lib/xml2#246
- jennybc/googlesheets#102
- https://github.com/quanteda/quanteda/pull/1364/files
- https://github.com/quanteda/readtext/pull/128/files
- https://github.com/andrie/sss/commit/8787dd01dc6784bf1fc0f0301ce5d318adad7f70
- https://github.com/cpsievert/rdom/pull/14/files
- https://github.com/USGS-R/geoknife/pull/362/files and https://github.com/USGS-R/geoknife/commit/76b4b47297b7e2f2cd96fb783ec2f35db2ac2f8b
- https://github.com/mikldk/ryacas/commit/e31954259b4ee9e23b0566566b94a9cad7e32ab8 (pretty straightforward example, regex-able)
- https://github.com/cloudyr/aws.alexa/commit/5f5441f1ac8bd4f1f409e2e61ae3ab154962f6ec
The itdepends
package helps with finding all usages of XML, see https://speakerdeck.com/jimhester/it-depends?slide=38
devtools::install_github("jimhester/itdepends") library("itdepends") itdepends::dep_locate("XML")
XML |
xml2 |
Comment |
---|---|---|
XML::getNodeSet(doc = <document object>, path = "<XPath expression>") or XML::xpathApply(...) |
xml2::xml_find_all(..) and xml2::xml_find_one(..) with x = <node>, xpath = "<XPath 1.0 expression>" |
Find matching nodes value of a node's attribute |
XML::htmlTreeParse(<path>, asText = <treat file as text>) |
xml2::read_html(<path, URL, connection, or literal xml>) |
parse HTML document |
XML::isXMLString("<string>") |
No direct equivalent, can try to parse... | Heuristically determine if string is XML |
XML::toString.XMLNode(<node>) |
as.character(<document or node>) |
object to character |
XML::xmlAttrs(node = <node object>) |
xml2::xml_attrs(x = <document, node, or node set>) |
Get the attributes of a node, both return a named character vector. |
XML::xmlApply(X = <node>) and XML::xmlSApply(..) |
functions xml2::xml_attrs(..) and xml2::xml_contents(..) are vectorized |
Apply function to each child of a node |
XML::xmlChildren(x = <node object>)[["<name of the sub-node>"]] |
xml2::xml_child(x = <node>, search = <number, or name of the sub-node>) (only elements) and xml2::xml_contents(..) for all nodes |
Get sub-nodes of a node |
XML::xmlElementsByTagName(el = <node object>, name = "<name to match>") |
xml2::xml_find_all(x = <document, node, node set>, xpath = "<name to match>") |
Retrieve children matching tag name (children/sub-elements) |
XML::xmlGetAttr(node = <node object>, name = "<attribute name>", default = "<default>") |
xml2::xml_attr(x = <document, node, or node set>, attr = "<attribute name>") |
Get value of a node's attribute |
XML::xmlName(node = <node object>) |
xml2::xml_name(x = <document, node, or node set>) |
Get name of a node |
XML::xmlParse(..) |
xml2::read_xml(..) |
Unexposed method in XML ? |
XML::xmlParseDoc(file = <file name> or "<xml content>", asText = !file.exists(file)) |
xml2::read_xml(x = <string, connection, URL, or raw vector>) |
parse XML document |
XML::xmlParseString(content = "<string>") |
xml2::read_xml(x = <string, connection, URL, or raw vector>) |
convenience function XML to node/tree |
XML::xmlRoot(x = <node object>) |
xml2::xml_root(x = <document, node, or node set> |
Get top-level node |
XML::xmlSize(obj = <node or document object>) |
xml2::xml_length() |
Note that xml_length(..) does not need to go to the root first, i.e. XML::xmlSize(XML::xmlRoot(old)) == xml2::xml_length(new) |
XML::xmlToList(node = <xml node or document>) |
xml2::as_list(x = <document, node, or node set>) |
convert to R-like list; difference: as_list does not drop the root element |
XML::xmlTreeParse(file = <file name> or "<xml content>", asText = !file.exists(file)) |
parse XML document | |
if(!is.null(<node object>[["<child name>"]])) { |
(inherits(xml_child(<node object>, "<child name>"), "xml_missing") |
Checking for child node existence |
XML::xmlValue(<node object>) |
xml2::xml_text(x = <document, node, or node set>) |
Get/Set contents of a leaf node |
Common snippets
XML |
xml2 |
Comment |
---|---|---|
if (!is.null(XML::xmlChildren(x = obj)[[<node name>]])) |
if (!inherits(xml2::xml_find_first(x = obj, xpath = <node name>), "xml_missing") |
Check if element exists. |
if(!is.null(XML::xmlAttrs(node = obj)[["href"]])) |
if(!is.na(xml2::xml_attr(x = obj, attr = "href"))) |
Checking for potentiall non-existing attribute |
XML |
xml2 |
Comment |
---|---|---|
XML::addAttributes(node = <node object>, ..., .attrs = <character vector with attribute names>, append = <replace or add>) |
xml2::xml_set_attrs(x = <document, node, node set>, value = <named character vector>) to set multiple attributes and overwrite existing ones, or xml2::xml_set_attr(x = <node>, attr = <name>, value = <value>) to append a single attribute |
Add attributes to a node; in xml2 no re-assigning the object is needed, i.e. no doc <- XML::addAttributes(node = doc, ...) |
XML::addChildren(node = <node object>, kids = list()) |
xml2::xml_add_child(.x = <document or nodeset>, .value = <document, node or nodeset>) |
Add child nodes to a node |
XML::saveXML(doc = <xml document object>, file = "<file name>") |
xml2::write_xml(x = <document or node>, file = "<path or connection">) |
Write XML document to string or file |
XML::xmlNamespaceDefinitions(x = <node>) |
xml2::xml_ns(x = <document, node, or node set>) |
Get namespace definitions from a node |
XML::xmlNode(name = "<node name>") |
xml2::xml_new_document %>% xml2::xml_add_child("<node name>") or (preferred in docs) xml2::xml_new_root("<node name>") |
Create a new node |
XML::xmlValue() |
xml2::xml_text(x = <document, node, or node set>) |
Get/Set contents of a leaf node |
XML |
xml2 |
Comment |
---|---|---|
XMLAbstractDocument |
xml_document |
.. |
XMLAbstractNode , XMLCommentNode , XMLTextNode , ... |
xml_node |
.. |
? | xml_missing |
.. |
The following steps were applied in switching from XML
to xml2
for the package sos4R
.
This is not a "clean" process, but hopefully provides useful input for other's doing the switch.
Ideally the lessons learned on what can be "regex-ed" and what needs manual interaction go into the above tables at a later stage.
- Make sure all functions use named parameters and package prefix with the following regular expressions
addAttributes\((?!node)
replaced withXML::addAttributes(node =
addChildren\(node
replaced withXML::addChildren(node
getNodeSet\((?!doc)
replaced withXML::getNodeSet(doc =
isXMLString\((?!str)
replaced withXML::isXMLString(str =
saveXML\((?!doc)
replaced withXML::saveXML(doc =
xmlAttrs\((?!node)
replaced withXML::xmlAttrs(node =
xmlChildren\((?!x)
replaced withXML::xmlChildren(x =
xmlElementsByTagName
replaced withXML::xmlElementsByTagName
xmlGetAttr\((?!node)
replaced withXML::xmlGetAttr(node =
xmlName\((?!node)
replaced withXML::xmlName(node =
xmlNode\((?!name)
andxmlNode\(name =
replaced withXML::xmlNode(name =
xmlParse\(
replaced withXML::xmlParse(file =
xmlParseDoc\((?!file)
replaced withXML::xmlParseDoc(file =
xmlParseString\(
replaced withXML::xmlParseString(content =
xmlRoot\((?!x)
replaced withXML::xmlRoot(x =
xmlSize\(
replaced withXML::xmlSize(obj =
xmlToList\(
replaced withXML::xmlToList(node =
xmlTreeParse\(
replaced withXML::xmlTreeParse(file =
xmlValue\((?!x)
replaced withXML::xmlValue(x =
Imports:
XML instead ofDepends:
- Run tests - skip the ones unrelated to XML handling
- Commit:
- Do the switch (parsing functions first, all searches in files
*.R
, files in/sandbox/
ignored for manual corrections; order driven by running a basic parsing test and see where it fails next)
XML::xmlParseDoc
- Replace
XML::xmlParseDoc(file =
withxml2::read_xml(x =
(26 occurrences) - Fix parameters
- drop
, asText = TRUE
by replacing it with `` (blank, 11 occurrences) - turn
options
into vector with strings - replace
c(XML::NOERROR, XML::RECOVER)
withSosDefaultParsingOptions()
- use
xmlParseOptions
everywhere
- drop
- Replace
XML::xmlParseString
- Replaced manually by simplifying the implementation of
encodeXML
for signature"character"
- Replaced manually by simplifying the implementation of
XML::xmlParse
- Replace single occurrence manually and refactored method
parseFile
- Replace single occurrence manually and refactored method
XML::xmlRoot
- Replace
XML::xmlRoot
withxml2::xml_root
(25 occurrences)
- Replace
XML::xmlName
- Replace
XML::xmlName(node =
withxml2::xml_name(x =
(30 occurrences) - Manually added
, ns = SosAllNamespaces()
later to have names with prefix
- Replace
XML::xmlAttrs
- Replace
XML::xmlAttrs(node =
withxml2::xml_attrs(x =
(3 occurrences) - Fix further occurrences manually by searching for
xmlAttrs
(must have slipped by before) xml2::xml_attrs(x = obj)[["href"]]
does not work because if attribute href does not exist there will be a "subscript out of bounds" error. Need to use
- Replace
- Search for
xml2::xml_attrs\(x = (.*)\[\[
and fix manually toxml2::xml_attrs(x = obj, attr = "<attribute name>")
and update subsequentis.null(..)
checks to useis.na(..)
XML::xmlGetAttr
- Replace
XML::xmlGetAttr\(node = (.*), name =
withxml2::xml_attr(x = $1, attr =
(55 occurrences) - Manually fix the ones with spread across multiple lines and with missing
name =
, can also fix indentation then or remove newline - Manually fix where
xmlGetAttr
was used withnlapply(..)
orsapply(..)
- Replace
XML::xmlValue
- Replace
XML::xmlValue\(x =
withxml2::xml_text(x =
(45 occurrences)
- Replace
XML::xmlChildren
- Replace
XML::xmlChildren\(x =
withxml2::xml_children(x =
(22 occurrences) - The common pattern
XML::xmlChildren(x = obj)[[gmlTimeInstantName]]
does not work becausexml2::xml_children(..)
does not return a named list. Need to runxml2::xml_find_all(x = obj, xpath = gmlTimeInstant)
orxml2::xml_find_first(..)
then. Search forxml2::xml_children\(x = (.*)\[\[
to fix those manually (10 results)..find_first
returns missing node:is.na(xml2::xml_find_first(x, "f"))
orinherits(xml2::xml_find_first(x, "f"), "xml_missing")
..find_all
returns (potentially empty) nodeset:length(xml2::xml_find_all(x, "f"))
- Replace
- Replaced occurrences of class
XMLAbstractNode
andXMLInternalDocument
for slots in S4 classes withANY
and the default prototype toxml2::xml_missing()
, will have to handle stuff manually around these classes- Opened issue about this in
xml2
repo: r-lib/xml2#248
- Opened issue about this in
- Add
SosAllNamespaces()
and add namespaces to all thexxxName
constants inR/Constants.R
test_exceptionreports.R
completetest_sams.R
added and parsing fixedXML::getNodeSet
manually switched toxml2::xml_find_all(..)
andxml2::xml_find_one(..)
, because XPath-based getting of sub-nodes withxml2
also requires proper namespaces and some handling can be simplified because of vectorisedxml2::xml_text(..)
.XML::xmlSize
- Updated single occurrence manually
XML::saveXML
- Replaced
XML::saveXML(doc =
withxml2::write_xml(x =
(6 occurrences), no parameters insaveXML
besidesdoc
andfile
were used
- Replaced
- Update
NAMESPACE
to importxml2
and notXML
- Parsing tests of
test_sensors.R
work XML::isXMLString
- Replace with own function using simple regex test:
grepl("^<(.*)>$", "...")
- Replace with own function using simple regex test:
- get rid of
.filterXmlChildren
and.filterXmlOnlyNoneTexts
manually usingxml2::xml_child(..)
,xml2::xml_find_first(..)
orxml2::xml_find_all(..)
- also remove all
".noneText"
objects (and by that fix all occurrences ofxmlTagName
) is.na(xml2::
> fix usingis.na(..)
(regex, 16 occurrences)
- also remove all
- must fix all
obj[[
because subsetting with[[
does not work with XML (107 occurrences at this point!)- trying to automate by replacing
obj\[\[(.*?)\]\]
withxml2::xml_child(x = obj, search = $1, ns = SosAllNamespaces())
- revert the changes in summary functions where
obj[[..]]
was used (filePrintShowStructureSummary-methods.R
) - does not work for multiple subsets, e.g.
obj[["elementCount"]][["Count"]][["value"]]
> search forSosAllNamespaces())[[
and fix manually to use XPath (4 occurrences) - re-check occurrences of
.children[[
is.null\(\.
with some XML object, should beis.na(..)
which picks up on"xml_missing"
objects- New tests added for...
parseOwsRange
parseSosFilter_Capabilities
parseOwsServiceIdentification
parseTime
parseSosObservationOffering
(also for 2.0.0)
- fix tests in
test_sensors.R
- trying to automate by replacing
- [Continue with encoding functions]
XML::addAttributes
- switched manually because sometimes
.attrs
is used, which is replaced withxml2::xml_set_attrs()
, and sometimes not (single...
), which is replaced withxml2::xml_set_attr()
, the_set_attr
variants operate directly on the object (no need to re-assign), and often statements are multi-line (18 occurrences) - get rid of
.sos100_NamespaceDefinitionsForAll
- switched manually because sometimes
XML::xmlNode
andXML::addChildren
- manually switched to
xml2::xml_new_root("<node name>")
andxml2::xml_add_child("<node name>")
attrs
parameter replaced withxml2::xml_set_attrs()
- r-lib/xml2#239 is a problem
XML::addChildren
with"append = TRUE"
replace with a for loop andxml2::xml_add_child(..)
- manually switched to
Limitations of regexes for the actual switch are due to multi-line statements and the result of functions not being the same.
Especially the subsetting with [[
used extensively does not work the same way anymore.
Hello,
Your blog is very useful thank you for that.
However; I wanted to let you know that XML is back again after all that period with new version https://cran.r-project.org/web/packages/XML/index.html.
Best regards,