Skip to content

Instantly share code, notes, and snippets.

@joewiz
Last active December 20, 2015 16:18
Show Gist options
  • Save joewiz/6160020 to your computer and use it in GitHub Desktop.
Save joewiz/6160020 to your computer and use it in GitHub Desktop.
Fix problems with mis-capitalized names, with XQuery
xquery version "3.0";
declare namespace fn="http://www.w3.org/2005/xpath-functions";
(: Fix problems with mis-capitalized names. For example:
Before: MACARTHUR, Douglas II
After: MacArthur, Douglas II
:)
declare function local:fix-name-capitalization($name as xs:string) {
(:
We'll use analyze-string() to split the name string up into "words".
We're defining "words" as strings of one-or-more upper- or lower-case letters and hyphens.
E.g.:
In "MAO Tse-tung", "MAO" and "Tse-tung" are the two words.
In "MCCLURKIN, Robert J. G.", the words are "MCCLURKIN", "Robert", "J", and "G".
The analyze-string() function will return results like:
<fn:analyze-string-result xmlns:fn="http://www.w3.org/2005/xpath-functions">
<fn:match>MCCLURKIN</fn:match>
<fn:non-match>, </fn:non-match>
<fn:match>Robert</fn:match>
<fn:non-match> </fn:non-match>
<fn:match>J</fn:match>
<fn:non-match>. </fn:non-match>
<fn:match>G</fn:match>
<fn:non-match>.</fn:non-match>
</fn:analyze-string-result>
We'll keep the "non-matches" unchanged, only looking carefully at each "match" to make
sure we apply the right capitalization rules.
:)
let $word-vs-non-word-pattern := '[A-Za-z-]+'
let $analyze-string-result := analyze-string($name, $word-vs-non-word-pattern)
return
string-join(
for $node in $analyze-string-result/*
return
(: let punctuation, spaces, etc. through unchanged :)
if ($node/self::fn:non-match) then
$node/string()
(: examine :)
else
(: MACARTHUR -> MacArthur :)
if (starts-with(lower-case($node), 'mac')) then
concat('Mac', upper-case(substring($node, 4, 1)), lower-case(substring($node, 5)))
(: MCCARTHY -> McCarthy :)
else if (starts-with(lower-case($node), 'mc')) then
concat('Mc', upper-case(substring($node, 3, 1)), lower-case(substring($node, 4)))
(: II -> II :)
else if (matches($node, '^[IVX]+$')) then
$node/string()
(: otherwise, just capitalize the word :)
else
concat(upper-case(substring($node, 1, 1)), lower-case(substring($node, 2)))
)
};
let $names :=
(
'MCCARTHY, Senator Joseph R.' (: potential problem because of "Mc" prefix :),
'MACARTHUR, Douglas II' (: potential problem because of "Mac" prefix and generational name "II" :),
'O’CONNOR, Roderic L' (: potential problem because of apostrophe in surname :),
'VAN Hollen, Christopher' (: potential problem because the last name is in two parts :),
'CHERWELL, Lord (Frederick Alexander Lindemann)' (: potential problem because of the parantheses :),
'LINDEMANN, Frederick Alexander.' (: potential problem because of the period :),
'MAO TSE-TUNG' (: potential problem because names are not comma-delimited :)
)
return
element results {
for $name in $names
return
element result {
element source { $name },
element repair { local:fix-name-capitalization($name) }
}
}
<results>
<result>
<source>MCCARTHY, Senator Joseph R.</source>
<repair>McCarthy, Senator Joseph R.</repair>
</result>
<result>
<source>MACARTHUR, Douglas II</source>
<repair>MacArthur, Douglas II</repair>
</result>
<result>
<source>O’CONNOR, Roderic L</source>
<repair>O’Connor, Roderic L</repair>
</result>
<result>
<source>VAN Hollen, Christopher</source>
<repair>Van Hollen, Christopher</repair>
</result>
<result>
<source>CHERWELL, Lord (Frederick Alexander Lindemann)</source>
<repair>Cherwell, Lord (Frederick Alexander Lindemann)</repair>
</result>
<result>
<source>LINDEMANN, Frederick Alexander.</source>
<repair>Lindemann, Frederick Alexander.</repair>
</result>
<result>
<source>MAO TSE-TUNG</source>
<repair>Mao Tse-tung</repair>
</result>
</results>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment