Skip to content

Instantly share code, notes, and snippets.

@joewiz
Last active March 3, 2016 04:57
Show Gist options
  • Save joewiz/5923408 to your computer and use it in GitHub Desktop.
Save joewiz/5923408 to your computer and use it in GitHub Desktop.
Trim phrases of arbitrary length to a maximum length, without cutting off words or ending on unwanted words, with XQuery
xquery version "3.0";
declare function local:trim-phrase-to-length($phrase, $length) {
(: if the phrase is already short enough, we're done :)
if (string-length($phrase) le $length) then
$phrase
(: the phrase is too long, so... :)
else
(: we will split the phrase into words and look for the longest possible arrangement within our length limit,
that doesn't end with boring words :)
let $words := tokenize(normalize-space($phrase), '\s+')
let $do-not-end-with-these-words := ('a', 'an', 'the', 'to', 'of', 'as', 'in', 'with', 'for')
let $best-fit :=
(: the heart of things: start with one word, then two, then three, measuring the length of the
new phrase until we exceed the desired length. if the 18th word exceeds the length, then this
routine will return '17'. but if word 17 is one of the unwanted last words, we'll get '16'. :)
max(
for $n in (1 to count($words))
let $subset := subsequence($words, 1, $n)
let $subset-length := string-length(string-join($subset, ' '))
return
if ($subset-length le $length and not(lower-case($words[$n]) = $do-not-end-with-these-words)) then $n else ()
)
(: now we know how many words we need, and we join them into a new phrase :)
let $snipped-phrase := string-join(subsequence($words, 1, $best-fit), ' ')
(: now we're ready to prepare the new phrase with ellipsis to indicate truncation :)
return
(: but no ellipsis is needed if the phrase already ends with a period - it's already a complete sentence :)
if (matches($snipped-phrase, '\.$')) then
$snipped-phrase
(: if the phrase ends with a comma or similar punctuation, we'll replace it with an ellipsis :)
else if (matches($snipped-phrase, '[,:;]$')) then
concat(substring($snipped-phrase, 1, string-length($snipped-phrase) - 1), '...')
(: otherwise, we'll just add an ellipsis :)
else
concat($snipped-phrase, '...')
};
let $phrases :=
<phrases>
<phrase>Egypt’s military on Wednesday ousted Mohamed Morsi, the nation’s first freely elected president, suspending the Constitution, installing an interim government and insisting it was responding to the millions of Egyptians who had opposed the Islamist agenda of Mr. Morsi and his allies in the Muslim Brotherhood.</phrase>
<phrase>Leslie James Pickering noticed something odd in his mail last September: A handwritten card, apparently delivered by mistake, with instructions for postal workers to pay special attention to the letters and packages sent to his home.</phrase>
</phrases>
return
<results>{
for $phrase in $phrases/phrase
let $trimmed := local:trim-phrase-to-length($phrase, 140)
return
<result>
<phrase chars="{string-length($phrase)}">{$phrase/string()}</phrase>
<trimmed chars="{string-length($trimmed)}">{$trimmed}</trimmed>
</result>
}</results>
<results>
<result>
<phrase chars="310">Egypt’s military on Wednesday ousted Mohamed Morsi, the nation’s first
freely elected president, suspending the Constitution, installing an interim government
and insisting it was responding to the millions of Egyptians who had opposed the
Islamist agenda of Mr. Morsi and his allies in the Muslim Brotherhood.</phrase>
<trimmed chars="139">Egypt’s military on Wednesday ousted Mohamed Morsi, the nation’s first
freely elected president, suspending the Constitution, installing...</trimmed>
</result>
<result>
<phrase chars="233">Leslie James Pickering noticed something odd in his mail last September:
A handwritten card, apparently delivered by mistake, with instructions for postal
workers to pay special attention to the letters and packages sent to his home.</phrase>
<trimmed chars="127">Leslie James Pickering noticed something odd in his mail last
September: A handwritten card, apparently delivered by mistake...</trimmed>
</result>
</results>
@joewiz
Copy link
Author

joewiz commented Jul 4, 2013

I've added some comments to explain each step of the function. Thanks to @travisbrown for his scala adaptation . He also raises a good point - that the addition of elipses can bump the resulting phrase over the maximum desired length.

@joewiz
Copy link
Author

joewiz commented Jul 6, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment