Skip to content

Instantly share code, notes, and snippets.

@JamoCA
Last active September 21, 2023 13:34
Show Gist options
  • Save JamoCA/6565bd4e2526b7c177a5f0cde3980d1c to your computer and use it in GitHub Desktop.
Save JamoCA/6565bd4e2526b7c177a5f0cde3980d1c to your computer and use it in GitHub Desktop.
JUnidecode ColdFusion Demo - Convert Unicode strings to somewhat reasonable ASCII7-only strings then strip diacritics and convert strings.
<cfprocessingdirective pageEncoding="utf-8">
<cfsetting enablecfoutputonly="Yes">
<!---
BLOG: https://dev.to/gamesover/convert-unicode-strings-to-ascii-with-coldfusion-junidecode-lhf
--->
<cfscript>
function JUnidecode(inputString){
var JUnidecodeLib = "";
var response = "";
var temp = {};
temp.encoder = createObject("java", "java.nio.charset.Charset").forName("utf-8").newEncoder();
temp.isUTF = temp.encoder.canEncode(arguments.inputString);
if (temp.isUTF){
/* NFKC: UTF Compatibility Decomposition, followed by Canonical Composition */
temp.normalizer = createObject( "java", "java.text.Normalizer" );
temp.normalizerForm = createObject( "java", "java.text.Normalizer$Form" );
arguments.inputString = temp.normalizer.normalize( javaCast( "string", arguments.inputString ), temp.normalizerForm.NFKC );
}
try {
JUnidecodeLib = createObject("java", "net.gcardone.junidecode.Junidecode");
response = JUnidecodeLib.unidecode( javacast("string", arguments.inputString) );
} catch (any e) {
response = "ERROR: JUnidecode is not installed";
}
return trim(Response.replaceAll("\[\?\]", ""));
}
function isDiff(compareArr, val, pos){
return (pos GT arrayLen(comparearr) OR comparearr[pos] neq val);
}
</cfscript>
<cfset TestStrings = [
"ℰ𝒳𝒜ℳ𝓟ℒℰ",
"ABC #chr(160)# Café “test”",
"北亰",
"Mr. まさゆき たけだ",
"Łukasiński",
"⠏⠗⠑⠍⠊⠑⠗",
"What about Ø, Ł or æøåá",
"ราชอาณาจักรไทย",
"Ελληνικά",
"Москвa",
"Հայաստան",
"čeština",
"®™™™©©©Ⓒ½⅓⅔¼¾⅕⅖⅗⅘⅙⅚⅐⅛⅜⅝⅞⅑⅒●⚫⬤",
"ÀÁÂÃÄÅÆÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝàáâãäåæèéêëìíîïñòóôõöøùúûüý’“”–…",
"Häuser Bäume Höfe Gärten daß Ü ü ö ä Ä Ö ß "
]>
<cfset CFString = "cfscript">
<cfparam name="URL.testString" default="">
<cfif len(trim(URL.testString))>
<cfset TestStrings = listToArray(trim(URL.testString))>
</cfif>
<cfsetting enablecfoutputonly="no">
<!doctype html>
<html lang="en">
<head>
<title>JUnidecode ColdFusion Demo</title>
</head>
<style>
.diff {background-color:#ff0;}
fieldset:nth-child(even) {background-color:#ededed;}
</style>
<body>
<h1>JUnidecode ColdFusion Demo</h1>
<p>by <a href="https://about.me/jamesmoberg">James Moberg</a> / <a href="https://www.sunstarmedia.com/">SunStar Media</a> (February 6, 2019)</p>
<p>This is a demo on how to use <a href="https://github.com/gcardone/junidecode">JUnidecode</a> with <a href="https://www.adobe.com/products/coldfusion-family.html">ColdFusion</a> to convert Unicode strings to somewhat reasonable ASCII7-only strings then strip diacritics and convert strings.</p>
<p>I've compared this Java library against <a href="https://www.bennadel.com/blog/1155-cleaning-high-ascii-values-for-web-safeness-in-coldfusion.htm">regex</a>, <a href="https://cflib.org/udf/deAccent">java.text.Normalizer</a>, <a href="https://gist.github.com/JamoCA/ec4617b066fc4bb601f620bc93bacb57">ICU4J Transliterate</a> (390k vs 12mb+) and <a href="https://www.codota.com/code/java/methods/org.apache.commons.lang3.StringUtils/stripAccents">Apache.Lang3.StringUtils.StripAccents()</a> (500k) and found it to generate more consistent results while safely converting more characters than other solutions. I've also updated our <a href="https://gist.github.com/JamoCA/fee34a03bbe61a2f8e40">SanitizeFilename UDF</a> to use it.</p>
<p><b>Installation:</b> Download the latest <a href="https://github.com/gcardone/junidecode/releases">JUnidecode JAR</a>, place it in your java path &amp; restart your ColdFusion server (or use Javaloader).<P>
<p><b>Sample User-Defined Function (UDF):</b></p>
<cfoutput>
<textarea rows="7" cols="100" style="margin-left:25px;"><#CFString#>
function JUnidecode(inputString){
var JUnidecodeLib = createObject("java", "net.gcardone.junidecode.Junidecode");
var response = JUnidecodeLib.unidecode( javacast("string", arguments.inputString) );
return trim(replacenocase(Response, "[?]", "", "all"));
}
</#CFString#></textarea>
<p><b>Usage:</b></p>
<p style="margin-left:25px;">JUnidecode(<i>string</i>)</p>
<hr>
<h2>Form Test</h2>
<form action="" method="get">
<input type="text" name="teststring" value="" required placeholder="Enter test string"> <button type="submit">Test</button><cfif len(trim(URL.TestString))> <a href="?">Reset</a></CFIF>
</form>
<h2>Test Results</h2>
<cfloop from="1" to="#ArrayLen(TestStrings)#" index="r">
<cfset TestString = TestStrings[r]>
<cfset TestResult = JUnidecode(TestString)>
<cfset letters = []>
<fieldset>
<legend>#r#. #TestString#</legend>
<b>Result:</b> #TestResult#
<table border="1" cellspacing="0" cellpadding="0">
<tr valign="top">
<th>Original</th><cfloop from="1" to="#len(TestString)#" index="i">
<cfset Letter = mid(TestString, i, 1)>
<cfset arrayAppend(letters, Letter)><td><tt>#Letter#</tt><br><tt>#asc(Letter)#</tt></td></cfloop>
</tr>
<tr valign="top">
<th>JUnidecode</th><cfloop from="1" to="#len(TestResult)#" index="i">
<cfset Letter = mid(TestResult, i, 1)>
<td<CFIF isDiff(Letters, Letter, i)> class="diff"</cfif>><tt>#Letter#</tt><br><tt>#asc(Letter)#</tt></td></cfloop>
</tr>
</table>
</fieldset>
</cfloop>
</cfoutput>
</body>
</html>
@knubew
Copy link

knubew commented Mar 1, 2019

Hi,

I've decoded an Euro symbol (€) and the result was "EU" ?!
I expected a result of "EUR", because this is the official writing in letters for €.
Is it a bug or is this the intended behaviour?

@JamoCA
Copy link
Author

JamoCA commented Apr 1, 2019

@knubew I don't have any control over the conversion choices. I was recently testing some characters, identified some missing conversions, reported them on the project's GitHub page and the author made some updates to the library.
gcardone/junidecode@4c448dc

If you think it's a bug, report it as a new issue here and provide a helpful link to the official usage rules regarding the currency symbol:
https://github.com/gcardone/junidecode/issues

@briannaess
Copy link

I know this is an older code demo, but I'd really like to take advantage of this functionality. My problem is that I can't figure out how to get Lucee to acknowledge junidecode. I'm on Lucee 5.4.3.2 deployed via a .war on an OpenShift Tomcat 8 app. I've tried putting the extracted junidecode-0.4.1 folder (and all it's contents) into WEB-INF/lucee/lib and WEB-INF/lucee-server/lib, but neither did the trick. I still received the catchall "ERROR: JUnidecode is not installed". Thanks!

@JamoCA
Copy link
Author

JamoCA commented Sep 20, 2023

@briannaess You may be able to add it to an accessible directory and then define it using this.javaSettings in application.cfc. (I personally prefer to copy the file the global java path so that all apps can take advantage of it without it having to be explicitly defined in application.cfc.)

This article defines some alternate ways to load the JAR (from the author of spreadsheet-cfml, my preferred way to work with Excel/CSV data.)
https://blog.simplicityweb.co.uk/121/loading-java-libraries-dynamically-in-lucee-without-javaloader

It appears Lucee added path/context support to createObject.

If you want cross-browser support and don't use application.cfc, using JavaLoader is the way to go.

@briannaess
Copy link

@JamoCA : Thank you so much for your reply and for providing those links. I've got it working now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment