Skip to content

Instantly share code, notes, and snippets.

@robbypelssers
Last active March 23, 2017 11:39
Show Gist options
  • Save robbypelssers/5186812 to your computer and use it in GitHub Desktop.
Save robbypelssers/5186812 to your computer and use it in GitHub Desktop.
Unicode Normalization
import java.text.Normalizer
/**
* Problem: Characters with accents or other adornments can be encoded in several different ways in Unicode
* However, from a user point of view if they logically mean the same, text search should make no distinction
* between the different notations. So it's important to store text in normalized unicode form. Code below shows
* how to check if text is normalized and how you can normalize it.
**/
object NormalizationTest {
def main(args: Array[String]) {
val text = "16-bit transceiver with direction pin, 30 Ω series termination resistors;"
println(text)
println(Normalizer.isNormalized(text, Normalizer.Form.NFC))
val normalizedText = Normalizer.normalize(text, Normalizer.Form.NFC)
println(normalizedText)
println(Normalizer.isNormalized(normalizedText, Normalizer.Form.NFC))
}
}
/**
* Output printed to console:
* -------------------------------
*
* 16-bit transceiver with direction pin, 30 Ω series termination resistors;
* false
* 16-bit transceiver with direction pin, 30 Ω series termination resistors;
* true
*/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment