Skip to content

Instantly share code, notes, and snippets.

@mklement0
Last active December 9, 2023 23:02
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mklement0/25694cbb8e10a7044b36a310e1243959 to your computer and use it in GitHub Desktop.
Save mklement0/25694cbb8e10a7044b36a310e1243959 to your computer and use it in GitHub Desktop.
PowerShell function that retrieves information about Unicode characters and categories.
<#
Prerequisites: PowerShell v3+
License: MIT
Author: Michael Klement <mklement0@gmail.com>
DOWNLOAD and DEFINITION OF THE FUNCTION:
irm https://gist.github.com/mklement0/25694cbb8e10a7044b36a310e1243959/raw/Get-CharInfo.ps1 | iex
The above directly defines the function below in your session and offers guidance for making it available in future
sessions too.
DOWNLOAD ONLY:
irm https://gist.github.com/mklement0/25694cbb8e10a7044b36a310e1243959/raw > Get-CharInfo.ps1
The above downloads to the specified file, which you then need to dot-source to make the function available
in the current session:
. ./Get-CharInfo.ps1
To learn what the function does:
* see the next comment block
* or, once downloaded and defined, invoke the function with -? or pass its name to Get-Help.
To define an ALIAS for the function, (also) add something like the following to your $PROFILE:
Set-Alias ci Get-CharInfo
#>
function Get-CharInfo {
<#
.SYNOPSIS
Gets Unicode character and category information.
.DESCRIPTION
Outputs detailed information about Unicode characters in the BMP
(Basic MultiLingual Plane) and optionally Unicode categories.
-EnumerateCategory <category> (-ca) lists all characters in the specified
Unicode category.
-CategoryInfo <category> (-ci) outputs information about a given category.
-CategoryInfo * lists all categories.
-AsMarkdown provides the requested information (in part) as a Markdown snippet
instead.
-Online opens information page(s) for the requested information in the default
web browser.
-CopyToClipboard (-cp) copies the requested character(s) only to the
clipboard (useful if you request the character by code point).
Combining it with -AsMarkdown copies the Markdown snippet, and with -Online
copies the URL (without opening it in the browser).
.NOTES
Due to the .NET [char] (System.Char) being limited to *16-bit Unicode code
units*, only BMP characters are supported; the constituent code points
of surrogate *pairs* that together form a non-BMP character are only supported
*individually*.
Unless you pass -NoName, this command ownloads and installs the following NuGet
package on demand in order to be able to retrieve the names of Unicode characters:
https://www.nuget.org/packages/UnicodeInformation/
On first use in the session, even with the NuGet package alrady installed,
execution takes noticeably longer, because the local package must be located
and its assembly loaded.
The website used with the -Online switch is:
http://www.fileformat.info/info/unicode
.EXAMPLE
Get-CharInfo hü
Outputs a custom object with detailed information about charactes 'h' and 'ü'.
Note how the input string was automatically broken down into characters.
.EXAMPLE
0x2d, 0x2013 | Get-CharInfo
Outputs custom objects with detailed information about charactes '-' (hyphen)
and '–' (en-dash), provided by their code points.
.EXAMPLE
Get-CharInfo € -Online
Opens a web page with information about Unicode character '€' in the default
browser.
.EXAMPLE
Get-CharInfo 0x40 -AsMarkdown
Outputs a Markdown-formatted string with information about the "@" character.
Add -cp (short for -CopyToClipboard) to copy the markdown string to the
clipboard instead of outputting it.
.EXAMPLE
Get-CharInfo 0x212a -cp
Copies the Kelvin sign (U+212a) to the clipboard, via the -cp alias of the
-CopyToClipboard parmeter.
.EXAMPLE
Get-CharInfo -EnumerateCategory Nd
Outputs information about all characters in the Nd (DecimalDigitNumber)
category.
.EXAMPLE
Get-CharInfo -CategoryInfo *
Outputs information about all Unicode categories.
.EXAMPLE
Get-CharInfo -CategoryInfo Pc -Online
Opens an information page about the Pc (ConnectorPunctuation) Unicode
category.
#>
[CmdletBinding(PositionalBinding = $false, DefaultParameterSetName = 'CharInfo')]
[OutputType([pscustomobject], ParameterSetName = 'CharInfo')]
[OutputType([string], ParameterSetName = 'Markdown')]
param(
[Parameter(ParameterSetName = 'CharInfo', Mandatory, ValueFromPipeline, Position = 0)]
[Parameter(ParameterSetName = 'Markdown', ValueFromPipeline, Position = 0)]
[Parameter(ParameterSetName = 'Online', ValueFromPipeline, Position = 0)]
[char[]] $Character
,
[Parameter(ParameterSetName = 'Markdown', Mandatory)]
[switch] $AsMarkdown
,
[Parameter(ParameterSetName = 'Online', Mandatory)]
[switch] $Online
,
[Parameter(ParameterSetName = 'CharInfo')]
[Parameter(ParameterSetName = 'Markdown')]
[Parameter(ParameterSetName = 'EnumerateCategory')]
[switch] $NoName
,
[Parameter(ParameterSetName = 'EnumerateCategory', Mandatory)]
[Parameter(ParameterSetName = 'Markdown', ValueFromPipeline)]
[Parameter(ParameterSetName = 'Online', ValueFromPipeline)]
[ValidateSet('UppercaseLetter', 'LowercaseLetter', 'TitlecaseLetter', 'ModifierLetter', 'OtherLetter', 'NonSpacingMark', 'SpacingCombiningMark', 'EnclosingMark', 'DecimalDigitNumber', 'LetterNumber', 'OtherNumber', 'SpaceSeparator', 'LineSeparator', 'ParagraphSeparator', 'Control', 'Format', 'Surrogate', 'PrivateUse', 'ConnectorPunctuation', 'DashPunctuation', 'OpenPunctuation', 'ClosePunctuation', 'InitialQuotePunctuation', 'FinalQuotePunctuation', 'OtherPunctuation', 'MathSymbol', 'CurrencySymbol', 'ModifierSymbol', 'OtherSymbol', 'OtherNotAssigned', 'Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Mn', 'Mc', 'Me', 'Nd', 'Nl', 'No', 'Zs', 'Zl', 'Zp', 'Cc', 'Cf', 'Cs', 'Co', 'Pc', 'Pd', 'Ps', 'Pe', 'Pi', 'Pf', 'Po', 'Sm', 'Sc', 'Sk', 'So', 'Cn')]
[Alias('ca')]
[string] $EnumerateCategory
,
[Parameter(ParameterSetName = 'CategoryInfo', Mandatory)]
[Parameter(ParameterSetName = 'Markdown', ValueFromPipeline)]
[Parameter(ParameterSetName = 'Online', ValueFromPipeline)]
[ValidateSet('*', 'All', 'UppercaseLetter', 'LowercaseLetter', 'TitlecaseLetter', 'ModifierLetter', 'OtherLetter', 'NonSpacingMark', 'SpacingCombiningMark', 'EnclosingMark', 'DecimalDigitNumber', 'LetterNumber', 'OtherNumber', 'SpaceSeparator', 'LineSeparator', 'ParagraphSeparator', 'Control', 'Format', 'Surrogate', 'PrivateUse', 'ConnectorPunctuation', 'DashPunctuation', 'OpenPunctuation', 'ClosePunctuation', 'InitialQuotePunctuation', 'FinalQuotePunctuation', 'OtherPunctuation', 'MathSymbol', 'CurrencySymbol', 'ModifierSymbol', 'OtherSymbol', 'OtherNotAssigned', 'Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Mn', 'Mc', 'Me', 'Nd', 'Nl', 'No', 'Zs', 'Zl', 'Zp', 'Cc', 'Cf', 'Cs', 'Co', 'Pc', 'Pd', 'Ps', 'Pe', 'Pi', 'Pf', 'Po', 'Sm', 'Sc', 'Sk', 'So', 'Cn')]
[Alias('ci')]
[string] $CategoryInfo
,
[Alias('cp')]
[switch] $CopyToClipboard
)
begin {
$ErrorActionPreference = 'Stop'; Set-StrictMode -Version 1
$urls = [Collections.Generic.List[string]]::new()
$stringsToCopy = [System.Collections.Generic.List[string]]::new()
# The list of Unicode categories, keyed by their shorthand aliases.
$ohtCategories = [ordered] @{
'Lu' = [System.Globalization.UnicodeCategory]::UppercaseLetter
'Ll' = [System.Globalization.UnicodeCategory]::LowercaseLetter
'Lt' = [System.Globalization.UnicodeCategory]::TitlecaseLetter
'Lm' = [System.Globalization.UnicodeCategory]::ModifierLetter
'Lo' = [System.Globalization.UnicodeCategory]::OtherLetter
'Mn' = [System.Globalization.UnicodeCategory]::NonSpacingMark
'Mc' = [System.Globalization.UnicodeCategory]::SpacingCombiningMark
'Me' = [System.Globalization.UnicodeCategory]::EnclosingMark
'Nd' = [System.Globalization.UnicodeCategory]::DecimalDigitNumber
'Nl' = [System.Globalization.UnicodeCategory]::LetterNumber
'No' = [System.Globalization.UnicodeCategory]::OtherNumber
'Zs' = [System.Globalization.UnicodeCategory]::SpaceSeparator
'Zl' = [System.Globalization.UnicodeCategory]::LineSeparator
'Zp' = [System.Globalization.UnicodeCategory]::ParagraphSeparator
'Cc' = [System.Globalization.UnicodeCategory]::Control
'Cf' = [System.Globalization.UnicodeCategory]::Format
'Cs' = [System.Globalization.UnicodeCategory]::Surrogate
'Co' = [System.Globalization.UnicodeCategory]::PrivateUse
'Pc' = [System.Globalization.UnicodeCategory]::ConnectorPunctuation
'Pd' = [System.Globalization.UnicodeCategory]::DashPunctuation
'Ps' = [System.Globalization.UnicodeCategory]::OpenPunctuation
'Pe' = [System.Globalization.UnicodeCategory]::ClosePunctuation
'Pi' = [System.Globalization.UnicodeCategory]::InitialQuotePunctuation
'Pf' = [System.Globalization.UnicodeCategory]::FinalQuotePunctuation
'Po' = [System.Globalization.UnicodeCategory]::OtherPunctuation
'Sm' = [System.Globalization.UnicodeCategory]::MathSymbol
'Sc' = [System.Globalization.UnicodeCategory]::CurrencySymbol
'Sk' = [System.Globalization.UnicodeCategory]::ModifierSymbol
'So' = [System.Globalization.UnicodeCategory]::OtherSymbol
'Cn' = [System.Globalization.UnicodeCategory]::OtherNotAssigned
}
# Function for getting the online URL for a given Unicode category.
function get-CharUrl ([char] $char) {
'http://www.fileformat.info/info/unicode/char/{0}' -f ([uint16] $char).ToString('x')
}
# Function for getting the online URL for a given Unicode category.
function get-CategoryUrl ([System.Globalization.UnicodeCategory] $category) {
'http://www.fileformat.info/info/unicode/category/{0}' -f $ohtCategories.GetEnumerator().Where( { $_.Value -eq $category }).Key
}
# Function for getting a Markdown representation for a given character.
function get-CharMarkdown ([char] $char) {
$codePoint = [uint16] $char
if ($NoName) {
'`{0}` ([`U+{1}`]({2}))' -f $char, $codePoint.ToString('X4'), (get-CharUrl $char)
}
else {
'`{0}` ({1}, [`U+{2}`]({3}))' -f $char, (get-CharName $char), $codePoint.ToString('X4'), (get-CharUrl $char)
}
}
# Function for getting a Markdown representation for given Unicode category
function get-CategoryMarkdown ([System.Globalization.UnicodeCategory] $category) {
'[`{0}` (`{1}`)]({2})' -f $category, $ohtCategories.GetEnumerator().Where( { $_.Value -eq $category }).Key, (get-CategoryUrl $category)
}
# Function for retrieving a Unicode character's official name and category.
# Uses the following NuGet package, which is installed (in the scope of
# the current) user on demand:
# https://www.nuget.org/packages/UnicodeInformation/
function get-CharName ([char] $char) {
if (-not ('System.Unicode.UnicodeInfo' -as [type])) {
# Install the package on demand and/or load the assembly.
$nugetPkgName = 'UnicodeInformation'
# !! Even with an already installed package, a Get-Package call takes a long time to complete if
# !! the underlying modules haven't been imported yet; therefore, we look for the package
# !! in the standard current-user package-installation location and fall back onto Get-Package, if needed.
# !! Note that we look for the highest version number, should multiple versions be installed.
$pkgLocalPath = try {
(
Get-Item -EA Ignore -Path ("$HOME/.local/share/PackageManagement/NuGet/Packages/$nugetPkgName*", "$env:LOCALAPPDATA\PackageManagement\NuGet\Packages\$nugetPkgName*")[$env:OS -eq 'Windows_NT'] |
Sort-Object { [version] ($_.Name -replace ('^{0}\.' -f [regex]::Escape($nugetPkgName))) }
)[-1]
}
catch { }
if (-not $pkgLocalPath) {
Write-Verbose -vb "Looking for local '$nugetPkgName' package..."
$pkg = Get-Package -EA Ignore $nugetPkgName
if (-not $pkg) {
# Download and install the package.
Write-Verbose -vb "Installing NuGet package '$nugetPkgName' for character-name support..."
$null = Install-Package -Scope CurrentUser -Source nuget.org $nugetPkgName
$pkg = Get-Package $nugetPkgName # Get the local package to determine its location.
Write-Verbose -vb "Package '$nugetPkgName' installed to: $($pkg.Source)"
}
$pkgLocalPath = Split-Path -Parent $pkg.Source
}
# Add the relevant assembly (*.dll) to the session with Add-Type -LiteralPath
# For predictability, we target the v2.0 .NET *Standard* DLL.
# Note: The command would load multiple *.dll files, but not recursively.
Add-Type -Path $pkgLocalPath/lib/netstandard2.0/*.dll
}
# The type of interest is available now, use it.
$charInfo = [System.Unicode.UnicodeInfo]::GetCharInfo([int] $char)
# !! Characters 0..31 have an empty .Name property, but the
# !! .NameAliases property contains values; we simply use the first one.
if ($charInfo.Name) { $charInfo.Name } else { $charInfo.NameAliases[0].Name }
}
# Category info requested.
if ($CategoryInfo) {
# Determine what categories to target.
$ohtTargetCategories = if ($CategoryInfo -in '*', 'All') {
$ohtCategories
}
else {
if ($CategoryInfo.Length -eq 2) {
$catShorthand = $CategoryInfo
$catEnum = $ohtCategories[$CategoryInfo]
}
else {
$catEnum = [System.Globalization.UnicodeCategory] $CategoryInfo
$catShorthand = $ohtCategories.GetEnumerator().Where( { $_.Value -eq $catEnum }).Key
}
[ordered] @{
$catShorthand = $catEnum
}
}
# Process categories.
if ($AsMarkdown) {
$outObj = $ohtTargetCategories.Values.ForEach( { get-CategoryMarkdown $_ })
}
elseif ($Online) {
foreach ($category in $ohtTargetCategories.Values) {
Start-Process (get-CategoryUrl $category)
}
}
else {
$outObj = $ohtTargetCategories.GetEnumerator().ForEach( {
[pscustomobject] @{
Shorthand = $_.Key
Name = [string] $_.Value
InfoUrl = get-CategoryUrl $_.Value
}
})
}
if ($CopyToClipboard) {
$outObj | Out-String | Set-Clipboard
}
else {
$outObj
}
exit 0 # We're done.
}
if ($EnumerateCategory) {
if ($EnumerateCategory.Length -eq 2) {
# A 2-character shorthand such as 'Lu' was passed to -Category; translate
# it to its .NET enum value, such as 'UpperCaseLetter'
$EnumerateCategory = $ohtCategories[$EnumerateCategory]
}
if ($Online) {
$Character = [char] 0x0 # dummy character so that the foreach ($char in $Character) loop is entered once.
}
else {
# Get all characters in the target category (brute-force, slow).
$Character = ([char[]] (0..0xffff)).Where( { [char]::GetUnicodeCategory($_) -eq $EnumerateCategory })
}
}
}
process {
foreach ($char in $Character) {
$codePoint = [uint16] $char
if ($Online) {
# Note: with -EnumerateCategory <category> -Online we don't treat the characters in the category indivdually,
# as that would open way too many pages - we go a page that describes
# the category itsef, wher links to the individual chars. are offered.
$url = if ($EnumerateCategory) { get-CategoryUrl $EnumerateCategory } else { get-CharUrl $char }
if ($CopyToClipboard) {
$stringsToCopy.Add($url) # will be processed in `end` block
}
else {
$urls.Add($url) # will be processed in `end` block
}
}
elseif ($AsMarkdown) {
$markdown = get-CharMarkdown $char
if ($CopyToClipboard) {
$stringsToCopy.Add($markdown) # will be processed in `end` block
}
else {
$markdown
}
}
else {
# output an info object or copy just the character to the clipboard.
if ($CopyToClipboard) {
# Note: Copy just the character itself; useful for getting the char.
# from its code point (e.g., Get-CharInfo 0x41 -cp)
$stringsToCopy.Add($char) # will be processed in `end` block
}
else {
# Construct and output a [pscustomobject] containing the character and metainformation.
$utf8Bytes = [Text.Encoding]::Utf8.GetBytes($char)
$charName = if (-not $NoName) { get-CharName $char }
[pscustomobject]@{
Char = $char
HexCodePoint = ('0x{0}' -f $codePoint.ToString("x1"))
Utf8ByteString = $utf8Bytes.ForEach( { '0x{0}' -f $_.ToString("x2") }) -join ' '
Name = $charName
Category = [char]::GetUnicodeCategory($char)
CategoryShorthand = $ohtCategories.GetEnumerator().Where( { $_.Value -eq [char]::GetUnicodeCategory($char) }).Key
CodePoint = $codePoint
Utf8Bytes = $utf8Bytes
InfoUrl = get-CharUrl $char
}
}
}
}
}
end {
if ($CopyToClipboard) {
$stringsToCopy | Set-Clipboard
}
elseif ($Online) {
foreach ($url in $urls) { Start-Process $url }
}
}
}
# --------------------------------
# GENERIC INSTALLATION HELPER CODE
# --------------------------------
# Provides guidance for making the function persistently available when
# this script is either directly invoked from the originating Gist or
# dot-sourced after download.
# IMPORTANT:
# * DO NOT USE `exit` in the code below, because it would exit
# the calling shell when Invoke-Expression is used to directly
# execute this script's content from GitHub.
# * Because the typical invocation is DOT-SOURCED (via Invoke-Expression),
# do not define variables or alter the session state via Set-StrictMode, ...
# *except in child scopes*, via & { ... }
if ($MyInvocation.Line -eq '') {
# Most likely, this code is being executed via Invoke-Expression directly
# from gist.github.com
# To simulate for testing with a local script, use the following:
# Note: Be sure to use a path and to use "/" as the separator.
# iex (Get-Content -Raw ./script.ps1)
# Derive the function name from the invocation command, via the enclosing
# script name presumed to be contained in the URL.
# NOTE: Unfortunately, when invoked via Invoke-Expression, $MyInvocation.MyCommand.ScriptBlock
# with the actual script content is NOT available, so we cannot extract
# the function name this way.
& {
param($invocationCmdLine)
# Try to extract the function name from the URL.
$funcName = $invocationCmdLine -replace '^.+/(.+?)(?:\.ps1).*$', '$1'
if ($funcName -eq $invocationCmdLine) {
# Function name could not be extracted, just provide a generic message.
# Note: Hypothetically, we could try to extract the Gist ID from the URL
# and use the REST API to determine the first filename.
Write-Verbose -Verbose "Function is now defined in this session."
}
else {
# Indicate that the function is now defined and also show how to
# add it to the $PROFILE or convert it to a script file.
Write-Verbose -Verbose @"
Function `"$funcName`" is now defined in this session.
* If you want to add this function to your `$PROFILE, run the following:
"``nfunction $funcName {``n`${function:$funcName}``n}" | Add-Content `$PROFILE
* If you want to convert this function into a script file that you can invoke
directly, run:
"`${function:$funcName}" | Set-Content $funcName.ps1 -Encoding $('utf8' + ('', 'bom')[[bool] (Get-Variable -ErrorAction Ignore IsCoreCLR -ValueOnly)])
"@
}
} $MyInvocation.MyCommand.Definition # Pass the original invocation command line to the script block.
}
else {
# Invocation presumably as a local file after manual download,
# either dot-sourced (as it should be) or mistakenly directly.
& {
param($originalInvocation)
# Parse this file to reliably extract the name of the embedded function,
# irrespective of the name of the script file.
$ast = $originalInvocation.MyCommand.ScriptBlock.Ast
$funcName = $ast.Find( { $args[0] -is [System.Management.Automation.Language.FunctionDefinitionAst] }, $false).Name
if ($originalInvocation.InvocationName -eq '.') {
# Being dot-sourced as a file.
# Provide a hint that the function is now loaded and provide
# guidance for how to add it to the $PROFILE.
Write-Verbose -Verbose @"
Function `"$funcName`" is now defined in this session.
If you want to add this function to your `$PROFILE, run the following:
"``nfunction $funcName {``n`${function:$funcName}``n}" | Add-Content `$PROFILE
"@
}
else {
# Mistakenly directly invoked.
# Issue a warning that the function definition didn't effect and
# provide guidance for reinvocation and adding to the $PROFILE.
Write-Warning @"
This script contains a definition for function "$funcName", but this definition
only takes effect if you dot-source this script.
To define this function for the current session, run:
. "$($originalInvocation.MyCommand.Path)"
"@
}
} $MyInvocation # Pass the original invocation info to the helper script block.
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment