Last active
December 9, 2023 23:02
-
-
Save mklement0/25694cbb8e10a7044b36a310e1243959 to your computer and use it in GitHub Desktop.
PowerShell function that retrieves information about Unicode characters and categories.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<# | |
Prerequisites: PowerShell v3+ | |
License: MIT | |
Author: Michael Klement <mklement0@gmail.com> | |
DOWNLOAD and DEFINITION OF THE FUNCTION: | |
irm https://gist.github.com/mklement0/25694cbb8e10a7044b36a310e1243959/raw/Get-CharInfo.ps1 | iex | |
The above directly defines the function below in your session and offers guidance for making it available in future | |
sessions too. | |
DOWNLOAD ONLY: | |
irm https://gist.github.com/mklement0/25694cbb8e10a7044b36a310e1243959/raw > Get-CharInfo.ps1 | |
The above downloads to the specified file, which you then need to dot-source to make the function available | |
in the current session: | |
. ./Get-CharInfo.ps1 | |
To learn what the function does: | |
* see the next comment block | |
* or, once downloaded and defined, invoke the function with -? or pass its name to Get-Help. | |
To define an ALIAS for the function, (also) add something like the following to your $PROFILE: | |
Set-Alias ci Get-CharInfo | |
#> | |
function Get-CharInfo { | |
<# | |
.SYNOPSIS | |
Gets Unicode character and category information. | |
.DESCRIPTION | |
Outputs detailed information about Unicode characters in the BMP | |
(Basic MultiLingual Plane) and optionally Unicode categories. | |
-EnumerateCategory <category> (-ca) lists all characters in the specified | |
Unicode category. | |
-CategoryInfo <category> (-ci) outputs information about a given category. | |
-CategoryInfo * lists all categories. | |
-AsMarkdown provides the requested information (in part) as a Markdown snippet | |
instead. | |
-Online opens information page(s) for the requested information in the default | |
web browser. | |
-CopyToClipboard (-cp) copies the requested character(s) only to the | |
clipboard (useful if you request the character by code point). | |
Combining it with -AsMarkdown copies the Markdown snippet, and with -Online | |
copies the URL (without opening it in the browser). | |
.NOTES | |
Due to the .NET [char] (System.Char) being limited to *16-bit Unicode code | |
units*, only BMP characters are supported; the constituent code points | |
of surrogate *pairs* that together form a non-BMP character are only supported | |
*individually*. | |
Unless you pass -NoName, this command ownloads and installs the following NuGet | |
package on demand in order to be able to retrieve the names of Unicode characters: | |
https://www.nuget.org/packages/UnicodeInformation/ | |
On first use in the session, even with the NuGet package alrady installed, | |
execution takes noticeably longer, because the local package must be located | |
and its assembly loaded. | |
The website used with the -Online switch is: | |
http://www.fileformat.info/info/unicode | |
.EXAMPLE | |
Get-CharInfo hü | |
Outputs a custom object with detailed information about charactes 'h' and 'ü'. | |
Note how the input string was automatically broken down into characters. | |
.EXAMPLE | |
0x2d, 0x2013 | Get-CharInfo | |
Outputs custom objects with detailed information about charactes '-' (hyphen) | |
and '–' (en-dash), provided by their code points. | |
.EXAMPLE | |
Get-CharInfo € -Online | |
Opens a web page with information about Unicode character '€' in the default | |
browser. | |
.EXAMPLE | |
Get-CharInfo 0x40 -AsMarkdown | |
Outputs a Markdown-formatted string with information about the "@" character. | |
Add -cp (short for -CopyToClipboard) to copy the markdown string to the | |
clipboard instead of outputting it. | |
.EXAMPLE | |
Get-CharInfo 0x212a -cp | |
Copies the Kelvin sign (U+212a) to the clipboard, via the -cp alias of the | |
-CopyToClipboard parmeter. | |
.EXAMPLE | |
Get-CharInfo -EnumerateCategory Nd | |
Outputs information about all characters in the Nd (DecimalDigitNumber) | |
category. | |
.EXAMPLE | |
Get-CharInfo -CategoryInfo * | |
Outputs information about all Unicode categories. | |
.EXAMPLE | |
Get-CharInfo -CategoryInfo Pc -Online | |
Opens an information page about the Pc (ConnectorPunctuation) Unicode | |
category. | |
#> | |
[CmdletBinding(PositionalBinding = $false, DefaultParameterSetName = 'CharInfo')] | |
[OutputType([pscustomobject], ParameterSetName = 'CharInfo')] | |
[OutputType([string], ParameterSetName = 'Markdown')] | |
param( | |
[Parameter(ParameterSetName = 'CharInfo', Mandatory, ValueFromPipeline, Position = 0)] | |
[Parameter(ParameterSetName = 'Markdown', ValueFromPipeline, Position = 0)] | |
[Parameter(ParameterSetName = 'Online', ValueFromPipeline, Position = 0)] | |
[char[]] $Character | |
, | |
[Parameter(ParameterSetName = 'Markdown', Mandatory)] | |
[switch] $AsMarkdown | |
, | |
[Parameter(ParameterSetName = 'Online', Mandatory)] | |
[switch] $Online | |
, | |
[Parameter(ParameterSetName = 'CharInfo')] | |
[Parameter(ParameterSetName = 'Markdown')] | |
[Parameter(ParameterSetName = 'EnumerateCategory')] | |
[switch] $NoName | |
, | |
[Parameter(ParameterSetName = 'EnumerateCategory', Mandatory)] | |
[Parameter(ParameterSetName = 'Markdown', ValueFromPipeline)] | |
[Parameter(ParameterSetName = 'Online', ValueFromPipeline)] | |
[ValidateSet('UppercaseLetter', 'LowercaseLetter', 'TitlecaseLetter', 'ModifierLetter', 'OtherLetter', 'NonSpacingMark', 'SpacingCombiningMark', 'EnclosingMark', 'DecimalDigitNumber', 'LetterNumber', 'OtherNumber', 'SpaceSeparator', 'LineSeparator', 'ParagraphSeparator', 'Control', 'Format', 'Surrogate', 'PrivateUse', 'ConnectorPunctuation', 'DashPunctuation', 'OpenPunctuation', 'ClosePunctuation', 'InitialQuotePunctuation', 'FinalQuotePunctuation', 'OtherPunctuation', 'MathSymbol', 'CurrencySymbol', 'ModifierSymbol', 'OtherSymbol', 'OtherNotAssigned', 'Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Mn', 'Mc', 'Me', 'Nd', 'Nl', 'No', 'Zs', 'Zl', 'Zp', 'Cc', 'Cf', 'Cs', 'Co', 'Pc', 'Pd', 'Ps', 'Pe', 'Pi', 'Pf', 'Po', 'Sm', 'Sc', 'Sk', 'So', 'Cn')] | |
[Alias('ca')] | |
[string] $EnumerateCategory | |
, | |
[Parameter(ParameterSetName = 'CategoryInfo', Mandatory)] | |
[Parameter(ParameterSetName = 'Markdown', ValueFromPipeline)] | |
[Parameter(ParameterSetName = 'Online', ValueFromPipeline)] | |
[ValidateSet('*', 'All', 'UppercaseLetter', 'LowercaseLetter', 'TitlecaseLetter', 'ModifierLetter', 'OtherLetter', 'NonSpacingMark', 'SpacingCombiningMark', 'EnclosingMark', 'DecimalDigitNumber', 'LetterNumber', 'OtherNumber', 'SpaceSeparator', 'LineSeparator', 'ParagraphSeparator', 'Control', 'Format', 'Surrogate', 'PrivateUse', 'ConnectorPunctuation', 'DashPunctuation', 'OpenPunctuation', 'ClosePunctuation', 'InitialQuotePunctuation', 'FinalQuotePunctuation', 'OtherPunctuation', 'MathSymbol', 'CurrencySymbol', 'ModifierSymbol', 'OtherSymbol', 'OtherNotAssigned', 'Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Mn', 'Mc', 'Me', 'Nd', 'Nl', 'No', 'Zs', 'Zl', 'Zp', 'Cc', 'Cf', 'Cs', 'Co', 'Pc', 'Pd', 'Ps', 'Pe', 'Pi', 'Pf', 'Po', 'Sm', 'Sc', 'Sk', 'So', 'Cn')] | |
[Alias('ci')] | |
[string] $CategoryInfo | |
, | |
[Alias('cp')] | |
[switch] $CopyToClipboard | |
) | |
begin { | |
$ErrorActionPreference = 'Stop'; Set-StrictMode -Version 1 | |
$urls = [Collections.Generic.List[string]]::new() | |
$stringsToCopy = [System.Collections.Generic.List[string]]::new() | |
# The list of Unicode categories, keyed by their shorthand aliases. | |
$ohtCategories = [ordered] @{ | |
'Lu' = [System.Globalization.UnicodeCategory]::UppercaseLetter | |
'Ll' = [System.Globalization.UnicodeCategory]::LowercaseLetter | |
'Lt' = [System.Globalization.UnicodeCategory]::TitlecaseLetter | |
'Lm' = [System.Globalization.UnicodeCategory]::ModifierLetter | |
'Lo' = [System.Globalization.UnicodeCategory]::OtherLetter | |
'Mn' = [System.Globalization.UnicodeCategory]::NonSpacingMark | |
'Mc' = [System.Globalization.UnicodeCategory]::SpacingCombiningMark | |
'Me' = [System.Globalization.UnicodeCategory]::EnclosingMark | |
'Nd' = [System.Globalization.UnicodeCategory]::DecimalDigitNumber | |
'Nl' = [System.Globalization.UnicodeCategory]::LetterNumber | |
'No' = [System.Globalization.UnicodeCategory]::OtherNumber | |
'Zs' = [System.Globalization.UnicodeCategory]::SpaceSeparator | |
'Zl' = [System.Globalization.UnicodeCategory]::LineSeparator | |
'Zp' = [System.Globalization.UnicodeCategory]::ParagraphSeparator | |
'Cc' = [System.Globalization.UnicodeCategory]::Control | |
'Cf' = [System.Globalization.UnicodeCategory]::Format | |
'Cs' = [System.Globalization.UnicodeCategory]::Surrogate | |
'Co' = [System.Globalization.UnicodeCategory]::PrivateUse | |
'Pc' = [System.Globalization.UnicodeCategory]::ConnectorPunctuation | |
'Pd' = [System.Globalization.UnicodeCategory]::DashPunctuation | |
'Ps' = [System.Globalization.UnicodeCategory]::OpenPunctuation | |
'Pe' = [System.Globalization.UnicodeCategory]::ClosePunctuation | |
'Pi' = [System.Globalization.UnicodeCategory]::InitialQuotePunctuation | |
'Pf' = [System.Globalization.UnicodeCategory]::FinalQuotePunctuation | |
'Po' = [System.Globalization.UnicodeCategory]::OtherPunctuation | |
'Sm' = [System.Globalization.UnicodeCategory]::MathSymbol | |
'Sc' = [System.Globalization.UnicodeCategory]::CurrencySymbol | |
'Sk' = [System.Globalization.UnicodeCategory]::ModifierSymbol | |
'So' = [System.Globalization.UnicodeCategory]::OtherSymbol | |
'Cn' = [System.Globalization.UnicodeCategory]::OtherNotAssigned | |
} | |
# Function for getting the online URL for a given Unicode category. | |
function get-CharUrl ([char] $char) { | |
'http://www.fileformat.info/info/unicode/char/{0}' -f ([uint16] $char).ToString('x') | |
} | |
# Function for getting the online URL for a given Unicode category. | |
function get-CategoryUrl ([System.Globalization.UnicodeCategory] $category) { | |
'http://www.fileformat.info/info/unicode/category/{0}' -f $ohtCategories.GetEnumerator().Where( { $_.Value -eq $category }).Key | |
} | |
# Function for getting a Markdown representation for a given character. | |
function get-CharMarkdown ([char] $char) { | |
$codePoint = [uint16] $char | |
if ($NoName) { | |
'`{0}` ([`U+{1}`]({2}))' -f $char, $codePoint.ToString('X4'), (get-CharUrl $char) | |
} | |
else { | |
'`{0}` ({1}, [`U+{2}`]({3}))' -f $char, (get-CharName $char), $codePoint.ToString('X4'), (get-CharUrl $char) | |
} | |
} | |
# Function for getting a Markdown representation for given Unicode category | |
function get-CategoryMarkdown ([System.Globalization.UnicodeCategory] $category) { | |
'[`{0}` (`{1}`)]({2})' -f $category, $ohtCategories.GetEnumerator().Where( { $_.Value -eq $category }).Key, (get-CategoryUrl $category) | |
} | |
# Function for retrieving a Unicode character's official name and category. | |
# Uses the following NuGet package, which is installed (in the scope of | |
# the current) user on demand: | |
# https://www.nuget.org/packages/UnicodeInformation/ | |
function get-CharName ([char] $char) { | |
if (-not ('System.Unicode.UnicodeInfo' -as [type])) { | |
# Install the package on demand and/or load the assembly. | |
$nugetPkgName = 'UnicodeInformation' | |
# !! Even with an already installed package, a Get-Package call takes a long time to complete if | |
# !! the underlying modules haven't been imported yet; therefore, we look for the package | |
# !! in the standard current-user package-installation location and fall back onto Get-Package, if needed. | |
# !! Note that we look for the highest version number, should multiple versions be installed. | |
$pkgLocalPath = try { | |
( | |
Get-Item -EA Ignore -Path ("$HOME/.local/share/PackageManagement/NuGet/Packages/$nugetPkgName*", "$env:LOCALAPPDATA\PackageManagement\NuGet\Packages\$nugetPkgName*")[$env:OS -eq 'Windows_NT'] | | |
Sort-Object { [version] ($_.Name -replace ('^{0}\.' -f [regex]::Escape($nugetPkgName))) } | |
)[-1] | |
} | |
catch { } | |
if (-not $pkgLocalPath) { | |
Write-Verbose -vb "Looking for local '$nugetPkgName' package..." | |
$pkg = Get-Package -EA Ignore $nugetPkgName | |
if (-not $pkg) { | |
# Download and install the package. | |
Write-Verbose -vb "Installing NuGet package '$nugetPkgName' for character-name support..." | |
$null = Install-Package -Scope CurrentUser -Source nuget.org $nugetPkgName | |
$pkg = Get-Package $nugetPkgName # Get the local package to determine its location. | |
Write-Verbose -vb "Package '$nugetPkgName' installed to: $($pkg.Source)" | |
} | |
$pkgLocalPath = Split-Path -Parent $pkg.Source | |
} | |
# Add the relevant assembly (*.dll) to the session with Add-Type -LiteralPath | |
# For predictability, we target the v2.0 .NET *Standard* DLL. | |
# Note: The command would load multiple *.dll files, but not recursively. | |
Add-Type -Path $pkgLocalPath/lib/netstandard2.0/*.dll | |
} | |
# The type of interest is available now, use it. | |
$charInfo = [System.Unicode.UnicodeInfo]::GetCharInfo([int] $char) | |
# !! Characters 0..31 have an empty .Name property, but the | |
# !! .NameAliases property contains values; we simply use the first one. | |
if ($charInfo.Name) { $charInfo.Name } else { $charInfo.NameAliases[0].Name } | |
} | |
# Category info requested. | |
if ($CategoryInfo) { | |
# Determine what categories to target. | |
$ohtTargetCategories = if ($CategoryInfo -in '*', 'All') { | |
$ohtCategories | |
} | |
else { | |
if ($CategoryInfo.Length -eq 2) { | |
$catShorthand = $CategoryInfo | |
$catEnum = $ohtCategories[$CategoryInfo] | |
} | |
else { | |
$catEnum = [System.Globalization.UnicodeCategory] $CategoryInfo | |
$catShorthand = $ohtCategories.GetEnumerator().Where( { $_.Value -eq $catEnum }).Key | |
} | |
[ordered] @{ | |
$catShorthand = $catEnum | |
} | |
} | |
# Process categories. | |
if ($AsMarkdown) { | |
$outObj = $ohtTargetCategories.Values.ForEach( { get-CategoryMarkdown $_ }) | |
} | |
elseif ($Online) { | |
foreach ($category in $ohtTargetCategories.Values) { | |
Start-Process (get-CategoryUrl $category) | |
} | |
} | |
else { | |
$outObj = $ohtTargetCategories.GetEnumerator().ForEach( { | |
[pscustomobject] @{ | |
Shorthand = $_.Key | |
Name = [string] $_.Value | |
InfoUrl = get-CategoryUrl $_.Value | |
} | |
}) | |
} | |
if ($CopyToClipboard) { | |
$outObj | Out-String | Set-Clipboard | |
} | |
else { | |
$outObj | |
} | |
exit 0 # We're done. | |
} | |
if ($EnumerateCategory) { | |
if ($EnumerateCategory.Length -eq 2) { | |
# A 2-character shorthand such as 'Lu' was passed to -Category; translate | |
# it to its .NET enum value, such as 'UpperCaseLetter' | |
$EnumerateCategory = $ohtCategories[$EnumerateCategory] | |
} | |
if ($Online) { | |
$Character = [char] 0x0 # dummy character so that the foreach ($char in $Character) loop is entered once. | |
} | |
else { | |
# Get all characters in the target category (brute-force, slow). | |
$Character = ([char[]] (0..0xffff)).Where( { [char]::GetUnicodeCategory($_) -eq $EnumerateCategory }) | |
} | |
} | |
} | |
process { | |
foreach ($char in $Character) { | |
$codePoint = [uint16] $char | |
if ($Online) { | |
# Note: with -EnumerateCategory <category> -Online we don't treat the characters in the category indivdually, | |
# as that would open way too many pages - we go a page that describes | |
# the category itsef, wher links to the individual chars. are offered. | |
$url = if ($EnumerateCategory) { get-CategoryUrl $EnumerateCategory } else { get-CharUrl $char } | |
if ($CopyToClipboard) { | |
$stringsToCopy.Add($url) # will be processed in `end` block | |
} | |
else { | |
$urls.Add($url) # will be processed in `end` block | |
} | |
} | |
elseif ($AsMarkdown) { | |
$markdown = get-CharMarkdown $char | |
if ($CopyToClipboard) { | |
$stringsToCopy.Add($markdown) # will be processed in `end` block | |
} | |
else { | |
$markdown | |
} | |
} | |
else { | |
# output an info object or copy just the character to the clipboard. | |
if ($CopyToClipboard) { | |
# Note: Copy just the character itself; useful for getting the char. | |
# from its code point (e.g., Get-CharInfo 0x41 -cp) | |
$stringsToCopy.Add($char) # will be processed in `end` block | |
} | |
else { | |
# Construct and output a [pscustomobject] containing the character and metainformation. | |
$utf8Bytes = [Text.Encoding]::Utf8.GetBytes($char) | |
$charName = if (-not $NoName) { get-CharName $char } | |
[pscustomobject]@{ | |
Char = $char | |
HexCodePoint = ('0x{0}' -f $codePoint.ToString("x1")) | |
Utf8ByteString = $utf8Bytes.ForEach( { '0x{0}' -f $_.ToString("x2") }) -join ' ' | |
Name = $charName | |
Category = [char]::GetUnicodeCategory($char) | |
CategoryShorthand = $ohtCategories.GetEnumerator().Where( { $_.Value -eq [char]::GetUnicodeCategory($char) }).Key | |
CodePoint = $codePoint | |
Utf8Bytes = $utf8Bytes | |
InfoUrl = get-CharUrl $char | |
} | |
} | |
} | |
} | |
} | |
end { | |
if ($CopyToClipboard) { | |
$stringsToCopy | Set-Clipboard | |
} | |
elseif ($Online) { | |
foreach ($url in $urls) { Start-Process $url } | |
} | |
} | |
} | |
# -------------------------------- | |
# GENERIC INSTALLATION HELPER CODE | |
# -------------------------------- | |
# Provides guidance for making the function persistently available when | |
# this script is either directly invoked from the originating Gist or | |
# dot-sourced after download. | |
# IMPORTANT: | |
# * DO NOT USE `exit` in the code below, because it would exit | |
# the calling shell when Invoke-Expression is used to directly | |
# execute this script's content from GitHub. | |
# * Because the typical invocation is DOT-SOURCED (via Invoke-Expression), | |
# do not define variables or alter the session state via Set-StrictMode, ... | |
# *except in child scopes*, via & { ... } | |
if ($MyInvocation.Line -eq '') { | |
# Most likely, this code is being executed via Invoke-Expression directly | |
# from gist.github.com | |
# To simulate for testing with a local script, use the following: | |
# Note: Be sure to use a path and to use "/" as the separator. | |
# iex (Get-Content -Raw ./script.ps1) | |
# Derive the function name from the invocation command, via the enclosing | |
# script name presumed to be contained in the URL. | |
# NOTE: Unfortunately, when invoked via Invoke-Expression, $MyInvocation.MyCommand.ScriptBlock | |
# with the actual script content is NOT available, so we cannot extract | |
# the function name this way. | |
& { | |
param($invocationCmdLine) | |
# Try to extract the function name from the URL. | |
$funcName = $invocationCmdLine -replace '^.+/(.+?)(?:\.ps1).*$', '$1' | |
if ($funcName -eq $invocationCmdLine) { | |
# Function name could not be extracted, just provide a generic message. | |
# Note: Hypothetically, we could try to extract the Gist ID from the URL | |
# and use the REST API to determine the first filename. | |
Write-Verbose -Verbose "Function is now defined in this session." | |
} | |
else { | |
# Indicate that the function is now defined and also show how to | |
# add it to the $PROFILE or convert it to a script file. | |
Write-Verbose -Verbose @" | |
Function `"$funcName`" is now defined in this session. | |
* If you want to add this function to your `$PROFILE, run the following: | |
"``nfunction $funcName {``n`${function:$funcName}``n}" | Add-Content `$PROFILE | |
* If you want to convert this function into a script file that you can invoke | |
directly, run: | |
"`${function:$funcName}" | Set-Content $funcName.ps1 -Encoding $('utf8' + ('', 'bom')[[bool] (Get-Variable -ErrorAction Ignore IsCoreCLR -ValueOnly)]) | |
"@ | |
} | |
} $MyInvocation.MyCommand.Definition # Pass the original invocation command line to the script block. | |
} | |
else { | |
# Invocation presumably as a local file after manual download, | |
# either dot-sourced (as it should be) or mistakenly directly. | |
& { | |
param($originalInvocation) | |
# Parse this file to reliably extract the name of the embedded function, | |
# irrespective of the name of the script file. | |
$ast = $originalInvocation.MyCommand.ScriptBlock.Ast | |
$funcName = $ast.Find( { $args[0] -is [System.Management.Automation.Language.FunctionDefinitionAst] }, $false).Name | |
if ($originalInvocation.InvocationName -eq '.') { | |
# Being dot-sourced as a file. | |
# Provide a hint that the function is now loaded and provide | |
# guidance for how to add it to the $PROFILE. | |
Write-Verbose -Verbose @" | |
Function `"$funcName`" is now defined in this session. | |
If you want to add this function to your `$PROFILE, run the following: | |
"``nfunction $funcName {``n`${function:$funcName}``n}" | Add-Content `$PROFILE | |
"@ | |
} | |
else { | |
# Mistakenly directly invoked. | |
# Issue a warning that the function definition didn't effect and | |
# provide guidance for reinvocation and adding to the $PROFILE. | |
Write-Warning @" | |
This script contains a definition for function "$funcName", but this definition | |
only takes effect if you dot-source this script. | |
To define this function for the current session, run: | |
. "$($originalInvocation.MyCommand.Path)" | |
"@ | |
} | |
} $MyInvocation # Pass the original invocation info to the helper script block. | |
} | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment