Skip to content

Instantly share code, notes, and snippets.

@Jach
Last active January 15, 2019 16:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Jach/4d0efd74fde74973b4c9ed05eac9a178 to your computer and use it in GitHub Desktop.
Save Jach/4d0efd74fde74973b4c9ed05eac9a178 to your computer and use it in GitHub Desktop.
extract letter unicodes for js regexes
#|
This file is in the public domain.
I grabbed the data from ucd.all.flat.zip at ftp://ftp.unicode.org/Public/11.0.0/ucdxml/
and use this script to create a UnicodeLetters.txt file
containing a series of unicode chars suitable for using in JavaScript regular expressions
where you want \p{L}.
Note that IE does not support the 'u' flag, you'll need to restrict yourself to the subset of
letters in the BMP (below 0x10000) and use this script to print out a simple "\uABCD" instead of "\\u{ABCDE}".
Also note that reading in the file seems to stress SBCL's default settings, run with
--dynamic-space-size 10000
|#
(defpackage unicode-extract
(:use :cl))
(in-package unicode-extract)
#-quicklisp
(load #p"~/.sbclrc")
(quicklisp:quickload 'plump)
(defparameter data-raw (plump:parse #P"~/ucd.all.flat.xml"))
(defun is-letter (category)
(member category '("Ll" "Lu" "Lt" "L&" "Lm" "Lo") :test #'string-equal))
(defparameter letter-nums
(sort
(mapcar #'(lambda (row) (parse-integer (plump:get-attribute row "cp") :radix 16))
(remove-if-not #'(lambda (el) (is-letter (plump:get-attribute el "gc")))
(plump:get-elements-by-tag-name data-raw "char")))
#'<))
(defun int-list-to-range-list (int-list)
"Groups consecutive integers into range pairs. e.g.
(int-list-to-range-list '(1 2 3 5 6 8 10 11))
->
'((1 3) (5 6) (8 8) (10 11))"
(let ((prev-el -10)
(result (list)))
(dolist (el int-list)
(cond
; diff of current el and prev is > 1, cons new group
((> (- el prev-el) 1) (setf result (cons (list el el) result)))
; diff is exactly 1, inc the end range of the previously cons'd item
((= (- el prev-el) 1) (incf (second (first result)))))
(setf prev-el el))
(reverse result)))
(defparameter letter-ranges (int-list-to-range-list letter-nums))
(defun to-hex-str (num)
"Gives string for num in hex format, with at least 4 characters using 0s for leading padding."
(format nil "~4,'0X" num))
(defun print-hexes ()
(dolist (grp letter-ranges)
(princ "\\\\u{")
(princ (to-hex-str (first grp)))
(princ "}")
(unless (= (first grp) (second grp))
(princ "-\\\\u{")
(princ (to-hex-str (second grp)))
(princ "}"))))
(with-open-file (s #P"~/UnicodeLetters.txt" :direction :output :if-exists :supersede)
(let ((*standard-output* s))
(print-hexes)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment