Samantha McVey (samcv)
...
Implement Unicode Collation Algorithm, improve speed and spec conformance of the text normalizer. Improve test coverage for Unicode specs and document our compliance or lack of compliance with the Unicode spec.
As Perl 6 starts to take off, it is increasingly important to provide robust Unicode support. Perl 6 already provides some of the best Unicode support on many levels compared to other programming languages. The goal of this project is to make Perl 6's Unicode support production ready.
-
Document any deficits of our Unicode coverage in the course of work on this project. This is very important due to the vastness of the Unicode standard. Deficits should have tests written, unless such a thing would not be possible to test or input is needed from the rest of the Perl 6 team. In any of these cases, they will be documented in my reports for future and current developers of Perl 6 to reference.
-
Tests will be written to cover all of the relevant Unicode 9.0 test files, as well as making current ones more robust when checking the breaking of graphemes.
-
Hangul Syllables and other Unicode names need to be programmatically determined when generating the Unicode database.
-
Add support for Unicode 1 names. (This is lowest on the priority list, as I had already implemented Name Aliases)
-
Fully implement the Unicode Collation Algorithm at least for language nonspecific sorting.
-
Assess needs in supporting language and country specific collation.
- Improve the performance of the text normalizer and also allow the normalizer to save state across multiple characters to properly support Grapheme Breaking for all of Unicode 9.0 and beyond.
-
The script used to generate the Unicode database shall be made deterministic, and produce the same output file on every run. At the current time ~1/2 of the file changes even if no changes are made to the script. This is an issue that will be solved.
-
Rewrite the Perl 5 script used to generate the Unicode database in Perl 6. This is also part of the previous item, since a rewrite is needed, it should be done in Perl 6 to help make it more maintainable.
- Work toward the rewrite has been occuring here.
-
Implement all relevant remaining Unicode properties from Unicode 9.0. This includes the properties needed to support the deliverables listed above.
-
Try to reduce the memory footprint of the Unicode database. Currently the unicode.o binary file created is 4.1MB. I hope to cut that in half.
1 1/2 months
Can begin work as soon as possible.
Reports will be made on my blog at https://cry.nu, which will be syndicated at pl6anet.org
Reports will be submitted every week.
Code will be stored in the MoarVM, NQP and Rakudo repositories, although work in progress may happen on my own public forks before being merged into these repositories. Change logs will be viewable on github.com.
Same as MoarVM/Rakudo/NQP (Artistic 2.0)
The Perl Foundation will have copyright of the deliverables.
Although I am a fairly recent addition to the Perl 6 core developers, in a short few months I have been very busy. I have two Perl 6 modules, IRC::TextColor and URL::Find and I am the lead developer of the Perl 6 syntax highlighter for Atom/Github as well as for docs.perl6.org. I converted the site from using the old Pygments highlighter to the new highlighter.
My contributions to Perl 6 have been focused on Unicode support in Perl 6, making changes throughout Rakudo, NQP and MoarVM to achieve this. All of the work I have already done on improving Unicode support in Perl 6 shows I am capable of completing this project and am the best person for this grant. In addition, I have already started work on rewriting the Unicode Database generation and shrinking the size of the data needed to be loaded on startup.
- Fixed several errata in roast related to our Unicode support, which had often been present for a long time.
- Added a test based on GraphemeBreakTest.txt from Unicode and many others to Unicode 9.0
- Updated other tests for Unicode 9.0 and reworked others for compliance.
- Implemented part of the Unicode Collation Algorithm.
- Added support for named codepoint sequences, which includes the Named Sequences,
Emoji Sequences and Emoji ZWJ Sequences.
"\c[woman gesturing OK]"
)
- Implemented Unicode Name Aliases in getting codepoints by name
- Implemented the 'Extend' Grapheme_Cluster_Break property which was new in Unicode 9.0. We previously had no support for this property.
- Implemented many other Grapheme_Cluster_Break fixes and added support for most Emoji sequences.
- Improved the speed of radix 50% non-ASCII decimal digits
- Improved the speed of text normalization, making slurping a Unicode heavy text file 14% faster
- Added a multitude of properties to our Unicode database.
- Added support for a large number Unicode properties, handling Bool/Str/Int
return types for
uniprop
- Implemented
uniprops
method in Rakudo
USA
USA
$50 USD / Hour * 100 hours a month = $5,000. Total is $7,500 for 1.5 months.
Yes
moritz