Skip to content

Instantly share code, notes, and snippets.

@samcv

samcv/grant_.md Secret

Last active February 13, 2017 20:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save samcv/77b2ef7c972c680569c87f90de4fda39 to your computer and use it in GitHub Desktop.
Save samcv/77b2ef7c972c680569c87f90de4fda39 to your computer and use it in GitHub Desktop.

Improving the Robustness of Unicode Support in Rakudo on MoarVM

Name

Samantha McVey (samcv)

Where can we contact you?

...

Synopsis

Implement Unicode Collation Algorithm, improve speed and spec conformance of the text normalizer. Improve test coverage for Unicode specs and document our compliance or lack of compliance with the Unicode spec.

Benefits to Perl 6 Development

As Perl 6 starts to take off, it is increasingly important to provide robust Unicode support. Perl 6 already provides some of the best Unicode support on many levels compared to other programming languages. The goal of this project is to make Perl 6's Unicode support production ready.

Deliverables/Project Details

General

  • Document any deficits of our Unicode coverage in the course of work on this project. This is very important due to the vastness of the Unicode standard. Deficits should have tests written, unless such a thing would not be possible to test or input is needed from the rest of the Perl 6 team. In any of these cases, they will be documented in my reports for future and current developers of Perl 6 to reference.

  • Tests will be written to cover all of the relevant Unicode 9.0 test files, as well as making current ones more robust when checking the breaking of graphemes.

Unicode Names

  • Hangul Syllables and other Unicode names need to be programmatically determined when generating the Unicode database.

  • Add support for Unicode 1 names. (This is lowest on the priority list, as I had already implemented Name Aliases)

Unicode Collation Algorithm

  • Fully implement the Unicode Collation Algorithm at least for language nonspecific sorting.

  • Assess needs in supporting language and country specific collation.

Text Normalization

  • Improve the performance of the text normalizer and also allow the normalizer to save state across multiple characters to properly support Grapheme Breaking for all of Unicode 9.0 and beyond.

Unicode Database Generation

  • The script used to generate the Unicode database shall be made deterministic, and produce the same output file on every run. At the current time ~1/2 of the file changes even if no changes are made to the script. This is an issue that will be solved.

  • Rewrite the Perl 5 script used to generate the Unicode database in Perl 6. This is also part of the previous item, since a rewrite is needed, it should be done in Perl 6 to help make it more maintainable.

    • Work toward the rewrite has been occuring here.
  • Implement all relevant remaining Unicode properties from Unicode 9.0. This includes the properties needed to support the deliverables listed above.

  • Try to reduce the memory footprint of the Unicode database. Currently the unicode.o binary file created is 4.1MB. I hope to cut that in half.

Project Schedule

1 1/2 months

When can you begin work?

Can begin work as soon as possible.

Report Schedule

Reports will be made on my blog at https://cry.nu, which will be syndicated at pl6anet.org

How frequently will these updates be made?

Reports will be submitted every week.

Public Repository

Code will be stored in the MoarVM, NQP and Rakudo repositories, although work in progress may happen on my own public forks before being merged into these repositories. Change logs will be viewable on github.com.

Grant Deliverables ownership/copyright and License Information

Same as MoarVM/Rakudo/NQP (Artistic 2.0)

Who and/or which organization will have ownership (copyright) of the grant deliverables?

The Perl Foundation will have copyright of the deliverables.

Bio

Although I am a fairly recent addition to the Perl 6 core developers, in a short few months I have been very busy. I have two Perl 6 modules, IRC::TextColor and URL::Find and I am the lead developer of the Perl 6 syntax highlighter for Atom/Github as well as for docs.perl6.org. I converted the site from using the old Pygments highlighter to the new highlighter.

My contributions to Perl 6 have been focused on Unicode support in Perl 6, making changes throughout Rakudo, NQP and MoarVM to achieve this. All of the work I have already done on improving Unicode support in Perl 6 shows I am capable of completing this project and am the best person for this grant. In addition, I have already started work on rewriting the Unicode Database generation and shrinking the size of the data needed to be loaded on startup.

Unicode already done within the last few months

Tests:

  • Fixed several errata in roast related to our Unicode support, which had often been present for a long time.
  • Added a test based on GraphemeBreakTest.txt from Unicode and many others to Unicode 9.0
  • Updated other tests for Unicode 9.0 and reworked others for compliance.

MoarVM:

  • Implemented part of the Unicode Collation Algorithm.
  • Added support for named codepoint sequences, which includes the Named Sequences, Emoji Sequences and Emoji ZWJ Sequences.
    • "\c[woman gesturing OK]")
  • Implemented Unicode Name Aliases in getting codepoints by name
  • Implemented the 'Extend' Grapheme_Cluster_Break property which was new in Unicode 9.0. We previously had no support for this property.
  • Implemented many other Grapheme_Cluster_Break fixes and added support for most Emoji sequences.
  • Improved the speed of radix 50% non-ASCII decimal digits
  • Improved the speed of text normalization, making slurping a Unicode heavy text file 14% faster
  • Added a multitude of properties to our Unicode database.

Rakudo:

  • Added support for a large number Unicode properties, handling Bool/Str/Int return types for uniprop
  • Implemented uniprops method in Rakudo

Country of Residence

USA

Nationality

USA

Amount Requested

$50 USD / Hour * 100 hours a month = $5,000. Total is $7,500 for 1.5 months.

Okay to publish proposal? :

Yes

Suggestions for Grant Manager

moritz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment