Skip to content

Instantly share code, notes, and snippets.

@97littleleaf11
Last active October 8, 2021 20:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save 97littleleaf11/889e76c4f8258bd07679f1606396efb7 to your computer and use it in GitHub Desktop.
Save 97littleleaf11/889e76c4f8258bd07679f1606396efb7 to your computer and use it in GitHub Desktop.
GSOC-2021

Speeding up string operations of Mypyc

Jingchen YE

This is the final report of my GSOC 2021 project at mypy team, mentored by Jukka Lehtosalo.

Introduction

Mypyc is a compiler from type-annotated Python to CPython C extension modules. It uses type information and restrictions to certain dynamic language features to speed up code, often significantly.

Python built-in string and its operations are widely used in many python programs. However, in mypyc only parts of string operations have been sped up using specific Python C APIs or customed C helper functions, instead of generic function calls. In addition, there are no specialized operations for bytes and bytearray.

There exist three string formatting methods in Python3, %-interpolation, str.format() method call, and f-strings. Before this project, there was no optimization for any of them and the performance could be even worse than CPython.

The major tasks of this project are as following(project description):

  • Analyse popular string formatting operations and speed them up.
  • Introduce bytes primitive type and optimize common used bytes operations.
  • Speed up encode and decode methods for conversion between string and bytes.

Implementation details

All my commits can be found here.

String formatting

Originally, all three string formatting methods in mypyc are compiled in generic ways. printf-style string formatting, also called %-interpolation in Python, is a binary op of a string object and an expression linked by a % operator. They were compiled into PyNumber_Remainder function calls. F-strings is translated into a str.join() call with multiple str.format() calls in mypy AST. This kind of overhead makes compiled F-strings run even worse than CPython.

Analyse distribution

The key idea of optimizing string formatting is that we only specialize those most common cases and keep other rare cases using old generic calls. By doing so we could gain promising boost-ups in real-world projects while maintaining a simple codebase without writing weird corner cases.

To start with, we roughly filter out the %-interpolation cases in dozens of popular open-source codebases. The results showing that %s takes up 70% cases and %d covers another 20% cases. The most popular conversion types in the rest are %f, %x, %i, and others. Though the distribution may vary slightly depending on the purpose of use, we can still safely draw the conclution that %s, %d, as well as {} in str.format() and f-strings, cover the majority of daily use cases.

Optimization

During IR building, we would have three different parsers for each methods, which would later share a same synthesis phase.

  • A tokenizer for c-style formatting strings which are used with % operator.
  • A tokenizer for braces formatting strings which are used with str.format() method.
  • A specializer for f-string by applying to str.join(), since f-strings are parsed into str.join() and str.format() in mypy AST. This specializer should also deal with a nested braces formatting strings.

Tokenizers reuse the regular expressions in mypy.checkstrformat to reduce code duplication and makes it more likely that we parse format strings consistently now and in the future.

These 'frontends' would first parse original strings into literals and formatting operations represented as FormatOp, an abstract presentation of formatting conversion operations. For example, we would have these FormatOps initially:

  • Convert a variable into string
  • Convert an int variable.
  • ...

Compare to ConversionSpecifier, FormatOp has fewer attributes and indicates compile time optimizations. For example, to mark a conversion from any object to string, ConversionSpecifier may have several representations, like '%s', '{}' or '{:{}}'. However, there would only exist one corresponding FormatOp. Currently FormatOp is just an Enum for convenience. We might add several attributes later and upgrade it to a class if we need to support more conversions. In the long term, a FormatOp can also has optional parameters for alignment, precision, mapping key, etc.

The synthesis phase, shared by all three methods, utilizes the unified FormatOp to apply conversion, and combines the results with literals. It also helps simplify the implementation. This phase would take care of optimizations, for example:

  • Skip literals
  • Use fast calls for specific primitives
  • ...

If any FormatOp or some attributes haven't been supported yet, we would fall back to original CPython method calls.

I also implemented a primitive for string initialization instead of calling str.join(). We could get rid of unnecessary checks by directly allocating the whole space before filling in the objects.

The core commits are as follows:

Bytes and bytes operations

In this project we added primitives and several operations for bytes and bytearray in mypyc. bytearray is treated as a subtype of bytes by mypy, even though they behave differently in some cases. We keep this design and check bytearray before calling the corresponding CPython API.

I added C helper functions for encode and decode methods, which are widely used for conversion between bytes and string. Now we also support faster bytes interpolation using the same logic with str.

Performance

Mypyc provides several microbenchmarks which automatically been tested everyday. The performance improvements from this projects are promising and here is the related microbenchmark result:

  • str_format: 1.36x -> 4.14x, +209.6%
  • bytes_indexing: 2.60x -> 11.40x, +338.3%
  • bytes_format: 1.64x -> 4.67x, +154.9%
  • bytes_call: 1.77x -> 3.66x, +107.0%
  • bytes_slicing: 1.72x -> 3.12x, +81.8%
  • bytes_concat: 1.62x -> 2.01x, +24.3%
  • encode_decode: 1.30x -> 1.92x, +37.3%

Regarding three string formatting methods, I also implemented separated microbenchmarks which follow the distribution of the real-world projects.

  • %-interpolation: 1.322x -> 2.810x
  • str.format: 1.173x -> 3.685x
  • f-strings: 0.380x -> 1.960x

Future work

I almost achieved all the main targets set forth in the proposal. However, there is still some remaining work that requires further contribution.

  • We left some primitives of common bytes operations as good first issues for new contributors. These primitives are relatively rare thus we could safely use generic method calls.
  • We could add more FormatOp for specific formatting types or operations, such as floats, hex.

My future work in the short term after this summer would be focusing on fixing bugs for mypyc and mypy, since we are gaining more and more users as well as issues.

Acknowledgements

I would like to express my gratitude to my mentor Jukka. He is an extremely nice person that always gives me timely, accurate , and helpful responses on my commits. He always teaches me, without reservation, how to design code architecture and how to balance between different goals, such as speed and code complexity.

I also want to say thanks to Xuanda, who introduced this project to me and shared his experience. I am so lucky to have such a good friend on my open source journey.

Finally, I want to thank Google for providing this opportunity to work with such an amazing community. This is my first time contributing to an open source project and I really enjoyed working with mypy team.

@TH3CHARLie
Copy link

Congrats!

@juniorxxue
Copy link

Congrats!

@rashi24-p
Copy link

congrats

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment