Skip to content

Instantly share code, notes, and snippets.

@wangyingsm
Last active March 11, 2021 07:56
Show Gist options
  • Save wangyingsm/72552588e2de391f2f3a4414125de7ab to your computer and use it in GitHub Desktop.
Save wangyingsm/72552588e2de391f2f3a4414125de7ab to your computer and use it in GitHub Desktop.

Achieving warp speed with Rust

使用Rust获得曲率速度

本文是Jack FranshamGist文章原文翻译。英文原文参照在页面中以引用格式展现,方便对照。

Contents:

目录


If you're looking to write fast code in Rust, good news! Rust makes it really easy to write really fast code. The focus on zero-cost abstractions, the lack of implicit boxing and the static memory management means that even naïve code is often faster than the equivalent in other languages, and certainly faster than naïve code in any equally-safe language. Maybe, though, like most programmers you've spent your whole programming career safely insulated from having to think about any of the details of the machine, and now you want to dig a little deeper and find out the real reason that Python script you rewrote in Rust runs 100x faster and uses a 10th of the memory. After all, they both do the same thing and run on the same CPU, right?

如果你正在寻找如何在Rust中编写高速的代码,好消息!Rust使得这个任务变得非常简单。这门语言的专注于零成本抽象,无隐式装箱以及静态内存管理,这意味着即使朴素原始的代码也通常比其他语言要运行的快,而且可以肯定的是比任何同等内存安全的语言更快。也许你和大部分的程序猿一样,在职业生涯中基本没有考虑过底层机器的细节,直到最近你突然发现自己需要更加深入的了解这个方面,希望能够找到为什么用Rust重写的代码能够仅仅使用十分之一的内存,却以100倍以上的性能碾压相同的Python脚本的原因。毕竟归根结底,它们都是在同一块CPU上完成了相同的任务而已。

So, here's an optimization guide, aimed at those who know how to program but maybe don't know how it maps to real ones and zeroes on the bare metal of your CPU. I'll try to weave practical tips about optimizing Rust code with explanations of the reason why it's faster than the alternative, and we'll end with a case study from the Rust standard library.

所以这篇Gist是一份优化指南,目标读者是那些知道编写程序但不清楚底层裸金属CPU二进制细节的程序猿。作者会尽可能在Rust中以实践的方式来说明优化的过程以及为何这样会比其他方式要快,最后会用一个Rust标准库的案例来进行总结说明。

This post assumes decent familiarity with programming, a beginner's familiarity with Rust and almost no familiarity with CPU architecture.

这篇文章假定读者对程序设计有相当的熟悉度,对Rust有初级的认识,但是基本上对CPU架构没有了解。

Number one optimization tip: don't

优化第一条:不要做的事

Ok, I'll start with a few disclaimers before I get into the meat. Firstly, unless you're running into performance problems in real-life usage, optimize your code for readability before you optimize it for runtime performance. Compilers and humans alike are better at understanding boring, straightforward code, so if you write that first you'll be likely to have "good enough" performance with the added benefit of being maintainable. If you write nice, refactorable code it'll be easy to change if you later realize that it's being wasteful.

好的,进入正题之前首先会是一些重要的免责声明。第一,除非你在实际生产使用中碰到了性能问题,否则你应该优先优化代码的可读性而不是运行时性能。编译器和人类一样更适合理解乏味的但是直接的代码,因此如果你以此为出发点的话,你就已经能获得“足够好”的性能,并且附加了可维护性的额外好处。如果在需求经常变化的情况下,你一开始就打算写出优质可重构的代码,之后很可能发现这是种浪费。

That's not to say that performance doesn't matter. You shouldn't only optimize slow programs. A long-running program using high amounts of CPU but not causing visible slowness is just as bad as a program that takes 30s instead of 3s to process some data, it's just that the former is wasting battery and the latter is wasting time. Weigh up the time it would take to optimize the code against the benefit you would get from the optimization.

这并不是说性能不重要。你不应该只优化慢的的程序。一个长时间运行的程序使用了很高的CPU资源但是并没有造成可察觉的性能问题,与一个原本应该3秒处理完数据的程序实际处理了30秒,两者同样糟糕,它们仅仅区别是前者浪费了电池而后者浪费了时间。这说明你需要在花费在优化上的时间和从优化中获得的好处之间进行取舍。

The reason maintainability is so important is that lot of optimizations are "try it and see" - if you're unable to make sweeping changes to your code without breaking it you're going to have a really bad time. Now, speaking of "try it and see"...

可维护性如此重要的原因在于,很多的优化都是“尝试一下看看”,如果你发现在这过程中,你无法在不破坏原来代码结构的情况下做出突破,那么将是一件非常痛苦的事情。然后进入“尝试一下看看”。。。

Never optimize blindly

永远不要盲目的优化

There are lots of performance-tracing tools out there. A famous tool/set of tools for C and other systems languages is valgrind, which are extremely powerful but can be scary to get started with, so if you just want to have a quick overview of what your program is doing from a performance standpoint check out this article on analyzing Rust with perf, a fantastic and easy-to-use performance tool for Linux. Unless there's a glaring flaw, like pervasive cloneing or a blatantly sub-optimal algorithm, perf will likely give better results than simply optimizing stuff that "looks slow".

行业中有很多性能追踪工具。其中对于C和其他系统语言来说有一个很著名的工具集叫做valgrind,这是一个异常强大但是对初学者十分不友好的工具,因此如果你只是希望从性能角度来获得程序的概貌的话,你可以看看使用perf对Rust进行分析,它是linux下面的一个很棒又容易上手的性能分析工具。除非在程序中有明显的缺陷,比方说无所不在的clone或明显欠优化的算法,否则perf都能提供更准确的分析结果,肯定要比你觉得慢的分析要好。

Another tool that's good to help you avoid gotchas of all kinds (not just performance) is clippy, but you already knew that because you're using it on all your code to begin with, right?

对于Rust来说还有另一个优秀的工具,能够帮助你避开各种坑(不仅限于性能),它就是clippy,但作者相信你应该已经知道它了,因为你现在使用的所有的代码都经过它的检查了,不是吗?

perf also shows the cumulative cost of each function over the course of the program's runtime, which leads me to my next point:

perf还能将程序运行过程中每个函数的累计耗时计算出来,然后可以进入下一节的内容:

Don't bother optimizing one-time costs

不要优化一次性的消耗

Your config-parsing code can be as slow as you like and it's unlikely to matter. Don't just optimize the slowest function in your program, optimize the one that takes up the most of your runtime. Those may be the same function, but if you get a 2 millisecond improvement on a function that's called 1000 times, that's better than a 1 second improvement on a function that's called once.

你用来读取配置的代码可以运行的很慢,只要是确实必要的,一般情况下这都不会造成什么影响。不要只关注于优化程序中运行最慢的函数,应该优化那个占程序运行时间最大的函数。当然它们也可能是同一个函数,否则如果你能将一个调用1000次的函数提升2毫秒的性能,那就比将一个只运行一次的函数提升1s性能要更好。

Improve your algorithms

改进你的算法

Now, as with every article on performance, this is where I add in the requisite disclaimer of use better algorithms first. Don't invent new algorithms unless you do that for a living, but in all likelihood if you're running into performance problems it's more likely to be due to poor algorithms than to poor implementations. Most programmers test their code on small datasets, but if you have O(n²) complexity that won't appear until you've tried it on a larger dataset. If you don't know what your algorithm is, which is likely since most code is written without a specific algorithm in mind, just try to have as few loops as possible and remember that every use of collect has to iterate over the entire collection at least once, and so the more work you can do using less loops the better. This is the same as optimization in any language though, so this is all I'll say on algorithmic complexity for now. If you want to find out more, there are some excellent resources out there.

最后,每篇讲述性能的文章都会提到,还有一个免责声明就是首先使用更好的算法。记住除非你以研究算法为生,否则不要发明新的算法,当你碰到性能问题时,基本上所有情况下都是因为你采用了较差的算法而不是采用了较差的实现。很多程序猿都是在小数据集上测试他们的代码,但是如果你使用了一个O(n²)时间复杂度的算法,那么问题就只会在一个较大的数据集上才会出现。也许你不确定你使用了什么算法,这很正常,因为大部分的代码都是在没有详细设计算法的情况下写出来的,那么你需要做的就是尽量减少循环的次数,并且需要记住每次调用collect都会迭代完整个集合每个元素至少一次。这个原则对于任何语言都适用,对于算法复杂度问题本文就仅限于此。如果你希望深入研究,网上有很多现成的资源。

CPU architecture primer

CPU架构入门

So maybe the complexity of your algorithm can't be improved, and to make any improvements you need to get into the down-to-the-metal stuff. This is where the difference comes in between languages like Rust and C and languages like Python and Ruby. It's entirely possible that you know this already, but it's worth going over to make sure we're all on the same page.

到了这里,你可能发现改进算法复杂度也无法改进程序的性能了,这个阶段再需要提升的话,就需要进入到底层的裸金属领域了。这里开始就能体现像Rust、C这样的语言和Python、Ruby之间的差别。很可能你已经了解了以下一些知识,不过这里很有必要再过一遍,以期大家都有共识。

There are two parts to any computation, the stuff it does and the stuff it does it on. The instructions, and the data.

对于计算来说,我们关注的有两部分,计算规则本身和计算应用在哪些内容之上。也就是指令和数据。

Instructions are stored in the instruction cache - a chunk of really, really fast memory that's directly readable by the CPU. Each instruction can put and/or take data from the CPU's registers, which is a small number of small pieces of memory, either 32 or 64 bits depending on your computer's word size. Only a small amount of data can be in registers at any one time, however, and you can't take a pointer to a register, so sometimes the CPU must access the computer's RAM. Since RAM is slow, the CPU tries to read in bulk and then store the result in increasingly small, increasingly fast caches. If it tries to access data that isn't in the smallest cache, it has to read the slightly larger cache, continuing up until it reaches RAM. The upshot is: you want to keep your data as small as possible, and for data that is accessed together to be close to each other so the CPU loads as much of it at once as possible. This should be enough information to get you through the rest of this article, but if you want to dive deeper into it you can check out the structure and implementation section in the Wikipedia page for the CPU.

指令存储在指令缓存上,这是一个非常非常快的内存区域,能够直接被CPU读取。每条指令都能读/写CPU寄存器上的数据,这些寄存器是CPU上很少和很小的一些存储单元,取决于你的机器字长可能是32位或64位。任何时间只能有很少的数据能存储在寄存器当中,而且你也无法获取寄存器的指针,因此有时CPU必须直接访问计算机的内存。因为内存比较慢,所以CPU通常试图将一整块的内存预先读取到自带的很小但是非常快的缓存里。如果在较小的缓存中找不到目标数据,CPU就会试图读取更大的缓存,直至最后退化成读取内存。这样造成的结果是:你应该将数据尽量保持足够小,对于需要一起访问的数据来说,它们应该处于内存上连续的区域中,以便CPU能够尽可能在一次装载过程中全部读取到。理解了上述原理足够理解本文余下的内容了,如果你想更深入,可以查阅维基上关于CPU结构和实现的章节

Keep as much as possible in cache

尽可能的利用缓存

The further away your data is from your CPU the slower your program will be. The very worst place for data to be is on a different computer. A less awful - but still very awful - place for data to be is on your hard drive. Better still is in your RAM but as mentioned before, RAM is slow. Almost the best possible place for your data is in CPU cache. You may have heard some folklore that allocating is bad, and this is the main reason why. Accessing two different locations one after another on the stack is fine, since they're likely to be on the same cache line. Accessing two different locations one after another on the heap is significantly slower, since it's much less likely they they're directly next to each other. We'll go into exactly why this is in a moment.

你的数据距离CPU越远,你的程序将运行的越慢。数据在另一台计算机上就是一个很糟糕的例子。稍好一点但也很糟糕的就是在本地计算机的硬盘上。再好些是在本地计算机的内存中,但是正如上所述,内存也很慢。基本上最好的位置就是数据处于本地计算机的CPU缓存中。你可能听过一个说法认为内存分配是不好的,原因正在此。访问栈上两个不同位置的数据通常没有问题,因为它们基本上都能放在同一条cache line当中。但是如果访问两个堆上不同位置的数据就可能很慢,因为它们很可能不是连续的。我们马上会回来深入讨论这个问题。

If you have to allocate, because you need variable-size containers, shared ownership or owned trait objects (see below for why you probably don't need trait objects), try to put data that will be accessed in sequence in order in RAM, so that when the CPU reads one element it necessarily has to read the next few elements too, meaning it doesn't need to stall waiting for RAM in order to operate on them.

如果你确实需要内存分配,原因是程序需要不定长的容器,共享所有权或有所有权的特质对象(后面我们还会谈到为什么你可能并不需要特质对象),那么尽可能将数据放置在内存的一段连续区域中,这样当CPU读取一个数据时能顺便将下一个数据读取到缓存中,所以CPU不需要等待再次加载内存数据才能进行下一步操作。

As a rule of thumb for whether something has to allocate: if you can tell me the amount of space the value will use up without running the program, it's stored on the stack. If you don't know something's size until runtime, it's allocated on the heap.

这里说明一下使用堆内存分配的一个标准:如果可以在不运行程序的情况下就确定某个值需要用到的内存大小,那么它应该被分配在栈上。如果直到运行时你才能确定某个值的大小,那么它就只能在堆上分配。

This means that String, Vec, HashMap and Box<Trait>/Box<[T]> all allocate, but any user-defined struct does not (it may contain something that does allocate, but it doesn't require any extra allocation to construct if you already have an instance of the allocating type). Box<T> where T has a statically-known size also allocates, so be careful of recursive enums. If you're creating a tree structure that then becomes immutable, like with an AST, you might want to consider using a TypedArena to get tighter control over memory use. TypedArena is still unstable though, and it increases complexity, so it's not suitable for all use-cases.

这说明类似StringVecHashMapBox<Trait>/Box<[T]>这些类型都只能在堆上分配,而任何用户定义的结构体并不需要(结构体可能包含字段必须在堆上分配,但是结构体本身并不需要任何额外的堆上内存分配,因为你已经明确了整个结构体占用内存的大小了)。Box<T>T有着静态大小的情况下依然是堆分配,因此你需要小心递归枚举类型。如果你要创建一个后续不可变的树型结构,比方说一棵抽象语法树AST,可以考虑使用TypedArena来获得更加紧凑的内存分配。但TypedArena目前还未稳定,同时还会增加代码复杂度,所以它也不适合所有的应用场景。

This is why you may have heard some complaints about Haskell's use of a linked list of characters to represent a string. I'm not going to beat gankro's wonderful rant on the subject, but suffice to say that this is more-or-less the worst possible data structure to choose when each individual element is small and the number of elements is large, because it needs to store a pointer for every element, instead of a single integer for the entire array.

你很可能听到过一些对Haskell使用字符链表来表示字符串的抱怨,上面的讨论就是原因。这里作者不打算继续吐槽,因为有一本专门讲这方面内容的书,但足以说明在每个元素很小而元素数量很大的情况下这是多么糟糕的一个选择,因为它需要为每个元素存储一个指针,而不是单个整数的数组。

Not only that, but without some truly crazy compiler optimizations this means that each element may not be directly after the element before it in memory. We'll get to how to calculate this in a moment, but essentially that means that Haskell's String type can cause a cache miss up to twice per element, whereas if you had a vector of Chars (assuming 32-bit chars) it could only cause a maximum of 1 cache miss for every 16 elements. This is why the performance-savvy in languages such as Haskell and Lisp know to use vector-like constructs when possible.

不仅如此,如果编译器无法进行一些真正疯狂的优化的话,所有这些元素可能无法紧挨着存储在内存的一段区域中。我们马上会看到如何计算非缓存命中情况,但本质上这就意味着Haskell的String类型可能在每个元素访问的时候产生高达两次的非缓存命中,而如果你又一个Char类型的vector的话,它最多每16个元素才会发生一次非缓存命中的情况。这也是为什么精明的Haskell或Lisp程序员在程序中都会尽量的使用vector的原因。

Back in the world of Rust, this means that you should avoid indirection-heavy representations Vec<Vec<_>> to represent a matrix, since this means that each sub-vec will likely be in a different location. Use a data structure that uses a flat Vec as a backing array and builds on top of it, unless you really do need to have the inner Vecs be both jagged (each sub-vec is of different size) and growable (you can change the size of each sub-vec independently at runtime). You probably don't need either, let alone both. If you need them to be a uniform size, store a Vec and a dimension tuple, if you need them to be jagged but don't need them to be growable, store a list of indices and return slices of the flat Vec using those. For an example of why this is good, let's dive into some basic math. Let's assume a matrix backed by flat vector and a number of columns (the number of rows can be inferred from columns + data length):

回到Rust的世界,这意味着你应该避免使用类似Vec<Vec<_>>这样的间接容器来表示矩阵,因为这可能造成每个内部vec被分配到内存的不同位置。你应该使用一个数据结构直接包含平铺的Vec类型作为内部存储,然后内部实现二维或多维结构,除非在内部Vec是不整齐的(比方说内部vec的大小是不一致的)或者内部要实现动态增长(也就是你可以在运行时动态改变每个内部vec的大小)的情况下。即使这两种情况下之一,你也可能不需要内部vec。如果你需要一个统一大小的矩阵,可以存储一个Vec加上一个维度的元组,如果你需要不均匀尺寸但是不需要动态增长,可以使用序号的列表然后得到平铺Vec的切片。为了说明这样处理的好处,我们下面使用一些基本的数学方法。让我们下面使用一个平铺的vector和一个列数来表示矩阵(行数可以从列数及数据长度中推断):

// This isn't how Vec is defined in the standard library but it's a simplified
// version with the same memory layout.
// 这不是标准库中Vec的实现,但它是一个简化版本,使用了相同的内存布局
struct Vec<T> {
    pointer: *mut T,
    capacity: usize,
    len: usize,
}

struct Matrix<T> {
    data: Vec<T>,
    num_columns: usize,
}

So a matrix with N rows and M columns needs N * M * size_of::<T>() space for the elements, plus size_of::<*mut T>() + 3 * size_of::<usize>() for the "metadata" (the vector's pointer and the capacity, length and num_columns fields). If we're on a 64-bit CPU with 64 byte cache lines like the i7, we end up with both *mut T and usize being 4 bytes each. If we had a 4x4 matrix of f32 (also 4 bytes in size) this would mean:

因此N行和M列的矩阵需要N * M * size_of::<T>()的内存空间,还需要size_of::<*mut T>() + 3 * size_of::<usize>()字节来存储元数据(vector的指针以及capacitylengthnum_columns字段)。如果我们在一个64位CPU和64字节的cache line机器上(如i7)运行程序,最终*mut Tusize都是4字节长。如果我们有一个4x4的f32(也是4字节长)矩阵,实际上:

元数据大小 Metadata size = 4 + 3 * 4 = 16
Maximum 2 cache misses
最多2次非缓存命中

数据大小 Data size = 4 * 4 * 4 = 64
Maximum 2 cache misses
最多2次非缓存命中

Since the metadata and the data could be in separate parts of memory, we have to calculate maximum cache misses separately. Both the metadata and the data could cross a cache line and require two cache misses to load. This means that the whole matrix would miss the cache 4 times in the worst case. If we had a Vec<Vec<f32>> representation that would mean the size of:

因为元数据和数据本身可能存在于内存的不同位置,我们分别计算各自的非缓存命中情况。这里元数据和数据都可能跨过了一个缓存line,因此需要最多2次的非缓存命中才能读取到。如果我们使用的是Vec<Vec<f32>>的数据类型,那么将变成:

矩阵元数据大小 Matrix metadata size = 4 + 2 * 4 = 12
Maximum 2 cache misses
最多两次非缓存命中

内部vector元数据大小 Inner vector metadata size = 4 * (4 + 2 * 4) = 48
Maximum 2 cache misses
最多两次非缓存命中

数据大小 Data size = 4 * 4 = 16
Maximum 2 cache misses per row (8 cache misses total)
每行最多两次非缓存命中,总共最多8次

This means that the Vec<Vec<f32>> representation could miss the cache up to 12 times to read the whole array, much worse than the flat representation.

这意味着表示成Vec<Vec<f32>>的方式可能产生高达12次的非缓存命中才能最终读取整个矩阵,这和铺平的表示方式相比要差了很多。

Even better, if you statically know the matrix's size you can use statically-sized arrays like [[T; N]; N]. These are even cheaper than flat vectors, although you obviously can't use them for data with a variable size at runtime. The 4x4 array in the previous example would be [[f32; 4]; 4] and take up 64 bytes, meaning that it would only take 2 cache misses to load in the worst case.

在更理想的情况下,你可以预先知道矩阵的形状,就能够使用静态分配的数组如[[T; N]; N]的表示方式。这种方式比平铺vectors的表示方式更加便宜,虽然这种方式肯定不能应用在运行时可变大小的数据上。使用这个方法表示的上面的4x4的矩阵将会是[[f32; 4]; 4],占用了64个字节,在最差情况下只会产生两次的非缓存命中。

Keep as much as possible in registers

尽可能的利用寄存器

Now, the absolute best place for your data - registers. The more work you can do without non-local writes the more that rustc and LLVM can assume about your data's access patterns. This is good because it means that data can be mapped to the CPU's physical registers, which are the fastest memory on your entire computer, but even better, if you make your data suitable for registers then anything that happens to that data can be aggressively optimized. Writes and reads through pointers have certain ordering restrictions on them that prevent optimization, but there are no such restrictions on register-allocated data.

下面轮到寄存器了,存储数据绝对最佳的地方。当程序在不需要本地写的情况下进行的工作越多,那么rustc(译者注:rust编译器)和LLVM(译者注:通用编译器后端,也是现代很多语言的编译后端,如rust和swift)就能更好地预测程序的数据访问模式。这样数据就能被映射到CPU的物理寄存器上,这是你能够在整个计算机上找到的速度最快的存储设备了,不止如此,如果你的数据能够适配在寄存器上时,对数据的任何操作都能被非常有效的优化。使用指针的读写操作都有一定的顺序限制,会影响对其进行优化,而在寄存器上这些限制都不存在。

It's worth noting that since Rust restricts pointers more than C does, the ordering restrictions on pointers could be relaxed. This hasn't been implemented in LLVM yet since most of the optimization work is based on leveraging the rules of C and C-family languages. Even if they did implement relaxations on the reordering rules, however, storing data in registers will still be easier to optimize.

这里值得注意的是因为Rust对指针的使用比C更加严格,所以Rust能够在指针上使用relaxed顺序限制。这种限制方式目前在LLVM中还未实现,因为目前大多数编译器的优化工作都还是围绕C或C家族语言来进行的。但即使最终LLVM实现了relaxed的顺序限制,数据保存在寄存器上依然更加容易优化。

So how do you get rustc to allocate things to registers? Essentially, the less pointers you have to write at runtime the better. Writing to local variables is better than writing through a mutable pointer. As much as possible, you should try to constrain mutable writes to the data that you have ownership over. So a mutable loop counter is fine, but passing a mutable reference to a loop counter through multiple layers of functions is not (unless they end up getting inlined, of course). This is really just an extension of one of my first points: clean, boring code is easier to optimize than spaghetti.

那么如何才能让rustc进行寄存器分配呢?最基本一点,运行时涉及写操作的指针越少越好。写一个本地变量要优于通过可变指针进行写操作。同样,你也应该尽可能减少对拥有所有权的数据进行写操作。因此使用一个可变循环计数器是可以接受的,但是使用循环并且在多个函数中传递一个可变引用的做法就是不可取的(当然如果最终这些函数变成内联就不存在这个问题)。这也印证了作者前面的主要观点:干净的、乏味的代码比炫技的高深代码更加容易优化。

Avoid Box<Trait>

避免使用Box<Trait>

The canonical way to create trait objects is Box<Trait>, but the majority of code can get away with &mut Trait, which also has dynamic dispatch but saves an allocation. If you absolutely need ownership then use a Box, but most use-cases can use an &Trait or &mut Trait. Even better is to avoid using a trait object all together. impl Trait is the obvious way to avoid them, but that doesn't allow you to store a heterogenous collection of elements that implement a single trait since it's basically type inference in a fancy hat. A good trick for when you want to allow a variable but finite number of implementors of a type because you want to choose between them or iterate over them, use either a tuple or a recursive generic struct like this:

创建特质对象的典型方法是通过Box<Trait>类型,但是实际上很多时候只需要使用&mut Trait就够了,后者虽然也是动态分派但能节省一次内存分配。尽量在你确实需要所有权的情况下才使用Box,通常来说只用&Trait&mut Trait就足够了。当然更好的做法是完全避免使用特质对象。使用impl Trait类型正是实现的方法,不过这种做法不适用于在一个集合当中存储很多异质的元素,而这些元素都实现了同一个特质,因为这样会导致类型推断无所适从。如果你有一些实现了某个特质的有限数量的类型,并且需要从中选择或者迭代的话,使用元组或者一个递归的泛型结构体是一个很有用的技巧,如下:

struct Cons<Head, Tail>(Head, Tail);

Since data structures in Rust don't add any indirection or space overhead, you can implement a trait for this structure recursively and have a function that can take any number of parameters that runs as fast as an equivalent function that takes a fixed number of parameters. Here's an example of how this could look for a function that takes a list of functions and calls them:

因为Rust当中的数据结构不会增加任何额外的间接指向或空间占用, 你可以为这样的结构体递归的实现一个特性,然后定义一个函数接收任意数量的参数,这与接收一个固定数量参数的函数版本运行速度是一致的。下面是接收一个函数列表集合并循环调用它们的一个例子:

Allocating version:

需要内存分配的版本:

fn call_all_fns(fns: Vec<Box<FnBox() -> ()>>) {
    for f in fns {
        f();
    }
}

Allocation-free version:

无需内存分配的版本:

struct Cons<First, Second>(First, Second);

trait HCons: Sized {
    fn cons<T>(self, other: T) -> Cons<Self, T> {
        Cons(self, other)
    }
}

impl<T: Sized> HCons for T {}

// This is a hack to get around the fact that manually implementing the `Fn`
// traits is currently unstable.
// 下面的特质是一个小技巧让结构体递归实现特质,因为目前手动实现`Fn`特质还是不稳定的
trait Callable {
    fn call(self);
}

impl<F: Fn() -> ()> Callable for F {
    fn call(self) { self() }
}

impl<First: Callable, Second: Callable> Callable for Cons<First, Second> {
    fn call(self) {
        self.0.call();
        self.1.call();
    }
}

fn call_all_fns_no_alloc<T: Callable>(fns: T) {
    fns.call();
}

Here's what they both look like in use:

下面看看如何使用这两个版本:

fn main() {
    let first_fn = || { println!("Hello!"); };
    let second_fn = || { println!("World!"); };
    
    call_all_fns(vec![Box::new(first_fn), Box::new(second_fn)]);
    
    let first_fn = || { println!("Hello!"); };
    let second_fn = || { println!("World!"); };
    
    call_all_fns_no_alloc(first_fn.cons(second_fn));
}

The functions passed to call_all_fns_no_alloc are eligible for inlining, they require no space overhead, and their instructions and data are directly next to each other in memory and are therefore much faster to access than if each of them were boxed. For example, in combine there's a choice function that takes an array that could contain trait objects, but it also supplies a .or() combinator (and a choice! macro that expands to recursive .or calls) that returns an Or<A, B> that in turn implements Parser. This means that dispatch is static and the objects are all stored in order in memory (because it's just a set of recursive structs). You will still need dynamic dispatch for some cases, but using this method means that the number of cases where this is necessary is very small.

传递给call_all_fns_no_alloc的函数参数很适合内联,它们不需要额外的空间,它们的指令和数据在内存中都会存储在内存的连续区域中,因此会比第一个装箱的版本快很多。例如,在combine中有一个choice函数可以接受包含特质对象的集合,但是它也提供了一个.or()组合器(还有一个choice!宏会被扩展成递归的.or()函数调用)会返回一个Or<A, B>,返回值中的类型依次实现了Parser特质。这意味着所有的分派都是静态的并且这些对象都能全部保存在内存的连续区域中(因为它归根结底就是一系列递归的结构体)。当然在某些情况下你仍然需要动态分派,但是学会了这个方法能将动态分派的情况限制到最小。

Use stack-based variable-length datatypes

在栈上分配可变长的数据类型

Fixed-length datatypes are trivially storable on the stack, but for dynamically-sized data it's not so simple. However, smallvec, smallstring and tendril are all variable-length datatypes that allow you to store small numbers of elements on the stack (shameless plug: smallstring was written by me). Due to the law of small numbers, you are very likely to have more of these small strings than larger ones. This is good because it reduces allocation, but it's great if you're storing these in a Vec or HashMap, since you will have less indirection and therefore better cache use. A good rule of thumb is to never have more than one layer of pointers to dereference before you reach your value (NASA enforces this rule in their C code, albeit for reliability and not performance).

定长的数据类型很容易实现在栈上分配,但对于动态大小的数据来说,情况就没有那么简单。例如smallvecsmallstringtendril都是可变长的数据类型,但是允许你将少量元素的集合在栈上分配(插一句:smallstring是原文作者写的)。基于小数定律,你的程序中很大几率会较多的使用这些小字符串而不是长的字符串。这能够减少分配,很不错,但如果你能将它们存储在一个VecHashMap当中的话将会更棒,因为这会减少间接指针因此达到更优的缓存使用。这里有一个很好的指导原则就是获取数据永远不要超过一次的解引用操作(NASA在他们的C代码中强制这个原则,虽然目的是可读性而不是性能)。

Libraries like smallvec are great for cache locality, since an array of SmallVec<[T; 4]> will have exactly the same cache-locality as an array of just T - as long as the length of each SmallVec is below 8 it just gets stored in order. Going back to the cache-miss calculations from earlier:

类似smallvec这样的库非常适合CPU本地缓存,因为SmallVec<[T; 4]>这样的数组与单纯的T数组有着相同的本地缓存行为(译者注:前面说过的数据结构在Rust不会带来任何性能和空间损耗,所谓的零成本抽象),只要是集合中每个SmallVec的长度都小于8,它们都会连续分配。我们回到前面的非缓存命中的计算看看:

// This is a gross oversimplification of how this type is implemented in the
// crate, but it's enough to explain how it works.
// 这是`smallvec`这个crate的极端简化实现,但是足够解释发生的情况。
enum SmallVec<T> {
    Small([T; 4]),
    Big(Vec<T>),
}

type Matrix<T> = SmallVec<SmallVec<T>>;

As long as there are less than or equal to 4 elements in the SmallVec, the size of each instance is the size of the data plus the size of the tag, which is:

只要每个SmallVec中的元素少于4个,整个数据部分的大小就是数据占用大小和枚举标签的占用空间之和,也就是:

let size_of_data = size_of::<T>() * 4;
let size_of_tag  = max(size_of::<u8>(), align_of::<T>());
size_of_data + size_of_tag

The obvious question is why the size of the tag isn't just size_of::<u8>(). This is because if T was more than 1 byte in size, this would mean that all of the elements would all be unaligned by 1 byte, which is bad. CPUs work much slower on unaligned data, but unless you write a compiler you will never have to think about that. The size of the data and its alignment don't have to be the same. For structs, for example, the alignment is typically the largest alignment of any of its members. For primitive types like pointers, integers and floats the alignment is the same as its size. The alignment and size of an f32 are both 4. The alignment of a SmallVec<f32> is the largest alignment of its members, which is same as the alignment of [f32; 4], which is the same as the alignment of f32: 4.

这里有个问题,为什么枚举标签的尺寸不直接是size_of::<u8>(),而是与align_of::<T>()取最大值。这是因为当T的尺寸比1字节大的情况下,所有的元素都将因为多了标签1个字节而变成不对齐的状态。CPU在不对齐数据上工作的很慢,这不是我们想要的,通常除非你要写编译器否则你不需要考虑这点。数据的尺寸和它的对齐不一定要是相同的。对于结构体来说,对齐通常是其最长字段的大小。对于原始类型如指针、整数和浮点数来说,对齐与其尺寸相同。f32的对齐和大小都是4。SmallVec<f32>的对齐是它最长成员的对齐,也就是[f32; 4]的对齐,也就是f32的对齐:4。

Consider we had a 4x4 matrix of f32, this would mean that the size of the matrix would be:

如果我们有一个4x4的f32矩阵,这意味着整个矩阵的大小会是:

内部SmallVec的大小 Inner SmallVec size = 4 * 4 + 4
矩阵尺寸 Matrix size = 4 * (4 * 4 + 4) + 4 = 84
Maximum 3 cache misses
最多3次非缓存命中

We don't need to calculate the inner and outer cache misses seperately because they are guaranteed to be next to each other in memory.

这里我们不需要分别计算内部数据和外部元数据的非缓存命中,因为它们能保证是在内存中连续分配的。

From a cache standpoint this is as good as the flat vector representation, but there's nothing stopping you from accidentally making the inner vectors different lengths and breaking the invariant that an array's rows should be the same length.

站在缓存的角度来看,这种做法和平铺vector的表示方式一样好,但是上面的做法无法阻止你不小心在内部vector中使用不同的长度,从而打破了矩阵每一行应该具有相同长度这个不变性要求。

I want to make something clear: you will never do these calculations in the process of optimizing your program. This is merely some mathematical justification for the voodoo folklore that "allocation is bad", since that is often countered by "malloc is fast". Both statements are true - the actual process of allocating and deallocating memory is fast, but data structures that allocate are worse for use-cases that require maximum speed.

这里还要明确一点:你永远不需要在优化程序的过程使用上面这些计算。这里的说明只是为了证明前面提到的那个近乎巫术的传说“内存分配很不好”,而这句话经常会碰到“malloc很快”的挑战。两句话都是对的,真正分配和释放内存过程确实很快,但是对于数据结构来说,内存分配在某些要求极端性能的场合是不理想的。

Loop unrolling is still cool

循环展开仍然很酷

Duff's device is fun, but array-length-generic unrolled loops are unlikely to be faster than the equivalent optimized naïve code nowadays, since any optimizing compiler worth its bits will do this kind of optimization without having to mangle your code and ruining future-you's day.

达夫设备很有意思,但是时至今日,这种基于数组长度通用的循环展开方式基本都不会比同等的简单朴素代码的优化效果好,因为现代编译器都能够自动帮你产生这种优化,而不需要你手动搅乱代码来摧毁可维护性。

Having said that, if you know that an array is likely to be a multiple of N size, try making it a &[[T; N]] and operating on a [T; N] in each iteration. This reduces the number of iterations (and therefore, the number of times you need to recalculate the loop variables) and allows the compiler to operate more aggressively on the loop body.

话说回来,如果你明确知道需要处理一个数组,尺寸是N的整数倍,你还是可以将它变成&[[T; N]],然后再每次循环迭代中操作一个[T; N]。这能够有效减少循环次数(同时减少你需要重新计算循环变量的次数),还能帮助编译器在循环体内进行更加有侵略性的优化。

You can also use more classical loop unrolling if it allows you to reduce the "strength" of your operations. This means that if you have to calculate some value for each iteration of the loop and calculating this value takes longer than the body itself, manually unroll the body so you can calculate it less. Example: you can implement an integer logarithm function like so:

还有一种更加经典的循环展开技巧,前提是使用这种技巧能够降低循环体内昂贵操作的成本。如果在循环体中每次迭代都需要计算一个值,而计算这个值的耗时可能超过了循环体迭代本身的话,手动展开循环能够降低计算成本。比方说,你可以如下实现一个整数对数函数:

fn log_base(mut n: usize, base: usize) -> usize {
    let mut out = 1;

    loop {
        if n < base { return out; }

        out += 1;
        n /= base;
    }
}

However, n /= base; out += 1; is slower to calculate than n < base. To take advantage of this fact, you can unroll the loop like so:

观察到上面的n /= base; out += 1;运算比n < base要慢。利用这一点,你可以如下展开循环:

fn log_base_unrolled(mut n: usize, base: usize) -> usize {
    const UNROLL_COUNT: usize = 4;

    // We use a fixed-size array to ensure that we don't get the array count and
    // the `out` skip value out of sync.
    // 这里我们使用固定长度的数组以免造成数组长度和`out`步长之间的不同步
    let premultiplied_base: [_; UNROLL_COUNT] = [
        base,
        base * base,
        base * base * base,
        base * base * base * base,
    ];

    let mut out = 1;
    
    loop {
        if n < premultiplied_base[0] { return out; }
        if n < premultiplied_base[1] { return out + 1; }
        if n < premultiplied_base[2] { return out + 2; }
        if n < premultiplied_base[3] { return out + 3; }
        
        n /= precalculated_base[UNROLL_COUNT - 1];
        out += UNROLL_COUNT;
    }
}

Here are the benchmarks I used:

下面是作者用到的基准测试:

#[bench]
fn bench_log_base(b: &mut Bencher) {
    b.iter(|| {
        let input = black_box(5000000120510250);

        assert_eq!(log_base(input, 10), 16);
    });
}

#[bench]
fn bench_log_base_unrolled(b: &mut Bencher) {
    b.iter(|| {
        let input = black_box(5000000120510250);

        assert_eq!(log_base(input, 10), 16);
    });
}

test::black_box is a magic function that prevents rustc and LLVM calculating those function calls at compile-time and converting them into a constant, which usually they would (actually, it's not magic, it's just some inline assembly that doesn't do anything, since neither rustc nor LLVM will try to optimize anything that's been accessed by inline assembly).

test::black_box是一个魔术函数,它阻止rustc和LLVM将函数调用在编译期计算出来然后放置到常量中,因为在优化情况下这两个编译器经常会这样做(事实上,也不是什么魔术,实际上这个函数只是一些内联的汇编代码然后不做任何事情,因为无论是rustc还是LLVM都不会尝试优化任何内联汇编访问的内容)。

This gives the following results:

基准测试结果如下:

test bench_log_base          ... bench:  18 ns/iter (+/- 0)
test bench_log_base_unrolled ... bench:   5 ns/iter (+/- 0)

Wait a minute, though, what happens when we give a non-constant value for base?

等一下,如果在base中也提供一个非常量值会如何?

#[bench]
fn bench_log_base_nonconstbase(b: &mut Bencher) {
    b.iter(|| {
        let input = black_box(5000000120510250);
        let base = black_box(10);

        assert_eq!(log_base(input, base), 16);
    });
}

#[bench]
fn bench_log_base_unrolled_nonconstbase(b: &mut Bencher) {
    b.iter(|| {
        let input = black_box(5000000120510250);
        let base = black_box(10);

        assert_eq!(log_base_unrolled(input, base), 16);
    });
}
test bench_log_base_unrolled_nonconstbase ... bench:  37 ns/iter (+/- 1)
test bench_log_base_nonconstbase          ... bench: 199 ns/iter (+/- 5)

They're both much slower! Can we do better? Turns out yes, we can:

它们都变慢了很多!我们可以做的更好吗?答案是肯定的:

fn log_base_increasing(n: usize, base: usize) -> usize {
    const UNROLL_COUNT: usize = 4;

    let premultiplied_base: [_; UNROLL_COUNT] = [
        base,
        base * base,
        base * base * base,
        base * base * base * base,
    ];

    if n < premultiplied_base[0] { return 1; }
    if n < premultiplied_base[1] { return 2; }
    if n < premultiplied_base[2] { return 3; }
    if n < premultiplied_base[3] { return 4; }

    let mut out = UNROLL_COUNT + 1;
    let mut mul = premultiplied_base[UNROLL_COUNT - 1];

    loop {
        if n < premultiplied_base[0] * mul { return out; }
        if n < premultiplied_base[1] * mul { return out + 1; }
        if n < premultiplied_base[2] * mul { return out + 2; }
        if n < premultiplied_base[3] * mul { return out + 3; }

        mul *= premultiplied_base[UNROLL_COUNT - 1];
        out += UNROLL_COUNT;
    }
}

#[bench]
fn bench_log_base_increasing(b: &mut Bencher) {
    b.iter(|| {
        let input = black_box(5000000120510250);

        assert_eq!(log_base_increasing(input, 10), 16);
    });
}

#[bench]
fn bench_log_base_increasing_nonconstbase(b: &mut Bencher) {
    b.iter(|| {
        let input = black_box(5000000120510250);
        let base = black_box(10);

        assert_eq!(log_base_increasing(input, base), 16);
    });
}

Let's check out the results now:

现在看一下全部的基准测试情况:

test bench_log_base                         ... bench:  18 ns/iter (+/- 0)
test bench_log_base_nonconstbase            ... bench: 199 ns/iter (+/- 5)

test bench_log_base_unrolled                ... bench:   5 ns/iter (+/- 0)
test bench_log_base_unrolled_nonconstbase   ... bench:  37 ns/iter (+/- 1)

test bench_log_base_increasing              ... bench:   6 ns/iter (+/- 0)
test bench_log_base_increasing_nonconstbase ... bench:   8 ns/iter (+/- 1)

Turns out the compiler was doing something sneaky: it can optimize integer division by a constant into a multiplication combined with a shift. When it could no longer fold the constant into the function it slowed down considerably. It's ok to rely on const-folding if it allows you to gain considerable speedups and you know that the function will usually be called with constant arguments, but be careful. The things to look out for are if statements and integer division, both of which can be much slower with non-constant values compared to constants.

从中我们可以发现编译器做了手脚:它能把常量除法优化成使用位移进行的乘法操作。当编译器无法将常量代入到函数当中时,性能将会大幅度下降。当然如果在你能确定函数通常都是使用常量参数调用的情况下,依赖这种编译器的常量代入优化也是没问题的,不过要当心。如果你的代码中出现了if分支和除法操作时,非常量调用情况都会比常量情况调用慢许多许多。

The fastest method by far converts to an f64, calls .log(base) on that, and then converts back. It doesn't work for large numbers, however, because of loss of precision. This is probably a good time to note that although adding and multiplying integers is faster than doing the same for floats, for code that does division by a non-constant value or something more complex like trigonometry, you should definitely use floats. The compiler can't do the conversion for you - it won't apply optimizations that make your code less precise - but you can check for areas where this would be an improvement and make the change manually.

目前这个问题最快的方法是将参数转为f64,调用.log(base),然后将结果转回整数。这个方法无法满足大数值整数,因为存在着精度丢失问题。但是我们还是需要注意的是整数的加法和乘法要比浮点数的相同操作快,但是需要对于非常量进行除法或者一些更复杂的计算如三角函数时,你应该坚持使用浮点数。编译器无法帮你做这个转换,它不会采取任何优化措施令你的代码更加不精确,但是对于程序猿来说,应该知道什么情况下能够应用这种优化,并且手动采用它。

assert! conditions beforehand

提前使用assert!断言条件

If you want to reduce the number of implicit asserts that get compiled into the code, then instead of this:

如果你希望减少编译器向代码中插入隐式断言的话,你可以改写下面的代码:

fn do_something_with_array(array: &[u8]) -> u8 {
    array[0] + array[1] + array[2] + array[3] + array[4] + array[5]
}

Do this:

提前增加断言:

fn do_something_with_array(array: &[u8]) -> u8 {
    assert!(array.len >= 5);
    array[0] + array[1] + array[2] + array[3] + array[4] + array[5]
}

This allows LLVM to realize that the later asserts are unreachable and elides them. This is useful for any code that may assert multiple different qualities about the same data, but is especially useful for indexing since we know that if array[n] succeeds then array[n - 1] will succeed too. This is similar to the point about fixed-length arrays in the previous section.

这能够帮助LLVM明白到后续的断言是不可达的,因此屏蔽掉它们。这对于任何在相同数据上进行多个值的断言都是有效的,但是尤其对于序号索引判断特别有用,因为我们明确知道如果array[n]是正确的,那么array[n - 1]也是正确的。这类似与上面小节中我们提到的固定长度数组的情况。

Essentially, try to consolidate checks into a single assert!. This means that the later checks become statically unreachable. If LLVM/Rust still don't optimize it away you can switch to using the unsafe indexing methods while ensuring that they're still safe. This tip is shamelessly stolen from a comment on /r/rust.

更重要的是,将这些断言集中在一个assert!断言当中,意味着后续的检查都是静态不可达的。如果这时候LLVM/Rust仍然没有将后续断言优化掉,你还可以使用unsafe的索引方法,同时仍然保证代码是内存安全的。这个方法出自于reddit上/r/rust的一个评论

Use link-time optimization

使用链接期优化

Normally, Rust can only inline functions that are either defined in-crate or, in the case of functions in other libraries, have #[inline] specified. LTO allows the compiler to inline cross-crate, at the cost of a compile-time speed penalty. I am of the opinion that compile times only matter for debug builds, so that's a tradeoff I'm willing to make. As with everything else here, profile and check that the tradeoff is worthwhile.

通常来说,Rust只能对本crate内部或者其他库中明确声明#[inline]的那些函数进行内联。链接期优化LTO允许编译器可以跨crate对函数进行内联,付出的代价是编译时间。作者的观点是编译时间只会影响到debug构建过程,因此这是一个可以接受的取舍。性能测试和检查也说明了这一点。

Don't use #[inline(always)]

不要使用#[inline(always)]

#[inline(always)] feels good as a performance hint, but the truth is that optimizing compilers are really good at working out when a function would benefit from being inlined, and Rust isn't constrained to the slower standardized C calling convention and can use fastcc, making function calls extremely cheap. You're more likely to cause the size of your executable to bloat. This takes up more space on your hard drive, of course, but that's not too much of a problem. If you have even a single bundled asset like images or audio they will likely dwarf the size of your executable.

#[inline(always)]看起来是一个很好的性能标记,但事实上现代编译器非常善于发现一个函数是否能使用内联获得优化,Rust并没有使用速度低的标准C函数调用规范,而是能够使用fastcc,这使得函数调用变得非常廉价。因此使用上面标记更可能的结果是导致你的可执行程序体积变大。这显然会占用你更多的硬盘空间,但这也不是什么特别大的问题。因为如果你将图像或者音频一起打包到可执行程序当中,可能会增加更多的体积。

The real issue here is that it can make your program no longer fit in the CPU's instruction cache. The CPU will only have to go to RAM for its instructions when functions are called with instructions outside of the current cache line. The larger your binary is, the more likely this is, and the more functions are inlined, the larger your binary is. It's not the end of the world to have a large binary, but unless you're really certain that something will be improved by manually marking it as inline, and you have benchmarks to back that up, it's just as likely that you'll slow a program down with careless inlining than to speed it up.

真正的问题在于这会导致你的程序体积不再适合放置在CPU的指令缓存中。CPU将不得不去内存当中读取当前调用函数的代码段到缓存line当中。你的二进制文件体积越大,这种情况就越频繁。不过一个巨大的二进制文件也不是世界末日,但除非你确定如果某些函数手动内联会提升效率,并且进行了基准测试证明这一点,否则并不是在提升程序性能,反而可能因为滥用内联而降低了性能。

So now I've scared you off inlining, let's talk about when you should explicitly add inlining annotations. Small functions that are called often are a good target for inlining. Iterator::next, for example, or Deref::deref. The overhead from calling these functions may be larger than the time it takes to run the function itself. These are likely to be automatically inlined when called internally, but marking these as #[inline] will allow users of your library to inline them too, even if they don't use LTO. Only functions marked #[inline] will be considered for cross-crate inlining, but that means the definition has to be stored in the compiled library, causing bloat and increasing compile times. #[inline(always)] is even more niche, but it's sometimes nice to ensure that a tiny function will be inlined, or as a kind of documentation that the function call is free for if someone comes along and tries to manually inline it to improve performance. It really is very rare that you would want to do this, though, and it's best to just trust the compiler.

好吧,现在你可能被作者吓得不敢使用内联了,下面来说说什么情况下你应该使用显式的内联标注吧。一个频繁被调用的小函数就是很理想的目标。比方说Iterator::next或者Deref::deref。因为这些函数调用所需要的时间可能比函数本身执行时间还要多。这些函数在你自己的crate中通常都会被编译成内联的方式,不过将这些函数标记成#[inline]能够让调用库的用户也获得内联特性,即使用户不使用LTO的情况下也可以。只有那些被标记了#[inline]的函数能跨crate内联,但这种做法会将这些函数定义打包到目标二进制当中导致体积和编译时间的增加。#[inline(always)]标记会加剧这种后果,但有时这对于保证一个极小的函数确定内联编译是有好处的,这有时也可以作为一种文档方式,明确告知调用者这个函数可以自由的内联到对方目标中以提升性能。当然这种情况是非常少见的,所以最好还是相信编译器。

The other class of functions that are good targets for annotating inlining are ones that you know to often be called with constant parameters. We go into this later on, but {integer}::from_str_radix is an exellent example of this. Most uses of this function will have a constant as the second parameter, and so by judicious use of #[inline] we can prevent branching and expensive operations like division for the consumers of our library. It's not worth losing sleep over though, since they could just use link-time optimization if they need to squeeze out every last drop of performance.

另外一种很适合标注为内联的函数式那些你知道经常会使用常量作为参数进行调用的。我们后面还会深入了解这点,{integer}::from_str_radix就是这样一个例子。使用这个函数的调用者在大多数情况下都会把第二个参数设置为常量,因此将这种函数标记为#[inline]能够避免分支和类似除法之类的昂贵操作,提升调用者的性能。但是也不必过度使用,因为如果调用者需要将性能压榨到极致的话,他们依然可以使用LTO。

Also, the compiler does really get it wrong sometimes, and can miss out on inlining opportunities that would improve code speed. However, only add #[inline(always)] annotation if you can prove with benchmarks that it improves the speed, and adding these annotations is a bit of a dark art. You effort is probably better spent elsewhere.

还有些情况下,编译器确实会搞错,没有采用函数内联以提升性能。但是,永远只在你能使用基准测试证明性能得到提升的情况下使用#[inline(always)],增加这个标记有时就像黑魔法。你的精力可能更值得花费到其他的地方。

If you want to reduce the size of your code, you can try using panic = "abort". This removes the "landing pads" that allow Rust to show a nice stack trace after a panic, and causes any panic to end the program instantly. I have legitimately seen non-trivial speedups on benchmarks for the ion shell after adding this option to the release build, and I can only attribute it to making more code fit in the instruction cache. I have not tried it with many other programs, but it would probably only affect medium to large projects. Try it out on your codebase, it's as easy as adding one line to the Cargo.toml and it may improve your code's speed.

如果你希望减小你的程序尺寸,你可以尝试使用panic = "abort"。这会移除程序入口时的一些运行时代码,它们用来当你的程序panic的时候展示良好的调用栈情况,移除后panic将会直接结束程序的运行。作者留意到了当ion终端在发布构建时增加了这个选项后能够较为明显的提升性能,这背后的原因也是能够让指令缓存容纳更多的代码。作者并没有在更多其他程序上进行过测试,这个方法可能仅对中等到大型规模的项目有效。你可以在你的代码库中尝试一下,这只需要在Cargo.toml文件中增加一行,却可能提升程序的性能。

Parallelize, but not how you think

并行化,但不是你认为的那样

There's an absolutely amazing library for Haskell called Haxl, that automatically tracks the data dependencies of your network requests and batches them and runs them asynchronously as long as they don't overlap. It's something that shows the power of computational abstractions like monads and it's not something that has a, ahem, parallel in any other language, as far as I know. At least, not for IO. We've had this exact ability in the CPU for a long, long time. The CPU tracks the data dependencies of computations and will parallelize them wherever possible.

在Haskell当中有一个相当神奇的库叫做Haxl,它会自动跟踪你网络请求的数据依赖情况然后批量处理它们,只要这些数据不会交叉,它们可以同时被处理。这是一个被称为monads的计算抽象的方法,然而这与任何其他语言中的并行是不同的。至少,一般来说并行不是针对IO的。实际上我们的CPU已经具备这种能力很长时间了。CPU会跟踪数据的依赖性,然后尽可能的并行计算它们。

The reason data dependencies matter is that the CPU doesn't just execute one instruction at a time. As long as two instructions don't share a register they can safely be run simultaneously, so the CPU does so. This is essentially free parallelism without the need for locks, work queues or anything that affects your architecture at all, so you would be crazy not to take advantage of it.

数据依赖性的重要性是在于CPU并不是同一时间只执行一条指令的。如果两条指令没有共享寄存器内容的话,它们就可以安全的被同时运行。这是免费的并行化,不需要使用锁、任务队列或任何影响你代码架构的东西,因此如果你不使用它是难以置信的。

Parallelizable computation also lends itself well to autovectorization, which is the process where the compiler realizes that you're doing the same thing to multiple different values and converts it to a special instruction that, well, does the same thing to multiple different values.

并行计算同样还会进行自动向量化操作,这指的是编译器能够发现你的代码尝试对多个不同的值进行相同的操作时,会将这些代码转换成一条特殊的指令,同时对多个不同的值一次性完成该计算。

For example, the compiler could translate the following numerical code:

例如,编译器能够将下面的代码:

(a1 + a2) + (b1 + b2) + (c1 + c2) + (d1 + d2)

into just one instruction that executes all four subexpressions as fast as a single addition.

翻译成只需要一个加法就完成所有四个子加法运算的操作。

%intermediate-value = add-vectors [%a1 %b1 %c1 %d1] [%a2 %b2 %c2 %d2]
sum-parts %intermediate-value

A case study

实际案例

Let's write a version of usize::from_str_radix that's about 30% faster than the one in the standard library.

下面让我们写一个usize::from_str_radix的函数版本,能够比标准库的实现提升大约30%的性能。

// We're redefining these here since they're private in the stdlib
// 我们需要重新定义以下数据结构,因为在标准库中它们是私有的
#[derive(Debug, Clone, PartialEq, Eq)]
struct ParseIntError {
    kind: IntErrorKind,
}

#[derive(Debug, Clone, PartialEq, Eq)]
enum IntErrorKind {
    Empty,
    InvalidDigit,
    Overflow,
    Underflow,
}

#[inline]
fn from_str_radix(input: &str, radix: usize) -> Result<usize, ParseIntError> {
    fn to_digit_ascii(ascii: u8, radix: usize) -> Result<usize, ParseIntError> {
        let decimal_digit = ascii.wrapping_sub(b'0');

        if radix > 10 && decimal_digit > 9 {
            let out = (ascii | 32).wrapping_sub(b'a') as usize;

            if out > radix - 10 {
                Err(ParseIntError { kind: IntErrorKind::InvalidDigit })
            } else {
                Ok(out + 10)
            }
        } else {
            let decimal_digit = decimal_digit as usize;
            if decimal_digit > radix {
                Err(ParseIntError { kind: IntErrorKind::InvalidDigit })
            } else {
                Ok(decimal_digit)
            }
        }
    }

    if radix > 36 {
        panic!("from_str_radix: radix is too high (maximum 36)");
    }

    let bytes = input.as_bytes();

    if bytes.len() == 0 {
        return Err(
            ParseIntError { kind: IntErrorKind::Empty }
        );
    }

    let bytes = match bytes[0] {
        b'+' => { &bytes[1..] },
        b'-' => { return Err(ParseIntError { kind: IntErrorKind::Underflow }) },
        _ => bytes,
    };

    let mut mul = radix;
    let mut index = bytes.len() - 1;

    let mut output = to_digit_ascii(bytes[index], radix)?;

    for &byte in bytes[..index].iter().rev() {
        let digit = to_digit_ascii(byte, radix)?;

        let next_output = output.wrapping_add(digit * mul);

        if output > next_output {
            return Err(
                ParseIntError { kind: IntErrorKind::Overflow }
            );
        }

        mul *= radix;
        output = next_output;
    }

    Ok(output)
}

I explicitly use wrapping_* functions not for optimization purposes (because overflow checks are removed at runtime), but because overflow is required for correct behaviour. You'll notice some optimizations here:

  • We start at the end and work backwards, keeping a "mul" counter. Originally I wrote a version of this that works forwards and multiplies output by radix each loop but the backwards method is 10% faster. This seems to be due to better instruction-level parallelism. The multiplications can be parallelized and only the addition relies on the previous iteration's value for output, and addition is much faster. Any folding operations can be improved in this way by exploiting the algebraic laws (in this case, the distributive law) to improve the number of operations that can be done in parallel.
  • We rely on overflow to test A < x < B comparisons. This is only useful here because we already need the x - A value, so we're saving an extra comparison and a bitwise and. In most code A < x && x < B is as cheap or cheaper than x - A < B with overflow.
  • We use | 32 to unify the codepaths for upper- and lowercase letters, reducing the number of comparisons we need to do.
  • We don't do output + digit * mul when output == 0 and mul == 1. This seems to be consistently 1ns faster, but it's possible that this doesn't make a difference and the 1ns speedup I'm seeing is pure luck. I reran the benchmarks both with and without the change multiple times and saw a consistent difference but this doesn't rule out luck. This is the problem with microbenchmarks, when the difference becomes small enough you can't tell whether you're really making it faster.
  • We use Rust's safe iterator interface, which is as fast as the C idiom of storing a "start" and "end" pointer so you can check whether the loop is finished with a simple ==. If you ever hear someone say "Rust's safety guarantees are useless because you need to drop down to unsafe to get any real speed" (I've seen this almost verbatim on Hacker News before) you can show them this.
  • We don't rely on const-folding in order to make our code fast, but it does run faster with a constant value for radix. Therefore, we add #[inline] to allow downstream crates to apply const-folding too.

作者这里使用了wrapping_*函数,不是为了优化目的(因为在运行时溢出检查会被移除),而是为了程序的正确性。你可以在代码里面看到这些优化:

  • 我们从字符串末尾开始,期间保持一个mul计数器。之前作者写了一个从字符串开头开始的版本,然后在每次循环当中将output乘上radix,发现从末尾开始的版本快了10%。这看起来是因为末尾开始的版本能够获得更好的指令级并行。这里的乘法能够并行执行,因为只有加法操作依赖于上一次迭代的output值,而加法显然比乘法更快。任何的数据折叠操作都能通过利用代数定理使用这种方法(这里是分配率)来提升可以指令级并行的操作数量。
  • 我们依赖着溢出来测试A < x < B条件。这仅仅因为上面代码中我们已经需要x - A的值了,这样我们就能节省一次额外的不等判断和一次逻辑与。但在大部分情况下A < x && x < B都能比x - A < B的溢出判断要更加高效。
  • 我们使用了| 32来将大小写字母统一处理,从而减少了可能需要的条件比较。
  • output = 0mul = 1的情况下我们不进行output + digit * mul运算。这在测试中持续展现了约1ns的加速,但很有可能两者之间并没有区别,测试看到的只是纯属运气。作者在两种写法上重复进行了多次基准测试,结果都体现了区别,但这仍然无法排除运气成分。这也是微基准测试的问题所在,当区别很小的时候,你就无法辨别是否真的提升了性能。
  • 我们使用的是Rust的安全迭代器接口,它的性能与C的保存“开始”和“结束”指针一致,因此能够简单的使用==来判断循环是否已经结束。如果你曾经听到“Rust的安全保证是没用的,因为你需要使用不安全特性才能获得真正的速度”这种说法的话(作者曾经在Hacker News上看到过一模一样的原话),你可以用这篇文章打他们的脸。
  • 实际上我们没有依赖常量折叠这个方法来提升代码性能,但是实际测试中当radix是常量时,程序的确运行的更快。因此添加一个#[inline]标记方便下游用户能够应用常量折叠。

The method I used for this is the basic method you should use for any optimization work: write a representative benchmark and then progressively tweak and rerun benchmarks until you can't shave off any more cycles. Doing this for pure functions is much easier, so one of the first things you should do to optimize any function that's called in a tight loop is to make it pure. This avoids indirect writes and reads (reads and writes to places in memory that are likely to be outside cache lines) and makes benchmarking much, much easier. If you use test-driven development for reliability, this is the equivalent for performance.

上面例子中作者采用的方法也是你在任何优化任务中应该使用的方法:写一个典型的基准测试然后逐步调优和重复进行基准测试直到你已经无法在继续优化了。完成这项任务对于一个纯函数会更加容易,所以你在优化任何频繁调用的函数时应该做的第一件事就是让这个函数变为纯函数。这避免了间接的读写操作(读和写内存很容易发生在缓存line之外),这同样会使得基准测试变得非常容易。如果你在可靠性上采取了测试驱动开发TDD实践的话,这对于性能也是一样的。

Extending this to work on signed integer types is an exercise for the reader. Tip: unlike C, you can rely on signed integers overflowing with 2's complement arithmetic.

将上面的代码扩展至带符号的整数类型将会是留给读者的一个联系。提示:不像C语言,你可以利用二进制补码来让程序依赖着带符号整数的溢出。

Here are the functions I used to benchmark the code:

下面是基准测试的代码:

#[bench]
fn bench_from_str(b: &mut Bencher) {
    b.iter(|| {
        let input = black_box("1235112512");
        assert_eq!(from_str_radix(input, 10), Ok(1235112512));
        let input = black_box("FFaf125A");
        assert_eq!(from_str_radix(input, 16), Ok(0xFFaf125A));
    });
}

#[bench]
fn bench_from_str_native(b: &mut Bencher) {
    b.iter(|| {
        let input = black_box("1235112512");
        assert_eq!(usize::from_str_radix(input, 10), Ok(1235112512));
        let input = black_box("FFaf125A");
        assert_eq!(usize::from_str_radix(input, 16), Ok(0xFFaf125A));
    });
}

#[bench]
fn bench_from_str_nonconstradix(b: &mut Bencher) {
    b.iter(|| {
        let input = black_box("1235112512");
        let radix = black_box(10);
        assert_eq!(from_str_radix(input, radix), Ok(1235112512));
        let input = black_box("FFaf125A");
        let radix = black_box(16);
        assert_eq!(from_str_radix(input, radix), Ok(0xFFaf125A));
    });
}

#[bench]
fn bench_from_str_native_nonconstradix(b: &mut Bencher) {
    b.iter(|| {
        let input = black_box("1235112512");
        let radix = black_box(10);
        assert_eq!(usize::from_str_radix(input, radix), Ok(1235112512));
        let input = black_box("FFaf125A");
        let radix = black_box(16);
        assert_eq!(usize::from_str_radix(input, radix), Ok(0xFFaf125A));
    });
}

#[bench]
fn bench_from_str_1char(b: &mut Bencher) {
    b.iter(|| {
        let input = black_box("1");
        assert_eq!(from_str_radix(input, 10), Ok(1));
        let input = black_box("F");
        assert_eq!(from_str_radix(input, 16), Ok(0xF));
    });
}

#[bench]
fn bench_from_str_native_1char(b: &mut Bencher) {
    b.iter(|| {
        let input = black_box("1");
        assert_eq!(usize::from_str_radix(input, 10), Ok(1));
        let input = black_box("F");
        assert_eq!(usize::from_str_radix(input, 16), Ok(0xF));
    });
}

Results:

结果如下:

test bench_from_str                      ... bench:          22 ns/iter (+/- 7)
test bench_from_str_native               ... bench:          36 ns/iter (+/- 0)
test bench_from_str_nonconstradix        ... bench:          26 ns/iter (+/- 0)
test bench_from_str_native_nonconstradix ... bench:          39 ns/iter (+/- 0)
test bench_from_str_1char                ... bench:           5 ns/iter (+/- 0)
test bench_from_str_native_1char         ... bench:          13 ns/iter (+/- 0)

Something I've noticed with benchmarks below 1ms is that it can take some time to "spin up" the CPU. Occasionally the first benchmark in a set will take 20-30ns longer than the ones after it. If you duplicate the benchmark verbatim and take the number with the lowest variance this avoids the issue. I think this is due to the CPU needing to gather information in order to do proper branch prediction. Ideally you'd just not do micro-benchmarks, but some functions do legitimately call for it. Don't trust a benchmark, especially a microbenchmark, until you've rerun it multiple times.

作者留意到当基准测试耗时低于1ms时,CPU可能需要预热。测试中偶发会出现基准测试集中的第一个测试会比后续测试耗时多20-30ns。如果你复制多个同样的基准测试,并且采用那个差异最小的测试结果可以避免这个问题。作者认为这是由于CPU需要收集一些信息来进行正确的分支预测。最理想的情况是尽量避免微基准测试,如果有些函数确实需要进行。不要盲目相信基准测试,特别是微基准测试,除非你重复运行足够多次。

When I reran this particular benchmark (at least 10 times in total, not including the benchmarks I ran while editing the code) to ensure that the numbers were stable, and although the averages are extremely stable (the native one sometimes was slightly slower, the 36ns value above is what I see most of the time), the variances are mostly 0-3ns with spikes of 13-26ns. I don't have a good explanation for this, expect a follow-up post with tips on writing better benchmarks.

当作者重复运行这些基准测试多次(至少10次,不包括那些在编辑代码时运行的测试)最终达到稳定的结果,虽然平均值相当稳定(那个native基准测试有时会稍慢,上面记录的36ns是作者看到最多次数的结果),这些结果的差异通常是0-3ns,同时峰谷可以达到13-26ns。作者对此没有一个很好的解释,期望能够在评论中看到一些如何编写更好的基准测试的提示。

This is a perfect example of why low-level optimization is important, since this is exactly the kind of function that could be used hundreds of thousands of times in parsers of textual data and a 10ns speedup here could lead to meaningful improvements over the life of the program. It's also an example of why you should avoid low-level optimization when possible. The original stdlib implementation of this function isn't the most idiomatic code, but it's significantly more readable than this. Having said that, though, it's a testament to Rust's commitment to zero-cost abstractions that you can write mostly-idiomatic, safe code and have it perform as well as equivalent C/C++ code that would require use of unsafe pointer arithmetic.

上述例子完美的说明了为何底层优化如此重要,因为这个函数就是那种在解析文本数据时会达到百万级调用的一类函数,而10ns的速度提升对程序生命周期的运行产生巨大的改善。但是同时它也是一个例子告诉你应该尽量避免底层优化。原本标准库实现可能不是最快的代码,但其实现代码的可读性比这段代码强太多。话说回来,Rust承诺的就是能够编写最优化且安全的代码来实现零成本抽象,程序性能可以与C/C++代码同等快速,而不需要使用任何不安全的指针运算。

Wrapping up

总结

If I had to sum the trick to optimization up in a pithy QotW-ready snippet it would be this:

The fastest code is code that doesn't run at all, the second-fastest code is code that never stops running.

作者希望使用下面这句讽刺的短语来总结优化的技巧:

最快的代码就是那些根本不会执行到的代码,第二快的代码就是那些根本运行不完的代码。

Ideally, you want to do less work, and if you're doing the minimum amount of work you want to reduce the amount of time the CPU spends waiting around.

最理想的情况下,你能够做更少的工作,因为如果你的工作量是最小的话,那么你就大幅缩短了CPU浪费在等待状态上的时间。

If you want more Rustic performance tips with more numbers I would consider BurntSushi's excellent post-mortem of ripgrep required reading for anyone wanting to write fast software (it was the thing that originally sent me down this deep, deep rabbit hole). For more general systems language-y points, check out Andrei Alexandrescu's talk "Fastware", from which the from_str_radix and log_base code was adapted. A lot of the points in this article are expansions upon points behind one of those two links.

如果你希望更深入的研究Rust的性能知识,作者推荐BurntSushi对ripgrep进行解剖的很棒的一篇文章,这是所有希望写出更快的软件的程序猿的床头读物(也是这篇文章最早将作者扔到了性能这个无底洞里)。对于更加通用的系统程序设计来说,可以看看Andrei Alexandrescu的谈话"Fastware",作者从中借用了前面的from_str_radixlog_base的代码思路。本Gist很多的观点都是引申自上面的这两个参考内容。

I hope that whether you're a soft-shell Rustacean or a grizzled veteran, this has given you a better sense of when some code may be poorly performing, and what to do about it. Go make things go vroom.

作者希望无论读者是一个Rust菜鸟还是一个老兵,这篇文章都能帮你培养出一种灵感,发现某些代码可能性能很差,并且能够知道如何处理它们。让我们写的程序飞起来吧。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment