Skip to content

Instantly share code, notes, and snippets.

@oconnor663
Created May 13, 2022 04:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save oconnor663/46895288b886616ffa06dfb6f90d7966 to your computer and use it in GitHub Desktop.
Save oconnor663/46895288b886616ffa06dfb6f90d7966 to your computer and use it in GitHub Desktop.
early Bao video subtitles
[Script Info]
; Script generated by Aegisub 3.2.2
; http://www.aegisub.org/
Title: Default Aegisub file
ScriptType: v4.00+
WrapStyle: 0
ScaledBorderAndShadow: yes
YCbCr Matrix: TV.601
PlayResX: 1280
PlayResY: 720
[Aegisub Project Garbage]
Audio File: /home/jacko/Downloads/yt5s.com-BLAKE2b and the bao tree mode SIMD and multithreaded hashing in Rust.mp4
Video File: /home/jacko/Downloads/yt5s.com-BLAKE2b and the bao tree mode SIMD and multithreaded hashing in Rust.mp4
Video AR Mode: 4
Video AR Value: 1.777778
Video Zoom Percent: 0.250000
Scroll Position: 18
Active Line: 24
Video Position: 4341
[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding
Style: Default,Arial,50,&H00FFFFFF,&H000000FF,&H00000000,&H00000000,0,0,0,0,100,100,0,0,1,2,2,2,10,10,10,1
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:00.86,0:00:05.26,Default,,0,0,0,,Alright, great, so...
Dialogue: 0,0:00:05.26,0:00:06.08,Default,,0,0,0,,Who hashes stuff?
Dialogue: 0,0:00:06.08,0:00:09.92,Default,,0,0,0,,Anybody ever use like `md5sum`, `sha1sum`, those things on the command line?
Dialogue: 0,0:00:09.92,0:00:14.94,Default,,0,0,0,,Sometimes we use `md5sum`, but we probably shouldn't. Sometimes we use `sha1sum`, but we probably shouldn't.
Dialogue: 0,0:00:14.94,0:00:21.81,Default,,0,0,0,,You may have heard that SHA-1 was broken, about two years ago now, because time moves really fast, by some very expensive machines at Google.
Dialogue: 0,0:00:21.81,0:00:26.32,Default,,0,0,0,,Those are designed to be very fast. And it turns out that using Rust, we can make things that are even faster.
Dialogue: 0,0:00:26.32,0:00:34.40,Default,,0,0,0,,So let's look real quick. `md5sum` of a file "f"...
Dialogue: 0,0:00:34.40,0:00:38.44,Default,,0,0,0,,Always do this twice, because you need to make sure it's in memory.
Dialogue: 0,0:00:38.44,0:00:46.03,Default,,0,0,0,,Right, so that's `md5sum`, that's a benchmark there. "f" is, let's see here, almost exactly 1 GB in size. It's just full of random data.
Dialogue: 0,0:00:46.03,0:00:54.67,Default,,0,0,0,,So that's `md5sum`. We also have `sha512sum`. That's probably the one we should be using if we're doing anything important.
Dialogue: 0,0:00:54.67,0:00:57.55,Default,,0,0,0,,Which is a little slow, which is kind of depressing.
Dialogue: 0,0:00:57.55,0:01:06.14,Default,,0,0,0,,It turns out the fastest guy on the block right now is still `sha1sum`, unfortunately, which clocks in at 1.3-ish.
Dialogue: 0,0:01:06.14,0:01:12.52,Default,,0,0,0,,But I always -- sorry, this isn't necessarily clear -- we're looking at this number over here. That's the actual time that passed.
Dialogue: 0,0:01:12.52,0:01:20.56,Default,,0,0,0,,So 1.3, that's pretty cool. But we shouldn't be using it anymore, because it's a broken hash function. So what can we use that's modern, that might actually conceivably be faster.
Dialogue: 0,0:01:20.56,0:01:27.50,Default,,0,0,0,,Some of you guys might've heard of a hash function called BLAKE2. Anybody raise their hand who's actually heard that name before?
Dialogue: 0,0:01:27.50,0:01:41.52,Default,,0,0,0,,There is a BLAKE2 implementation on most people's machines, if you have like GNU Coreutils. So let me try that. `/usr/bin/b2sum`...live coding...
Dialogue: 0,0:01:41.52,0:01:43.15,Default,,0,0,0,,And unfortunately that was a little slow.
Dialogue: 0,0:01:43.15,0:01:45.24,Default,,0,0,0,,Despite being a modern hash function it's a little slow.
Dialogue: 0,0:01:45.24,0:01:52.52,Default,,0,0,0,,The reason it's slow is that SHA-1, under the covers, is linking against OpenSSL, with a very fancy implementation of SHA-1.
Dialogue: 0,0:01:52.52,0:01:56.41,Default,,0,0,0,,GNU Coreutils is just using portable C code, which is very clean, but it's not quite as fast.
Dialogue: 0,0:01:56.41,0:02:03.52,Default,,0,0,0,,It turns out that using Rust's just recently stabilized "SIMD intrinsics", we can get quite a lot faster.
Dialogue: 0,0:02:03.52,0:02:14.00,Default,,0,0,0,,So if I run `b2sum` directly -- that's actually a binary that I've compiled on my machine...
Dialogue: 0,0:02:14.00,0:02:22.08,Default,,0,0,0,,Boom! We beat SHA-1. We beat SHA-1 using a set of intrinsics called AVX2. Has anybody heard the name "AVX2" before?
Dialogue: 0,0:02:22.08,0:02:24.70,Default,,0,0,0,,This should be even fewer than "BLAKE2"... Maybe some? Cool.
Dialogue: 0,0:02:24.70,0:02:33.77,Default,,0,0,0,,So, real quick -- I have so little time -- the Intel people have been spending a lot of time making processors faster, and they've been spending a lot of time making cores to run multiple threads.
Dialogue: 0,0:02:33.77,0:02:38.00,Default,,0,0,0,,But also they've been adding new instructions, so we can do things that the old processors couldn't do.
Dialogue: 0,0:02:38.00,0:02:41.15,Default,,0,0,0,,Now if you use those instructions, your program's not compatible with old machines.
Dialogue: 0,0:02:41.15,0:02:43.32,Default,,0,0,0,,So generally compilers don't emit them by default.
Dialogue: 0,0:02:43.32,0:02:48.78,Default,,0,0,0,,But if you tell your compiler to emit them, it will. And then you have to be very careful, cause you're doing unsafe things.
Dialogue: 0,0:02:48.78,0:02:55.48,Default,,0,0,0,,So here is a bunch of crap. And what it means doesn't really matter. It's shuffling bytes around. It's a hash function, it's shuffling bytes around, we all know.
Dialogue: 0,0:02:55.48,0:02:57.63,Default,,0,0,0,,But this is in Rust, of course.
Dialogue: 0,0:02:57.63,0:03:06.86,Default,,0,0,0,,And we see this cute little thing here, which is telling the compiler that, regardless of what platform you thought you were compiling for, in this function, use those AVX2 instructions.
Dialogue: 0,0:03:06.86,0:03:11.24,Default,,0,0,0,,You know, multiple integers at a time, loading and storing, that kind of stuff.
Dialogue: 0,0:03:11.24,0:03:16.70,Default,,0,0,0,,I did not invent this implementation. I ported it from C. So they, the actual authors, get credit for the performance.
Dialogue: 0,0:03:16.70,0:03:27.32,Default,,0,0,0,,So doing this says the compiler will emit these instructions for all targets, and then we're responsible for never ever calling this function when it's unsafe. So...
Dialogue: 0,0:03:27.32,0:03:31.53,Default,,0,0,0,,Down here, we see what we have to do to make this sort of stuff useful.
Dialogue: 0,0:03:31.53,0:03:35.21,Default,,0,0,0,,So, this is if you're sort of statically targeting modern processors, which is rare.
Dialogue: 0,0:03:35.21,0:03:39.53,Default,,0,0,0,,But here, we see that we're at runtime asking the processor,
Dialogue: 0,0:03:39.53,0:03:43.58,Default,,0,0,0,,"What instructions do you support?" If this processor supports AVX2, we'll call this function.
Dialogue: 0,0:03:43.58,0:03:47.76,Default,,0,0,0,,If I replace this with `if true`, this would hopefully crash.
Dialogue: 0,0:03:47.76,0:03:52.27,Default,,0,0,0,,Probably do something worse, on a machine that's older than three years. So this part's pretty important.
Dialogue: 0,0:03:52.27,0:03:56.67,Default,,0,0,0,,So that's how you beat `sha1sum`, using pure, stable Rust.
Dialogue: 0,0:03:56.67,0:03:58.67,Default,,0,0,0,,And that works now.
Dialogue: 0,0:03:58.67,0:04:00.67,Default,,0,0,0,,If we want to get even faster than this,
Dialogue: 0,0:04:00.67,0:04:02.14,Default,,0,0,0,,we have to start using multiple threads.
Dialogue: 0,0:04:02.14,0:04:05.32,Default,,0,0,0,,So, a standard hash function is simply not designed to run on multiple different threads.
Dialogue: 0,0:04:05.32,0:04:08.44,Default,,0,0,0,,Each step depends on the last step.
Dialogue: 0,0:04:08.44,0:04:16.01,Default,,0,0,0,,So as a...so we can see that this implementation is of course producing the same result.
Dialogue: 0,0:04:16.01,0:04:18.99,Default,,0,0,0,,4e2...something something, as the portable C.
Dialogue: 0,0:04:18.99,0:04:21.80,Default,,0,0,0,,That's good, because it has the same name, so it better produce the same result.
Dialogue: 0,0:04:21.80,0:04:27.93,Default,,0,0,0,,But if we wanted, if we were willing to use a different construction designed for multiple threads, we could do things faster. So let's try that.
Dialogue: 0,0:04:27.93,0:04:29.66,Default,,0,0,0,,This utility implements it.
Dialogue: 0,0:04:29.66,0:04:32.75,Default,,0,0,0,,The BLAKE2 standard includes such a construction.
Dialogue: 0,0:04:32.75,0:04:38.72,Default,,0,0,0,,[inaudible] called BLAKE2bp,
Dialogue: 0,0:04:38.72,0:04:39.82,Default,,0,0,0,,which is quite fast.
Dialogue: 0,0:04:39.82,0:04:42.14,Default,,0,0,0,,So that's going to fire up four threads.
Dialogue: 0,0:04:42.14,0:04:46.30,Default,,0,0,0,,It's going to chunk the input into four groups.
Dialogue: 0,0:04:46.30,0:04:52.17,Default,,0,0,0,,So the idea is like a block size of, BLAKE2[b] uses 128 bytes if I remember correctly,
Dialogue: 0,0:04:52.17,0:04:55.64,Default,,0,0,0,,So the first 128 bytes is block0, block 1, 2, 3, and 4.
Dialogue: 0,0:04:55.64,0:04:58.68,Default,,0,0,0,,So then you get 0, goes with 4, goes with 8, goes with 12, that's on one thread.
Dialogue: 0,0:04:58.68,0:05:00.40,Default,,0,0,0,,And 1 goes with 5, and so on.
Dialogue: 0,0:05:00.40,0:05:02.70,Default,,0,0,0,,So you get four threads kind of running in parallel on the same input.
Dialogue: 0,0:05:02.70,0:05:06.46,Default,,0,0,0,,You get four hashes, hash those together, the result is the BLAKE2bp sum.
Dialogue: 0,0:05:06.46,0:05:09.72,Default,,0,0,0,,That plus a few other parameter tweaks, that's the definition of BLAKE2bp.
Dialogue: 0,0:05:09.72,0:05:13.00,Default,,0,0,0,,And then, it's not a like 4x speedup, unfortunately.
Dialogue: 0,0:05:13.00,0:05:16.80,Default,,0,0,0,,There are four physical cores on this machine. So we hope we get 4x, but it's [inaudible].
Dialogue: 0,0:05:16.80,0:05:19.23,Default,,0,0,0,,That's BLAKE2bp, and again that's slightly standard.
Dialogue: 0,0:05:19.23,0:05:23.28,Default,,0,0,0,,Not widely implemented in a lot of places, but in theory other software implements this.
Dialogue: 0,0:05:23.28,0:05:26.78,Default,,0,0,0,,If we want to get even faster than that, we have to start doing very interesting stuff.
Dialogue: 0,0:05:26.78,0:05:28.84,Default,,0,0,0,,Real quick, since we're on BLAKE2bp though,
Dialogue: 0,0:05:28.84,0:05:33.12,Default,,0,0,0,,let's look at the implementation of that.
Dialogue: 0,0:05:33.12,0:05:35.24,Default,,0,0,0,,I'm going to ignore most of the details.
Dialogue: 0,0:05:35.24,0:05:39.47,Default,,0,0,0,,
Dialogue: 0,0:05:39.47,0:05:43.23,Default,,0,0,0,,So, four of these go first in parallel.
Dialogue: 0,0:05:43.23,0:05:46.11,Default,,0,0,0,,One of these goes at the end, slightly different parameters, it's just a hash.
Dialogue: 0,0:05:46.11,0:05:49.31,Default,,0,0,0,,This is the interesting part, which I'll zoom in on.
Dialogue: 0,0:05:49.31,0:05:52.19,Default,,0,0,0,,This is Rayon. Who's used Rayon before?
Dialogue: 0,0:05:52.19,0:05:54.88,Default,,0,0,0,,It's surprisingly easy to use.
Dialogue: 0,0:05:54.88,0:06:03.72,Default,,0,0,0,,Niko...oh my god I should not even attempt to pronounce his last name...Matsakis, invented this early on, probably before 1.0.
Dialogue: 0,0:06:03.72,0:06:06.01,Default,,0,0,0,,It makes threading surprisingly easy.
Dialogue: 0,0:06:06.01,0:06:10.44,Default,,0,0,0,,`join` means, something on the left something on the right, run them on multiple threads...maybe.
Dialogue: 0,0:06:10.44,0:06:12.70,Default,,0,0,0,,And you'll notice it does some very very clever things.
Dialogue: 0,0:06:12.70,0:06:15.31,Default,,0,0,0,,If I scroll up...
Dialogue: 0,0:06:15.31,0:06:17.88,Default,,0,0,0,,`input` is a slice argument to this function, right.
Dialogue: 0,0:06:17.88,0:06:24.40,Default,,0,0,0,,Then this closure, the worker closure, is like chunking up input [inaudible] and hashing it.
Dialogue: 0,0:06:24.40,0:06:26.27,Default,,0,0,0,,And this is running that on multiple threads.
Dialogue: 0,0:06:26.27,0:06:32.86,Default,,0,0,0,,The type checker is already doing all the work to make sure that these threads are not going to take this reference that lives on this thread's stack
Dialogue: 0,0:06:32.86,0:06:37.58,Default,,0,0,0,,and then like outlast it. It's all safe, and all this stuff magically works. That's really cool.
Dialogue: 0,0:06:37.58,0:06:44.84,Default,,0,0,0,,And if the program happens to be using Rayon threads in other capacities, it's all the same global threadpool, so it kind of does a lot for you.
Dialogue: 0,0:06:44.84,0:06:50.01,Default,,0,0,0,,So this is just, do some things in parallel, surprisingly easy, and you get your answer.
Dialogue: 0,0:06:50.01,0:06:55.26,Default,,0,0,0,,Again, if we want to do something even faster that .4 seconds,
Dialogue: 0,0:06:55.26,0:06:58.20,Default,,0,0,0,,we need to start defining custom hashing modes.
Dialogue: 0,0:06:58.20,0:07:04.28,Default,,0,0,0,,So the project that actually...led up...all this talk to...is called "Bao", currently. B-A-O.
Dialogue: 0,0:07:04.28,0:07:08.86,Default,,0,0,0,,It's a custom tree hashing mode. The details don't matter; it's a binary tree.
Dialogue: 0,0:07:08.86,0:07:12.94,Default,,0,0,0,,And if we run `bao hash` on "f"...
Dialogue: 0,0:07:12.94,0:07:15.02,Default,,0,0,0,,Maybe we want to know how fast it is...
Dialogue: 0,0:07:15.02,0:07:17.02,Default,,0,0,0,,Quite fast.
Dialogue: 0,0:07:17.02,0:07:19.52,Default,,0,0,0,,That is almost 4 gigabytes per second.
Dialogue: 0,0:07:19.52,0:07:22.16,Default,,0,0,0,,That is utilizing all the cores on the machine.
Dialogue: 0,0:07:22.16,0:07:26.03,Default,,0,0,0,,That is utilizing the AVX2 instructions in that BLAKE2 implementation we just saw.
Dialogue: 0,0:07:26.03,0:07:29.07,Default,,0,0,0,,If you look at the implementation of this...
Dialogue: 0,0:07:29.07,0:07:31.68,Default,,0,0,0,,This is the top-level hash function.
Dialogue: 0,0:07:31.68,0:07:36.30,Default,,0,0,0,,It doesn't make sense to use Rayon if you have like, a thousand bytes, cause there's a lot of overhead.
Dialogue: 0,0:07:36.30,0:07:40.88,Default,,0,0,0,, But if you have enough bytes, it calls into this recursive thing,
Dialogue: 0,0:07:40.88,0:07:42.30,Default,,0,0,0,,which again uses `rayon::join`.
Dialogue: 0,0:07:42.30,0:07:47.55,Default,,0,0,0,,This is a shockingly simple function for implementing, like, a four gigabyte per second hash.
Dialogue: 0,0:07:47.55,0:07:53.02,Default,,0,0,0,,It's...the main difference between this and BLAKE2bp is that this on is actually directly recursing on itself.
Dialogue: 0,0:07:53.02,0:07:56.35,Default,,0,0,0,,So we actually call `hash_recurse_rayon` inside of `hash_recurse_rayon`.
Dialogue: 0,0:07:56.35,0:08:01.64,Default,,0,0,0,,Before, with BLAKE2bp, it was just sort of a hardcoded set of four calls into a worker.
Dialogue: 0,0:08:01.64,0:08:05.80,Default,,0,0,0,,And this will parallelize an arbitrary number of threads on the way down.
Dialogue: 0,0:08:05.80,0:08:10.08,Default,,0,0,0,,`hash_node` is literally just the BLAKE2 hash. When you actually get to the bottom, you hash a leaf.
Dialogue: 0,0:08:10.08,0:08:12.40,Default,,0,0,0,,Then you hash everything together at the end.
Dialogue: 0,0:08:12.40,0:08:14.49,Default,,0,0,0,,So, that's Bao. It's pretty fast.
Dialogue: 0,0:08:14.49,0:08:19.66,Default,,0,0,0,,If anybody has any questions about Rayon or SIMD instructions, let me know.
Dialogue: 0,0:08:19.66,0:08:28.14,Default,,0,0,0,,
Dialogue: 0,0:08:28.14,0:08:32.81,Default,,0,0,0,,So what's the name of the...how do you spell that...the hash?
Dialogue: 0,0:08:32.81,0:08:34.03,Default,,0,0,0,,The one at the end?
Dialogue: 0,0:08:34.03,0:08:35.32,Default,,0,0,0,,I think you said "Bao"?
Dialogue: 0,0:08:35.32,0:08:41.02,Default,,0,0,0,,B-A-O. I just put up the GitHub link, but it's too small for anyone to read.
Dialogue: 0,0:08:41.02,0:08:53.60,Default,,0,0,0,,
Dialogue: 0,0:08:53.60,0:08:56.30,Default,,0,0,0,,The other one was `blake2b_simd`.
Dialogue: 0,0:08:56.30,0:09:03.50,Default,,0,0,0,,This actually...this implements...the first one implements the standard BLAKE[2] hash function, which you might actually be using in other applications. It's just faster
Dialogue: 0,0:09:03.50,0:09:04.92,Default,,0,0,0,,
Dialogue: 0,0:09:04.92,0:09:08.83,Default,,0,0,0,,The compress[ion] function you showed using SIMD unstructions is marked unsafe.
Dialogue: 0,0:09:08.83,0:09:14.48,Default,,0,0,0,,What's [inaudible] SIMD intrinsics, those happen to be unsafe?
Dialogue: 0,0:09:14.48,0:09:15.28,Default,,0,0,0,,Those happen to be unsafe.
Dialogue: 0,0:09:15.28,0:09:17.80,Default,,0,0,0,,There's sort of two unsafe things about them.
Dialogue: 0,0:09:17.80,0:09:23.56,Default,,0,0,0,,One is that, if you just simply call them at all on a platform that doesn't support them, it's just by definition undefined behavior.
Dialogue: 0,0:09:23.56,0:09:25.26,Default,,0,0,0,,I have no idea what actually happens.
Dialogue: 0,0:09:25.26,0:09:31.80,Default,,0,0,0,,Another is that those...one of the intrinsics defined in those sets is usually some of wide load.
Dialogue: 0,0:09:31.80,0:09:37.29,Default,,0,0,0,,Like load a 256-bit integer, or something like that, and those usually require aligned pointers.
Dialogue: 0,0:09:37.29,0:09:39.74,Default,,0,0,0,,So if you have some pointer...
Dialogue: 0,0:09:39.74,0:09:42.33,Default,,0,0,0,,You run into this if you're doing pointer casts.
Dialogue: 0,0:09:42.33,0:09:46.49,Default,,0,0,0,,If you take a pointer to some bytes, and you cast it into a pointer to a 64-bit integer,
Dialogue: 0,0:09:46.49,0:09:52.48,Default,,0,0,0,,you might think all you need to do is length check, and on x86 that might be true...sometimes.
Dialogue: 0,0:09:52.48,0:09:56.70,Default,,0,0,0,,But in fact what you've done, is you've got a problem with your unaligned pointer, and you've caused undefined behavior.
Dialogue: 0,0:09:56.70,0:10:01.68,Default,,0,0,0,,So that is a thing that tends to happen if you [inaudible].
Dialogue: 0,0:10:01.68,0:10:07.90,Default,,0,0,0,,Does Rayon act like actual threads, or does it do something with like scheduled threads?
Dialogue: 0,0:10:07.90,0:10:09.90,Default,,0,0,0,,Tell me more about the difference between those two things.
Dialogue: 0,0:10:09.90,0:10:13.95,Default,,0,0,0,,Like do you [inaudible] full operating system multiple threads where you're running it on...
Dialogue: 0,0:10:13.95,0:10:19.21,Default,,0,0,0,,Yes, under the covers you have an actual thread pool with actual threads.
Dialogue: 0,0:10:19.21,0:10:26.49,Default,,0,0,0,,It does a fair bit of logic to make you you have, like, idle cores and stuff like that. I think it tries to be fairly intelligent.
Dialogue: 0,0:10:26.49,0:10:28.49,Default,,0,0,0,,It also tries to [inaudible] around whether...
Dialogue: 0,0:10:28.49,0:10:32.59,Default,,0,0,0,,Of course this tree, you could imagine you have like 2^30 branches.
Dialogue: 0,0:10:32.59,0:10:35.02,Default,,0,0,0,,You might not want to run them all on individual worker threads.
Dialogue: 0,0:10:35.02,0:10:37.76,Default,,0,0,0,,So it tries to [inaudible], but yes, real threads.
Dialogue: 0,0:10:37.76,0:10:39.21,Default,,0,0,0,,
Dialogue: 0,0:10:39.21,0:10:41.21,Default,,0,0,0,,Alright. Thanks guys.
1
00:00:00,860 --> 00:00:05,260
Alright, great, so...
2
00:00:05,260 --> 00:00:06,086
Who hashes stuff?
3
00:00:06,086 --> 00:00:09,926
Anybody ever use like `md5sum`, `sha1sum`, those things on the command line?
4
00:00:09,926 --> 00:00:14,946
Sometimes we use `md5sum`, but we probably shouldn't. Sometimes we use `sha1sum`, but we probably shouldn't.
5
00:00:14,946 --> 00:00:21,813
You may have heard that SHA-1 was broken, about two years ago now, because time moves really fast, by some very expensive machines at Google.
6
00:00:21,813 --> 00:00:26,320
Those are designed to be very fast. And it turns out that using Rust, we can make things that are even faster.
7
00:00:26,320 --> 00:00:34,400
So let's look real quick. `md5sum` of a file "f"...
8
00:00:34,400 --> 00:00:38,448
Always do this twice, because you need to make sure it's in memory.
9
00:00:38,448 --> 00:00:46,032
Right, so that's `md5sum`, that's a benchmark there. "f" is, let's see here, almost exactly 1 GB in size. It's just full of random data.
10
00:00:46,032 --> 00:00:54,672
So that's `md5sum`. We also have `sha512sum`. That's probably the one we should be using if we're doing anything important.
11
00:00:54,672 --> 00:00:57,552
Which is a little slow, which is kind of depressing.
12
00:00:57,550 --> 00:01:06,144
It turns out the fastest guy on the block right now is still `sha1sum`, unfortunately, which clocks in at 1.3-ish.
13
00:01:06,144 --> 00:01:12,528
But I always -- sorry, this isn't necessarily clear -- we're looking at this number over here. That's the actual time that passed.
14
00:01:12,528 --> 00:01:20,560
So 1.3, that's pretty cool. But we shouldn't be using it anymore, because it's a broken hash function. So what can we use that's modern, that might actually conceivably be faster.
15
00:01:20,560 --> 00:01:27,504
Some of you guys might've heard of a hash function called BLAKE2. Anybody raise their hand who's actually heard that name before?
16
00:01:27,504 --> 00:01:41,520
There is a BLAKE2 implementation on most people's machines, if you have like GNU Coreutils. So let me try that. `/usr/bin/b2sum`...live coding...
17
00:01:41,520 --> 00:01:43,152
And unfortunately that was a little slow.
18
00:01:43,150 --> 00:01:45,248
Despite being a modern hash function it's a little slow.
19
00:01:45,248 --> 00:01:52,528
The reason it's slow is that SHA-1, under the covers, is linking against OpenSSL, with a very fancy implementation of SHA-1.
20
00:01:52,520 --> 00:01:56,416
GNU Coreutils is just using portable C code, which is very clean, but it's not quite as fast.
21
00:01:56,416 --> 00:02:03,520
It turns out that using Rust's just recently stabilized "SIMD intrinsics", we can get quite a lot faster.
22
00:02:03,520 --> 00:02:14,000
So if I run `b2sum` directly -- that's actually a binary that I've compiled on my machine...
23
00:02:14,000 --> 00:02:22,080
Boom! We beat SHA-1. We beat SHA-1 using a set of intrinsics called AVX2. Has anybody heard the name "AVX2" before?
24
00:02:22,080 --> 00:02:24,704
This should be even fewer than "BLAKE2"... Maybe some? Cool.
25
00:02:24,704 --> 00:02:33,776
So, real quick -- I have so little time -- the Intel people have been spending a lot of time making processors faster, and they've been spending a lot of time making cores to run multiple threads.
26
00:02:33,776 --> 00:02:38,000
But also they've been adding new instructions, so we can do things that the old processors couldn't do.
27
00:02:38,000 --> 00:02:41,152
Now if you use those instructions, your program's not compatible with old machines.
28
00:02:41,150 --> 00:02:43,328
So generally compilers don't emit them by default.
29
00:02:43,328 --> 00:02:48,784
But if you tell your compiler to emit them, it will. And then you have to be very careful, cause you're doing unsafe things.
30
00:02:48,784 --> 00:02:55,488
So here is a bunch of crap. And what it means doesn't really matter. It's shuffling bytes around. It's a hash function, it's shuffling bytes around, we all know.
31
00:02:55,480 --> 00:02:57,632
But this is in Rust, of course.
32
00:02:57,632 --> 00:03:06,864
And we see this cute little thing here, which is telling the compiler that, regardless of what platform you thought you were compiling for, in this function, use those AVX2 instructions.
33
00:03:06,860 --> 00:03:11,248
You know, multiple integers at a time, loading and storing, that kind of stuff.
34
00:03:11,240 --> 00:03:16,704
I did not invent this implementation. I ported it from C. So they, the actual authors, get credit for the performance.
35
00:03:16,700 --> 00:03:27,328
So doing this says the compiler will emit these instructions for all targets, and then we're responsible for never ever calling this function when it's unsafe. So...
36
00:03:27,320 --> 00:03:31,536
Down here, we see what we have to do to make this sort of stuff useful.
37
00:03:31,530 --> 00:03:35,216
So, this is if you're sort of statically targeting modern processors, which is rare.
38
00:03:35,210 --> 00:03:39,536
But here, we see that we're at runtime asking the processor,
39
00:03:39,530 --> 00:03:43,584
"What instructions do you support?" If this processor supports AVX2, we'll call this function.
40
00:03:43,580 --> 00:03:47,760
If I replace this with `if true`, this would hopefully crash.
41
00:03:47,760 --> 00:03:52,272
Probably do something worse, on a machine that's older than three years. So this part's pretty important.
42
00:03:52,270 --> 00:03:56,672
So that's how you beat `sha1sum`, using pure, stable Rust.
43
00:03:56,672 --> 00:03:58,670
And that works now.
44
00:03:58,670 --> 00:04:00,670
If we want to get even faster than this,
45
00:04:00,670 --> 00:04:02,144
we have to start using multiple threads.
46
00:04:02,140 --> 00:04:05,328
So, a standard hash function is simply not designed to run on multiple different threads.
47
00:04:05,320 --> 00:04:08,448
Each step depends on the last step.
48
00:04:08,440 --> 00:04:16,016
So as a...so we can see that this implementation is of course producing the same result.
49
00:04:16,010 --> 00:04:18,992
4e2...something something, as the portable C.
50
00:04:18,990 --> 00:04:21,808
That's good, because it has the same name, so it better produce the same result.
51
00:04:21,800 --> 00:04:27,936
But if we wanted, if we were willing to use a different construction designed for multiple threads, we could do things faster. So let's try that.
52
00:04:27,930 --> 00:04:29,664
This utility implements it.
53
00:04:29,660 --> 00:04:32,752
The BLAKE2 standard includes such a construction.
54
00:04:32,750 --> 00:04:38,720
[inaudible] called BLAKE2bp,
55
00:04:38,720 --> 00:04:39,824
which is quite fast.
56
00:04:39,820 --> 00:04:42,144
So that's going to fire up four threads.
57
00:04:42,140 --> 00:04:46,304
It's going to chunk the input into four groups.
58
00:04:46,300 --> 00:04:52,176
So the idea is like a block size of, BLAKE2[b] uses 128 bytes if I remember correctly,
59
00:04:52,170 --> 00:04:55,648
So the first 128 bytes is block0, block 1, 2, 3, and 4.
60
00:04:55,640 --> 00:04:58,688
So then you get 0, goes with 4, goes with 8, goes with 12, that's on one thread.
61
00:04:58,680 --> 00:05:00,400
And 1 goes with 5, and so on.
62
00:05:00,400 --> 00:05:02,704
So you get four threads kind of running in parallel on the same input.
63
00:05:02,700 --> 00:05:06,464
You get four hashes, hash those together, the result is the BLAKE2bp sum.
64
00:05:06,460 --> 00:05:09,728
That plus a few other parameter tweaks, that's the definition of BLAKE2bp.
65
00:05:09,720 --> 00:05:13,008
And then, it's not a like 4x speedup, unfortunately.
66
00:05:13,000 --> 00:05:16,800
There are four physical cores on this machine. So we hope we get 4x, but it's [inaudible].
67
00:05:16,800 --> 00:05:19,232
That's BLAKE2bp, and again that's slightly standard.
68
00:05:19,230 --> 00:05:23,280
Not widely implemented in a lot of places, but in theory other software implements this.
69
00:05:23,280 --> 00:05:26,784
If we want to get even faster than that, we have to start doing very interesting stuff.
70
00:05:26,780 --> 00:05:28,848
Real quick, since we're on BLAKE2bp though,
71
00:05:28,840 --> 00:05:33,120
let's look at the implementation of that.
72
00:05:33,120 --> 00:05:35,248
I'm going to ignore most of the details.
73
00:05:35,240 --> 00:05:39,472
74
00:05:39,470 --> 00:05:43,232
So, four of these go first in parallel.
75
00:05:43,230 --> 00:05:46,112
One of these goes at the end, slightly different parameters, it's just a hash.
76
00:05:46,110 --> 00:05:49,312
This is the interesting part, which I'll zoom in on.
77
00:05:49,310 --> 00:05:52,192
This is Rayon. Who's used Rayon before?
78
00:05:52,190 --> 00:05:54,880
It's surprisingly easy to use.
79
00:05:54,880 --> 00:06:03,728
Niko...oh my god I should not even attempt to pronounce his last name...Matsakis, invented this early on, probably before 1.0.
80
00:06:03,720 --> 00:06:06,016
It makes threading surprisingly easy.
81
00:06:06,010 --> 00:06:10,448
`join` means, something on the left something on the right, run them on multiple threads...maybe.
82
00:06:10,440 --> 00:06:12,704
And you'll notice it does some very very clever things.
83
00:06:12,700 --> 00:06:15,312
If I scroll up...
84
00:06:15,310 --> 00:06:17,888
`input` is a slice argument to this function, right.
85
00:06:17,880 --> 00:06:24,400
Then this closure, the worker closure, is like chunking up input [inaudible] and hashing it.
86
00:06:24,400 --> 00:06:26,272
And this is running that on multiple threads.
87
00:06:26,270 --> 00:06:32,864
The type checker is already doing all the work to make sure that these threads are not going to take this reference that lives on this thread's stack
88
00:06:32,860 --> 00:06:37,584
and then like outlast it. It's all safe, and all this stuff magically works. That's really cool.
89
00:06:37,580 --> 00:06:44,848
And if the program happens to be using Rayon threads in other capacities, it's all the same global threadpool, so it kind of does a lot for you.
90
00:06:44,840 --> 00:06:50,016
So this is just, do some things in parallel, surprisingly easy, and you get your answer.
91
00:06:50,010 --> 00:06:55,264
Again, if we want to do something even faster that .4 seconds,
92
00:06:55,260 --> 00:06:58,208
we need to start defining custom hashing modes.
93
00:06:58,200 --> 00:07:04,288
So the project that actually...led up...all this talk to...is called "Bao", currently. B-A-O.
94
00:07:04,280 --> 00:07:08,864
It's a custom tree hashing mode. The details don't matter; it's a binary tree.
95
00:07:08,860 --> 00:07:12,944
And if we run `bao hash` on "f"...
96
00:07:12,940 --> 00:07:15,024
Maybe we want to know how fast it is...
97
00:07:15,024 --> 00:07:17,020
Quite fast.
98
00:07:17,020 --> 00:07:19,520
That is almost 4 gigabytes per second.
99
00:07:19,520 --> 00:07:22,160
That is utilizing all the cores on the machine.
100
00:07:22,160 --> 00:07:26,032
That is utilizing the AVX2 instructions in that BLAKE2 implementation we just saw.
101
00:07:26,030 --> 00:07:29,072
If you look at the implementation of this...
102
00:07:29,070 --> 00:07:31,680
This is the top-level hash function.
103
00:07:31,680 --> 00:07:36,304
It doesn't make sense to use Rayon if you have like, a thousand bytes, cause there's a lot of overhead.
104
00:07:36,300 --> 00:07:40,880
But if you have enough bytes, it calls into this recursive thing,
105
00:07:40,880 --> 00:07:42,304
which again uses `rayon::join`.
106
00:07:42,300 --> 00:07:47,552
This is a shockingly simple function for implementing, like, a four gigabyte per second hash.
107
00:07:47,550 --> 00:07:53,024
It's...the main difference between this and BLAKE2bp is that this on is actually directly recursing on itself.
108
00:07:53,020 --> 00:07:56,352
So we actually call `hash_recurse_rayon` inside of `hash_recurse_rayon`.
109
00:07:56,350 --> 00:08:01,648
Before, with BLAKE2bp, it was just sort of a hardcoded set of four calls into a worker.
110
00:08:01,640 --> 00:08:05,808
And this will parallelize an arbitrary number of threads on the way down.
111
00:08:05,800 --> 00:08:10,080
`hash_node` is literally just the BLAKE2 hash. When you actually get to the bottom, you hash a leaf.
112
00:08:10,080 --> 00:08:12,400
Then you hash everything together at the end.
113
00:08:12,400 --> 00:08:14,496
So, that's Bao. It's pretty fast.
114
00:08:14,490 --> 00:08:19,664
If anybody has any questions about Rayon or SIMD instructions, let me know.
115
00:08:28,140 --> 00:08:32,816
So what's the name of the...how do you spell that...the hash?
116
00:08:32,810 --> 00:08:34,032
The one at the end?
117
00:08:34,030 --> 00:08:35,328
I think you said "Bao"?
118
00:08:35,320 --> 00:08:41,024
B-A-O. I just put up the GitHub link, but it's too small for anyone to read.
119
00:08:53,600 --> 00:08:56,304
The other one was `blake2b_simd`.
120
00:08:56,300 --> 00:09:03,504
This actually...this implements...the first one implements the standard BLAKE[2] hash function, which you might actually be using in other applications. It's just faster
121
00:09:04,920 --> 00:09:08,832
The compress[ion] function you showed using SIMD unstructions is marked unsafe.
122
00:09:08,830 --> 00:09:14,480
What's [inaudible] SIMD intrinsics, those happen to be unsafe?
123
00:09:14,480 --> 00:09:15,280
Those happen to be unsafe.
124
00:09:15,280 --> 00:09:17,808
There's sort of two unsafe things about them.
125
00:09:17,800 --> 00:09:23,568
One is that, if you just simply call them at all on a platform that doesn't support them, it's just by definition undefined behavior.
126
00:09:23,560 --> 00:09:25,264
I have no idea what actually happens.
127
00:09:25,260 --> 00:09:31,808
Another is that those...one of the intrinsics defined in those sets is usually some of wide load.
128
00:09:31,800 --> 00:09:37,296
Like load a 256-bit integer, or something like that, and those usually require aligned pointers.
129
00:09:37,290 --> 00:09:39,744
So if you have some pointer...
130
00:09:39,740 --> 00:09:42,336
You run into this if you're doing pointer casts.
131
00:09:42,330 --> 00:09:46,496
If you take a pointer to some bytes, and you cast it into a pointer to a 64-bit integer,
132
00:09:46,490 --> 00:09:52,480
you might think all you need to do is length check, and on x86 that might be true...sometimes.
133
00:09:52,480 --> 00:09:56,704
But in fact what you've done, is you've got a problem with your unaligned pointer, and you've caused undefined behavior.
134
00:09:56,700 --> 00:10:01,680
So that is a thing that tends to happen if you [inaudible].
135
00:10:01,680 --> 00:10:07,904
Does Rayon act like actual threads, or does it do something with like scheduled threads?
136
00:10:07,904 --> 00:10:09,900
Tell me more about the difference between those two things.
137
00:10:09,900 --> 00:10:13,952
Like do you [inaudible] full operating system multiple threads where you're running it on...
138
00:10:13,950 --> 00:10:19,216
Yes, under the covers you have an actual thread pool with actual threads.
139
00:10:19,210 --> 00:10:26,496
It does a fair bit of logic to make you you have, like, idle cores and stuff like that. I think it tries to be fairly intelligent.
140
00:10:26,496 --> 00:10:28,490
It also tries to [inaudible] around whether...
141
00:10:28,490 --> 00:10:32,592
Of course this tree, you could imagine you have like 2^30 branches.
142
00:10:32,590 --> 00:10:35,024
You might not want to run them all on individual worker threads.
143
00:10:35,020 --> 00:10:37,760
So it tries to [inaudible], but yes, real threads.
144
00:10:39,216 --> 00:10:41,210
Alright. Thanks guys.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment