You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I'm thinking that it will be straight up unlikely that both the slices are well aligned to 8 or 16-byte boundaries, so you'd simply shoot for unaligned loads that are fast on x86-64 anyway. This code is probably similar to what you were already doing, and similar to the memchr crate. The optimizer changes whether it's using two separate 64-bit loads or one 128-bit load depending on details on how the d0, d1 variables are used.
If more throughput is needed (only achievable for inputs shorter than one of the cpu caches I guess), you could unroll the loop further.
You can also use trailing_zeros() / 8 to get the offset straight from the the xor result (that's the reason for the .to_le() call in the code, so it's actually a redundant leftover in this version).
I'm thinking that it will be straight up unlikely that both the slices are well aligned to 8 or 16-byte boundaries, so you'd simply shoot for unaligned loads that are fast on x86-64 anyway. This code is probably similar to what you were already doing, and similar to the memchr crate. The optimizer changes whether it's using two separate 64-bit loads or one 128-bit load depending on details on how the d0, d1 variables are used.
If more throughput is needed (only achievable for inputs shorter than one of the cpu caches I guess), you could unroll the loop further.
You can also use trailing_zeros() / 8 to get the offset straight from the the xor result (that's the reason for the .to_le() call in the code, so it's actually a redundant leftover in this version).