Skip to content

Instantly share code, notes, and snippets.

@eloff
Last active August 3, 2021 01:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save eloff/6e0c236ae6388744a7d5b76a5e0e9b6c to your computer and use it in GitHub Desktop.
Save eloff/6e0c236ae6388744a7d5b76a5e0e9b6c to your computer and use it in GitHub Desktop.
merge two adjacent Bytes objects into one
// Public Domain
use bytes;
struct BytesAlike {
data: *const u8,
len: usize,
_1: usize,
_2: usize,
}
/// unsplit_bytes checks if b2 follows directly after b1 in memory, and if so merges it
/// into b1 and returns b1. Otherwise returns b1 and b2 unmodified.
/// Safety: because of pointer provenance https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#pointer-provenance
/// this may invoke undefined behavior if used to merge two Bytes not originally part of
/// the same allocation (e.g. not split from the same BytesMut or similar.)
pub unsafe fn unsplit_bytes(mut b1: Bytes, b2: Bytes) -> (Option<Bytes>, Option<Bytes>) {
if are_contiguous(&b1, &b2) {
assert_eq!(std::mem::size_of::<Bytes>(), std::mem::size_of::<BytesAlike>());
// Safety: this is pretty unsafe, we assume the length is a usize stored after a pointer.
// So we check that both the pointer and length fields appear to be where expect them
// and if they're not, we simply treat them as if they cannot be merged.
let p = b1.as_ptr();
let len = b1.len();
let bytes_ref = unsafe { &mut *(&mut b1 as *mut Bytes as *mut BytesAlike) };
if bytes_ref.data == p && bytes_ref.len == len {
bytes_ref.len += b2.len();
return (Some(b1), None);
}
}
(Some(b1), Some(b2))
}
/// Returns true if b2 immediately follows b1 in memory
fn are_contiguous(b1: &Bytes, b2: &Bytes) -> bool {
// Safety: we don't dereference end
let end = unsafe { b1.as_ptr().add(b1.len()) };
end == b2.as_ptr()
}
#[cfg(test)]
mod tests {
use super::*;
use bytes::Buf;
#[test]
fn test_unsplit_bytes_success() {
let s = "foobar".as_bytes();
let mut b1 = Bytes::from(s);
let b2 = b1.split_off(3);
assert!(are_contiguous(&b1, &b2));
let (r1, r2) = unsafe { unsplit_bytes(b1, b2) };
assert!(r1.is_some());
assert!(r2.is_none());
assert_eq!(r1.unwrap().chunk(), s);
}
#[test]
fn test_unsplit_bytes_fail() {
let foo = "foopad".as_bytes();
let bar = "bar".as_bytes();
let b1 = Bytes::from(&foo[..3]);
let b2 = Bytes::from(bar);
assert!(!are_contiguous(&b1, &b2));
let (r1, r2) = unsafe { unsplit_bytes(b1, b2) };
assert_eq!(r1.is_some(), r2.is_some());
assert_eq!(r1.unwrap().chunk(), &foo[..3]);
assert_eq!(r2.unwrap().chunk(), bar);
}
}
@eloff
Copy link
Author

eloff commented Aug 2, 2021

Note this doesn't handle the case where b1 and b2 are adjacent, but b2 comes before b1 in memory.

It's easy enough to modify this to handle that. One can also just call unsplit_bytes a second time with the arguments reversed. I don't have a need to handle that case, so it's not implemented here.

@eloff
Copy link
Author

eloff commented Aug 2, 2021

Consider this code as being in the public domain, use as you see fit.

@memoryruins
Copy link

memoryruins commented Aug 2, 2021

miri errors with the following:

error: Undefined Behavior: no item granting read access to tag <3108> at alloc1521+0x8 found in borrow stack.
  --> src/main.rs:20:9
   |
20 |         bytes_ref.len += b2.len();
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^ no item granting read access to tag <3108> at alloc1521+0x8 found in borrow stack.

click tools > miri to run it https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=2cc2a5b8e1bc0e4783c4ef6a88ee5986

this appears to be trying to do a similar thing that slice's docs warn against
https://doc.rust-lang.org/nightly/std/slice/fn.from_raw_parts.html#incorrect-usage

(the code also assumes the layout of bytes::Bytes in the cast. Bytes not only has private fields, which means you shouldn't assume anything as an user of the library, it is also repr(rust) which means rustc is free to re-order the fields)

@eloff
Copy link
Author

eloff commented Aug 2, 2021

I don't see what Rust safety rule specifically is being violated here. Could this not just be an issue with miri?

I'm mutating memory marked as mutable through a non-aliased mutable reference.

The unsafe bit is, as you point out, is assuming the order of the fields in bytes::Bytes. It's made a little safer by checking at runtime that the length field matches the actual length. Given the other three fields are pointers, this is unlikely to occur by chance (the Bytes buffer would have to be very large for that to be possible, and because the numbers are then very large, they're unlikely to match by chance.)

It's not great. But I think it's the best that can be done barring this being included in the bytes library itself: tokio-rs/bytes#503

I think for the purposes of this hack, changing the assert so instead of panicking it just does not merge the Bytes would be an improvement. It would also make it a little safer (probabilistically) to check the data pointer is where it's expected to be as well.

@memoryruins
Copy link

memoryruins commented Aug 2, 2021

I think the current miri error is due to the borrow in the assert before that line. A mutable borrow is created to b1, then it's invalidated by the shared borrow when calling b1.len(), then we try to use the mutable borrow again in the last line. to work around that, can take the len before creating the mutable borrow.

let len = b1.len();
let bytes_ref = unsafe { &mut *(&mut b1 as *mut Bytes as *mut BytesAlike) };
assert_eq!(bytes_ref.len, len);
bytes_ref.len += b2.len();

but I do not think that makes us in an okay place yet. if we do something like the following,

fn main() {
    let s = "foobar".as_bytes();
    let b1 = Bytes::from(s);
    let b2 = Bytes::from("aaaaaaaa");
    assert!(are_contiguous(&b1, &b2));
    let (r1, r2) = unsplit_bytes(b1, b2);
    assert!(r1.is_some());
    assert!(r2.is_none());
    println!("{:?}", r1);
}

and compile in release and run it (because debug/miri do not end up making it contiguous), it will print

Some(b"foobaraaaaaaaa")

and I wonder if a case could be created that causes the issue mentioned in slice's docs / these links,

@eloff
Copy link
Author

eloff commented Aug 2, 2021

I think the current miri error is due to the borrow in the assert before that line. A mutable borrow is created to b1, then it's invalidated by the shared borrow when calling b1.len(), then we try to use the mutable borrow again in the last line.

Yes, you're right. Good catch! I'll update the code.

About the panic, that seems expected. The two allocations end up being contiguous (yes that surprised me too!) so they're joined and the strings are not equal because now it's "foobaraaaaaaaa" as expected.

However, you have a point about pointer provenance. This may be undefined behavior to combine two adjacent memory regions that weren't originally part of the same buffer. In practice I have difficulty imagining how the compiler could miscompile a program in this specific case. It's possible though. One can avoid that by just documenting that it's undefined behavior to use unsplit in cases like that (and add unsafe to the function.)

@memoryruins
Copy link

One can avoid that by just documenting that it's undefined behavior to use unsplit in cases like that (and add unsafe to the function.)

sounds reasonable ^^

@eloff
Copy link
Author

eloff commented Aug 3, 2021

Hey, thank you for your feedback. Your review improved the code. Are you affiliated with miri at all?

@memoryruins
Copy link

Glad it could help! I'm not affiliated, but have tried to help on a few issues on miri's repo in the past. Noticed your issue earlier while checking byte's repo (not affiliated with it either) and was curious what it was about :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment