Skip to content

Instantly share code, notes, and snippets.

@sbeam
Created September 16, 2022 21:57
Show Gist options
  • Save sbeam/8e754d86fd78b79f742785813d59f004 to your computer and use it in GitHub Desktop.
Save sbeam/8e754d86fd78b79f742785813d59f004 to your computer and use it in GitHub Desktop.
use std::str;
fn main() {
// the string literal is actually a &str (slice ref to a String) that is "owned" by the
// runtime when it starts, it's shipped as part of the binary as pre-allocated and readonly
// memory and is not on the heap.
// it needs to be converted to a String so it can be placed on the heap, and borrowed and
// resized if needed. This is the job of .to_string(). It might seem strange to take a string
// and immediately convert it to a String, but basically you need to ship it from "const" land
// to the heap, or you can't do anything with it and the compiler will be sad.
//
// We will use a phrase in German to make it look slightly more interesting.
let s = "große Feuer-Bälle 🔥!".to_string();
// as_bytes() converts the String (a Vec<UTF8stuff> sorta) to the separate bytes [u8]
let byte_array = s.as_bytes();
// let's have a look at them
for n in byte_array {
print!("{}|", n);
}
println!("");
// =>
// 103|114|111|195|159|101|32|70|101|117|101|114|45|66|195|164|108|108|101|32|240|159|148|165|
//
// Nice, but where do all these bytes live?
// {:p} makes it easy to print the physical address of any object in memory.
println!("heap string: {:p}", s.as_ptr());
// =>
// 0x600000e31120
//
// Ok, there it be. That's a big heaping number.
// what if make a ref to the string?
let a_ref = &s;
println!("ref: {:p}", a_ref);
// =>
// 0x309fb8808
//
// interesting, a much smaller address! The reference is stored on the stack.
//
// Wait. What is the stack again?
// It's the place in memory where the runtime stores all values whose size can be known at
// compile time. It can therefore be pre-allocated. Items can only be popped or shifted from
// the ends of the stack as they go into or out of scope. Allocations to the stack are very
// cheap.
//
// It's distinct from the heap, which is dynamically allocated at runtime. It is managed as a
// binary tree (?), and allocations to the heap are moderately expensive, so Rust will avoid
// that whenever possible. Strings, most Vecs, Hashmaps, "Box", and other types must be stored
// in the heap, but are often referred to by reference. References are stored on the stack, but
// usually point to objects on the heap. Regions of memory that become unused as they go out of
// scope can be returned to the OS (I assume?) or re-used. Since Rust tracks the "owner" of
// every bit of memory throughout runtime, it can free any objects that go out of scope and
// therefore does not need garbage collection, but also cannot be used to write non-memory-safe
// code (miraculous!).
//
// Thus, taking the referernce (a &str) and calling as_ptr() reveals, again, the same location
// of the String, somewhere out in the heap.
println!("{:p}", a_ref.as_ptr());
// what is a &str again?
// &str is a reference to a string literal stored in the read only memory when the program is
// run. Can’t be changed. It's always a reference to a slice of somebody else's String. It's
// stored on the stack.
//
// So in general, if a function needs to be called with a string that doesn't need to change,
// it should receive a &str
hark(a_ref);
// => Hark! große Feuer-Bälle 🔥!
// a &str doesn't have to reference the entire underlying string. The String itself is still
// owned by someone else, and can't be borrowed mutably. This just creates another pointer on
// the stack we can pass around like any other.
let ending = &s[19..];
hark(ending);
// => Hark! 🔥!
// and what if the underlying string needs to be changed in-place?
// In that case, we must have a string declared as mutable, then make sure you pass it mutably
// borrowed (&mut). Here, `clone()` obviously allocates a whole new area of the stack and
// copies the original string to it byte-for-byte.
let mut s2 = s.clone();
hark(&upper(&mut s2));
// => Hark! GROSSE FEUER-BÄLLE 🔥!
//
// Note the ß is expanded to "SS" by .to_uppercase() which apparently is correct.
// what if we want to decode the string to individual bytes? Here we convert byte_array to
// a Vec<String>, and can then print them with join(), which gives the same output as above.
let bytes_as_strings: Vec<String> = byte_array.iter().map(|i| i.to_string()).collect();
println!("{}", bytes_as_strings.join("|"));
// => 103|114|111|195|159|101|32|70|101|117|101|114|45|66|195|164|108|108|101|32|240|159|148|165|33
// boring decimal numbers again. But what if we were interested in Unicode? (who isn't???? right?)
// chars() can get us a Vec<char>, and char is a ‘Unicode scalar value’.
for c in s.chars() {
print!("{} U+({:04X}) ", c, c as u32);
}
println!();
// =>
// g U+(0067) r U+(0072) o U+(006F) ß U+(00DF) e U+(0065) U+(0020) F U+(0046) e U+(0065) u
// U+(0075) e U+(0065) r U+(0072) - U+(002D) B U+(0042) ä U+(00E4) l U+(006C) l U+(006C) e
// U+(0065) U+(0020) 🔥 U+(1F525) ! U+(0021)
//
// yep, that emoji has a mighty big codepoint and takes up 4 bytes. I guess that's why there
// are so many damn emoji 🤓
// Now, what if we wanted to construct a byte array of our own, and convert it to a String?
// let's copy the byte_array, make it mutable, and replace that last 4 bytes with a U+2661
// which should be a heart shape! yay!
//
// first we have to cast the original string's bytes to a mutable Vec
let mut new_bytes: Vec<u8> = s.bytes().collect();
// splice() makes it too easy to insert our desired values and remove the extra byte for the
// emoji vs the extended codepage character or whatever it's called. I cheated to figure out
// what the 3 decimal values should be.
new_bytes.splice(20..24, [226, 153, 161]);
for n in new_bytes.iter() {
print!("{}|", n);
}
println!();
// Here we use std::str::from_utf8 instead of String::from_utf8 because the latter does not
// take a reference, and therefore borrows new_bytes. This upsets the compiler when we want to
// use new_bytes on the next line. (TBH I am not sure why the receiving function cannot "give
// back" ownership once it is done, since it isn't async or anything). In any case this
// version takes a reference and thus does not trigger a borrow check.
let edited = str::from_utf8(&new_bytes).unwrap();
println!("{}", edited);
// =>
// große Feuer-Bälle ♡!
//
// Sweet, we are well on our way to cloning emacs!
//
// and what does Rust do if you try to String-ify a sequence of bytes that isn't valid UTF-8?
new_bytes[7] = 199;
if let Err(ohno) = str::from_utf8(&new_bytes) {
eprintln!("{}", ohno);
}
// => invalid utf-8 sequence of 1 bytes from index 7
//
// Beautiful. Rust is annoyingly pedantic and uncompromising. But it's nice to know the
// compiler has absolutely no tolerance for nonsense bugs that are so common in other
// languages.
}
fn hark(text: &str) {
println!("Hark! {}", text);
}
fn upper(text: &mut str) -> String {
text.to_uppercase()
}
/*
* Credit to:
* https://blog.thoughtram.io/string-vs-str-in-rust/
* https://fasterthanli.me/articles/working-with-strings-in-rust
* https://www.reddit.com/r/rust/comments/fcuq8x/understanding_string_and_str_in_rust/
*/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment