Skip to content

Instantly share code, notes, and snippets.

@polytypic
Last active February 15, 2024 09:47
Show Gist options
  • Save polytypic/4b6793782a7ed68debb3da8215b1d2b9 to your computer and use it in GitHub Desktop.
Save polytypic/4b6793782a7ed68debb3da8215b1d2b9 to your computer and use it in GitHub Desktop.
Fear the `Obj.magic`

Fear the Obj.magic

Reaction of a junior OCaml programmer being suggested to use Obj.magic

Let me first quote a paragraph from the Memory Representation of Values (with my emphasis):

Obj is an undocumented module that exposes the internals of the OCaml compiler and runtime. It is very useful for examining and understanding how your code will behave at runtime but should never be used for production code unless you understand the implications. The module bypasses the OCaml type system, making memory corruption and segmentation faults possible.

If you ask about Obj.magic on public OCaml forums, you'll likely get a warning of some sort. Here is mine: If you found this note, because you are just learning OCaml and have no idea of what you are doing, you should move on. Avoid Obj.magic at all cost.

If, on the other hand, you are actually implementing a supposedly high-performance data structure and care about the cost, and you want to understand the implications, then this note might be for you. Knowledge and understanding are far more useful than fear of the unknown. Understanding the memory representation of values can significantly help one when trying to improve the performance of data structures in OCaml.

In this note I'm going to first talk a bit about the memory representation of values and about unboxed float arrays. This is not meant to be a comprehensive treatment on the subject, however. I only want to point out a number of specific features. Then I'll go through a number of inefficiencies one tends to encounter when trying to implement high-performance data structures and I'll briefly show how one might use Obj.magic to avoid them.

On the memory representation of values

As explained in numerous other documents, the OCaml runtime uses a uniform memory representation of values. What this means is that an OCaml value is stored as a single, 32-bit or 64-bit, word and is either an immediate integer or a pointer. A bit in each value indicates whether the value is an immediate integer or a pointer. If the value is a pointer, it will point to a block of memory that begins with a control word containing information on the block. The information includes, among other things, the size of the block and a "tag" that tells what kind of block it is.

In the rest of this note we are mostly interested in the representation of these blocks and, in particular, we need to distinguish between a few different kinds of blocks:

  • constructor blocks, having one of the constructor_tags,
  • double blocks, having the double_tag, and
  • double array blocks, having the double_array_tag.

A constructor block contains entries stored as 32-bit or 64-bit words, just like any other OCaml values, a double block contains a single double precision floating point number stored as a 64-bit entry, taking either two 32-bit words or one 64-bit word, and a double array block contains double precision floating point numbers stored as 64-bit entries.

Entries in those different types of blocks need to be accessed in different ways. Using Obj.magic we can trick OCaml into performing accesses on values that it would not otherwise allow. If we do that carelessly, it is possible to trick OCaml into performing invalid accesses, that, if we are lucky, fail at runtime and crash the program.

Examining the representation of values using the Obj module

Let's create a little helper named tag_name_of that tells us the tag of an OCaml value:

let tag_names = [|
  Obj.forcing_tag, "forcing_tag";
  Obj.cont_tag, "cont_tag";
  Obj.lazy_tag, "lazy_tag";
  Obj.closure_tag, "closure_tag";
  Obj.object_tag, "object_tag";
  Obj.infix_tag, "infix_tag";
  Obj.forward_tag, "forward_tag";
  Obj.no_scan_tag, "no_scan_tag";
  Obj.abstract_tag, "abstract_tag";
  Obj.string_tag, "string_tag";
  Obj.double_tag, "double_tag";
  Obj.double_array_tag, "double_array_tag";
  Obj.custom_tag, "custom_tag";
  Obj.int_tag, "int_tag";
  Obj.out_of_heap_tag, "out_of_heap_tag";
  Obj.unaligned_tag, "unaligned_tag";
|]

let tag_name_of x =
  let tag = Obj.tag (Obj.repr x) in
  if Obj.first_non_constant_constructor_tag <= tag &&
     tag <= Obj.last_non_constant_constructor_tag then
    Printf.sprintf "constructor_tag %d" tag
  else
    match Array.find_opt (fun (tag', _) -> tag == tag') tag_names with
    | Some (_, name) -> name
    | None -> Printf.sprintf "unknown tag %d" tag

Let's then examine some OCaml values. An int value is stored as an immediate integer value as are nullary or constant constructors:

# tag_name_of 101
- : string = "int_tag"

# tag_name_of None
- : string = "int_tag"

# tag_name_of `Nullary
- : string = "int_tag"

As one might expect a float is stored as block with the double_tag:

# tag_name_of 1.01
- : string = "double_tag"

Pretty much everything else, i.e. non-constant constructors, most records and tuples, ref-cells (which are actually records), and most arrays, are stored as constructor blocks:

# tag_name_of (Ok "value")
- : string = "constructor_tag 0"

# tag_name_of (Error "value")
- : string = "constructor_tag 1"

# tag_name_of (4.2, 1.01)
- : string = "constructor_tag 0"

# tag_name_of (ref 1.01)
- : string = "constructor_tag 0"

# tag_name_of (Atomic.make 1.01)
- : string = "constructor_tag 0"

# tag_name_of [|101; 42|]
- : string = "constructor_tag 0"

The notable exceptions are records that only contain float fields and float arrays:

# let open struct type vec2 = { x: float; y: float } end in
  tag_name_of { x = 4.2; y = 1.01 }
- : string = "double_array_tag"

# tag_name_of (Array.make 101 4.1)
- : string = "double_array_tag"

As explained in the post About unboxed float arrays, the special treatment of float arrays has both pros and cons. In this note we are mostly interested in the cons, how to deal with the possibility of float arrays, and how to avoid the costs when they are unnecessary.

Array accesses

Let's examine the code that OCaml generates for array accesses using the Compiler Explorer. OCaml generates different code for an array access depending on what it knows about the type of the array elements.

Array of immediate type

For an array of immediate elements:

type immediate [@@immediate]

let immediate_get (xs: immediate array) i =
  Array.unsafe_get xs i

let immediate_set (xs: immediate array) i v =
  Array.unsafe_set xs i v

OCaml generates what one might expect:

camlExample__immediate_get_269:
        movq    -4(%rax,%rbx,4), %rax
        ret
camlExample__immediate_set_317:
        movq    %rdi, -4(%rax,%rbx,4)
        movl    $1, %eax
        ret

This is the most efficient case — essentially a single instruction for an array access.

Array of non-float type

Things get more interesting for an array of non-float elements:

type non_float = [ `Non_float of non_float ]

let non_float_get (xs: non_float array) i =
  Array.unsafe_get xs i

let non_float_set (xs: non_float array) i v =
  Array.unsafe_set xs i v

OCaml has no way to say to the compiler that an abstract type is not a float. So, above we define a concrete type non_float that the compiler can analyze and see that it is not an immediate type and also that it isn't a float. The compiler generates array accesses that assume that the values may be pointers:

camlExample__non_float_get_323:
        movq    -4(%rax,%rbx,4), %rax
        ret
camlExample__non_float_set_327:
        subq    $8, %rsp
        movq    %rdi, %rsi
        leaq    -4(%rax,%rbx,4), %rdi
        movq    %rsp, %rbp
        movq    56(%r14), %rsp
        call    caml_modify@PLT
        movq    %rbp, %rsp
        movl    $1, %eax
        addq    $8, %rsp
        ret

A get is still optimal, but now a set calls caml_modify. The caml_modify function, which is part of the OCaml runtime, will perform the necessary GC write barrier and fences mandated by the OCaml memory model.

Array of float

Things get perhaps even more interesting for an array of float elements:

let float_get (xs: float array) i =
  Array.unsafe_get xs i

let float_set (xs: float array) i v =
  Array.unsafe_set xs i v

While a set is still fairly efficient, a get actually allocates memory:

camlExample__float_get_332:
        subq    $8, %rsp
        subq    $16, %r15               ; Allocate 16 bytes for double block
        cmpq    (%r14), %r15            ; Check whether a GC call is needed
        jb      .L105
.L107:
        leaq    8(%r15), %rdi
        movq    $1277, -8(%rdi)         ; Write double block header word
        movsd   -4(%rax,%rbx,4), %xmm0  ; Read from array
        movsd   %xmm0, (%rdi)           ; Write double block value
        movq    %rdi, %rax
        addq    $8, %rsp
        ret
.L105:
        call    caml_call_gc@PLT
.L106:
        jmp     .L107
camlExample__float_set_336:
        movsd   (%rdi), %xmm0           ; Read double block value (*)
        movsd   %xmm0, -4(%rax,%rbx,4)  ; Write double array element
        movl    $1, %eax
        ret

⚠️ There is one particularly important thing to note in the float_set code. A float value is generally represented as a double block. So, before writing the value to the double array block, the value is read from the double block. Now, consider the following function:

let will_this_crash_or_not xs =
  float_set xs 0 (Obj.magic ())

If you call will_this_crash_or_not [| 1.01 |] the program will most likely crash. Notice the instruction marked with a (*). It attempts to read the value from a double block using the value in register rdi as a pointer. Obj.magic () is an immediate value. When that immediate value is interpreted as a pointer, it will most likely point to an address in memory that the program is not allowed to access and so the call will most likely crash.

Array of unknown type

Based on what we saw previously, one might already be able to guess what happens when one accesses arrays of unknown (abstract or polymorphic) type:

let polymorphic_get xs i =
  Array.unsafe_get xs i

let polymorphic_set xs i v =
  Array.unsafe_set xs i v

The compiler has no choice but to generate code that can handle any one of the previous cases:

camlExample__polymorphic_get_341:
        subq    $8, %rsp
        movzbq  -8(%rax), %rdi          ; Fetch tag from array header word
        cmpq    $254, %rdi              ; Is it a double array?
        je      .L109
        movq    -4(%rax,%rbx,4), %rax   ; (*)
        addq    $8, %rsp
        ret
.L109:
        subq    $16, %r15               ; Allocate 16 bytes for double block
        cmpq    (%r14), %r15            ; Check whether a GC call is needed
        jb      .L111
.L113:
        leaq    8(%r15), %rdi
        movq    $1277, -8(%rdi)         ; Write double block header word
        movsd   -4(%rax,%rbx,4), %xmm0  ; Read from array
        movsd   %xmm0, (%rdi)           ; Write double block value
        movq    %rdi, %rax
        addq    $8, %rsp
        ret
.L111:
        call    caml_call_gc@PLT
.L112:
        jmp     .L113
camlExample__polymorphic_set_345:
        subq    $8, %rsp
        movq    %rdi, %rsi
        movzbq  -8(%rax), %rdi          ; Fetch tag from array header word
        cmpq    $254, %rdi              ; Is it a double array?
        je      .L115
        leaq    -4(%rax,%rbx,4), %rdi
        movq    %rsp, %rbp
        movq    56(%r14), %rsp
        call    caml_modify@PLT         ; (*)
        movq    %rbp, %rsp
        jmp     .L114
.L115:
        movsd   (%rsi), %xmm0           ; Read double block value
        movsd   %xmm0, -4(%rax,%rbx,4)  ; Write double array element
.L114:
        movl    $1, %eax
        addq    $8, %rsp
        ret

That is quite an explosion from the best case of a single instruction for both a get and a set. The possibility of dealing with a float array significantly increases the amount of code generated for and slows down array accesses. I refer to this effect as float array pessimization — the compiler pessimistically generates expensive code.

⚠️ There is one thing that I'd like to point out at this point. Notice the two lines marked with (*). When the compiler does not know the representation of the elements, a single instruction is sufficient for the get and a call to caml_modify is sufficient for the set of the non-float array case. The caml_modify function can handle both immediate and pointer like values, including float values, i.e. double blocks, but it does not handle float arrays as targets.

Summary

The OCaml runtime uses a uniform representation values, where a value is either an immediate integer value or a pointer to a block, which can, for our purposes, be a constructor block, a double block, or a double array block. When using Obj.magic it is important to distinguish between these representations. Avoiding float array pessimization can both significantly reduce code size and improve performance.

Using Obj.magic

In the following subsections I will go through a few examples or cases where Obj.magic can be used to avoid some inefficiency.

A unique zero value

Consider the following data structure:

module type Mvar = sig
  type 'a t
  val create : 'a option -> 'a t
  val take_opt : 'a t -> 'a option
  val try_give : 'a t -> 'a -> bool
end

module Mvar_opt : Mvar = struct
  type 'a t = 'a option ref

  let create = ref

  let[@poll error][@inline never] take_opt t =
    match !t with
    | None -> None
    | Some _ as some ->
      t := None;
      some

  let[@poll error][@inline never] try_give t opt =
    !t == None && begin
      t := opt;
      true
    end

  let try_give t v =
    (* A [@poll error] function must not allocate, which is why we do the
       allocation outside of the critical section. *)
    try_give t (Some v)
end

The above could be the beginnings of a wait-free systhread-safe implementation of a so called M-var.

A performance issue with the above implementation is that a try_give operation requires an allocation and that there is an extra indirection due to the use of an option.

If there was a way to have a zero value that is distinct from all other values then we wouldn't need the option.

Well, there is a way. Recall that the OCaml runtime uses a uniform representation of values. Each value is either an immediate integer value or a pointer to a block. A pointer to a unique block is distinct from all other values. Allocating a fresh mutable block, such as a ref cell, gives us a unique value. We can then use Obj.magic to cast that unique value to any type. We then need to make sure that we don't let the zero value escape outside of the M-var implementation:

module Mvar_zero : Mvar = struct
  type 'a t = 'a ref

  let zero =
    (* [zero] is a unique value distinct from all other OCaml values. *)
    Obj.magic (ref 101)

  let create opt =
    (* The [match] below is in an inlinable function and there is hope that an
       optimizing compiler for OCaml could eliminate the implied allocation of
       [Some value] in some cases. *)
    ref (match opt with None -> zero | Some value -> value)

  let[@poll error][@inline never] take_zero t =
    let value_or_zero = !t in
    (* We compare using [!=] or [==]. *)
    if value_or_zero != zero then
      t := zero;
    value_or_zero

  let take_opt t =
    let value_or_zero = take_zero t in
    if value_or_zero == zero then
      None
    else
      (* The allocation below is in an inlinable function and there is hope that
         an optimizing compiler for OCaml could eliminate this allocation. *)
      Some value_or_zero

  let[@poll error][@inline never] try_give t value =
    !t == zero && begin
      t := value;
      true
    end
end

Let's give it a spin:

# let mv = Mvar_zero.create (Some 4.2)
val mv : float Mvar_zero.t = <abstr>

# Mvar_zero.take_opt mv
- : float option = Some 4.2

# Mvar_zero.take_opt mv
- : float option = None

# Mvar_zero.try_give mv 1.01
- : bool = true

# Mvar_zero.take_opt mv
- : float option = Some 1.01

Is this safe?

A ref cell is allocated as a constructor block, because the representation of the value it holds is unknown. In other words, a ref cell is not subject to the float array pessimization. Because the representation of the element stored by M-var is unknown to the compiler, the compiler will generate calls to caml_modify to update the value and that is safe for both immediate values and pointer values. When used with a float value, the float value will be treated as a uniformly represented value like any other pointer value.

Here is a Compiler Explorer link for this example.

Avoiding space leaks

Consider the following node based and wait-free systhread-safe queue using a GADT to avoid some unnecessary match cases:

module type Queue = sig
  type 'a t
  val create : unit -> 'a t
  val enqueue : 'a t -> 'a -> unit
  val dequeue_opt : 'a t -> 'a option
end

module Queue_opt : Queue = struct
  type ('a, _) node =
    | Nil : ('a, [> `Nil]) node
    | Next : {
        mutable next: ('a, [`Nil | `Next]) node;
        mutable value: 'a option; (* 1 *)
      } -> ('a, [> `Next]) node

  type 'a t = {
      mutable head: ('a, [`Next]) node;
      mutable tail: ('a, [`Next]) node;
    }

  let create () =
    let dummy = Next {
        next = Nil;
        value = None; (* 2 *)
      }
    in
    { head = dummy; tail = dummy }

  let[@poll error][@inline never]
    enqueue t (Next new_tail_r as new_tail : (_, [< `Next]) node) =
    let Next old_tail_r = t.tail in
    t.tail <- new_tail;
    old_tail_r.next <- new_tail

  let enqueue t v =
    let (Next _ as new_tail : (_, [< `Next]) node) = Next {
        next = Nil;
        value = Some v (* 3 *)
      }
    in
    enqueue t new_tail

  let[@poll error][@inline never] dequeue_opt t =
    let Next old_head_r = t.head in
    match old_head_r.next with
    | Nil -> None
    | Next new_head_r as new_head ->
      let result = new_head_r.value in
      new_head_r.value <- None; (* 4 *)
      t.head <- new_head;
      result
end

Notice the lines marked with (* ? *). The queue makes it so that both head and tail always point to a non-Nil node. In other words, the queue uses a so called sentinel node. The value of the node pointed to by the head is not logically part of the queue and the next field of the node pointed to by the tail is Nil. When the queue is empty, head and tail point to the same node.

When we remove a value from the queue, we need to make sure that the value is not retained by the queue. This is the reason why we wrap the value as an option (1), which requires allocating extra space (3), and populate (2) or overwrite (3) the value field with None when the node is empty.

We could use the same trick as we used in the previous section to create a unique zero value to avoid allocating extra space for the option and avoid the associated indirection, but in this case we can do even better. We can simply use Obj.magic () as the "no leak" value, because we don't need to be able to tell it apart from valid values:

module Queue_unit : Queue = struct
  type ('a, _) node =
    | Nil : ('a, [> `Nil]) node
    | Next : {
        mutable next: ('a, [`Nil | `Next]) node;
        mutable value: 'a; (* 1 *)
      } -> ('a, [> `Next]) node

  type 'a t = {
      mutable head: ('a, [`Next]) node;
      mutable tail: ('a, [`Next]) node;
    }

  let create () =
    let dummy = Next {
        next = Nil;
        value = Obj.magic (); (* 2 *)
      }
    in
    { head = dummy; tail = dummy }

  let[@poll error][@inline never]
    enqueue t (Next new_tail_r as new_tail : (_, [< `Next]) node) =
    let Next old_tail_r = t.tail in
    t.tail <- new_tail;
    old_tail_r.next <- new_tail

  let enqueue t v =
    let (Next _ as new_tail : (_, [< `Next]) node) = Next {
        next = Nil;
        value = v (* 3 *)
      }
    in
    enqueue t new_tail

  let[@poll error][@inline never] dequeue_node t =
    let Next old_head_r = t.head in
    match old_head_r.next with
    | Nil -> Nil
    | Next _ as new_head ->
      t.head <- new_head;
      new_head

  let dequeue_opt t =
    match dequeue_node t with
    | Nil -> None
    | Next new_head_r ->
      let value = new_head_r.value in
      new_head_r.value <- Obj.magic (); (* 4 *)
      (* The allocation below is in an inlinable function and there is hope that
         an optimizing compiler for OCaml could eliminate the allocation. *)
      Some value
end

Let's give it a spin:

# let q : float Queue_unit.t = Queue_unit.create ()
val q : float Queue_unit.t = <abstr>

# Queue_unit.dequeue_opt q
- : float option = None

# Queue_unit.enqueue q 1.01
- : unit = ()

# Queue_unit.enqueue q 4.2
- : unit = ()

# Queue_unit.dequeue_opt q
- : float option = Some 1.01

# Queue_unit.dequeue_opt q
- : float option = Some 4.2

# Queue_unit.dequeue_opt q
- : float option = None

Is this safe?

A Next _ node will be allocated as a constructor block, because the next field is not a float. The representation of 'a is also unknown to the compiler. The compiler will treat the value field like elements of a non-float array and Obj.magic (), which is an immediate value, can safely be stored in such a field.

Here is a Compiler Explorer link for this example.

Avoiding float array pessimization

Arrays are a very common building block of high-performance data structures.

Consider the following array based wait-free systhread-safe bounded queue implementation:

module type Bounded_queue = sig
  type 'a t
  val create : min_capacity:int -> 'a t
  val try_enqueue : 'a t -> 'a -> bool
  val dequeue_opt : 'a t -> 'a option
end

let ceil_pow2 x =
  if x <= 0 then 1
  else
    let x = x - 1 in
    let x = x lor (x lsr 1) in
    let x = x lor (x lsr 2) in
    let x = x lor (x lsr 4) in
    let x = x lor (x lsr 8) in
    let x = x lor (x lsr 16) in
    let x = if Sys.int_size <= 32 then x else x lor (x lsr 32) in
    x + 1

module Bounded_queue_opt : Bounded_queue = struct
  type 'a t = {
      mutable head : int;
      mutable tail : int;
      array : 'a option array; (* 1 *)
    }

  let create ~min_capacity =
    let capacity = ceil_pow2 min_capacity in {
      head = 0;
      tail = 0;
      array = Array.make capacity None; (* 2 *)
    }

  let[@poll error][@inline never] try_enqueue t opt =
    let capacity = Array.length t.array in
    t.tail - t.head < capacity && begin
      Array.unsafe_set t.array (t.tail land (capacity - 1)) opt;
      t.tail <- t.tail + 1;
      true
    end

  let try_enqueue t v = try_enqueue t (Some v) (* 3 *)

  let[@poll error][@inline never] dequeue_opt t =
    let size = t.tail - t.head in
    if size <= 0 then
      None
    else
      let capacity = Array.length t.array in
      let index = t.head land (capacity - 1) in
      let result = Array.unsafe_get t.array index in
      Array.unsafe_set t.array index None; (* 4 *)
      t.head <- t.head + 1;
      result
end

Notice the lines with comments. Once again, to avoid space leaks, we wrap values as options (1, 3) and populate (2) or overwrite (4) with None to avoid leaks.

To avoid the extra allocations and indirections one could use an array of plain values 'a array and use Obj.magic () as the elements. Unfortunately, doing so would trigger the float array pessimization. Aside from the performance cost, it would also make it somewhat more tricky to implement a wait-free systhread-safe version of the queue, because getting an element from the array would potentially require an allocation.

We can avoid the float array pessimization by making OCaml think it is accessing an array of non_float elements and we can also use the trick of creating a zero value to avoid allocations:

module Bounded_queue_nfa : Bounded_queue = struct
  type non_float = [ `Non_float of non_float ]

  type 'a t = {
      mutable head : int;
      mutable tail : int;
      array : non_float array;
    }

  let create ~min_capacity =
    let capacity = ceil_pow2 min_capacity in {
      head = 0;
      tail = 0;
      array = Array.make capacity (Obj.magic ()) (* 1 *)
    }

  let[@poll error][@inline never] try_enqueue t v =
    let capacity = Array.length t.array in
    t.tail - t.head < capacity && begin
      Array.unsafe_set t.array (t.tail land (capacity - 1)) (Obj.magic v); (* 2 *)
      t.tail <- t.tail + 1;
      true
    end

  let zero = Obj.magic (ref 42) (* 3 *)

  let[@poll error][@inline never] dequeue_zero t =
    let size = t.tail - t.head in
    if size <= 0 then
      zero (* 4 *)
    else
      let capacity = Array.length t.array in
      let index = t.head land (capacity - 1) in
      let result = Array.unsafe_get t.array index in (* 5 *)
      Array.unsafe_set t.array index (Obj.magic ()); (* 6 *)
      t.head <- t.head + 1;
      result

  let dequeue_opt t =
    let value_or_zero = dequeue_zero t in
    if value_or_zero == zero then (* 7 *)
      None
    else
      Some (Obj.magic value_or_zero) (* 8 *)
end

Let's give it a spin:

# let bq : float Bounded_queue_nfa.t = Bounded_queue_nfa.create ~min_capacity:2
val bq : float Bounded_queue_nfa.t = <abstr>

# Bounded_queue_nfa.dequeue_opt bq
- : float option = None

# Bounded_queue_nfa.try_enqueue bq 1.01
- : bool = true

# Bounded_queue_nfa.try_enqueue bq 4.2
- : bool = true

# Bounded_queue_nfa.try_enqueue bq 7.6
- : bool = false

# Bounded_queue_nfa.dequeue_opt bq
- : float option = Some 1.01

# Bounded_queue_nfa.try_enqueue bq 7.6
- : bool = true

# Bounded_queue_nfa.dequeue_opt bq
- : float option = Some 4.2

# Bounded_queue_nfa.dequeue_opt bq
- : float option = Some 7.6

# Bounded_queue_nfa.dequeue_opt bq
- : float option = None

Is this safe?

To access an element of a non_float array, OCaml will generate code that works with both immediate and pointer values, including float values, i.e. double blocks, but will not with a float array target. OCaml decides the array representation when allocating the array (1) based on the initial value. We populate the array with Obj.magic () values. This makes OCaml allocate the array as a constructor block and not as a double array block. The reads (5) and writes (2, 6) are therefore safe as is the cast (8) to the result type after a successful dequeue. The use of a unique value (3) used to indicate an empty queue (4, 7) is also safe.

⚠️ But, wait, there is a gotcha! There is now a discrepancy between the type of the array and the type of elements we actually store into the array. The basic OCaml compilers do not notice the problem. However, in some cases the Flambda compiler does notice such discrepancies. In this particular case, we used the [@poll error] and [@inline never] annotations on some of the functions, because we want those functions to not contain any poll points. If we remove those annotations Flambda will complain:

>> Fatal error: Assignment of a float to a specialised non-float array:
[...]

In this case, disallowing inlining prevents Flambda from exposing the sleight of hand.

Here is a Compiler Explorer link for this example.

What about abstract types?

In the previous example, we used the non_float type to trick OCaml to generate the array accesses we wanted. This trick was used, because we essentially wanted to use an array of any given type. If the array elements need not be of arbitrary type, but are rather known to be of some type whose representation is known to the compiler, then the non_float trick may be unnecessary.

Unfortunately, there is a catch. The OCaml compiler does not propagate representation information through abstract types.

As an example, at the time of writing this, Atomic.t is an abstract type constructor. This means that, to access an array of atomics, _ Atomic.t array, OCaml will invoke the float array pessimization. Funnily enough, it does that even though it is smart enough to specialize Atomic.get to a single instruction on the x86.

Consider the following example:

let atomic_array_get xs i =
  let x = Array.unsafe_get xs i in (* 1 *)
  Atomic.get x (* 2 *)

As one can confirm with Compiler Explorer, the compiler is smart enough to generate a single instruction corresponding to the Atomic.get:

camlExample__atomic_array_get_268:
        subq    $8, %rsp
        movzbq  -8(%rax), %rdi
        cmpq    $254, %rdi               ; Is it a double array?
        je      .L101
        movq    -4(%rax,%rbx,4), %rdi
        jmp     .L100
.L101:
        subq    $16, %r15
        cmpq    (%r14), %r15
        jb      .L103
.L105:
        leaq    8(%r15), %rdi
        movq    $1277, -8(%rdi)
        movsd   -4(%rax,%rbx,4), %xmm0
        movsd   %xmm0, (%rdi)
.L100:
        movq    (%rdi), %rax             ; Atomic.get
        addq    $8, %rsp
        ret
.L103:
        call    caml_call_gc@PLT
.L104:
        jmp     .L105

However, at the same time, the compiler generates code for a potential float array.

Holy float array pessimization!

Indeed.

Avoiding Atomic.t allocation and indirection

OCaml recently acquired multicore support and part of that support is the Atomic module and the ref-like Atomic.t type constructor. Consider the following linked list based lock-free set implementation using physical equality:

module type Physical_set = sig
  type 'a t
  val create : unit -> 'a t
  val try_add : 'a t -> 'a -> bool
  val try_remove : 'a t -> 'a -> bool
end

module Physical_set_external_atomic : Physical_set = struct
  type ('a, _) node =
    | Null : ('a, [> `Null]) node
    | Node : {
        next: 'a link Atomic.t; (* 1 *)
        value: 'a
      } -> ('a, [> `Node]) node
    | Mark : ('a, [< `Null | `Node]) node -> ('a, [> `Mark]) node

  and 'a link =
    | Link : ('a, [< `Null | `Node | `Mark]) node -> 'a link [@@unboxed]

  type 'a t = 'a link Atomic.t

  let create () = Atomic.make (Link Null)

  let rec find_node t prev value : _ -> (_, [< `Null | `Node ]) node = function
    | Link (Mark _) -> find_node t t value (Atomic.get t)
    | Link Null -> Null
    | Link (Node r as node) as before ->
        match Atomic.get r.next with
        | Link (Mark node) ->
            if Atomic.compare_and_set prev before (Link node) then
              find_node t prev value (Link node)
            else
              find_node t prev value (Atomic.get prev)
        | Link (Null | Node _) as next ->
            if r.value == value then
              node
            else
              find_node t r.next value next

  let rec try_add t value =
    let before = Atomic.get t in
    match find_node t t value before with
    | Node _ -> false
    | Null ->
        let after =
          Node {
            next = Atomic.make before; (* 2 *)
            value
          }
        in
        Atomic.compare_and_set t before (Link after)
        || try_add t value

  let rec try_remove t value =
    match find_node t t value (Atomic.get t) with
    | Null -> false
    | Node r ->
        match Atomic.get r.next with
        | Link (Mark _) ->
            false
        | Link ((Null | Node _) as next) as before ->
            let after = Mark next in
            if Atomic.compare_and_set r.next before (Link after)
            then begin
              find_node t t value (Atomic.get t) |> ignore;
              true
            end
            else try_remove t value
end

Notice the lines marked with (* ? *). The links within the list go through Atomic.t blocks. Internally a _ Atomic.t has essentially the same representation as a _ ref. Unfortunately, there is nothing like the mutable specifier for record fields to allow one to "inline" the atomic to the record and avoid the indirection. The use of Atomic.t effectively doubles the number of indirections in the list. Doubling the indirections is effectively the same as making the list twice as long and can actually double the amount of time operations on the data structure take.

We can, however, use Obj.magic to trick the OCaml compiler to treat the first field of a record as an Atomic.t:

module Physical_set_inlined : Physical_set = struct
  type ('a, _) node =
    | Null : ('a, [> `Null]) node
    | Node : {
        mutable next: 'a link; (* 1 *)
        value: 'a
      } -> ('a, [> `Node]) node
    | Mark : ('a, [< `Null | `Node]) node -> ('a, [> `Mark]) node

  and 'a link =
    | Link : ('a, [< `Null | `Node | `Mark]) node -> 'a link [@@unboxed]

  type 'a t = 'a link Atomic.t

  let create () = Atomic.make (Link Null)

  let next_as_atomic (Node _ as node : ('a, [< `Node]) node) =
    (Obj.magic node : 'a link Atomic.t) (* 2 *)

  let rec find_node t prev value : _ -> (_, [< `Null | `Node ]) node = function
    | Link (Mark _) -> find_node t t value (Atomic.get t)
    | Link Null -> Null
    | Link (Node r as node) as before ->
        match Atomic.get (next_as_atomic node) with
        | Link (Mark node) ->
            if Atomic.compare_and_set prev before (Link node) then
              find_node t prev value (Link node)
            else
              find_node t prev value (Atomic.get prev)
        | Link (Null | Node _) as next ->
            if r.value == value then
              node
            else
              find_node t (next_as_atomic node) value next

  let rec try_add t value =
    let before = Atomic.get t in
    match find_node t t value before with
    | Node _ -> false
    | Null ->
        let after =
          Node {
            next = before; (* 3 *)
            value
          }
        in
        Atomic.compare_and_set t before (Link after)
        || try_add t value

  let rec try_remove t value =
    match find_node t t value (Atomic.get t) with
    | Null -> false
    | Node _ as node ->
        match Atomic.get (next_as_atomic node) with
        | Link (Mark _) ->
            false
        | Link ((Null | Node _) as next) as before ->
            let after = Mark next in
            if Atomic.compare_and_set (next_as_atomic node) before (Link after)
            then begin
              find_node t t value (Atomic.get t) |> ignore;
              true
            end
            else try_remove t value
end

Let's give it a spin:

# let s : int Physical_set_inlined.t = Physical_set_inlined.create ()
val s : int Physical_set_inlined.t = <abstr>

# Physical_set_inlined.try_add s 101
- : bool = true

# Physical_set_inlined.try_add s 42
- : bool = true

# Physical_set_inlined.try_remove s 101
- : bool = true

# Physical_set_inlined.try_add s 42
- : bool = false

# Physical_set_inlined.try_remove s 42
- : bool = true

# Physical_set_inlined.try_remove s 101
- : bool = false

Is this safe?

An 'a Atomic.t has the same representation as a 'a ref, which is a record with a single mutable field. No special optimizations are made for floats nor for any other type of values and a Node _ record is also allocated as a constructor block rather than as a float array. OCaml compilers do not perform record layout optimizations, which means that the layouts of 'a Atomic.t and of a record whose first element is of type 'a are compatible and the accesses are safe.

Here is a Compiler Explorer link for this example.

Well...

There is technically no guarantee of any kind when using Obj.magic. The reason for that is that, in the future, the OCaml runtime might be changed in some way, or a new compiler for OCaml might emerge, that makes the reasoning based on the current implementation(s) and uniform representation of values invalid.

Does that make me worried?

No, it does not. Primarily for two reasons:

  • Obj.magic is fairly commonly used. Try the Sherlocode tool, for example, and don't forget to try clicking for more. This is a roundabout way to say that changes to the representation of values are likely to break a lot of existing code. While that does not entirely prevent changes from being made, it is unlikely that such changes would be just suddenly introduced. If significant changes are made, it will take a long time.

    Rare footage of a core OCaml compiler developer getting excited about avoiding float array pessimization sped up by a factor of 10000

    Perhaps, or rather, in our hopes and dreams, one of the most likely future changes is dropping the float array pessimization. That change will not break any of the examples in this note, but would make the uses of Obj.magic to avoid the pessimization redundant. In all likelyhood, it will take many years before that happens.

  • None of the uses of Obj.magic are apparent from the signatures of the data structures. Part of the reason why the motivating examples make no use of Obj.magic is to show that it can be done. What this means is that, should some use of Obj.magic become broken or unsafe, one could always adjust the data structure implementation and release a new version of it.

In other words, it should be easy to fix unlikely future issues and there should be plenty of time to fix them.

Do I recommend using Obj.magic?

No, I don't really recommend it. All other things being equal, I'd rather not have the Obj module at all and would rather have an optimizing compiler that would not suffer from the float array pessimization and would be able to perform data representation optimizations. Unfortunately that is not where we are now and there are cases where Obj.magic can significantly reduce code size and improve performance.

(mdx
(files article.md))
(lang dune 3.8)
(using mdx 0.4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment