Skip to content

Instantly share code, notes, and snippets.

@djspiewak
Last active April 16, 2019 21:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save djspiewak/b72c27179733d6479e380cbcf6fbafc7 to your computer and use it in GitHub Desktop.
Save djspiewak/b72c27179733d6479e380cbcf6fbafc7 to your computer and use it in GitHub Desktop.
// we really want to vectorize something like this:
val bytes: Array[Byte] = ???
val results: Array[Byte] = new Array[Byte](bytes.length)
var i = 0
while (i < bytes.length) { // assume bytes.length % 4 == 0
results(i) |= bytes(i) & '"'
}
// the idea is to look for quote characters
//
// anyway, this *almost* works, but bitwise operations
// are not defined on bytes (ironically), only on int/long
// so this triggers coercions, which messes up the auto
// vectorization
// what we need to do is get it into Array[Int] form...
// ...without copying it
// the problem here is that Array[Byte] != Array[Int], which
// the JVM will happily inform us of if we attempt to cast
bytes.asInstanceOf[Array[Int]] // => ClassCastException
// so we have to trick it! 😈
import sun.misc.Unsafe // oooooh boy here we go!
val f = classOf[Unsafe].getDeclaredField("theUnsafe")
f.setAccessible(true)
val unsafe = f.get(null).asInstanceOf[Unsafe]
// we can use Unsafe to dig into the Array representation.
// spoiler alert: the class tag is stored in the header,
// meaning that we can modify that tag without touching
// anything else, and trick the JVM into thinking that
// bytes is *actually* an Array[Int]
// Array[Byte](1, 2, 3, 4) looks like this in memory:
//
// 0| 09 00 00 00
// 4| 00 00 00 00
// 8| f5 00 00 f8 <-- this is the class tag!
// 12| 04 00 00 00 <-- length
// 16| 01 02 03 04 <-- data
// the length of the leading header is Unsafe.ADDRESS_SIZE
val TagOffset = Unsafe.ADDRESS_SIZE.toLong
val BytesClassTag = unsafe.getInt(bytes, TagOffset)
val IntsClassTag = unsafe.getInt(Array[Int](0), TagOffset)
// evil mode: engage!
unsafe.putInt(bytes, TagOffset, IntsClassTag)
val ints = bytes.asInstanceOf[Array[Int]] // holy moly it works!
// so that's a static_cast, for those of you keeping score at home
// no data has been copied
// note though that byte arrays are *packed*, so the meaning of
// ints.length is, uh, different and wrong. But fortunately, wrong
// in a good way. Imagine we had somehow saved off bytes before
// performing our witchcraft:
(ints(0) >> 0) & 0xFF == originalBytes(0)
(ints(0) >> 8) & 0xFF == originalBytes(1)
(ints(0) >> 16) & 0xFF == originalBytes(2)
(ints(0) >> 24) & 0xFF == originalBytes(3)
// in other words, each set of four bytes is packed into one int
val results: Array[Int] = new Array[Int](bytes.length / 4)
val sigil = '"' | ('"' << 8) | ('"' << 16) | ('"' << 24)
var i = 0
while (i < ints.length / 4) {
results(i) |= ints(i) & sigil
}
// this vectorizes. get rekt, managed runtime
unsafe.putInt(bytes, TagOffset, BytesClassTag)
// ...whistles innocently
@phderome
Copy link

phderome commented Apr 14, 2019

There is a fair bit I don't understand, though that does not matter too much (I have no reason to doubt the validity of your solution, especially since you're not anonymous).

One thing though...Is this pseudo-code in part or working code? I don't get the while loop that don't increment i at lines 7 and 76.

I see now that unsafe.putInt is updating the classtag from array[byte] to array[int].

More fundamentally, if you try to identify presence of '"' at particular byte location I don't see how bitwise & helps (perhaps it relates to discussion at lines 60-70).

@plokhotnyuk
Copy link

sun.misc.Unsafe is really unsafe:

https://bugs.openjdk.java.net/browse/JDK-8202414

@djspiewak
Copy link
Author

I’m not writing anything other than the tag though. Basically I’m tricking the JVM into interpreting a bunch of bytes as a bunch of ints. The data is simple enough that this works with the appropriate masking.

@djspiewak
Copy link
Author

But yes, it is very very easy to get segfaults while doing stuff like this. Just reading ints(1) is enough most of the time.

@phderome
Copy link

the revision introducing 'sigil' does make the presentation clearer.

@Ichoran
Copy link

Ichoran commented Apr 16, 2019

Why go to all that trouble instead of just getting a pointer to the data inside the byte array and then use getInt and putInt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment