Skip to content

Instantly share code, notes, and snippets.

@Karang
Last active June 25, 2025 11:14
Show Gist options
  • Select an option

  • Save Karang/a00e6641eca5958ce7f5459787812e02 to your computer and use it in GitHub Desktop.

Select an option

Save Karang/a00e6641eca5958ce7f5459787812e02 to your computer and use it in GitHub Desktop.
CG Spring Challenge 2025

General algorithm

I used a DFS with a cache. DFS is implemented with a stack and a while loop, no function call. I used 2 stacks, one for the generated boards containing only the board as an int32 and one for the parent board with the hash being accumulated, the number of children and the permutation (used for symetries).

Board representation & SWAR

I used 3 bit per cell in a board fitted into a 32bit integer. I never decode that representation into an array, instead I use SWAR (SIMD within a register) to do computations, shuffling and tests. During move generation, two operations are quite useful: add_saturate and zmask. The first one adds each of the 9 cells of two input board, if the result overflows it is clamped to 7 (maximum representable value in 3 bits). The second set the top bit of each cell to one if the cell is zero, otherwise it clears all bits of the cell.

// Cells positions and order from MSB to LSB:
// 0 1 2
// 3 4 5
// 6 7 8

int add_saturate(int x, int y) {
    auto t0 = (y ^ x) & set(4,4,4, 4,4,4, 4,4,4);
    auto t1 = (y & x) & set(4,4,4, 4,4,4, 4,4,4);
    x = x & set(3,3,3, 3,3,3, 3,3,3);
    y = y & set(3,3,3, 3,3,3, 3,3,3);
    x = x + y;
    t1 = t1 | (t0 & x);
    t1 = (t1 << 1) - (t1 >> 2);
    return (x ^ t0) | t1;
}

int zmask(int x) {
    int32_t y = (x & set(3,3,3, 3,3,3, 3,3,3)) + set(3,3,3, 3,3,3, 3,3,3);
    y = ~(y | x | set(3,3,3, 3,3,3, 3,3,3)); 
    return y;
}

SIMD move generation

Some symetries in the board can be exploited to generate moves. For instance, the four corners can generate either a capture, place a 1, or nothing. The regular nature of move generations makes it suitable for vectorization. I used SSE to generate 4 moves at a time. As an example, here is a snipet of what corner move generation looked like. Similarly, the sides and center can also be vectorized.

// Compute move_mask (1 in top bit of empty cells, 0 if the cell is non-zero)
//
auto move_mask = z3bmask(v) & set(7,7,7, 7,7,7, 7,7,7);
__m128i vv = _mm_set1_epi32(v);
__m128i vmove_mask = _mm_set1_epi32(move_mask);

// Corners move generation:

// Start by computing the sum of each corner neighbor and store result in the corners:
// 1 5 1   3 3 5   13 35 15
// 1 . 1 + 7 . 7 = 17  . 17
// 7 5 7   3 3 5   37 35 57
auto v3x5xxx3x5 = v & set(0,0,0, 7,0,7, 0,0,0);
v3x5xxx3x5 = (v3x5xxx3x5 >> 9) | (v3x5xxx3x5 << 9);
auto v1x1xxx7x7 = v & set(0,7,0, 0,0,0, 0,7,0);
v1x1xxx7x7 = (v1x1xxx7x7 << 3) | (v1x1xxx7x7 >> 3);
auto v1511x1757 = (v1x1xxx7x7 | (v3x5xxx3x5 << 3)) & set(7,7,7,0,0,0,7,7,7) | (v1x1xxx7x7 >> 9);
auto v3357x7335 = (v3x5xxx3x5 | (v3x5xxx3x5 >> 3)) & set(7,7,7,0,0,0,7,7,7) | (v1x1xxx7x7 << 9);
auto pairs_sums = add_saturate(v1511x1757, v3357x7335);

// Check the state to see if the corners are zero
__m128i vcorner_mask = _mm_set_epi32(set1(0,7),set1(2,7),set1(6,7),set1(8,7));
__m128i has_corner = _mm_cmpeq_epi32(_mm_and_si128(vv, vcorner_mask), _mm_set1_epi32(0));
// Check if none of the neighbors are free (we need 2 to make a capture):
__m128i vdiag_mask = _mm_set_epi32(
    set(0,7,0, 7,0,0, 0,0,0),
    set(0,7,0, 0,0,7, 0,0,0),
    set(0,0,0, 7,0,0, 0,7,0),
    set(0,0,0, 0,0,7, 0,7,0));
__m128i has_diag = _mm_cmpeq_epi32(_mm_and_si128(vmove_mask, vdiag_mask), _mm_set1_epi32(0));
// Get the sum of neighbors for each corners:
__m128i sum = _mm_and_si128(_mm_set1_epi32(pairs_sums), vcorner_mask);
// Check if there is a capture (2 neighbors and sum < 7)
__m128i dopush = _mm_and_si128(_mm_cmplt_epi32(sum, vcorner_mask), has_diag);
// The value to push, state + sum, clear the 2 neighbors:
__m128i pushv = _mm_andnot_si128(vdiag_mask, _mm_or_si128(vv, sum));
// The value to push if there was no capture (set the corner to 1):
__m128i vnocapture = _mm_or_si128(vv, _mm_set_epi32(set1(0,1),set1(2,1),set1(6,1),set1(8,1)));
// Blend the capture and no capture case:
pushv = _mm_castps_si128(_mm_blendv_ps(_mm_castsi128_ps(vnocapture), _mm_castsi128_ps(pushv), _mm_castsi128_ps(dopush)));
// Push 0,1,2,3 or 4 values to the stack: 
stack.push(has_corner, pushv); // (use a permute table, movemask and popcount to emulate a compressstore operation)

Cache

Caching is done using a custom open-addressing hashmap. I tuned the array to be as small as possible, and I allow evicting hashes in case of collision. I use SIMD (avx2) to check the key at the index given by the hash and the following 7 keys. If it is found I return the index of the key, otherwise I replace the first element.

int find_(int d, int k, bool insert=false) {
    uint32_t i = ((k * 2654435761) >> (shift)) & mod_mask;
    // Load 8 keys
    __m256i keys = _mm256_loadu_si256((__m256i*)(m_keys+i));
    __m256i vk = _mm256_set1_epi32(k);
    // if we found the key return the index
    __m256i mask = _mm256_cmpeq_epi32(keys, vk);
    if (!_mm256_testz_si256(mask, mask)) return __builtin_ctz(_mm256_movemask_ps(_mm256_castsi256_ps(mask))) + i;
    if (!insert) return i;
    // else if one slot is empty return it
    __m256i zero = _mm256_set1_epi32(0);
    mask = _mm256_cmpeq_epi32(keys, zero);
    if (!_mm256_testz_si256(mask, mask)) return __builtin_ctz(_mm256_movemask_ps(_mm256_castsi256_ps(mask))) + i;
    // else if one slot is smaller than key return it
    mask = _mm256_cmpgt_epi32(vk, keys);
    if (!_mm256_testz_si256(mask, mask)) return __builtin_ctz(_mm256_movemask_ps(_mm256_castsi256_ps(mask))) + i;
    // else return first slot
    return i;
}

Keys are stored on 32 bits. The lower 27 bits are the board state, the upper 5 is the depth (the msb of the depth is cropped, but the lower bits combined with the board state are enough to differentiate boards). I hardcoded the 40 possible 000 000 000 input states, so that my keys are never 0 and I can use 0 keys as a empty flag.

Hash symetries

Handling symetries was a big speedup, and an interesting part of the challenge. The first step is to "canonicalize" a board, ie reduce it to a deterministic board for each of the possible symetric permutations. To do so, I compute the 8 symetries and find the minimum value. Applying symetries on the default board representation was not really playing nice with the 6 rotation symetries, so I first transform the board to a spiral representation:

0 1 2    0 1 2
3 4 5 -> 7 8 3
6 7 8    6 5 4

This representation allowed me to use only 2 shifts, 1 and and 1 or to apply a rotation to a board.

I used avx2 to compute the 6 rotations in one go, and to perform a min reduction with keys in the LSbs.

std::tuple<int32_t, int> canonicalize(int32_t x) {
    // 0 1 2    0 1 2 3 4 5 6 7 8
    // 3 4 5 -> X 0 1 2 5 8 7 6 3
    // 6 7 8    
    int32_t iden
          = (x & set(7,7,7, 0,0,0, 7,0,0));
    iden |= (x & set(0,0,0, 0,0,7, 0,7,0)) << 6;
    iden |= (x & set(0,0,0, 0,0,0, 0,0,7)) << 12;
    iden |= (x & set(0,0,0, 7,0,0, 0,0,0)) >> 12;

    // 0 1 2    2 1 0    0 1 2 3 4 5 6 7 8
    // 3 4 5 -> 5 4 3 -> X 2 1 0 3 6 7 8 5
    // 6 7 8    8 7 6
    int32_t flip
          = (x & set(0,7,0, 7,0,0, 0,0,0));
    flip |= (x & set(0,0,7, 0,0,0, 7,7,7)) << 6;
    flip |= (x & set(7,0,0, 0,0,7, 0,0,0)) >> 6;

    // Perform the rotations:
    __m256i keys = _mm256_set_epi32(iden, iden, iden, iden, flip, flip, flip, flip);
    __m256i a = _mm256_srlv_epi32(keys, _mm256_set_epi32(0,6,12,18,0,6,12,18));
    __m256i b = _mm256_sllv_epi32(keys, _mm256_set_epi32(0,18,12,6,0,18,12,6));
    keys = _mm256_and_si256(_mm256_or_si256(a, b), _mm256_set1_epi32(set(7,7,7,7, 7,7,7,7, 0)));
    // Use the free spot in LSbs to store the keys
    keys = _mm256_or_si256(keys, _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7));

    // Reduce to find the minimum key and type
    __m128i i = _mm256_castsi256_si128( keys );
    i = _mm_min_epi32( i,  _mm256_extractf128_si256( keys, 1 ));
    i = _mm_min_epi32( i, _mm_shuffle_epi32( i, _MM_SHUFFLE(1,0,3,2) ) );
    i = _mm_min_epi32( i, _mm_shuffle_epi32( i, _MM_SHUFFLE(3,2,0,1) ) );
    int key = _mm_cvtsi128_si32( i );
    int type = key & 0x7;

    if (type == 0) return {x, 0};

    // convert back
    // 0 1 2 
    // 3 4 5 -> 0 1 2 5 8 7 6 3 X
    // 6 7 8    0 1 2 3 4 5 6 7 8
    int32_t out = x & set1(4,7);
    out |= (key & set(7,7,7, 0,0,0, 7,0,0)) ;
    out |= (key & set(0,0,0, 7,0,7, 0,0,0)) >> 6;
    out |= (key & set(0,0,0, 0,7,0, 0,0,0)) >> 12;
    out |= (key & set(0,0,0, 0,0,0, 0,7,0)) << 12;

    return {out, type};
}

Once the board have been canonicalized, the second challenge of handling symetries was to store the hash in a way it can still be permuted. For me it was a 9 integer array, stored in the order 0 1 2 3 5 6 7 8 4. Putting the middle digit in last position allows to use avx2 on the first 8 elements to perform additions and permutations (the middle digit never move with symetries). The hash digits are only summed at the end and the modulo 2^30 is applied once.

Hardcoding

Caching is a powerful way to cut work. Offline caching is even better as it allow to lookup the result without doing any computation at run time. However the space is limited. In codingame, the source size is limited to 100k utf-16 codepoints. In this section I'll describe how I encoded the data so that it takes as little space as possible, and how I choosed which hashes to cache, in order to maximize their usefulness.

UTF-16 codepoints are stored in 16bits, but there are non characters values in the range 0xD800-0xDFFF which cannot be used for our purpose. To simplify things, and without loosing too much space, we can restrict ourselves to the lower 15 bits which are always valid characters. A hash value is modulo 2^30, so by splitting it into two 15 bits part, it fits exactly into 2 codepoints.

To be able to fill our cache, we need a key and a Hash 9 element array representing the hashes of each digit. Storing these 10 values as is would take 20 codepoint per cache entry, which would be wasteful, we can do better ! First, we don't need to store the key at all. If we generate the states in a deterministic order, we can replay the generation on the decoding side, saving one value. We also don't need to encode the 9 values per hash, we can once again use symetries. If a board has all symetries we only need to store 3 / 9 hashes, the corner, side and center hashes can then be replicated to fill the blank. Similarly, combinations of different symetries offer replication possibilities. I identified the following cases:

rot1, rot3: (all symetries)
a b a
b c b => 3 / 9
a b a

rot2 & (flip+rot1) & (flip+rot3)
a b c
b d b => 4 / 9
c b a

flip & (flip+rot3) & rot2:
a b a
c d c => 4 / 9
a b a

rot2:
a b c
d e d => 5 / 9
c b a

flip:
a b a
c f c => 6 / 9
d e d

flip+rot1:
a b c
d f b => 6 / 9
e d a 

flip+rot3:
a b c
b f d => 6 / 9
c d e

flip+rot2:
a b c
d f e => 6 / 9
a b c

no sym:
a b c
d i e => 9 / 9
f g h 

To choose which hashes to transmit, I made a 2D plot of how much time a board was taking. On the x axis, I've put the max_depth from 1 to 40 and on the y axis I've put the sum of all dices from 0 to 54, which I'm calling the rank of the board. The rank have the nice property of never decreasing. Which means that if I can cache all boards of a given rank, it creates a barrier that all lower ranks must go through. So caching one rank, also speed up the lower ranks. The most compute intensive boards are in the lower right corner of my plot (low rank, high max_depth). Using the encoding described above, I was able to cache all hashes of rank 5 for max_depth >= 20. The plot of times (ms) with this optimisation look like this:

rank 18( 899857):   6   5   6   5   7   7   7   9  10   9   8   9   9  13  10  14  10  13  12  11  13  11  11  16  10  10  10  14  12  13  13  12  14  15  14
rank 17( 693693):   4   6   6   7   5   7   6   7   8   9  12   8  12  14  11  14  13  19  24  14  22  16  18  21  19  28  20  15  21  13  17  15  24  12  18
rank 16( 518301):   4   5   6   6   7   7   9  10  10   9  14  14  12  18  23  13  13  15  18  20  20  33  17  20  17  17  33  21  21  19  25  40  19  35  61
rank 15( 374808):   4   5   5   5   5   6   7   9   9  14  15  21  15  17  23  20  22  56  20  26  41  54  75  43 110 139  40  41  28 147  64  51  22  20  77
rank 14( 261891):   4   4   3   5   6   7   8   7   9  12  12  13  34  25  20  30  25  34  52  68  28  83  63 105  61  85  43  74  31  36  42  49  62  56  46
rank 13( 176463):   3   4   5   5   7   4  11  11  14  22  18  22  30  28  32  55  61  46  27  62 112 198  37 183 382  67  86 195  76 135 153 179  49 355 662
rank 12( 114387):   3   4   5   5   6   7  13  13  13  20  21  22  23  44  44  58  62  79  81  66  89 170 114 104 137  96 123  97 115 126 167 626 183 105 104
rank 11(  71127):   3   4   5   6   5   7  11  16  14  20  41  22  30  35  38  50  56  74  95 114 169  92 212 120 120 184 162 318 220 275 114 107 195 270 490
rank 10(  42273):   3   3   4   5   6   8   9  12  20  22  24  35  37  77 121  84 120 120 116 163 110 189 171 165 199 202 210 200 226 264 227 277 321 203 316
rank  9(  23905):   3   3   4   5   7   6  10   9  17  19  21  36  59 114 114  94  89 137 118 172 165 276 292 261 314 405 251 361 223 330 310 300 324 361 421
rank  8(  12789):   3   4   5   5   7   8  10  14  17  17  27  32  41  85 120  97 118 148 162 180 194 222 300 303 338 391 421 418 413 422 375 498 470 504 404
rank  7(   6426):   3   3   5   5   8   7   8  12  14  19  22  32  37  66 469  98 120 136 169 177 220 324 329 355 373 364 391 467 450 480 505 465 541 466 359
rank  6(   3003):   3   3   4   5   5   8   9  13  14  16  22  26  32  55  95  91 120 138 157 192 268 262 321 329 357 424 388 470 460 511 395 473 554 632 514
rank  5(   1287):   3   3   4   5   5   6   9  10  13  17  18  24  28  41   2   2   4   4   4   5   6   6   6   7   7   7   7   8   8   9   9  10  10  10  10
rank  4(    495):   3   3   4   4   5   6   5   8  11  15  17  21  25  32  44  44  20  20   5   5   5   6   6   7   7   7   7   8   9   9  10   9   9  10  10
rank  3(    165):   3   2   3   5   5   6   6   8  10  12  15  16  20  24  36  65  73  73  51   5   5   5   7   7   7   8   8   8   9   9  10  10  10  11  10
rank  2(     45):   3   3   3   4   4   4   6   7   8  10  12  15  17  21  28  55  68  78  73  44   6   6   6   6   8   7   8   8   8   9   9  11   9  10  10
rank  1(      9):   2   2   3   3   4   4   4   5   6   8  10  12  16  18  21  49  56  67  85  87  64   7   7   7   8   8   9   8   8   9   9   9  10  10  10
rank  0(      1):   2   1   2   1   1   1   2   1   1   2   1   1   1   2   1   1   2   1   2   2   1   1   2   1   1   1   2   1   1   2   1   2   2   1   2
max_depth:          6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40

Ideally, caching rank 13 would have cut almost all the 3 digit time, but it would require many more codepoints to fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment