Skip to content

Instantly share code, notes, and snippets.

@KirillLykov
Last active April 29, 2021 08:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save KirillLykov/c641e63adfd68591bafbdb342f75d141 to your computer and use it in GitHub Desktop.
Save KirillLykov/c641e63adfd68591bafbdb342f75d141 to your computer and use it in GitHub Desktop.
This is LOG for my MSD Radix Sort experiments

This gist will be to place steps and numbers for optimization of msd radix sort.

Benchmakring for Travis' radix_sort7 LSD radix sort implementation:

CPU Caches:
  L1 Data 32K (x28)
  L1 Instruction 32K (x28)
  L2 Unified 1024K (x28)
  L3 Unified 19712K (x2)
Load Average: 0.47, 0.44, 0.63
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
SortingBmk_random_wholeRange/StdStableSort/10000         537 us          535 us         5230
SortingBmk_random_wholeRange/StdStableSort/60000        3921 us         3912 us          716
SortingBmk_random_wholeRange/StdStableSort/110000       7686 us         7668 us          365
SortingBmk_random_wholeRange/StdStableSort/160000      11455 us        11428 us          245
SortingBmk_random_wholeRange/StdStableSort/210000      15475 us        15438 us          181
SortingBmk_random_wholeRange/StdStableSort/260000      19690 us        19643 us          143
SortingBmk_random_wholeRange/StdStableSort/310000      23834 us        23778 us          118
SortingBmk_random_wholeRange/StdStableSort/360000      28007 us        27941 us          100
SortingBmk_random_wholeRange/StdStableSort/410000      32287 us        32211 us           87
SortingBmk_random_wholeRange/StdStableSort/460000      36361 us        36273 us           77
SortingBmk_random_wholeRange/StdStableSort/510000      40226 us        40130 us           70
SortingBmk_random_wholeRange/StdStableSort/560000      44344 us        44239 us           63

CPU Caches:
  L1 Data 32K (x28)
  L1 Instruction 32K (x28)
  L2 Unified 1024K (x28)
  L3 Unified 19712K (x2)

Benchmark                                                 Time             CPU   Iterations
-------------------------------------------------------------------------------------------
SortingBmk_random_wholeRange/MSDRadixSort/10000         195 us          195 us        14394
SortingBmk_random_wholeRange/MSDRadixSort/60000        1212 us         1209 us         2316
SortingBmk_random_wholeRange/MSDRadixSort/110000       2439 us         2433 us         1155
SortingBmk_random_wholeRange/MSDRadixSort/160000       3531 us         3522 us          795
SortingBmk_random_wholeRange/MSDRadixSort/210000       4624 us         4613 us          608
SortingBmk_random_wholeRange/MSDRadixSort/260000       5732 us         5718 us          491
SortingBmk_random_wholeRange/MSDRadixSort/310000       6789 us         6773 us          413
SortingBmk_random_wholeRange/MSDRadixSort/360000       7929 us         7910 us          353
SortingBmk_random_wholeRange/MSDRadixSort/410000       9034 us         9012 us          311
SortingBmk_random_wholeRange/MSDRadixSort/460000      10086 us        10062 us          278
SortingBmk_random_wholeRange/MSDRadixSort/510000      11157 us        11131 us          252
SortingBmk_random_wholeRange/MSDRadixSort/560000      12223 us        12194 us          230

CPU Caches:
  L1 Data 32K (x28)
  L1 Instruction 32K (x28)
  L2 Unified 1024K (x28)
  L3 Unified 19712K (x2)
Load Average: 0.99, 0.63, 0.68
-------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations
-------------------------------------------------------------------------------------------
SortingBmk_random_wholeRange/LSDRadixSort/10000        82.2 us         82.1 us        34223
SortingBmk_random_wholeRange/LSDRadixSort/60000         552 us          550 us         5092
SortingBmk_random_wholeRange/LSDRadixSort/110000       1442 us         1438 us         1941
SortingBmk_random_wholeRange/LSDRadixSort/160000       2103 us         2099 us         1337
SortingBmk_random_wholeRange/LSDRadixSort/210000       2781 us         2775 us         1009
SortingBmk_random_wholeRange/LSDRadixSort/260000       3461 us         3453 us          811
SortingBmk_random_wholeRange/LSDRadixSort/310000       4143 us         4133 us          678
SortingBmk_random_wholeRange/LSDRadixSort/360000       4835 us         4823 us          580
SortingBmk_random_wholeRange/LSDRadixSort/410000       5543 us         5530 us          505
SortingBmk_random_wholeRange/LSDRadixSort/460000       6233 us         6219 us          452
SortingBmk_random_wholeRange/LSDRadixSort/510000       6912 us         6895 us          406
SortingBmk_random_wholeRange/LSDRadixSort/560000       7599 us         7581 us          369

========================================================= The first version of MSD is below.

Code

#include <bits/stdc++.h>

using namespace std;

using T = uint64_t;
const size_t    RADIX_BITS   = 1;
const size_t    RADIX_SIZE   = (size_t)1 << RADIX_BITS;
const size_t    RADIX_LEVELS = (63 / RADIX_BITS) + 1;
const size_t    DO_FINAL_SWAP = RADIX_LEVELS & 1;
const T  RADIX_MASK   = RADIX_SIZE - 1;

constexpr auto partFunc = [](T v, size_t i) -> int { return (v >> i) & RADIX_MASK; }; // it looks to be inlined

// pass <- [7, 0]
void radix_msd_rec(vector<T>& from, vector<T>& to, size_t lo, size_t hi, size_t pass) {
    //cout << "ITER " << lo << " " << hi << ", " << pass << endl;
    size_t shift = pass * RADIX_BITS;

    size_t freq[RADIX_SIZE] = {};
    for (size_t i = lo; i < hi; ++i) {
        T value = from[i];
        ++freq[partFunc(value, shift)];
    }
    
    T* queue_ptrs[RADIX_SIZE];
    queue_ptrs[0] = &to[lo];
    for (size_t i = 1; i < RADIX_SIZE; ++i)
        queue_ptrs[i] = queue_ptrs[i-1] + freq[i-1];
    
    for (size_t i = lo; i < hi; ++i) {
        T value = from[i];
        size_t index = partFunc(value, shift);
        *queue_ptrs[index] = value;
        ++queue_ptrs[index];
    }
    
    auto newLo = lo;
    auto newHi = lo + freq[0];
    for (size_t i = 1; i < RADIX_SIZE; ++i) {
        if (newHi - newLo > 1 && pass != 0) // at least one element to sort
            radix_msd_rec(to, from, newLo, newHi, pass - 1);
        
        newLo = newHi;
        newHi += freq[i];
    }
    if (newHi - newLo > 1 && pass != 0)
        radix_msd_rec(to, from, newLo, newHi, pass - 1);
}

void radix_sort_msd(vector<T>& data)
{
    vector<T> buf(data.size());
    radix_msd_rec(data, buf,  0, data.size(), 2);
    if (DO_FINAL_SWAP)
        swap(data, buf);
}
int main()
{
    vector<T> data = 
    {0b11, 
     0b10, 
     0b01, 
     0b00};
  
    radix_sort_msd(data);
    
    for (auto& v : data) {
        cout << bitset<2>(v) << endl;
    }
    
    for (auto& v : data) {
        cout << v << " ";
    }
    cout << endl;
    
    return 0;
}

@KirillLykov
Copy link
Author

KirillLykov commented Apr 26, 2021

Next steps:
1. test properly on random numbers
2. benchmark
3. check out known optimizations:
3.1 Skip trivial
3.2 precompute histograms? --> looks impossible
3.4 prefetch?
3.5 Avoid recursion?
3.6 check out asm in godbolt quickly?
3.7 perf with bmks?
3.8 Try MSD until size of the range is not in L1 where we use LSD link

@KirillLykov
Copy link
Author

KirillLykov commented Apr 26, 2021

Add skip trivial steps check

Code

#include <algorithm>
#include <memory>
#include <array>
#include <assert.h>
#include <string.h>
#include <vector>

namespace msd_impl
{
using T = uint64_t;

const size_t    RADIX_BITS   = 8;
const size_t    RADIX_SIZE   = (size_t)1 << RADIX_BITS;
const size_t    RADIX_LEVELS = (63 / RADIX_BITS) + 1;
const T  RADIX_MASK   = RADIX_SIZE - 1;

const size_t    DO_FINAL_SWAP = RADIX_LEVELS & 1; // if RADIX_LEVELS is even, buf contains the result

auto partFunc = [](T v, size_t i) -> int { return (v >> i) & RADIX_MASK; }; // it looks to be inlined

static bool is_trivial(size_t freqs[RADIX_SIZE], size_t count) {
    for (size_t i = 0; i < RADIX_SIZE; i++) {
        auto freq = freqs[i];
        if (freq != 0) {
            return freq == count;
        }
    }
    assert(count == 0); // we only get here if count was zero
    return true;
}
// pass <- [7, 0]
void radix_msd_rec(std::vector<T>& from, std::vector<T>& to, size_t lo, size_t hi, size_t pass) {
    //cout << "ITER " << lo << " " << hi << ", " << pass << endl;
    size_t shift = pass * RADIX_BITS;

    size_t freq[RADIX_SIZE] = {};
    for (size_t i = lo; i < hi; ++i) {
        T value = from[i];
        ++freq[partFunc(value, shift)];
    }
    if (is_trivial(freq, hi - lo)) {
        if (pass != 0) // at least one element to sort
            radix_msd_rec(from, to, lo, hi, pass - 1);
        return;
    }

    T* queue_ptrs[RADIX_SIZE];
    queue_ptrs[0] = &to[lo];
    for (size_t i = 1; i < RADIX_SIZE; ++i)
        queue_ptrs[i] = queue_ptrs[i-1] + freq[i-1];

    for (size_t i = lo; i < hi; ++i) {
        T value = from[i];
        size_t index = partFunc(value, shift);
        *queue_ptrs[index] = value;
        ++queue_ptrs[index];
    }

    auto newLo = lo;
    auto newHi = lo + freq[0];
    for (size_t i = 1; i < RADIX_SIZE; ++i) {
        if (newHi - newLo > 1 && pass != 0) // at least one element to sort
            radix_msd_rec(to, from, newLo, newHi, pass - 1);

        newLo = newHi;
        newHi += freq[i];
    }
    if (newHi - newLo > 1 && pass != 0)
        radix_msd_rec(to, from, newLo, newHi, pass - 1);
}
}

void radix_sort_msd(std::vector<msd_impl::T>& data)
{
    std::vector<msd_impl::T> buf(data.size());
    msd_impl::radix_msd_rec(data, buf,  0, data.size(), msd_impl::RADIX_LEVELS-1);
    if (msd_impl::DO_FINAL_SWAP)
        swap(data, buf);
}

Benchmarks

CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 3.91, 3.01, 2.60
-------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations
-------------------------------------------------------------------------------------------
SortingBmk_random_wholeRange/MSDRadixSort/10000         150 us          150 us        18002
SortingBmk_random_wholeRange/MSDRadixSort/60000        1012 us         1010 us         2727
SortingBmk_random_wholeRange/MSDRadixSort/110000       2067 us         2061 us         1365
SortingBmk_random_wholeRange/MSDRadixSort/160000       2988 us         2981 us          942
SortingBmk_random_wholeRange/MSDRadixSort/210000       4280 us         4265 us          659
SortingBmk_random_wholeRange/MSDRadixSort/260000       5393 us         5380 us          514
SortingBmk_random_wholeRange/MSDRadixSort/310000       6694 us         6673 us          426
SortingBmk_random_wholeRange/MSDRadixSort/360000       7788 us         7767 us          357

@KirillLykov
Copy link
Author

KirillLykov commented Apr 26, 2021

Use stable_sort for small arrays (under the hood it is insertion sort, see [1], [2])

Code

// Radix sort is taken from https://github.com/travisdowns/sort-bench
// Unknown license

#include <algorithm>
#include <memory>
#include <array>
#include <assert.h>
#include <string.h>
#include <assert.h>

namespace msd_impl
{
using T = uint64_t;

const size_t    RADIX_BITS   = 8;
const size_t    RADIX_SIZE   = (size_t)1 << RADIX_BITS;
const size_t    RADIX_LEVELS = (63 / RADIX_BITS) + 1;
const T  RADIX_MASK   = RADIX_SIZE - 1;

const size_t    DO_FINAL_SWAP = RADIX_LEVELS & 1; // if RADIX_LEVELS is even, buf contains the result

constexpr auto partFunc = [](T v, size_t i) -> int { return (v >> i) & RADIX_MASK; }; // it looks to be inlined

static bool is_trivial(size_t freqs[RADIX_SIZE], size_t count) {
    for (size_t i = 0; i < RADIX_SIZE; i++) {
        auto freq = freqs[i];
        if (freq != 0) {
            return freq == count;
        }
    }
    assert(count == 0); // we only get here if count was zero
    return true;
}
// pass <- [7, 0]
void radix_msd_rec(std::vector<T>& from, std::vector<T>& to, size_t lo, size_t hi, size_t pass) {
    //cout << "ITER " << lo << " " << hi << ", " << pass << endl;
    if (hi - lo < 128) { // there will be insertion sort under the hood
        auto s = &from.front() + lo;
        auto e = &from.front() + hi;
        std::stable_sort(s, e);
        return;
    }

    size_t shift = pass * RADIX_BITS;

    size_t freq[RADIX_SIZE] = {};
    for (size_t i = lo; i < hi; ++i) {
        T value = from[i];
        ++freq[partFunc(value, shift)];
    }
    if (is_trivial(freq, hi - lo)) {
        if (pass != 0) // at least one element to sort
            radix_msd_rec(from, to, lo, hi, pass - 1);
        return;
    }

    T* queue_ptrs[RADIX_SIZE];
    queue_ptrs[0] = &to[lo];
    for (size_t i = 1; i < RADIX_SIZE; ++i)
        queue_ptrs[i] = queue_ptrs[i-1] + freq[i-1];

    for (size_t i = lo; i < hi; ++i) {
        T value = from[i];
        size_t index = partFunc(value, shift);
        *queue_ptrs[index] = value;
        ++queue_ptrs[index];
    }

    auto newLo = lo;
    auto newHi = lo + freq[0];
    for (size_t i = 1; i < RADIX_SIZE; ++i) {
        if (newHi - newLo > 1 && pass != 0) {// at least one element to sort
            radix_msd_rec(to, from, newLo, newHi, pass - 1);
        }
        newLo = newHi;
        newHi += freq[i];
    }
    if (newHi - newLo > 1 && pass != 0)
        radix_msd_rec(to, from, newLo, newHi, pass - 1);
}
}

void radix_sort_msd(std::vector<T>& data)
{
    std::vector<T> buf(data.size());
    msd_impl::radix_msd_rec(data, buf,  0, data.size(), RADIX_LEVELS-1);
    if (msd_impl::DO_FINAL_SWAP)
        swap(data, buf);
}

Benchmarks -- actually the same as without this check.
In particular, if we use stable sort for arrays smaller than 128 we have the same perfromance as we always use MSD radix.
Starting from 256, the performance is worse

CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 2.25, 2.05, 2.01
-------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations
-------------------------------------------------------------------------------------------
SortingBmk_random_wholeRange/MSDRadixSort/10000         141 us          140 us        19797
SortingBmk_random_wholeRange/MSDRadixSort/60000         868 us          866 us         3193
SortingBmk_random_wholeRange/MSDRadixSort/110000       1696 us         1693 us         1666
SortingBmk_random_wholeRange/MSDRadixSort/160000       2438 us         2432 us         1155
SortingBmk_random_wholeRange/MSDRadixSort/210000       3340 us         3334 us          815
SortingBmk_random_wholeRange/MSDRadixSort/260000       4790 us         4719 us          655
SortingBmk_random_wholeRange/MSDRadixSort/310000       5420 us         5407 us          512
SortingBmk_random_wholeRange/MSDRadixSort/360000       7038 us         7007 us          424

[1] gcc stable_sort
[2] llvm stable_sort

@KirillLykov
Copy link
Author

KirillLykov commented Apr 27, 2021

From perf, some hot places

           │     constexpr auto partFunc = [](T v, size_t i) -> int { return (v >> i) & RADIX_MASK; }; // it looks to be inlined                                                   
2.3368:   mov    (%rdi),%rax                                                                                                                                              
0.03add    $0x8,%rdi                                                                                                                                                
2.23shr    %cl,%rax                                                                                                                                                 
           │     _ZN8msd_impl13radix_msd_recERSt6vectorImSaImEES3_mmm():                                                                                                           
++freq[partFunc(value, shift)];                                                                                                                           
0.02   │       movzbl %al,%eax                                                                                                                                                 
36.88 │       addq   $0x1,-0x1030(%rbp,%rax,8)                                                                                                                                
           │         for (size_t i = lo; i < hi; ++i) {                                                                                                                            
...
*queue_ptrs[index] = value;                                                                                                                               ◆
  2.31mov    -0x830(%rbp,%rdi,8),%r8
  5.55mov    %r11,(%r8)

So it happens because we take values every time from the random positions in the array.
It was optimized in LSD sort by Travis by precomputing for each number all the frequencies:

  for (size_t i = 0; i < count; i++) {
      uint64_t value = a[i];
      for (size_t pass = 0; pass < RADIX_LEVELS; pass++) {
          freqs[pass][value & RADIX_MASK]++;
          value >>= RADIX_BITS;
      }
  }

Not sure if there is anyway to do the same here

@KirillLykov
Copy link
Author

KirillLykov commented Apr 27, 2021

need to consider negative integers
link to save somehwere
flipping sign bits

@KirillLykov
Copy link
Author

On the size of radix -- 11-bit radix gives 15% improvement in performance

For 8 bits:

Benchmark                                                 Time             CPU   Iterations
-------------------------------------------------------------------------------------------
SortingBmk_random_wholeRange/MSDRadixSort/10000         199 us          198 us        14005
SortingBmk_random_wholeRange/MSDRadixSort/60000        1203 us         1201 us         2327
SortingBmk_random_wholeRange/MSDRadixSort/110000       2437 us         2431 us         1155
SortingBmk_random_wholeRange/MSDRadixSort/160000       3527 us         3518 us          797
SortingBmk_random_wholeRange/MSDRadixSort/210000       4623 us         4612 us          607
SortingBmk_random_wholeRange/MSDRadixSort/260000       5705 us         5692 us          491
SortingBmk_random_wholeRange/MSDRadixSort/310000       6780 us         6764 us          413
SortingBmk_random_wholeRange/MSDRadixSort/360000       7912 us         7893 us          355
SortingBmk_random_wholeRange/MSDRadixSort/410000       9026 us         9004 us          311
SortingBmk_random_wholeRange/MSDRadixSort/460000      10059 us        10035 us          279
SortingBmk_random_wholeRange/MSDRadixSort/510000      11185 us        11158 us          250
SortingBmk_random_wholeRange/MSDRadixSort/560000      12217 us        12188 us          230

For 11 bits

-------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations
-------------------------------------------------------------------------------------------
SortingBmk_random_wholeRange/MSDRadixSort/10000         170 us          170 us        16416
SortingBmk_random_wholeRange/MSDRadixSort/60000        1023 us         1021 us         2738
SortingBmk_random_wholeRange/MSDRadixSort/110000       1956 us         1952 us         1436
SortingBmk_random_wholeRange/MSDRadixSort/160000       2868 us         2861 us          973
SortingBmk_random_wholeRange/MSDRadixSort/210000       3782 us         3773 us          740
SortingBmk_random_wholeRange/MSDRadixSort/260000       4674 us         4663 us          601
SortingBmk_random_wholeRange/MSDRadixSort/310000       5559 us         5546 us          505
SortingBmk_random_wholeRange/MSDRadixSort/360000       6509 us         6493 us          431
SortingBmk_random_wholeRange/MSDRadixSort/410000       7428 us         7410 us          377
SortingBmk_random_wholeRange/MSDRadixSort/460000       8333 us         8313 us          336
SortingBmk_random_wholeRange/MSDRadixSort/510000       9243 us         9221 us          304
SortingBmk_random_wholeRange/MSDRadixSort/560000      10265 us        10240 us          274

@KirillLykov
Copy link
Author

From the above, due to possible optimizations for histogram LSD is faster. Why we are interested in MSD -- because it better for bandwidth.
So there are 3 questions

  1. How to measure bandwidth
  2. Different in bandwidth for LSD and MSD
  3. Can we have a combined MSD and LSD version which has benefits of both?
  4. 46% of runtime is spent on histogram, any idea how to optimize?

@KirillLykov
Copy link
Author

KirillLykov commented Apr 28, 2021

Code might be bound in one of the 3 ways:

  1. Memory bound
    1.1 Bandwidth bound
    1.2 Latency bound
  2. CPU Bound

In our case, it is memory bound and I think latency is coming from random access by index.

Here, I'm interested if LSD is more memory bound than MSD.

Below are summary of the known ways

  1. LLC_MISS * 64. Eg "last level cache miss". Note, doesn't include prefetch misses

LSD utilises read bandwidth more by (90783-55512)/55512=63%
LSD utilises write bandwidth more by (11483-3638)/3638=210%

CPU Caches:
  L1 Data 32K (x28)
  L1 Instruction 32K (x28)
  L2 Unified 1024K (x28)
  L3 Unified 19712K (x2)
Load Average: 0.23, 0.29, 0.40
-------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations
-------------------------------------------------------------------------------------------
SortingBmk_random_wholeRange/MSDRadixSort/310000       6750 us         6734 us          409

 Performance counter stats for './radix --benchmark_min_time=2 --benchmark_filter=SortingBmk_random_wholeRange/MSDRadixSort/310000':

             55512      LLC-load-misses:u
              3638      LLC-store-misses:u

       3.557161343 seconds time elapsed
...

-------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations
-------------------------------------------------------------------------------------------
SortingBmk_random_wholeRange/LSDRadixSort/310000       4063 us         4053 us          690

 Performance counter stats for './radix --benchmark_min_time=2 --benchmark_filter=SortingBmk_random_wholeRange/LSDRadixSort/310000':

             90783      LLC-load-misses:u
             11483      LLC-store-misses:u

       3.289039239 seconds time elapsed
  1. MEMORY_BW_READS. Error that it is unsupported on my proc. details
    The command is
perf stat -e uncore_imc_0/event=0x4,umask=0x3/,uncore_imc_1/event=0x4,umask=0x3/,uncore_imc_4/event=0x4,umask=0x3/,uncore_imc_5/event=0x4,umask=0x3/ <exec>

@KirillLykov
Copy link
Author

Looks like #cache misses is lower for MSD but i/c is also lower.

Performance counter stats for './radix --benchmark_min_time=2 --benchmark_filter=SortingBmk_random_wholeRange/LSDRadixSort/310000':

       10828156746      cycles:u
       25202335777      instructions:u            #    2.33  insn per cycle
          33397168      cache-misses:u
-----
 Performance counter stats for './radix --benchmark_min_time=2 --benchmark_filter=SortingBmk_random_wholeRange/MSDRadixSort/310000':

       11723789045      cycles:u
       19503092141      instructions:u            #    1.66  insn per cycle
          17758484      cache-misses:u

       3.592622470 seconds time elapsed

@KirillLykov
Copy link
Author

LLC-sotres 27784186(MSD) vs 386000(LSD)

@KirillLykov
Copy link
Author

KirillLykov commented Apr 29, 2021

If I use LSD when the size of the subarray is smaller than 2^14:

    if (hi - lo < (1<<14)) {
        radix_lsd(&from[0] + lo, &to[0] + lo, hi - lo, pass + 1); // radix_sort6 by Travis
        return;
    }

I get:

SortingBmk_random_wholeRange/MSDRadixSort/310000       5988 us         5974 us          469

 Performance counter stats for './radix --benchmark_min_time=2 --benchmark_filter=SortingBmk_random_wholeRange/MSDRadixSort/310000':

             40792      LLC-load-misses:u
              5726      LLC-store-misses:u

       3.511111146 seconds time elapsed

-----

SortingBmk_random_wholeRange/LSDRadixSort/310000       4107 us         4097 us          691

 Performance counter stats for './radix --benchmark_min_time=2 --benchmark_filter=SortingBmk_random_wholeRange/LSDRadixSort/310000':

            229121      LLC-load-misses:u
             59772      LLC-store-misses:u

       3.324133220 seconds time elapsed

So it looks promising. This happens only with 2^14 and not for smaller values. On this processor L1 cache size is 32K (does half taken for instructions?)

Now performance

SortingBmk_random_wholeRange/MSDRadixSort/310000       6031 us         6017 us          465
SortingBmk_random_wholeRange/LSDRadixSort/310000       4107 us         4097 us          684

Although it looks slower than LSD, it faster than it was before by 10%:

SortingBmk_random_wholeRange/MSDRadixSort/310000       6791 us         6775 us          413

@KirillLykov
Copy link
Author

Next steps:

  1. Stick with the current implementation.
  2. Add tests for different distributions for different arrays sizes
  3. Check that the current implementation performs consistently better in all of them
  4. Add negative numbers
  5. Add tests for other data types (int32, int16)
  6. Add template code for parameters, interface, etc
  7. Polish code, repo, make nice readme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment