Skip to content

Instantly share code, notes, and snippets.

Results for Sebastian Aaltonen's buffer tester https://github.com/sebbbi/perftest
From Intel Haswell GT2 (i3 4010-U). It was necessary to change the threadgroup count to 64x64 down from 1024, or the test would TDR.
Load R8 invariant: 2.106ms
Load R8 linear: 13.438ms
Load R8 random: 6.053ms
Load RG8 invariant: 2.105ms
Load RG8 linear: 12.763ms
Load RG8 random: 6.229ms
Load RGBA8 invariant: 2.105ms
#include <string>
#include <fstream>
#include <istream>
#include <sstream>
#include <boost/tokenizer.hpp>
#include <boost/timer/timer.hpp>
using namespace std;
static const __m128i SHUFFLE_TABLE[16] = {
_mm_setr_epi8(12,13,14,15, 8, 9,10,11, 4, 5, 6, 7, 0, 1, 2, 3),
_mm_setr_epi8( 0, 1, 2, 3,12,13,14,15, 8, 9,10,11, 4, 5, 6, 7),
_mm_setr_epi8( 4, 5, 6, 7,12,13,14,15, 8, 9,10,11, 0, 1, 2, 3),
_mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7,12,13,14,15, 8, 9,10,11),
_mm_setr_epi8( 8, 9,10,11,12,13,14,15, 4, 5, 6, 7, 0, 1, 2, 3),
_mm_setr_epi8( 0, 1, 2, 3, 8, 9,10,11,12,13,14,15, 4, 5, 6, 7),
_mm_setr_epi8( 4, 5, 6, 7, 8, 9,10,11,12,13,14,15, 0, 1, 2, 3),
@jbarczak
jbarczak / Reorder_with_shuffle_LUT
Created June 14, 2015 02:43
ray reordering with shuffle lut
// Tried this, and it was marginally slower
//
// Some notes about this:
// 1. Seperate hit/miss arrays force me to use a lot more stack than I did before, and
// probably doesn't use the cache quite as well.
// 2. The prefetching of the rays doesn't fit in quite as neatly, and doesn't help anymore if I stick it in there
// it might make more sense to move that elsewhere anyway
// 3. LUT is 256 bytes. Not too bad, but it's probably knocking a few rays out of the cache
// 4. Reordering can produce at least one packet that is partially miss and partially hit.
@jbarczak
jbarczak / PrefixSum
Last active August 29, 2015 14:22
Prefix sum improvements suggested by ryg
static void __fastcall ReorderRays( StackFrame& frame, size_t nGroups )
{
RayPacket** pPackets = frame.pActivePackets;
uint32 pIDs[MAX_TRACER_SIZE];
size_t nHitLoc = 0;
size_t nMissLoc = 8*nGroups;
const char* pRays = (const char*) frame.pRays;
Microsoft compiler appears to ignore prefetches inside a loop.
Tested this on MSVC 2013 express edition. Microsoft's connect site says I am not authorized to submit feedback for who knows what reason, or else I'd send it there directly......
Code I used:
void Foo( char* p, int* q )
{
for( size_t i=0; i<8; i++ )