Skip to content

Instantly share code, notes, and snippets.

anonymous
anonymous / -
Created September 12, 2016 12:54
Not all of these events have been tested and they may be broken
USE AT YOUR OWN RISK!
CBO (Last Level Cache Slice) CACHE Events
CBO.LLC_LOOKUP Cache Lookups
CBO.LLC_LOOKUP.ANY Cache Lookups
CBO.LLC_LOOKUP.DATA_READ Cache Lookups
CBO.LLC_LOOKUP.NID Cache Lookups
CBO.LLC_LOOKUP.READ Cache Lookups
@androm3da
androm3da / build.sh
Created February 15, 2017 21:42
build clang
#!/bin/bash -ex
CC="clang"
CXX="clang++"
SRCTOP=$(readlink -f ${PWD})
INSTALL=${SRCTOP}/install
if [[ ! -d ${SRCTOP}/llvm ]]; then
echo Expected to find the source in ${SRCTOP}/llvm but it is missing
exit 3
#!/bin/bash -ex
CC="clang"
CXX="clang++"
export PATH=/local/mnt/workspace/install/binutils-2.27/bin:${PATH}
SRCTOP=$(readlink -f ${PWD})
INSTALL=${1-${SRCTOP}/install}
if [[ ! -d ${SRCTOP}/llvm ]]; then
echo Expected to find the source in ${SRCTOP}/llvm but it is missing
@martinmoene
martinmoene / catch-main.cpp
Last active September 6, 2017 01:43
CATCH - Small complete multi-file example.
// This tells Catch to provide a main() - only do this in one cpp file:
#define CATCH_CONFIG_MAIN
#include "catch.hpp"
@rygorous
rygorous / fma3_codegen.txt
Last active November 19, 2017 06:45
FMA3 codegen
So the general idea with FMA3 is that it's designed to have enough degrees of
freedom so you can just have a more general FMA4 op in the IR and pick the right
FMA3 op *very* late (ideally, after register allocation!).
The generalized FMA op looks like this:
dst = FMA ±src0, ±src1, ±src2
where at most one of the src's can be a memory operand. This computes
(±src0 * ±src1) ± src2. The two signs for src0 and src1 combine
@nkurz
nkurz / Results Haswell
Created July 12, 2016 03:37
Differences in macro- and micro-fusion performance Skylake vs Haswell
nate@haswell:~/src$ likwid-perfctr -m -g UOPS_ISSUED_ANY:PMC0,UOPS_EXECUTED_CORE:PMC1,UOPS_RETIRED_ALL:PMC2,BR_INST_RETIRED_NEAR_TAKEN:PMC3 -C 1 fusion
-------------------------------------------------------------
-------------------------------------------------------------
CPU type: Intel Core Haswell processor
CPU clock: 3.39 GHz
-------------------------------------------------------------
fusion
two_micro_two_macro: sum1=10000000, sum2=9999999
one_micro_two_macro: sum1=10000000, sum2=9999999
one_micro_one_macro: sum1=10000000, sum2=9999999
#!/usr/bin/env python
# Copyright 2017 Ryan Stortz (@withzombies)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
@rygorous
rygorous / box_pruning_notes.txt
Created February 17, 2017 00:41
Note on changes to the box pruning code.
Brief explanation what I did to get the speed-up, and the thought process behind it.
The original code went:
EnterLoop:
movaps xmm3, xmmword ptr [edx+ecx*2] // Box1YZ
cmpnltps xmm3, xmm2
movmskps eax, xmm3
cmp eax, 0Ch
@vegard
vegard / quantize-tikz.cc
Created April 15, 2020 11:34
Float to byte quantisation
#if 0
(g++-9 $0 || g++ $0) && \
./a.out > output.tex && \
pdflatex output && \
exec convert -density 400 -flatten output.pdf -resize 25% output.png
exit 1
#endif
#include <cmath>
#include <cstdio>