Travis Downs travisdowns

## tinymembench-i7-4770.txt
==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==

## hsl_vs_skl_decode.txt
Results of HSW vs SKL decode tests, for different patterns of multi-byte nops.

Tests run with:

./uarch-bench.sh --timer=libpfc --test-name=misc/decode* --extra-events=inst_retired,lsd.uops,idq.dsb_uops,idq.mite_uops --precision=3

Haswell (i7-4770) Results:

** Running group misc : Miscellaneous tests **
                               Benchmark     Cycles     INST_R     LSD:UO     IDQ:DS     IDQ:MI

## DecodeSim.java
package com.example.scrap;

import java.util.ArrayList;
import java.util.stream.Collectors;

import com.google.common.collect.Iterables;
import com.google.common.collect.Iterators;
import com.google.common.collect.PeekingIterator;

public class DecodeSim {

## store_fwd_litmus.txt
An example of an allowed outcome not explained solely by StoreLoad reordering (i.e., could not occur on
a system with a total store order and only StoreLoad reordering).

all memory initially zero

thread 1:
mov [x],   1
mov   a, [x]
mov   b, [y]

## parser.java
// Output created by jacc on Sun Apr 07 22:44:19 COT 2019

package travisdowns.github.io;

class RegexParser extends ParserBase implements RegexTokens {
    private int yyss = 100;
    private int yytok;
    private int yysp = 0;
    private int[] yyst;
    protected int yyerrno = (-1);

## nop-helper.h
// if NOPCOUNT is not defined, use 0 which makes NOPS a no-op
#ifndef NOPCOUNT
#define NOPCOUNT 0
#endif

#define NOPS_HELPER2(N) asm(".rept " #N ";nop;.endr");
#define NOPS_HELPER1(N) NOPS_HELPER2(N)
#define NOPS NOPS_HELPER1(NOPCOUNT)

## CNL_stores.txt
Driver: intel_pstate, governor: performance
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i3-8121U CPU @ 2.20GHz
loading msr kernel module
intel_pstate/no_turbo reports that turbo is already disabled
Using timer: clock
Welcome to uarch-bench (b6d37f9)
Supported CPU features: SSE3 PCLMULQDQ VMX EST TM2 SSSE3 FMA CX16 SSE4_1 SSE4_2 MOVBE POPCNT AES AVX RDRND TSC_ADJ BMI1 AVX2 BMI2 ERMS MPX AVX512F AVX512DQ RDSEED ADX AVX512IFMA CLFLUSHOPT INTEL_PT AVX512CD SHA AVX512BW AVX512VL
Pinned to CPU 0
Median CPU speed: 2.194 GHz

## Shuffle.java
package com.example.scrap;

import java.util.Random;
import java.util.function.Consumer;
import java.util.stream.IntStream;

public class Shuffle {

    private static final Random r = new Random();
    private static final int TRIALS = 100000;

## lcd-osaca.md

      
              1 file
            
          
              0 forks
            
          
              5 comments
            
          
              0 stars
            
          
                travisdowns
                / lcd-osaca.md
            
            
              Last active
              October 17, 2019 00:53
            
          
    In a recent paper, a method is described for calculating the loop-carried dependencies (LCD), if any, of a assembly level loop, as follows:

OSACA can detect LCDs by creating a DAG of a code comprising two back-to-back copies of the loop body. It can thus analyze all possible paths from each vertex of the first kernel section and detect cyclic LCDs if there exists a  dependency chain from one instruction form to its corresponding duplicate.

However, I don't think this is sufficient to catch all loop carried dependencies. In particular, you can have a case where a dependency is loop carried, but where no vertex (instruction) in the second iteration depends on the same vertex¹ in the first. Rather, some other vertex in the second depends on the first, and in some subsequent iteration, the cycle is completed.
An example:
add eax, 1  ; A (deps: E-previous)


## cache-counters-rant.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              19 stars
            
          
                travisdowns
                / cache-counters-rant.md
            
            
              Created
              October 13, 2019 16:46
            
              
                Discussion of x86 L1D related cache counters
              
          
    The counters that are the easiest to understand and the best for making ratios that are internally consistent (i.e., always fall in the range 0.0 to 1.0) are the mem_load_retired events, e.g., mem_load_retired.l1_hit and mem_load_retired.l1_miss.
These count at the instruction level, i.e., the universe of retired instructions. For example, could make a reasonable hit ratio from mem_load_retired.l1_hit / mem_inst_retired.all_loads and it will be sane (never indicate a hit rate more than 100%, for example).
That one isn't perfect though, in that it may not reflect the true costs of cache misses and the behavior of the program for at least the following reasons:

It appplies only to loads and can't catch misses imposed by stores (AFAICT there is no event that counts store misses).
It only counts loads that retire - a lot of the load activity in your process may be due to loads on a speculative path that never retire. Loads on a speculative path may bring in data that is never used, causing misses and d
	==========================================================================
	== Memory bandwidth tests ==
	== ==
	== Note 1: 1MB = 1000000 bytes ==
	== Note 2: Results for 'copy' tests show how many bytes can be ==
	== copied per second (adding together read and writen ==
	== bytes would have provided twice higher numbers) ==
	== Note 3: 2-pass copy means that we are using a small temporary buffer ==
	== to first fetch data into it, and only then write it to the ==
	== destination (source -> L1 cache, L1 cache -> destination) ==
	Results of HSW vs SKL decode tests, for different patterns of multi-byte nops.

	Tests run with:

	./uarch-bench.sh --timer=libpfc --test-name=misc/decode* --extra-events=inst_retired,lsd.uops,idq.dsb_uops,idq.mite_uops --precision=3

	Haswell (i7-4770) Results:

	Running group misc : Miscellaneous tests
	Benchmark Cycles INST_R LSD:UO IDQ:DS IDQ:MI
	package com.example.scrap;

	import java.util.ArrayList;
	import java.util.stream.Collectors;

	import com.google.common.collect.Iterables;
	import com.google.common.collect.Iterators;
	import com.google.common.collect.PeekingIterator;

	public class DecodeSim {
	An example of an allowed outcome not explained solely by StoreLoad reordering (i.e., could not occur on
	a system with a total store order and only StoreLoad reordering).

	all memory initially zero

	thread 1:
	mov [x], 1
	mov a, [x]
	mov b, [y]
	// Output created by jacc on Sun Apr 07 22:44:19 COT 2019

	package travisdowns.github.io;

	class RegexParser extends ParserBase implements RegexTokens {
	private int yyss = 100;
	private int yytok;
	private int yysp = 0;
	private int[] yyst;
	protected int yyerrno = (-1);
	// if NOPCOUNT is not defined, use 0 which makes NOPS a no-op
	#ifndef NOPCOUNT
	#define NOPCOUNT 0
	#endif

	#define NOPS_HELPER2(N) asm(".rept " #N ";nop;.endr");
	#define NOPS_HELPER1(N) NOPS_HELPER2(N)
	#define NOPS NOPS_HELPER1(NOPCOUNT)
	Driver: intel_pstate, governor: performance
	Vendor ID: GenuineIntel
	Model name: Intel(R) Core(TM) i3-8121U CPU @ 2.20GHz
	loading msr kernel module
	intel_pstate/no_turbo reports that turbo is already disabled
	Using timer: clock
	Welcome to uarch-bench (b6d37f9)
	Supported CPU features: SSE3 PCLMULQDQ VMX EST TM2 SSSE3 FMA CX16 SSE4_1 SSE4_2 MOVBE POPCNT AES AVX RDRND TSC_ADJ BMI1 AVX2 BMI2 ERMS MPX AVX512F AVX512DQ RDSEED ADX AVX512IFMA CLFLUSHOPT INTEL_PT AVX512CD SHA AVX512BW AVX512VL
	Pinned to CPU 0
	Median CPU speed: 2.194 GHz
	package com.example.scrap;

	import java.util.Random;
	import java.util.function.Consumer;
	import java.util.stream.IntStream;

	public class Shuffle {

	private static final Random r = new Random();
	private static final int TRIALS = 100000;