Skip to content
View experiment-log.txt
run: 1
Sat Feb 6 10:45:27 UTC 2016
Reset branch 'master+master'
basic1-100e6 32.1 -
packetblaster-64 12.65 -
packetblaster-synth-64 12.91 -
snabbnfv-iperf-1500 0 -
snabbnfv-iperf-jumbo 0 -
snabbnfv-loadgen-dpdk 5.012 -
run: 2
This is initial work-in-progress code for exploring an efficient
design for a "blitter" written in assembler.
The idea here is to take a large number of memory copy operations, for
example 100, sort them into buckets based on length (in cache lines),
and then execute several of them in parallel. The idea is that this
would be efficient for copies that are bounded by the memory subsystem
(e.g. L3 cache latency) and don't achieve the maximum throughput (~32
bytes/cycle) when executed serially with memcpy.
View cachemiss.txt
output from
mcode dump: p1
7f939af17000 48C7C100E1F505 mov rcx, 0x05f5e100
7f939af17007 48BE2020F8410000. mov rsi, 0x0000000041f82020
7f939af17011 49C7C000000000 mov r8, 0x0
7f939af17018 4E8B04C6 mov r8, [rsi+r8*8]
7f939af1701c 4130C8 xor r8b, cl
7f939af1701f 4883E901 sub rcx, +0x01
View log.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script type="text/javascript" src="jquery.min.js"></script><script type="text/javascript" src="jquery-ui.min.js"></script><script type="text/javascript" src="treebits.js"></script><link rel="stylesheet" href="logfile.css" type="text/css">
<title>Log File</title>
<h1>VM build log</h1>
<p><a href="javascript:" class="logTreeExpandAll">Expand all</a> |
View readme.txt
$ sudo ./snabb snsh -jdump=+rs,x -jp=vi1 counter2.lua
100% TRACE 3 ->loop counter2.lua:5
$ sudo ./snabb snsh -jdump=+rs,x -jtprof counter2.lua
traceprof report (recorded 10567/10567 samples):
62% TRACE 3 counter2.lua:5
26% TRACE 4 (3/5) counter2.lua:9
10% TRACE 3:LOOP counter2.lua:5
View luajit-profiler-separate-loops.patch
diff --git a/lib/luajit/src/jit/p.lua b/lib/luajit/src/jit/p.lua
index d894bb7..b619517 100644
--- a/lib/luajit/src/jit/p.lua
+++ b/lib/luajit/src/jit/p.lua
@@ -71,7 +71,8 @@ local map_vmmode = {
-- Profiler callback.
-local function prof_cb(th, samples, vmmode)
+local function prof_cb(th, samples, vmmode, ip)
if [ "$1" == "1" ]; then
elif [ "$1" == "2" ]; then
View 0checksum.lua
local ffi = require("ffi")
local C = ffi.C
local pmu = require("lib.pmu")
-- Use /etc/passwd contents as a dummy packet
local data = core.lib.readfile("/etc/passwd", "*all")
local ptr = ffi.cast("unsigned char *", data), #data
local len = #data
local loops = 100000
View readme.txt
CPU PMU (Performance Monitoring Unit) support
I have geeked out on a new piece of hardware :-)
This time it is the Performance Monitoring Unit built into the CPU. This is a hardware capability to track fine-grained events inside the processor and give visibility into things like cache misses, branch mispredictions, utilization of internal CPU resources,
Turns out that you only need two special CPU instructions to drive this - WMSR to setup a counter, RDPMC to read it - and a simple but interesting benchmarking tool is only 500 lines of code. This was also a good opportunity to use our new ability to write Lua code that generates machine code at runtime.
View pmu.lua
-- This module counts and reports on CPU events such as cache misses,
-- branch mispredictions, utilization of internal CPU resources such
-- as execution units, and so on.
-- Hundreds of low-level counters are available. The exact list
-- depends on CPU model. See pmu_cpu.lua for our definitions.
-- API:
Something went wrong with that request. Please try again.