uucidl/00_ProfilingAndDataAnalysisInSoftware.org

## 00_ProfilingAndDataAnalysisInSoftware.org

      
    Raw
  

              00_ProfilingAndDataAnalysisInSoftware.org
            
          
    Goals

Looking at software from a different angle as done during profiling and data analysis has numerous benefits. It requires and increases our understanding of the problems being solved by the software. It exposes us to unexpected discoveries and insights. It allows us to suggest improvements.
Since there is a great element of skills involved, and it is rarely taught in Computer Science degrees, the role of practice is very important to gain the necessary skills.
Exercises

Ex1: sequential processing

Reading sequentially and processing each element from:

  Case 1: an array `Foo arr[N]`
  Case 2: an array `Foo *arr[N];` (pointers to “random” locations)
  Case 3: linked list of Foos.


  Order the cases
  Explain the ordering
  Measure

Ex1.1: measure again while another Core continuously exercises Case 1 in a loop.

Suggested reading list

Essential


  https://www.icloud.com/keynote/05iUiTiZFlOBN7qZny3Vr8bfQ#Engine_programmers_should_know Engine programmers should know… (insights on practice, problem solving process and pitfalls)
  http://www.gdcvault.com/play/1021866/Code-Clinic-2015-How-to How to Write Code the Compiler Can Actually Optimize by Mike Acton (shows process of analyzing bottlenecks, i.e. finding the resource that is most in demand, verifying hypothesis. Value of analyzing one value and getting value distributions, the value of practice, and an example of that practice)

Other


  https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-to-dataflow-graphs/
  https://www.facebook.com/perftidbits/photos/pcb.504731853211324/504731296544713/?type=3&theater Performance by Design by Rico Mariani
  https://docs.google.com/presentation/d/1ST3mZgxmxqlpCFkdDhtgw116MQdCr2Fax2yjd8Az6zM/edit#slide=id.g238f34177c_0_82 OOP Pitfalls talk 2017 by Tony Albrecht
  https://shipilev.net/blog/2014/java-scala-divided-we-fail/ show that our formal and informal training as software engineers often leaves out holes in certain areas such as how to properly design and interpret an experiment. shows how “known inputs” can skew a benchmark in a certain direction and mislead.
  http://gpuopen.com/video-memory-profiling-wpa/
  https://bling.kapsi.fi/blog/x86-memory-access-visualization.html
  https://twitter.com/Reedbeta/status/828402842184486912 nice example of unicode usage visualisation
  http://stackoverflow.com/questions/29551516/glclear-takes-too-long-android-opengl-es-2/29553399#29553399 asynchronous APIs can be hard to profile
  https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/ Fabian Giesen records his profiling+optimization work of an Intel library for occlusion culling
  https://macton.smugmug.com/Other/2008-07-30-by-Eye-Fi/n-qN9F8/i-Z2vc8sC On the importance of focusing on the data + final transform + actual contraints
  https://hero.handmade.network/episode/code/day112 A Mental Model of CPU Performance by Casey Muratori. (how to think about optimization/performance, CPU instruction level, exposure to vocabulary around lower-level optimizations, cycles / sec)
  https://gist.github.com/ocornut/cb980ea183e848685a36 Memory, Cache, CPU optimization resources (compiled by Omar Cornut + others)
  http://www.codercorner.com/blog/?page_id=1632 The Box Pruning Series
    
      https://gist.github.com/rygorous/fdd41f45b24472649aaeb5b55bbe6e26?ts=4
    
  
  https://gist.github.com/ocornut/cb980ea183e848685a36 Memory, Cache, CPU optimization resources (compiled by Omar Cornut + others
  https://youtu.be/gNsF8Q1Nh2Q?list=PLHxtyCq_WDLXFAEA-lYoRNQIezL_vaSX-&t=1538 Stepanov, Programming Conversations
  http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html CPU Utilization Is Wrong
  http://aras-p.info/blog/2017/08/08/Unreasonable-Effectiveness-of-Profilers/ An example of studying build times with a tracing profiler
  https://www.snellman.net/blog/archive/2017-08-19-slow-ps4-downloads/ Gives a good outlook of what a performance experiment design looks like, from the establishment of a mental model and hypothesis.
  https://www.youtube.com/watch?v=lJ8ydIuPFeU How NOT to measure Latency by Gil Tene
  https://www.facebook.com/notes/rico-marianis-performance-tidbits/a-brief-word-about-some-common-metrics/489778214706688/ Mach-O Metrics (Macos, IOS) and their usefulness
  https://www.facebook.com/perftidbits/photos/ms.c.eJw9zdsNwEAIA8GOIl4Hpv~;GIp3jfI5goWBlmDhd2TlPfcZO75jckW5pv~;POe~_VD~;~;3wXshLq4dz32X~_D~;VgH~_pRd~;9Abrpl0CUvnZ~;X7z3HC~;fpMSM~-.bps.a.480408715643638.1073741829.414843438866833/480408725643637/?type=3&theater Performance Troubleshooting Techniques: Keep big picture, Look for correlated resource comsumption (for ex Memory usage for a CPU usage increase) to drive analysis around a period of time
  http://www.art.net/~hopkins/Don/unix-haters/tirix/embarrassing-memo.html Memo about IRIX 5.1’s issues and how the team scrambled to address them after the fact
  http://ubiquity.acm.org/article.cfm?id=1513451 The Fallacy Of Premature Optimization
  https://www.youtube.com/watch?v=DxP–1yEgKQ toolbox and attitude towards performance. Peter Seinbach

People


  Paul Khuong https://twitter.com/pkhuong http://pvk.ca
  Mike Acton https://twitter.com/mike_acton http://www.macton.ninja/
  Fabian Giesen https://fgiesen.wordpress.com
  Brendan Gregg http://www.brendangregg.com
  Rico Mariani https://www.facebook.com/notes/rico-marianis-performance-tidbits/

Tips


  http://uucidl.github.io/use-frame-time-not-frame-rate-when-working-on-performance.html

Reasons

Rob Pike Rules

From http://users.ece.utexas.edu/~adnan/pike.html:

  
    Rule 1. You can’t tell where a program is going to spend its time. Bottlenecks occur in surprising places, so don’t try to second guess and put in a speed hack until you’ve proven that’s where the bottleneck is.
    Rule 2. Measure. Don’t tune for speed until you’ve measured, and even then don’t unless one part of the code overwhelms the rest.
    Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don’t get fancy. (Even if n does get big, use Rule 2 first.)
    Rule 4. Fancy algorithms are buggier than simple ones, and they’re much harder to implement. Use simple algorithms as well as simple data structures.
    Rule 5. Data dominates. If you’ve chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
  

From Reality to Model, rather than Model to Reality

Software creation is often presented as the creation of a model that will run “in reality.”
This activity of profiling and analyzing data + code is about reversing that arrow:
From http://realworldrisk.com/:

  When and if we model, we go from reality to models not from models to reality


## 01_Tools.org

      
    Raw
  

              01_Tools.org
            
          
    Tools


  https://uica.uops.info/ latency / microops analysis for x86_64
  Microsoft PIX d3d12, x64, windows
  Very Sleepy windows. Free, lightweight, easy to use. Slow.
  ETW/Windows Performance Toolkit/XPerf/WPA. Windows. Really worth it, yet barely useable before Bruce Dawson documented it and made a front-end for it (UIForETW)
    
      https://randomascii.wordpress.com/2012/05/08/the-lost-xperf-documentationcpu-sampling/
      https://randomascii.wordpress.com/2012/11/21/the-lost-xperf-documentationdisk-usage/
      https://randomascii.wordpress.com/2012/05/11/the-lost-xperf-documentationcpu-scheduling/
    
  
  Visual Studio’s Performance Profiler (Debug > Performance Profiler). Windows
  RenderDoc OpenGL, Vulkan, D3D11, D3D12 capture/debugging, windows/linux
  https://catapult.gsrc.io/ Google’s Catapult suite of tools
  https://github/com/uucidl/uu.spdr Invasive tracing generating chrome://tracing reports
  https://github.com/adrianstone55/SymbolSort Windows. SymbolSort sorts symbols in binaries by weight.\
  NVidia NSight debugger (profiler/debugger for GPU workloads)
  Python profiler: https://github.com/plasma-umass/scalene see https://www.youtube.com/watch?v=vVUnCXKuNOg

Tricks

Adding some overhead to the thing under measurement (especially if it’s diffuse) is a nice way to see its impact. (Like adding a loop around a section)

  
## 02_AutomatedSystems.org

      
    Raw
  

              02_AutomatedSystems.org
            
          
    Google Chrome

Google Chrome has a suite of performance tests that are continuously checked.
https://chromium.googlesource.com/chromium/src/+/master/docs/speed/addressing_performance_regressions.md
Reports can be found at: https://chromeperf.appspot.com/
Rico Mariani on Quantized Distributions (Problems in Benchmarking)

https://www.facebook.com/notes/rico-marianis-performance-tidbits/quantized-distributions-illustrated/475243556160154/

  
## 03_PerformanceCulture.org

      
    Raw
  

              03_PerformanceCulture.org
            
          
    Rico Mariani: “Performance Culture Best Practices”

@url: https://www.facebook.com/notes/rico-marianis-performance-tidbits/performance-culture-best-practices/473526169665226/
Quoting:
Two Rules You Must Follow

Rule #1: Measure

Just thinking about what to measure will help you do a good job
  If you don’t measure you can be sure it will be slow, big, or whatever else you don’t want
  If you haven’t measured, your job’s not finished
Rule #2: Do your homework

Good engineering requires you to understand your raw materials
  What are the key properties of your Framework? Your processor? Your target system?
  Three Steps To Success
Budget

An exercise to assess the value of a new feature and the cost your customer would be willing to pay, not a technical assessment of what is possible
Plan

Design and validate against the budget, this is a design plan and a risk assessment
Verify

Measure the final results, discard failures without remorse or penalty, don’t make your customers live with them

  
## RR_FabianGiesen_OnBenchmarking.md

      
    Raw
  

              RR_FabianGiesen_OnBenchmarking.md
            
          
    Fabian Giesen @rygorous
Top 3 benchmarking mistakes/gotchas:

Are you using a relevant test set (are you testing the right thing)

Example:
If you're interested in some string processing thing and most strings you care about are short, test on 100 000 short strings, not one 1MB string. Very different workloads!

Does your test give consistent results if run repeatedly?

If you don't verify this and try to iterate on something, you're just looking at noise.

Have you verified that you're measuring what you think you're measuring?

a) In microbenchmarks especially, look at what the compiler does with it. If your test consists of repeated executions of a pure function with the same arg and you never do anything with the result many compilers will get rid of the loop entirely.
b) don't run microbenchmarks to gather data, period. They're OK for hypothesis testing when trying to confirm an expected result but if you have no idea what the result should be you also have no way of detecting the many, many ways to screw microbenchmarks up.

If you're using a recursive Fibonacci "benchmark" for any purpose other than as the compiler writer's equivalent to "Hello, world" to make sure you're function calls work, you're in a state of sin.

Using a Fib benchmark result in any debate results in instant ban. It's the law.

  
## RR_HowNotToMeasureLatency_GilTene.org

      
    Raw
  

              RR_HowNotToMeasureLatency_GilTene.org
            
          
    “How NOT to Measure Latency” by Gil Tene

@url{https://www.youtube.com/watch?v=lJ8ydIuPFeU} How NOT to Meaure
  Latency by Gil Tene.
Presenter

Gil Tene, Azul
  CTO and Cofounder of Azul System. (Builder of high performance JVMs)
  Ex: Nortel Networks, Shasta Networks, Check Point Software Technologies.
@url: http://stuff-gil-says.blogspot.de/
  @url: http://latencytipoftheday.blogspot.de/
  @url: https://twitter.com/giltene
Summary

This talk has been given in multiple forms. It usually gets a Oh Shit!
  response from its audience. Gil Tene presents himself as an
  experienced System Builder.
The topic is latency, or response-time. The time it takes for one
  operation to happen.
More accurately, as we’re dealing with not just one operation but
  many, the topic is how to study the latency behavior of a system: How
  does latency in general behave?
Pretty, misleading charts

If there’s one thing you should and can measure and plot, it’s the max
  response time. This is a genuine signal of something that happened.
However often the question we ask ourselves is what the most common
  response time was. This can however lead to misleading visualizations,
  unfortunately present in many monitoring and measuring systems.
see @url:
  http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-q-whats-
  wrong-with_21.html

  Percentile
a percentage corresponding to a certain amount of
    observations from a system. The 95% percentile of
    requests is: if you take 100 requests, looking at 95
    of them.

A typical chart contains:

  response time of a system over time (2h)
  latency as one series per percentile

Beware of graphs that only plot only what’s convenient. Gil Tene here
  shows a graph that shows response times at most for 95% of
  requests. What happens for the 5% of requests that were left out?
Then the presenter makes a thought experiment. What is the user
  experience like?
He takes the example of users visiting a website. These users have a
  typical interaction of about 5 pages visit each of which make about 40
  requests. Each request is essential, and may lead to a bad user
  experience.
Let’s figure out how many users encounter the requests that were in
  our 95% percentile? Just 0.95^200 = 0.0035% of your users, which is
  basically nobody.
So why should we look at the 95% and below percentile? This chart was
  really misleading in that it gave us the impression that we were
  looking at a common case!
Let’s do the experiment in reverse. What percentile of requests should
  you look at to have an idea of what 99% of the users experience?
99% = y^200
  log(99%) = 200*log(y)
  y = exp(log(0.99)/200) = 99.9950 %
Gil Tene suggests collecting and printing the response time with x
  axis: percentile and y axis: response time, to get an overview of the
  response time at various percentiles.
Sale pitch: HdrHistogram. @url: https://github.com/HdrHistogram/HdrHistogram
The coordinated omission problem.

A problem due to measurement methodology. That induces bad conclusion about
  performance.
Under a serial load generator, emitting operations one after the other
  at a fixed rate, an unresponsive system that freezes for a long time,
  will give misleading read-outs, underestimating the extent of the
  freeze:
As the system becomes unresponsive (completely) for a long time, we
  get 1 bad response time, and n very good ones once it has become
  responsive again.
A naive load generator does that if it waits and does not emit
  requests until the previous one has responded.
In code measurement the same occurs: as we collect traces of
  operations, we won’t get new traces of what was wrong because we
  simply were unresponsive. This results in only keeping the “good”
  numbers.
Service Time vs Response Time.

The serial load generator/measurement shown in the part before was bad
  in that it only measured service time and did not include waiting
  time.
However latency is really composed of the total waiting + service
  times:
+---------------------+
| Response Time       | 
+------+--------------+
|      | Service Time |
+---------------------+
| Wait | Work         |
+------+--------------+


  Wait is pure waste. (Queuing to get service?)
  Work is productive. (Getting serviced)

Tip to verify whether your load generator/monitoring system measures actual response time:

  Push it hard. After a while the response time should grow linearly.
  Linear response is when the system reaches fragility.

Sustainable Throughput

Latency relating to throughput and load.
Figuring out how to avoid “saturation” is better than trying to study
  what happens after or at saturation. Figure out how fast are we while
  still being safe.
Comparing response time or latency behaviors

Differential comparison at various workloads. 40K/85K/ {90K bad}
  Looking at requirements can help comparing two setups/systems and
  assess whether which ones are more economical than others for the
  requirement.
Plot the max

Is a good visual summary for comparing two systems.

  
## RR_Scylla’s Approach to Improve Performance for CPU-bound workloads.md

      
    Raw
  

              RR_Scylla’s Approach to Improve Performance for CPU-bound workloads.md
            
          
    @title: Scylla’s Approach to Improve Performance for CPU-bound workloads
@url: https://www.scylladb.com/2017/07/06/scyllas-approach-improve-performance-cpu-bound-workloads/
Funny. Today almost everything is CPU-bound. (If you don't make a big difference between CPU and memory)
I mostly skimmed through the article because I was familiar with some of out. Things that stood out:

knowing how much processor time was spent in particular functions may be not enough. When the CPU is busy it may be so because it is actually doing some work or is stalled waiting for memory

@see: https://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html
completed instructions per cycle: IPC

If it is low, less than 1, it is very likely that the application is, in fact, memory bound.


modern processors have Performance Monitoring Unit (PMU)

@see: https://github.com/andikleen/pmu-tools/wiki/toplev-manual
@see: https://sites.google.com/site/analysismethods/yasin-pubs
@see: https://drive.google.com/file/d/0B_SDNxjh2Wbcc0lWemFNSGMzLTA/view?resourcekey=0-eHHOKhPrwPN8uq-t_L9yLg (Top-Down analysis paper)
toplev is a tool for Linux, so not directly useable here on windows. It appears that Intel has implemented the Top-Down analysis paper in V-Tune: @see: https://software.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/methodologies/top-down-microarchitecture-analysis-method.html
@see: https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler/download.html
The V-Tune profiler appears to be available to all without fee these days, which is good news! I've heard of it for so long and never got to try it out.
I guess the interesting idea in this article, beyond sharing information about CPU architecture is the idea behind the Top-Down paper, which is to arrange bottlenecks into hierarchies, which can help with the analysis.
Another thing I like is that it highlighted how ICache misses are quite often an issue. Can be alleviated by batching and creating

  
## ZZ_ReadingList.org

      
    Raw
  

              ZZ_ReadingList.org
            
          
  https://chadaustin.me/2017/05/writing-a-really-really-fast-json-parser/

Test:
  https://github.com/pierricgimmig/orbitprofiler