rogarg

## vectorized-atan2f.cpp
// Branchless, vectorized `atan2f`. Various functions of increasing
// performance are presented. The fastest version is 50~ faster than libc
// on batch workloads, outputing a result every ~2 clock cycles, compared to
// ~110 for libc. The functions all use the same `atan` approximation, and their
// max error is around ~1/10000 of a degree.
//
// They also do not handle inf / -inf
// and the origin as an input as they should -- in our case these are a sign
// that something is wrong anyway. Moreover, manual_2 does not handle NaN
// correctly (it drops them silently), and all the auto_ functions do not

## latency.markdown

      
        
          
            
              
              2 files
            
          
          
            
              
              778 forks
            
          
            
              
                
                58 comments
              
            
          
            
              
              4624 stars
            
          
        
        
          
              
          
          
            
                hellerbarde
                / latency.markdown
            
            
              Created
              May 31, 2012 13:16
                — forked from jboner/latency.txt
            
              
                Latency numbers every programmer should know
              
          
        
      
        
  
      
    Latency numbers every programmer should know

L1 cache reference ......................... 0.5 ns
Branch mispredict ............................ 5 ns
L2 cache reference ........................... 7 ns
Mutex lock/unlock ........................... 25 ns
Main memory reference ...................... 100 ns             
Compress 1K bytes with Zippy ............. 3,000 ns  =   3 µs
Send 2K bytes over 1 Gbps network ....... 20,000 ns  =  20 µs
SSD random read ........................ 150,000 ns  = 150 µs

Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs
	// Branchless, vectorized `atan2f`. Various functions of increasing
	// performance are presented. The fastest version is 50~ faster than libc
	// on batch workloads, outputing a result every ~2 clock cycles, compared to
	// ~110 for libc. The functions all use the same `atan` approximation, and their
	// max error is around ~1/10000 of a degree.
	//
	// They also do not handle inf / -inf
	// and the origin as an input as they should -- in our case these are a sign
	// that something is wrong anyway. Moreover, manual_2 does not handle NaN
	// correctly (it drops them silently), and all the auto_ functions do not