Skip to content

Instantly share code, notes, and snippets.

@kazuki
Last active March 5, 2017 02:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kazuki/16911d259c0dd2d1d135debb8d3ff3ce to your computer and use it in GitHub Desktop.
Save kazuki/16911d259c0dd2d1d135debb8d3ff3ce to your computer and use it in GitHub Desktop.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
stepping : 2
microcode : 0x38
cpu MHz : 1199.920
cache size : 15360 KB
physical id : 0
siblings : 12
core id : 0
cpu cores : 6
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
bugs :
bogomips : 6997.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
DDR4-2133 ECC DR 16GB x 4枚
BIOS Information
Vendor: American Megatrends Inc.
Version: P3.20
Release Date: 08/16/2016
BIOS Revision: 5.11
Base Board Information
Manufacturer: ASRock
Product Name: X99M Extreme4
Memory Device
Array Handle: 0x000E
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 72 bits
Size: 16384 MB
Form Factor: RIMM
Set: None
Locator: DIMM_A1
Bank Locator: NODE 1
Type: <OUT OF SPEC>
Type Detail: Synchronous
Speed: 2133 MHz
Manufacturer: Micron
Asset Tag: DIMM_A1_AssetTag
Part Number: 36ASF2G72PZ-2G1A2
Rank: 2
Configured Clock Speed: 2133 MHz
$ OMP_NUM_THREADS=6 ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.014684 msec, 14.995488 GFLOPS.
On-board: 0.009763 msec, 22.554633 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.277897 msec, 7.244636 GFLOPS.
On-board: 0.268828 msec, 7.489059 GFLOPS.
$ OMP_NUM_THREADS=12 ./fftw
3-D FFT: 128 x 128 x 128
On-board: 0.010458 msec, 21.055685 GFLOPS.
On-board: 0.008329 msec, 26.439007 GFLOPS.
3-D FFT: 256 x 256 x 256
On-board: 0.279447 msec, 7.204465 GFLOPS.
On-board: 0.272284 msec, 7.393980 GFLOPS.
M 10240 N 10240 K 10240 al -1 b 1
Dgemm start
memory use 2.34375 GB
3:3:16 initialize done.
254.995 GFlops, 1.73491
Sgemm start
3:3:51 initialize done.
511.707 GFlops, 5.02656
$ mpirun -np 1 ./lu2_mpi -n 32768
(略)
Error = 5.930995e-08 1.890872e-09 [41/1489]
update time= 2.91993e+11 ops/cycle= 80.0177
update matmul time= 2.77436e+11 ops/cycle= 37.9941
update swap+bcast time= 1.27753e+10 ops/cycle= 0.0344883
total time= 6.09222e+11 ops/cycle= 38.5019
rfact time= 3.14754e+11 ops/cycle= 0.00179896
ldmul time= 3.53161e+10 ops/cycle= 58.3752
colum dec with trans time= 1.60102e+11 ops/cycle= 0.00335411
colum dec right time= 8.19711e+10 ops/cycle= 6.95994
colum dec left time= 3.63658e+08 ops/cycle= 0.0919088
rowtocol time= 2.27866e+10 ops/cycle= 0.0235665
column dec in trans time= 1.24084e+11 ops/cycle= 0.0346219
coltorow time= 1.32282e+10 ops/cycle= 0.0405952
dgemm8 time= 5.89311e+09 ops/cycle= 0.728812
dgemm16 time= 5.34285e+09 ops/cycle= 1.60774
dgemm32 time= 5.71966e+09 ops/cycle= 3.00365
dgemm64 time= 6.63841e+09 ops/cycle= 5.1759
dgemm128 time= 5.24282e+09 ops/cycle= 13.1074
main dgemm time= 3.07159e+11 ops/cycle= 69.4599
col trsm time= 8.79853e+09 ops/cycle= 5.64071
col update time= 7.27637e+10 ops/cycle= 0.629603
col r dgemm time= 5.02651e+10 ops/cycle= 21.7888
col right misc time= 2.24946e+10 ops/cycle= 0.0954667
backsub time= 8.72095e+08 ops/cycle= 1.23122
col dec total time= 2.42443e+11 ops/cycle= 0.00233553
DGEMM2k time= 3.39549e+11 ops/cycle= 68.906
DGEMM1k time= 1.12077e+10 ops/cycle= 58.2487
DGEMM512 time= 8.232e+09 ops/cycle= 37.5653
DGEMMrest time= 3.83344e+10 ops/cycle= 7.44828
TRSM U time= 5.40459e+09 ops/cycle= 29.6683
col r swap/scale time= 4.05956e+08 ops/cycle= 0.164665
rfact ex coldec time= 7.22276e+10 ops/cycle= 0
ld_phase1 time= 59263 ops/cycle= 1132.39
rfact misc1 time= 5.00091e+09 ops/cycle= 0
rfact misc2 time= 3.16592e+10 ops/cycle= 0
rfact misc3 time= 2.75225e+09 ops/cycle= 0
rfact misc4 time= 3.28153e+10 ops/cycle= 0
copysubmats time= 1.78958e+10 ops/cycle= 0.128906
============================================================================ [3/1489]
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR01R2C32 32768 2048 1 1 174.36 1.345e+02
----------------------------------------------------------------------------
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR01R2C32 32768 2048 1 1 174.36 1.345e+02
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0348843 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0280407 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0048949 ...... PASSED
============================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
----------------------------------------------------------------------------
End of Tests.
============================================================================
cpusec = 1578.84 wsec=174.364 134.525 Gflops
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0348843 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0280407 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0048949 ...... PASSED
============================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
----------------------------------------------------------------------------
End of Tests.
============================================================================
cpusec = 1578.84 wsec=174.364 134.525 Gflops
$ uname
Linux vivace 4.10.0-gentoo #1 SMP Tue Feb 21 00:33:47 JST 2017 x86_64 Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz GenuineIntel GNU/Linux
$ gcc --version
gcc (Gentoo 6.3.0 p1.0) 6.3.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
fftw: 3.3.4
openblas: 0.2.19
openmpi: 2.0.2
$ ./stream_cxx.out --arraysize 409600000
-------------------------------------------------------------
STREAM version $Revision : 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array Size = 409600000 (elements), Offset = 0 (elements)
Memory per array = 3125 MiB (= 3.05176 GiB).
Total Memory required = 9375 MiB (= 9.15527 GiB).
Each kernel will be executed 10 times.
Function Best Rate MB/s Avg time Min time Max time
Copy: 30946.3 0.214962 0.211773 0.228922
Scale: 31006.5 0.213416 0.211362 0.216842
Add: 33919.8 0.290569 0.289813 0.292838
Triad: 33978.6 0.291207 0.289311 0.295740
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment