Skip to content

Instantly share code, notes, and snippets.

@mwarusz
mwarusz / Float32 results
Created April 13, 2021 23:28
volumerhs indexing vs performance
Info: volumerhs_orig! details:
│ - 171 registers, max 256 threads
│ - 0 bytes local memory,
│ 1.113 KiB shared memory,
└ 0 bytes constant memory
Trial(7.074 ms)
┌ Info: volumerhs_ijk! details:
│ - 136 registers, max 384 threads
│ - 0 bytes local memory,
│ 1.113 KiB shared memory,
+---------+-------------------------+-------------------+----------------+----------+
| FT | split_explicit_implicit | penalty on linear | remainer_model | result |
+---------+-------------------------+-------------------+----------------+----------+
| Float32 | false | yes | - | stable |
| Float32 | true | yes | fully discrete | unstable |
| Float64 | true | yes | fully discrete | stable |
| Float64 | false | no | - | unstable |
| Float64 | true | no | fully discrete | unstable |
| Float64 | true | no | single flux | stable |
@mwarusz
mwarusz / CuArrays.patch
Last active May 2, 2020 17:41
GPUArrays and CuArrays patches for Broadcasted
diff --git a/src/mapreduce.jl b/src/mapreduce.jl
index 14bcfe1..ad2da8c 100644
--- a/src/mapreduce.jl
+++ b/src/mapreduce.jl
@@ -132,14 +132,15 @@ function partial_mapreduce_grid(f, op, neutral, Rreduce, Rother, shuffle, R, As.
end
## COV_EXCL_STOP
-
-NVTX.@range function GPUArrays.mapreducedim!(f, op, R::CuArray{T}, As::AbstractArray...; init=nothing) where T
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla V100-SXM2-16GB (0)"
Kernel: ptxcall___gpu_transpose_kernel_naive__426_1
10 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 90.44% 90.50% 90.48%
10 global_hit_rate Global Hit Rate in unified l1/tex 0.54% 0.55% 0.54%
Kernel: ptxcall_transpose_cuda__5
10 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 50.11% 50.12% 50.12%
10 global_hit_rate Global Hit Rate in unified l1/tex 77.75% 77.75% 77.75%
Kernel: ptxcall___gpu_transpose_kernel_naive_ldg__429_2
10 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 90.53% 90.56% 90.54%
==4461== Profiling application: julia kernel_transpose.jl
==4461== Profiling result:
==4461== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "GeForce MX150 (0)"
Kernel: ptxcall___gpu_transpose_kernel_naive__426_1
10 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 0.08% 0.10% 0.09%
10 global_hit_rate Global Hit Rate in unified l1/tex 0.00% 0.00% 0.00%
Kernel: ptxcall_transpose_cuda__5
10 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 87.50% 87.50% 87.50%
@mwarusz
mwarusz / naive_transpose.jl
Created February 18, 2020 21:26
naive transpose performance
using KernelAbstractions
using GPUifyLoops
using CUDAnative, CuArrays, CUDAdrv
@kernel function transpose_kernel_naive!(b, a)
I = @index(Global, Cartesian)
i, j = Tuple(I)
@inbounds b[i, j] = a[j, i]
end
@mwarusz
mwarusz / 1_master.txt
Last active October 23, 2019 22:59
DYCOMS cpu
─────────────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 573s / 100% 242GiB / 100%
Section ncalls time %tot avg alloc %tot avg
─────────────────────────────────────────────────────────────────────────────
dostep! 100 572s 100% 5.72s 242GiB 100% 2.42GiB
facerhs! 500 241s 42.2% 483ms 124GiB 51.1% 253MiB
@mwarusz
mwarusz / CLIMA.trace
Created September 14, 2019 00:29
DifferentialEquations Carpenter Kennedy LSRK vs CLIMA
==64089== Profiling application: julia --project=../../../env/gpu isentropicvortex.jl
==64089== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Device Context Stream Name
ms ms KB B
1.15e+05 1.838327 (125000 1 1) (125 1 1) 40 0.000000 0 Tesla V100-SXM2 1 7 ptxcall_knl_nodal_update_aux__6 [136]
1.22e+05 5.682147 (125000 1 1) (5 5 5) 150 14.89844 0 Tesla V100-SXM2 1 7 ptxcall_volumerhs__7 [147]
1.25e+05 9.771470 (125000 1 1) (25 1 1) 106 0.000000 0 Tesla V100-SXM2 1 7 ptxcall_facerhs__8 [158]
1.25e+05 3.028208 (305176 1 1) (256 1 1) 16 0.000000 0 Tesla V100-SXM2 1 7 ptxcall_update__9 [169]
1.25e+05 1.840918 (125000 1 1) (125 1 1) 40 0.000000 0
diff --git a/Project.toml b/Project.toml
index 104d043..d15c4ad 100644
--- a/Project.toml
+++ b/Project.toml
@@ -5,6 +5,7 @@ version = "0.2.8"
[deps]
Cassette = "7057c7e9-c182-5462-911a-8362d720325c"
+Cthulhu = "f68482b8-f384-11e8-15f7-abe071a5a75f"
Requires = "ae029012-a4dd-5104-9daa-d747884805df"
//
// Generated by LLVM NVPTX Back-End
//
.version 6.0
.target sm_61
.address_size 64
.extern .func (.param .b32 func_retval0) vprintf
(