-
Make it relevant to the audience.
- share a couple of specific examples, and identify the general shape of the problem
-
Describe how this problem is easy to parallelize, in particular, on a GPU.
-
Introduce the specific shape of our problem: why are we using bitmaps?
- Storing data compact on a GPU.
[2020-02-21T03:00:32Z ERROR gfx_backend_vulkan] | |
VALIDATION [UNASSIGNED-CoreValidation-Shader-InconsistentSpirv (0)] : SPIR-V module not valid: Invalid SPIR-V binary version 1.3 for target environment SPIR-V 1.0 (under Vulkan 1.0 semantics). | |
object info: (type: UNKNOWN, hndl: 0) | |
[2020-02-21T03:00:32Z ERROR gfx_backend_vulkan] | |
VALIDATION [UNASSIGNED-CoreValidation-Shader-ExceedDeviceLimit (0)] : Shader requires flag VK_SHADER_STAGE_COMPUTE_BIT set in VkPhysicalDeviceSubgroupProperties::supportedStages but it is not set on the device | |
object info: (type: UNKNOWN, hndl: 0) | |
[2020-02-21T03:00:32Z ERROR gfx_backend_vulkan] | |
VALIDATION [UNASSIGNED-CoreValidation-Shader-ExceedDeviceLimit (0)] : Shader requires flag VK_SUBGROUP_FEATURE_BASIC_BIT set in VkPhysicalDeviceSubgroupProperties::supportedOperations but it is not set on the device |
transpose-threadgroup-WGS=(1,32) kernel already compiled... | |
GPU results verified! | |
task name:Vk-Threadgroup-TG=32 | |
device: Radeon RX 570 Series | |
num BMs: 4096, TG size: 32 | |
CPU loops: 1001, GPU loops: 5001 | |
timestamp stats (N = 1001): 20.19 +/- 0.65 ms | |
instant stats (N = 1001): 20.83 +/- 0.66 ms | |
transpose-threadgroup-WGS=(2,32) kernel already compiled... |
#version 450 | |
#extension GL_KHR_shader_subgroup_shuffle: enable | |
#define WORKGROUP_SIZE ~WG_SIZE~ | |
// Unlike the threadgroup case, the Y-dimension of the workgroup size is not used. | |
// This is because the Y-dimension will be implicit in the number of subgroups in a workgroup. | |
layout(local_size_x = WORKGROUP_SIZE) in; | |
layout(set = 0, binding = 0) buffer BM { |
compiling kernel transpose-shuffle-WGS=(64,1)... | |
num bms: 4096, num dispatch groups: 2048 | |
GPU results verified! | |
task name:Vk-ShuffleAMD-WG=64 | |
device: Radeon RX 570 Series | |
num BMs: 4096, TG size: 64 | |
CPU loops: 1001, GPU loops: 5001 | |
timestamp stats (N = 1001): 8.53 +/- 0.54 ms | |
instant stats (N = 1001): 9.13 +/- 0.59 ms |
#version 450 | |
#define WORKGROUP_SIZE ~WG_SIZE~ | |
layout(local_size_x = WORKGROUP_SIZE, local_size_y = 1) in; | |
layout(set = 0, binding = 0) buffer BM { | |
uint[32] bms[]; | |
}; | |
struct Uniforms | |
{ |
transpose-hybrid-shuffle-WGS=(32,1) kernel already compiled... | |
num bms: 4096, num dispatch groups: 4096 | |
GPU results verified! | |
task name:Vk-HybridShuffle-TG=32 | |
device: Intel(R) HD Graphics 520 | |
num BMs: 4096, TG size: 32 | |
CPU loops: 101, GPU loops: 1001 | |
timestamp stats (N = 101): 57.83 +/- 1.31 ms | |
instant stats (N = 101): 58.53 +/- 1.26 ms |
compiling kernel transpose-threadgroup-WGS=(1,32)... | |
num bms: 4096, num dispatch groups: 4096 | |
GPU results verified! | |
task name:Vk-Threadgroup-TG=32 | |
device: Intel(R) HD Graphics 520 | |
num BMs: 4096, TG size: 32 | |
CPU loops: 101, GPU loops: 1001 | |
timestamp stats (N = 101): 81.46 +/- 1.37 ms | |
instant stats (N = 101): 82.24 +/- 1.35 ms |
There might be a collection of tasks, and a collection of processors, and we want to associate tasks with a particular processor depending on some rule. Or, in an N-body physics simulations, we may want to figure out which which of the N particles is close enough to interact with some particular particle. These problems, and many others, have have a similar shape: determine whether some objects in one collection are related in some way to objects in another collection.
It is easy to write parallelized solutions to this problem. Suppose we have a collection of objects A
, and a collection of objects B
, related by a boolean function p: (A, b) -> bool
, and for each object A
we want to determine which objects in B
are related to it via p
. We have a GPU, with many individual processors, so we can associate each object in A
, a_i
, with the i
th processor, and each such processor can loop over the elements in B
, indexed b_j
, and store the result p(a_i, b_j)
to output.
Data on a GPU should b
impl Drop for ExampleBase { | |
fn drop(&mut self) { | |
unsafe { | |
self.device.device_wait_idle().unwrap(); | |
self.device | |
.destroy_semaphore(self.present_complete_semaphore, None); | |
self.device | |
.destroy_semaphore(self.rendering_complete_semaphore, None); | |
self.device.free_memory(self.depth_image_memory, None); | |
self.device.destroy_image_view(self.depth_image_view, None); |