Skip to content

Instantly share code, notes, and snippets.

View bzm3r's full-sized avatar

Brian Merchant bzm3r

View GitHub Profile
[2020-02-21T03:00:32Z ERROR gfx_backend_vulkan]
VALIDATION [UNASSIGNED-CoreValidation-Shader-InconsistentSpirv (0)] : SPIR-V module not valid: Invalid SPIR-V binary version 1.3 for target environment SPIR-V 1.0 (under Vulkan 1.0 semantics).
object info: (type: UNKNOWN, hndl: 0)
[2020-02-21T03:00:32Z ERROR gfx_backend_vulkan]
VALIDATION [UNASSIGNED-CoreValidation-Shader-ExceedDeviceLimit (0)] : Shader requires flag VK_SHADER_STAGE_COMPUTE_BIT set in VkPhysicalDeviceSubgroupProperties::supportedStages but it is not set on the device
object info: (type: UNKNOWN, hndl: 0)
[2020-02-21T03:00:32Z ERROR gfx_backend_vulkan]
VALIDATION [UNASSIGNED-CoreValidation-Shader-ExceedDeviceLimit (0)] : Shader requires flag VK_SUBGROUP_FEATURE_BASIC_BIT set in VkPhysicalDeviceSubgroupProperties::supportedOperations but it is not set on the device
transpose-threadgroup-WGS=(1,32) kernel already compiled...
GPU results verified!
task name:Vk-Threadgroup-TG=32
device: Radeon RX 570 Series
num BMs: 4096, TG size: 32
CPU loops: 1001, GPU loops: 5001
timestamp stats (N = 1001): 20.19 +/- 0.65 ms
instant stats (N = 1001): 20.83 +/- 0.66 ms
transpose-threadgroup-WGS=(2,32) kernel already compiled...
#version 450
#extension GL_KHR_shader_subgroup_shuffle: enable
#define WORKGROUP_SIZE ~WG_SIZE~
// Unlike the threadgroup case, the Y-dimension of the workgroup size is not used.
// This is because the Y-dimension will be implicit in the number of subgroups in a workgroup.
layout(local_size_x = WORKGROUP_SIZE) in;
layout(set = 0, binding = 0) buffer BM {
compiling kernel transpose-shuffle-WGS=(64,1)...
num bms: 4096, num dispatch groups: 2048
GPU results verified!
task name:Vk-ShuffleAMD-WG=64
device: Radeon RX 570 Series
num BMs: 4096, TG size: 64
CPU loops: 1001, GPU loops: 5001
timestamp stats (N = 1001): 8.53 +/- 0.54 ms
instant stats (N = 1001): 9.13 +/- 0.59 ms
#version 450
#define WORKGROUP_SIZE ~WG_SIZE~
layout(local_size_x = WORKGROUP_SIZE, local_size_y = 1) in;
layout(set = 0, binding = 0) buffer BM {
uint[32] bms[];
};
struct Uniforms
{
transpose-hybrid-shuffle-WGS=(32,1) kernel already compiled...
num bms: 4096, num dispatch groups: 4096
GPU results verified!
task name:Vk-HybridShuffle-TG=32
device: Intel(R) HD Graphics 520
num BMs: 4096, TG size: 32
CPU loops: 101, GPU loops: 1001
timestamp stats (N = 101): 57.83 +/- 1.31 ms
instant stats (N = 101): 58.53 +/- 1.26 ms
compiling kernel transpose-threadgroup-WGS=(1,32)...
num bms: 4096, num dispatch groups: 4096
GPU results verified!
task name:Vk-Threadgroup-TG=32
device: Intel(R) HD Graphics 520
num BMs: 4096, TG size: 32
CPU loops: 101, GPU loops: 1001
timestamp stats (N = 101): 81.46 +/- 1.37 ms
instant stats (N = 101): 82.24 +/- 1.35 ms

Outline

  • Make it relevant to the audience.

    • share a couple of specific examples, and identify the general shape of the problem
  • Describe how this problem is easy to parallelize, in particular, on a GPU.

  • Introduce the specific shape of our problem: why are we using bitmaps?

    • Storing data compact on a GPU.

There might be a collection of tasks, and a collection of processors, and we want to associate tasks with a particular processor depending on some rule. Or, in an N-body physics simulations, we may want to figure out which which of the N particles is close enough to interact with some particular particle. These problems, and many others, have have a similar shape: determine whether some objects in one collection are related in some way to objects in another collection.

It is easy to write parallelized solutions to this problem. Suppose we have a collection of objects A, and a collection of objects B, related by a boolean function p: (A, b) -> bool, and for each object A we want to determine which objects in B are related to it via p. We have a GPU, with many individual processors, so we can associate each object in A, a_i, with the ith processor, and each such processor can loop over the elements in B, indexed b_j, and store the result p(a_i, b_j) to output.

Data on a GPU should b

impl Drop for ExampleBase {
fn drop(&mut self) {
unsafe {
self.device.device_wait_idle().unwrap();
self.device
.destroy_semaphore(self.present_complete_semaphore, None);
self.device
.destroy_semaphore(self.rendering_complete_semaphore, None);
self.device.free_memory(self.depth_image_memory, None);
self.device.destroy_image_view(self.depth_image_view, None);