Brian Merchant bzm3r

## vulkan-validation-layer-errors
[2020-02-21T03:00:32Z ERROR gfx_backend_vulkan]
VALIDATION [UNASSIGNED-CoreValidation-Shader-InconsistentSpirv (0)] : SPIR-V module not valid: Invalid SPIR-V binary version 1.3 for target environment SPIR-V 1.0 (under Vulkan 1.0 semantics).
object info: (type: UNKNOWN, hndl: 0)

[2020-02-21T03:00:32Z ERROR gfx_backend_vulkan]
VALIDATION [UNASSIGNED-CoreValidation-Shader-ExceedDeviceLimit (0)] : Shader requires flag VK_SHADER_STAGE_COMPUTE_BIT set in VkPhysicalDeviceSubgroupProperties::supportedStages but it is not set on the device
object info: (type: UNKNOWN, hndl: 0)

[2020-02-21T03:00:32Z ERROR gfx_backend_vulkan]
VALIDATION [UNASSIGNED-CoreValidation-Shader-ExceedDeviceLimit (0)] : Shader requires flag VK_SUBGROUP_FEATURE_BASIC_BIT set in VkPhysicalDeviceSubgroupProperties::supportedOperations but it is not set on the device

## gist:1e2c8de27548e23975239367c14b9b6a
transpose-threadgroup-WGS=(1,32) kernel already compiled...
GPU results verified!
task name:Vk-Threadgroup-TG=32
device: Radeon RX 570 Series
num BMs: 4096, TG size: 32
CPU loops: 1001, GPU loops: 5001
timestamp stats (N = 1001): 20.19 +/- 0.65 ms
instant stats (N = 1001): 20.83 +/- 0.66 ms

transpose-threadgroup-WGS=(2,32) kernel already compiled...

## shuffle.comp
#version 450
#extension GL_KHR_shader_subgroup_shuffle: enable

#define WORKGROUP_SIZE ~WG_SIZE~

// Unlike the threadgroup case, the Y-dimension of the workgroup size is not used.
// This is because the Y-dimension will be implicit in the number of subgroups in a workgroup.
layout(local_size_x = WORKGROUP_SIZE) in;

layout(set = 0, binding = 0) buffer BM {

## gist:6e65bc1c2d642586bd6abfbca919a017
compiling kernel transpose-shuffle-WGS=(64,1)...
num bms: 4096, num dispatch groups: 2048
GPU results verified!
task name:Vk-ShuffleAMD-WG=64
device: Radeon RX 570 Series
num BMs: 4096, TG size: 64
CPU loops: 1001, GPU loops: 5001
timestamp stats (N = 1001): 8.53 +/- 0.54 ms
instant stats (N = 1001): 9.13 +/- 0.59 ms

## gist:9078999cbc209af2cd059e3d5b0536e0
#version 450
#define WORKGROUP_SIZE ~WG_SIZE~

layout(local_size_x = WORKGROUP_SIZE, local_size_y = 1) in;

layout(set = 0, binding = 0) buffer BM {
    uint[32] bms[];
};
struct Uniforms
{

## intel-hybrid-shuffle-results
transpose-hybrid-shuffle-WGS=(32,1) kernel already compiled...
num bms: 4096, num dispatch groups: 4096
GPU results verified!
task name:Vk-HybridShuffle-TG=32
device: Intel(R) HD Graphics 520
num BMs: 4096, TG size: 32
CPU loops: 101, GPU loops: 1001
timestamp stats (N = 101): 57.83 +/- 1.31 ms
instant stats (N = 101): 58.53 +/- 1.26 ms

## intel-threadgroup-results
compiling kernel transpose-threadgroup-WGS=(1,32)...
num bms: 4096, num dispatch groups: 4096
GPU results verified!
task name:Vk-Threadgroup-TG=32
device: Intel(R) HD Graphics 520
num BMs: 4096, TG size: 32
CPU loops: 101, GPU loops: 1001
timestamp stats (N = 101): 81.46 +/- 1.37 ms
instant stats (N = 101): 82.24 +/- 1.35 ms

## transpose-timings-writeup-outline.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                bzm3r
                / transpose-timings-writeup-outline.md
            
            
              Last active
              April 1, 2020 01:46
            
          
    Outline


Make it relevant to the audience.

share a couple of specific examples, and identify the general shape of the problem


Describe how this problem is easy to parallelize, in particular, on a GPU.


Introduce the specific shape of our problem: why are we using bitmaps?

Storing data compact on a GPU.


## transpose-timings-writeup.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                bzm3r
                / transpose-timings-writeup.md
            
            
              Created
              April 1, 2020 01:47
            
          
    There might be a collection of tasks, and a collection of processors, and we want to associate tasks with a particular processor depending on some rule. Or, in an N-body physics simulations, we may want to figure out which which of the N particles is close enough to interact with some particular particle. These problems, and many others, have have a similar shape: determine whether some objects in one collection are related in some way to objects in  another collection.
It is easy to write parallelized solutions to this problem. Suppose we have a collection of objects A, and a collection of objects B, related by a boolean function p: (A, b) -> bool, and for each object A we want to determine which objects in B are related to it via p. We have a GPU, with many individual processors, so we can associate each object in A, a_i, with the ith processor, and each such processor can loop over the elements in B, indexed b_j, and store the result p(a_i, b_j) to output.
Data on a GPU should b

  
## test-drop-impl.rs
impl Drop for ExampleBase {
    fn drop(&mut self) {
        unsafe {
            self.device.device_wait_idle().unwrap();
            self.device
                .destroy_semaphore(self.present_complete_semaphore, None);
            self.device
                .destroy_semaphore(self.rendering_complete_semaphore, None);
            self.device.free_memory(self.depth_image_memory, None);
            self.device.destroy_image_view(self.depth_image_view, None);
	[2020-02-21T03:00:32Z ERROR gfx_backend_vulkan]
	VALIDATION [UNASSIGNED-CoreValidation-Shader-InconsistentSpirv (0)] : SPIR-V module not valid: Invalid SPIR-V binary version 1.3 for target environment SPIR-V 1.0 (under Vulkan 1.0 semantics).
	object info: (type: UNKNOWN, hndl: 0)

	[2020-02-21T03:00:32Z ERROR gfx_backend_vulkan]
	VALIDATION [UNASSIGNED-CoreValidation-Shader-ExceedDeviceLimit (0)] : Shader requires flag VK_SHADER_STAGE_COMPUTE_BIT set in VkPhysicalDeviceSubgroupProperties::supportedStages but it is not set on the device
	object info: (type: UNKNOWN, hndl: 0)

	[2020-02-21T03:00:32Z ERROR gfx_backend_vulkan]
	VALIDATION [UNASSIGNED-CoreValidation-Shader-ExceedDeviceLimit (0)] : Shader requires flag VK_SUBGROUP_FEATURE_BASIC_BIT set in VkPhysicalDeviceSubgroupProperties::supportedOperations but it is not set on the device
	transpose-threadgroup-WGS=(1,32) kernel already compiled...
	GPU results verified!
	task name:Vk-Threadgroup-TG=32
	device: Radeon RX 570 Series
	num BMs: 4096, TG size: 32
	CPU loops: 1001, GPU loops: 5001
	timestamp stats (N = 1001): 20.19 +/- 0.65 ms
	instant stats (N = 1001): 20.83 +/- 0.66 ms

	transpose-threadgroup-WGS=(2,32) kernel already compiled...
	#version 450
	#extension GL_KHR_shader_subgroup_shuffle: enable

	#define WORKGROUP_SIZE ~WG_SIZE~

	// Unlike the threadgroup case, the Y-dimension of the workgroup size is not used.
	// This is because the Y-dimension will be implicit in the number of subgroups in a workgroup.
	layout(local_size_x = WORKGROUP_SIZE) in;

	layout(set = 0, binding = 0) buffer BM {
	compiling kernel transpose-shuffle-WGS=(64,1)...
	num bms: 4096, num dispatch groups: 2048
	GPU results verified!
	task name:Vk-ShuffleAMD-WG=64
	device: Radeon RX 570 Series
	num BMs: 4096, TG size: 64
	CPU loops: 1001, GPU loops: 5001
	timestamp stats (N = 1001): 8.53 +/- 0.54 ms
	instant stats (N = 1001): 9.13 +/- 0.59 ms
	#version 450
	#define WORKGROUP_SIZE ~WG_SIZE~

	layout(local_size_x = WORKGROUP_SIZE, local_size_y = 1) in;

	layout(set = 0, binding = 0) buffer BM {
	uint[32] bms[];
	};
	struct Uniforms
	{
	transpose-hybrid-shuffle-WGS=(32,1) kernel already compiled...
	num bms: 4096, num dispatch groups: 4096
	GPU results verified!
	task name:Vk-HybridShuffle-TG=32
	device: Intel(R) HD Graphics 520
	num BMs: 4096, TG size: 32
	CPU loops: 101, GPU loops: 1001
	timestamp stats (N = 101): 57.83 +/- 1.31 ms
	instant stats (N = 101): 58.53 +/- 1.26 ms
	compiling kernel transpose-threadgroup-WGS=(1,32)...
	num bms: 4096, num dispatch groups: 4096
	GPU results verified!
	task name:Vk-Threadgroup-TG=32
	device: Intel(R) HD Graphics 520
	num BMs: 4096, TG size: 32
	CPU loops: 101, GPU loops: 1001
	timestamp stats (N = 101): 81.46 +/- 1.37 ms
	instant stats (N = 101): 82.24 +/- 1.35 ms
	impl Drop for ExampleBase {
	fn drop(&mut self) {
	unsafe {
	self.device.device_wait_idle().unwrap();
	self.device
	.destroy_semaphore(self.present_complete_semaphore, None);
	self.device
	.destroy_semaphore(self.rendering_complete_semaphore, None);
	self.device.free_memory(self.depth_image_memory, None);
	self.device.destroy_image_view(self.depth_image_view, None);