Skip to content

Instantly share code, notes, and snippets.

@itzmeanjan
Last active March 17, 2024 12:51
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save itzmeanjan/84613bc7595372c5e6b6c22481d42f9a to your computer and use it in GitHub Desktop.
Save itzmeanjan/84613bc7595372c5e6b6c22481d42f9a to your computer and use it in GitHub Desktop.
😎 Parallel Matrix Multiplication on GPGPU, using Vulkan Compute API 🚴🏼
#version 450
#pragma shader_stage(compute)
layout(local_size_x = 8, local_size_y = 4, local_size_z = 1) in;
layout(set = 0, binding = 0) buffer readonly MatrixA {
int[1<<20] matrix_a;
};
layout(set = 0, binding = 1) buffer readonly MatrixB {
int[1<<20] matrix_b;
};
layout(set = 0, binding = 2) buffer writeonly MatrixC {
int[1<<20] matrix_c;
};
void main() {
const uint row = gl_GlobalInvocationID.x;
const uint col = gl_GlobalInvocationID.y;
if(row >= 1024 || col >= 1024) {
return;
}
int sum = 0;
for(uint i = 0; i < 1024; i++) {
sum += matrix_a[row * 1024 + i] * matrix_b[i * 1024 + col];
}
matrix_c[row * 1024 + col] = sum;
}
extern crate rand;
extern crate vulkano;
extern crate vulkano_shaders;
use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use std::sync::Arc;
use std::time::Instant;
use vulkano::buffer::{BufferUsage, CpuAccessibleBuffer, ImmutableBuffer};
use vulkano::command_buffer::{AutoCommandBufferBuilder, CommandBufferUsage, PrimaryCommandBuffer};
use vulkano::descriptor::descriptor_set::PersistentDescriptorSet;
use vulkano::device::{Device, DeviceExtensions, Features};
use vulkano::instance::PhysicalDevice;
use vulkano::instance::{Instance, InstanceExtensions};
use vulkano::pipeline::{ComputePipeline, ComputePipelineAbstract};
use vulkano::sync::GpuFuture;
use vulkano::Version;
const N: u32 = 1 << 20;
fn main() {
let instance = Instance::new(None, Version::V1_2, &InstanceExtensions::none(), None)
.expect("failed to create instance !");
let physical_device = PhysicalDevice::enumerate(&instance)
.next()
.expect("failed to enumerate physical devices");
println!(
"Device: {}\nVulkan API: {}",
physical_device.properties().device_name.as_ref().unwrap(),
physical_device.api_version()
);
for i in physical_device.queue_families() {
println!(
"Queue Count: {}\tCompute: {}\tGraphics: {}",
i.queues_count(),
i.supports_compute(),
i.supports_graphics()
);
}
let queue_family = physical_device
.queue_families()
.find(|&v| v.supports_compute())
.expect("failed to find compute supported queue family");
let mut ext = DeviceExtensions::none();
ext.khr_storage_buffer_storage_class = true;
let (logical_device, mut queues) = Device::new(
physical_device,
&Features::none(),
&ext,
[(queue_family, 0.5)].iter().cloned(),
)
.expect("failed to create logical logical_device");
let queue = queues.next().expect("failed to find associated queue");
let matrix_a = generate_square_matrix(Some(13));
let matrix_b = generate_square_matrix(Some(17));
let matrix_c = generate_square_matrix(None);
// Matrix A --- stored in GPU accessible memory, CPU can't access it
let (matrix_a_buf, _) = ImmutableBuffer::from_iter(matrix_a, BufferUsage::all(), queue.clone())
.expect("failed to create uniform buffer");
// Matrix B --- stored in GPU accessible memory, CPU can't access it
let (matrix_b_buf, _) = ImmutableBuffer::from_iter(matrix_b, BufferUsage::all(), queue.clone())
.expect("failed to create uniform buffer");
// Matrix C --- resulting matrix can be accessed by both CPU, GPU
let matrix_c_buf =
CpuAccessibleBuffer::from_iter(logical_device.clone(), BufferUsage::all(), false, matrix_c)
.expect("failed to create storage buffer");
// loading compute shader, including shader compilation
// abstracted with macro!
let shader = cs::Shader::load(logical_device.clone()).unwrap();
// preparing compute pipeline
let compute_pipeline = Arc::new(
ComputePipeline::new(
logical_device.clone(),
&shader.main_entry_point(),
&(),
None,
)
.unwrap(),
);
// adding descriptors as per layout, into compute pipeline
let layout = compute_pipeline.layout().descriptor_set_layout(0).unwrap();
let set = Arc::new(
PersistentDescriptorSet::start(layout.clone())
.add_buffer(matrix_a_buf.clone())
.unwrap()
.add_buffer(matrix_b_buf.clone())
.unwrap()
.add_buffer(matrix_c_buf.clone())
.unwrap()
.build()
.unwrap(),
);
// create command buffer & start recording commands in it
let mut builder = AutoCommandBufferBuilder::primary(
logical_device.clone(),
queue.family(),
CommandBufferUsage::OneTimeSubmit,
)
.unwrap();
// only single command recorded in command buffer
builder
.dispatch(
[1024 / 8, 1024 / 4, 1],
compute_pipeline.clone(),
set.clone(),
(),
std::iter::empty(),
)
.unwrap();
// ending command recording
let command_buffer = builder.build().unwrap();
// Computing Matrix Multiplication on GPU
let start = Instant::now();
let finished = command_buffer.execute(queue.clone()).unwrap();
finished
.then_signal_fence_and_flush()
.unwrap()
.wait(None)
.unwrap();
let gpu_tm = start.elapsed();
println!("GPU matrix multiply: {:?}", gpu_tm);
let r_matrix_a = generate_square_matrix(Some(13)).collect::<Vec<i32>>();
let r_matrix_b = generate_square_matrix(Some(17)).collect::<Vec<i32>>();
// reading GPU-computed matrix multiplication result
let gpu_result = matrix_c_buf.read().unwrap();
// Computing Matrix Multiplication on CPU, and asserting !
let start = Instant::now();
for i in 0..1024 {
for j in 0..1024 {
let mut sum = 0i32;
for k in 0..1024 {
sum += r_matrix_a[i * 1024 + k] * r_matrix_b[k * 1024 + j];
}
assert_eq!(sum, gpu_result[i * 1024 + j]);
}
}
println!(
"CPU matrix multiply: {:?}\nSpeed Up: {}",
start.elapsed(),
start.elapsed().as_nanos() / gpu_tm.as_nanos()
);
}
// reproducible random matrix generator, as single dimensional iterator
fn generate_square_matrix(seed: Option<u64>) -> Box<dyn std::iter::ExactSizeIterator<Item = i32>> {
match seed {
Some(seed) => {
let mut rng = StdRng::seed_from_u64(seed);
Box::new((0..N).map(move |_| rng.gen::<i32>()))
}
None => Box::new((0..N).map(|_| 0)),
}
}
mod cs {
// does shader compilation
vulkano_shaders::shader! {
ty: "compute",
path: "./matrix_multiply.glsl",
vulkan_version: "1.2",
}
}
@itzmeanjan
Copy link
Author

itzmeanjan commented Sep 6, 2021

Background

If you haven't yet read post, this code snippet accompanies it.

Usage

  • Download this GIST
  • Create project directory tree by running
cargo init
  • Add following as dependencies in Cargo.toml
vulkano = "0.24.0"
vulkano-shaders = "0.24.0"
rand = "0.8.4"
  • In generated project directory's src/main.rs, paste this
  • Copy compute shader code, in file with same name, in root of cargo project.
  • Build & run project
cargo run --release
  • Your result will be different than mine
Device: Intel(R) HD Graphics 5500 (BDW GT2)
Vulkan API: 1.2.0
Queue Count: 1  Compute: true   Graphics: true
Subgroup Size: 32
GPU matrix multiply: 381.724958ms
CPU matrix multiply: 7.428704881s
Speed Up: 19

@seddonm1
Copy link

Thank you for this code.

Updated code for latest versions:

vulkano = "0.32.1"
vulkano-shaders = "0.32.0"
rand = "0.8.4"

matrix_multiply.rs

use rand::{rngs::StdRng, Rng, SeedableRng};
use std::time::Instant;
use vulkano::{
    buffer::{BufferUsage, CpuAccessibleBuffer, DeviceLocalBuffer},
    command_buffer::{
        allocator::StandardCommandBufferAllocator, AutoCommandBufferBuilder, CommandBufferUsage,
        PrimaryCommandBufferAbstract,
    },
    descriptor_set::{
        allocator::StandardDescriptorSetAllocator, PersistentDescriptorSet, WriteDescriptorSet,
    },
    device::{
        physical::PhysicalDeviceType, Device, DeviceCreateInfo, DeviceExtensions, QueueCreateInfo,
        QueueFlags,
    },
    instance::{Instance, InstanceCreateInfo},
    memory::allocator::StandardMemoryAllocator,
    pipeline::{Pipeline, PipelineBindPoint},
    sync::GpuFuture,
    VulkanLibrary,
};

const N: u32 = 1 << 20;

fn main() {
    let library = VulkanLibrary::new().unwrap();
    let instance = Instance::new(
        library,
        InstanceCreateInfo {
            // Enable enumerating devices that use non-conformant vulkan implementations. (ex. MoltenVK)
            enumerate_portability: true,
            ..Default::default()
        },
    )
    .unwrap();

    let device_extensions = DeviceExtensions {
        khr_storage_buffer_storage_class: true,
        ..DeviceExtensions::empty()
    };
    let (physical_device, queue_family_index) = instance
        .enumerate_physical_devices()
        .unwrap()
        .filter(|p| p.supported_extensions().contains(&device_extensions))
        .filter_map(|p| {
            p.queue_family_properties()
                .iter()
                .position(|q| {
                    q.queue_flags.intersects(&QueueFlags {
                        compute: true,
                        ..QueueFlags::empty()
                    })
                })
                .map(|i| (p, i as u32))
        })
        .min_by_key(|(p, _)| match p.properties().device_type {
            PhysicalDeviceType::DiscreteGpu => 0,
            PhysicalDeviceType::IntegratedGpu => 1,
            PhysicalDeviceType::VirtualGpu => 2,
            PhysicalDeviceType::Cpu => 3,
            PhysicalDeviceType::Other => 4,
            _ => 5,
        })
        .unwrap();

    println!(
        "Using device: {} (type: {:?}). API version: {}",
        physical_device.properties().device_name,
        physical_device.properties().device_type,
        physical_device.api_version()
    );

    let (device, mut queues) = Device::new(
        physical_device,
        DeviceCreateInfo {
            enabled_extensions: device_extensions,
            queue_create_infos: vec![QueueCreateInfo {
                queue_family_index,
                ..Default::default()
            }],
            ..Default::default()
        },
    )
    .unwrap();

    let queue = queues.next().unwrap();

    let memory_allocator = StandardMemoryAllocator::new_default(device.clone());
    let descriptor_set_allocator = StandardDescriptorSetAllocator::new(device.clone());
    let command_buffer_allocator =
        StandardCommandBufferAllocator::new(device.clone(), Default::default());

    let mut builder = AutoCommandBufferBuilder::primary(
        &command_buffer_allocator,
        queue.queue_family_index(),
        CommandBufferUsage::OneTimeSubmit,
    )
    .unwrap();

    // Deterministically produce the GPU matrices
    let matrix_a = generate_square_matrix(Some(13));
    let matrix_b = generate_square_matrix(Some(17));
    let matrix_c = generate_square_matrix(None);

    // Matrix A --- stored in GPU accessible memory, CPU can't access it
    let matrix_a_buf = DeviceLocalBuffer::from_iter(
        &memory_allocator,
        matrix_a,
        BufferUsage {
            storage_buffer: true,
            ..BufferUsage::empty()
        },
        &mut builder,
    )
    .expect("failed to create uniform buffer");

    // Matrix B --- stored in GPU accessible memory, CPU can't access it
    let matrix_b_buf = DeviceLocalBuffer::from_iter(
        &memory_allocator,
        matrix_b,
        BufferUsage {
            storage_buffer: true,
            ..BufferUsage::empty()
        },
        &mut builder,
    )
    .expect("failed to create uniform buffer");

    // Matrix C --- resulting matrix can be accessed by both CPU, GPU
    let matrix_c_buf = CpuAccessibleBuffer::from_iter(
        &memory_allocator,
        BufferUsage {
            storage_buffer: true,
            ..BufferUsage::empty()
        },
        false,
        matrix_c,
    )
    .expect("failed to create storage buffer");

    // loading compute shader, including shader compilation
    // abstracted with macro!
    let cs = cs::load(device.clone()).unwrap();

    // Create compute-pipeline for applying compute shader to vertices.
    let compute_pipeline = vulkano::pipeline::ComputePipeline::new(
        device.clone(),
        cs.entry_point("main").unwrap(),
        &(),
        None,
        |_| {},
    )
    .expect("Failed to create compute shader");

    // adding descriptors as per layout, into compute pipeline
    let layout = compute_pipeline.layout().set_layouts().get(0).unwrap();
    let set = PersistentDescriptorSet::new(
        &descriptor_set_allocator,
        layout.clone(),
        [
            WriteDescriptorSet::buffer(0, matrix_a_buf.clone()),
            WriteDescriptorSet::buffer(1, matrix_b_buf.clone()),
            WriteDescriptorSet::buffer(2, matrix_c_buf.clone()),
        ],
    )
    .unwrap();

    // only single command recorded in command buffer
    builder
        // copy from the first half to the second half (inside the same buffer) before we run the computation
        .bind_pipeline_compute(compute_pipeline.clone())
        .bind_descriptor_sets(
            PipelineBindPoint::Compute,
            compute_pipeline.layout().clone(),
            0,
            set,
        )
        .dispatch([1024 / 8, 1024 / 4, 1])
        .unwrap();
    let command_buffer = builder.build().unwrap();

    // Computing Matrix Multiplication on GPU
    let start = Instant::now();
    let finished = command_buffer.execute(queue.clone()).unwrap();
    finished
        .then_signal_fence_and_flush()
        .unwrap()
        .wait(None)
        .unwrap();
    let gpu_tm = start.elapsed();

    // reading GPU-computed matrix multiplication result
    let gpu_result = matrix_c_buf.read().unwrap();

    // Deterministically produce the CPU matrices
    let r_matrix_a = generate_square_matrix(Some(13)).collect::<Vec<i32>>();
    let r_matrix_b = generate_square_matrix(Some(17)).collect::<Vec<i32>>();

    // Computing Matrix Multiplication on CPU, and asserting !
    let start = Instant::now();
    for i in 0..1024 {
        for j in 0..1024 {
            let mut sum = 0i32;
            for k in 0..1024 {
                sum += r_matrix_a[i * 1024 + k] * r_matrix_b[k * 1024 + j];
            }
            assert_eq!(sum, gpu_result[i * 1024 + j]);
        }
    }
    let cpu_tm = start.elapsed();

    println!(
        "GPU matrix multiply: {:?}\nCPU matrix multiply: {:?}\nSpeed Up: {}x",
        gpu_tm,
        cpu_tm,
        cpu_tm.as_nanos() / gpu_tm.as_nanos()
    );
}

// reproducible random matrix generator, as single dimensional iterator
fn generate_square_matrix(seed: Option<u64>) -> Box<dyn std::iter::ExactSizeIterator<Item = i32>> {
    match seed {
        Some(seed) => {
            let mut rng = StdRng::seed_from_u64(seed);
            Box::new((0..N).map(move |_| rng.gen::<i32>()))
        }
        None => Box::new((0..N).map(|_| 0)),
    }
}

mod cs {
    // does shader compilation
    vulkano_shaders::shader! {
        ty: "compute",
        path: "./matrix_multiply.glsl",
        vulkan_version: "1.2",
    }
}

Result:

Using device: AMD Radeon Pro 5500M (type: DiscreteGpu). API version: 1.2.231
GPU matrix multiply: 41.683247ms
CPU matrix multiply: 1.453689134s
Speed Up: 34x

@itzmeanjan
Copy link
Author

Thanks @seddonm1

@mjaric
Copy link

mjaric commented Jan 14, 2023

Comparation is not fair, CPU is doing calculation in single thread. Dot product those two arrays (that is more than shader does), using ndarray takes 275 micro seconds on 16 vCore CPU.

Using device: Radeon RX 590 Series (type: DiscreteGpu). API version: 1.3.209
GPU matrix multiply: 28.4929ms
CPU matrix multiply: 548.6Β΅s <-- dot product using ndarray
Speed Up: 0x

Anyhow, example is nice :) TNX!

@macchky
Copy link

macchky commented Jul 13, 2023

Tested by RX6700XT and Ryzen 7 5800X3D

Device: AMD Radeon RX 6700 XT
Vulkan API: 1.2.0
Queue Count: 1  Compute: true   Graphics: true
Queue Count: 2  Compute: true   Graphics: false
Queue Count: 2  Compute: false  Graphics: false
GPU matrix multiply: 10.1086ms
CPU matrix multiply: 2.7680681s
Speed Up: 273

@itzmeanjan
Copy link
Author

Thanks for reporting this @macchky.

@itzmeanjan
Copy link
Author

Totally agreed with you @mjaric.

@minghuaw
Copy link

I was getting a really poor performance on a rtx 2070 super (~650 ms). Does running this in WSL affect the performance?

@itzmeanjan
Copy link
Author

I was getting a really poor performance on a rtx 2070 super (~650 ms). Does running this in WSL affect the performance?

There could be many possible reasons for why it ran slow on some platform. As I've never tried running it on WSL, can't really say anything.

@minghuaw
Copy link

minghuaw commented Jul 24, 2023

I was getting a really poor performance on a rtx 2070 super (~650 ms). Does running this in WSL affect the performance?

There could be many possible reasons for why it ran slow on some platform. As I've never tried running it on WSL, can't really say anything.

Seems like WSL is indeed the cause of low performance. The same code takes 10 ms to complete in Windows but takes over 600 ms in WSL

@itzmeanjan
Copy link
Author

I was getting a really poor performance on a rtx 2070 super (~650 ms). Does running this in WSL affect the performance?

There could be many possible reasons for why it ran slow on some platform. As I've never tried running it on WSL, can't really say anything.

Seems like WSL is indeed the cause of low performance. The same code takes 10 ms to complete in Windows but takes over 600 ms in WSL

Totally possible.

@airlied
Copy link

airlied commented Mar 15, 2024

random comment, you should rework this to use VK_KHR_cooperative_matrix as a good example to compare it vs native shader

@itzmeanjan
Copy link
Author

random comment, you should rework this to use VK_KHR_cooperative_matrix as a good example to compare it vs native shader

Thanks for the suggestion though I'm not actively maintaining it. You may send me a patch and I'll update the gist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment