Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Bitonic Sort on CUDA. On a quick benchmark it was 10x faster than the CPU version.
/*
* Parallel bitonic sort using CUDA.
* Compile with
* nvcc -arch=sm_11 bitonic_sort.cu
* Based on http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm
* License: BSD 3
*/
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
/* Every thread gets exactly one value in the unsorted array. */
#define THREADS 512 // 2^9
#define BLOCKS 32768 // 2^15
#define NUM_VALS THREADS*BLOCKS
void print_elapsed(clock_t start, clock_t stop)
{
double elapsed = ((double) (stop - start)) / CLOCKS_PER_SEC;
printf("Elapsed time: %.3fs\n", elapsed);
}
float random_float()
{
return (float)rand()/(float)RAND_MAX;
}
void array_print(float *arr, int length)
{
int i;
for (i = 0; i < length; ++i) {
printf("%1.3f ", arr[i]);
}
printf("\n");
}
void array_fill(float *arr, int length)
{
srand(time(NULL));
int i;
for (i = 0; i < length; ++i) {
arr[i] = random_float();
}
}
__global__ void bitonic_sort_step(float *dev_values, int j, int k)
{
unsigned int i, ixj; /* Sorting partners: i and ixj */
i = threadIdx.x + blockDim.x * blockIdx.x;
ixj = i^j;
/* The threads with the lowest ids sort the array. */
if ((ixj)>i) {
if ((i&k)==0) {
/* Sort ascending */
if (dev_values[i]>dev_values[ixj]) {
/* exchange(i,ixj); */
float temp = dev_values[i];
dev_values[i] = dev_values[ixj];
dev_values[ixj] = temp;
}
}
if ((i&k)!=0) {
/* Sort descending */
if (dev_values[i]<dev_values[ixj]) {
/* exchange(i,ixj); */
float temp = dev_values[i];
dev_values[i] = dev_values[ixj];
dev_values[ixj] = temp;
}
}
}
}
/**
* Inplace bitonic sort using CUDA.
*/
void bitonic_sort(float *values)
{
float *dev_values;
size_t size = NUM_VALS * sizeof(float);
cudaMalloc((void**) &dev_values, size);
cudaMemcpy(dev_values, values, size, cudaMemcpyHostToDevice);
dim3 blocks(BLOCKS,1); /* Number of blocks */
dim3 threads(THREADS,1); /* Number of threads */
int j, k;
/* Major step */
for (k = 2; k <= NUM_VALS; k <<= 1) {
/* Minor step */
for (j=k>>1; j>0; j=j>>1) {
bitonic_sort_step<<<blocks, threads>>>(dev_values, j, k);
}
}
cudaMemcpy(values, dev_values, size, cudaMemcpyDeviceToHost);
cudaFree(dev_values);
}
int main(void)
{
clock_t start, stop;
float *values = (float*) malloc( NUM_VALS * sizeof(float));
array_fill(values, NUM_VALS);
start = clock();
bitonic_sort(values); /* Inplace */
stop = clock();
print_elapsed(start, stop);
}
@DanyStinson

This comment has been minimized.

Copy link

DanyStinson commented Jan 4, 2016

Hi!
Do you have the CPU version code?

@mre

This comment has been minimized.

Copy link
Owner Author

mre commented Oct 8, 2016

@DanyStinson sorry, just saw your message.
Probably you already found an implementation yourself, but just in case somebody else is looking for BitonicSort in C, here are two projects:
https://github.com/steremma/Bitonic
https://github.com/orestisf1993/mpi_bitonic_sort

@ADITYA-TALAVIYA

This comment has been minimized.

Copy link

ADITYA-TALAVIYA commented Feb 14, 2017

how to give input to my own integers and can u tell me whats is the length

@mre

This comment has been minimized.

Copy link
Owner Author

mre commented Apr 27, 2017

There's a function called bitonic_sort and it has one parameter: the list of values.

Call it like this:

bitonic_sort(your_values);

The values need to be floating point (not integer).
The length is NUM_VALS.
If you know the number of values, you can adjust the settings at the top:

#define THREADS 512 // 2^9
#define BLOCKS 32768 // 2^15
#define NUM_VALS THREADS*BLOCKS

Modify THREADS and BLOCKS so that THREADS*BLOCKS is the total number of your values

@ADITYA-TALAVIYA

This comment has been minimized.

Copy link

ADITYA-TALAVIYA commented Apr 28, 2017

OK THANKS
But i have 29468 but so its in 2^n form so i have manage by 29472 floating point so can u tell me whats is the proper size of THREADS AND BLOCKS its divide

@byteunit

This comment has been minimized.

Copy link

byteunit commented Jun 5, 2017

I'm sorry, but this benchmark measures only how fast the launch of the kernel is.

as stated here (https://devblogs.nvidia.com/parallelforall/how-implement-performance-metrics-cuda-cc/) you need an additional
cudaDeviceSynchronize();

@gggprojects

This comment has been minimized.

Copy link

gggprojects commented Jul 5, 2017

@byteunit That's not true. cudaMemcpy is a memory blocking operations causing the syncronization of the device. Thus, the measured time includes the kernel execution.

@kosalape

This comment has been minimized.

Copy link

kosalape commented Sep 19, 2017

Hi,
This ONLY makes a bitonic sequence but NOT a bitonic sort right?

@twmht

This comment has been minimized.

Copy link

twmht commented Jun 28, 2018

@mre

I have tried that but getting slow in GTX-1080.

In the original settings,

#define THREADS 512 // 2^9
#define BLOCKS 32768 // 2^15

I got 0.3s.

with the new settings

#define THREADS 4
#define BLOCKS 250

this gives me 0.178s

I have no idea why it's slow.

@richursa

This comment has been minimized.

Copy link

richursa commented Jul 8, 2018

@etale-cohomology

This comment has been minimized.

Copy link

etale-cohomology commented Sep 26, 2019

Does it work on NON-power-of-two arrays?

@mre

This comment has been minimized.

Copy link
Owner Author

mre commented Sep 26, 2019

No. That's a limitation of the algorithm. You could fill up the rest of the array with "tombstone data" that you would later discard again.

@etale-cohomology

This comment has been minimized.

Copy link

etale-cohomology commented Sep 27, 2019

Great to know! I was worried I was missing something...

@penglz

This comment has been minimized.

Copy link

penglz commented Dec 27, 2019

half of the blocks are idle all the time? wouldn't it be more efficient if numvals = 2threadsblocks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.