Skip to content

Instantly share code, notes, and snippets.

@JonathanRaiman
JonathanRaiman / some_file.cu
Created April 23, 2016 21:17
Get reduction over all dimensions to work in mshadow
/*
Reduction over all dimensions in mshadow. Requires changing the structs in
mshadow expression to store their input expressions by value instead of
reference.
Installation:
nvcc some_file.cu -std=c++11 -O3 -w -o some_file -I /usr/local/include
Usage:
@JonathanRaiman
JonathanRaiman / array.cu
Created July 8, 2016 07:43
CUDA / CPU one file Array library
#include <vector>
#include <string>
#include <memory>
#include <sstream>
#include <iostream>
#define XINLINE __device__ __host__
#define MAX_DIM 10
#define INDENT_INCREMENT 2
@JonathanRaiman
JonathanRaiman / awesome_scan.py
Created July 11, 2016 18:50
Scan multi arg in tensorflow
def listify(x):
if isinstance(x, tuple):
return list(x)
return x
def awesome_scan(fn, elems, initializer=None, parallel_iterations=10, back_prop=True,
swap_memory=False, name=None):
"""scan on the list of tensors unpacked from `elems` on dimension 0.
This scan operator repeatedly applies the callable `fn` to a sequence
of elements from first to last. The elements are made of the tensors
@JonathanRaiman
JonathanRaiman / gemm_parallel.cpp
Created July 20, 2016 17:32
Explicit and implicit BLAS gemm parallelism.
/*
Comparing explicit BLAS parallelism using a thread pool
with vendor-implemented parallelism.
Program prints the runtime averaged over 100 runs of a matrix
multiply between two float matrices.
To run:
./gemm_parallel [<int> USE_EXPLICIT_PARALELLISM 0/1] [<int> LEADING_DIMENSION]
@JonathanRaiman
JonathanRaiman / array.h
Created September 3, 2016 23:36
Runtime Compilation of CUDA kernels
#ifndef RTC_ARRAY_H
#define RTC_ARRAY_H
#include <vector>
#include <string>
#include <memory>
#include <sstream>
#include <iostream>
#define XINLINE __device__ __host__
@JonathanRaiman
JonathanRaiman / openmp.cpp
Created October 18, 2016 04:45
Cute openmp timing tests
// clang++ openmp.cpp -o openmp -fopenmp -O3 -std=c+11
#include <cassert>
#include <stdlib.h>
#include <cmath>
#include <omp.h>
#include <iostream>
template<typename T>
struct Vector {
@JonathanRaiman
JonathanRaiman / viterbi.py
Last active February 4, 2020 11:30
tensorflow_viterbi_decode
import tensorflow as tf
def batch_gather_3d(values, indices):
return tf.gather(tf.reshape(values, [-1, tf.shape(values)[2]]),
tf.range(0, tf.shape(values)[0]) * tf.shape(values)[1] +
indices)
def batch_gather_2d(values, indices):
@JonathanRaiman
JonathanRaiman / faux_cudnn.py
Last active March 22, 2019 11:49
Convert CUDNN LSTM to Dynamic RNN
"""
Little script demonstration how to run cudnn rnns
without cudnn using dynamic rnn with the same weights
(e.g. train on cudnn, use with dynamic rnn on cpu).
Note: this will run slower than cudnn on a gpu (see below).
Tested on Titan X Pascal:
With cudnn 3.5s vs. with dynamic_rnn 8s to run through 79 batches
with batch size 128.
Network: input size: 127, 2 layer bidirectional LSTM with num_units 200.
import random
from deap import algorithms, base, creator, tools
import numpy as np
domains = 100
num_entities = 10000
entity_num_domains = 5
num_mentions = 200
classifications = np.random.binomial(
1, np.ones(domains) * entity_num_domains / domains, size=(num_entities, domains)
@JonathanRaiman
JonathanRaiman / plan.py
Last active November 27, 2018 02:25
Dali graph transformation Plan
"""
Micro-dali JIT Plan:
- contains gemm, operator fusion, elementwise/reduction ops.
- supports tensordot
- supports 'jit'
- supports conversion from gemm + im2col to conv2d (NHWC)
- supports 'optimization' passes
- supports 'implementation' registries for specialization
(e.g. int vs float)