青子守歌 aokomoriuta

## 倍精度.txt
https://github.com/aokomoriuta/StudiesOfOpenCLWithCloo/tree/master/VectorAddition/HeavyWorkItem の倍精度での結果。
単精度と同じでした（処理量変えても早くならない）。

= ベクトル加算の試験 =
1ワークアイテムの処理量を変えてみる

プラットフォーム：NVIDIA CUDA (OpenCL 1.1 CUDA 4.1.1)
デバイス数：2
* GeForce GTX 295 (NVIDIA Corporation)
* GeForce GTX 295 (NVIDIA Corporation)

## 倍精度.txt
https://github.com/aokomoriuta/StudiesOfOpenCLWithCloo/tree/master/VectorAddition/MultiGpu の倍精度版。

倍精度にすると更に速い（単一CPUに比べてx6）。
あれ？倍精度演算のほうが計算速度遅いんじゃないの、と思ったが、たぶんメモリ律速のせい。

= ベクトル加算の試験 =
複数GPUを使う

プラットフォーム：NVIDIA CUDA (OpenCL 1.1 CUDA 4.1.1)
デバイス数：2

## 倍精度.txt
https://github.com/aokomoriuta/StudiesOfOpenCLWithCloo/tree/master/VectorAddition/UseHostPointer の倍精度の結果

単精度と傾向は同じ。
ただしやはり加速率は倍精度のほうが上。

= ベクトル加算の試験 =
ホストポインタの使用有無での比較

プラットフォーム：NVIDIA CUDA (OpenCL 1.1 CUDA 4.1.1)
デバイス数：2

## Length2.cpp
#include<iostream>

// 2次元ベクトルCPU
void Length2()
{
    // 要素数
    const int N = 5;

    // x, y方向成分
    double x[N] = {0, 1, 2, 3, 4};

## ReplaceFile.ps1
# カレントディレクトリに移動
cd "対象ディレクトリ";

# 各ファイルに対して操作
Get-ChildItem | ForEach-Object
{
    # ファイル名を'-'で区切ってみたり
    $data = $_.Name.Split('-');

    # ファイル名が"H"で始まってたら除外したり

## Parallel.cs
Parallel.For(0, 200, i=>
{
	result[i] = 0;
	for(int j = 0; j<200; j++)
	{
		result[i] += Math.Sqrt(Math.Sin(i + j));
	}
});

## cuda.log
----------------------------------------------
               Device Info
----------------------------------------------

----------------------------------------------
----------------------------------------------
## Benchmark :: Dense Matrix-Matrix product
----------------------------------------------

   -------------------------------

## particles.txt
1410 particles
8 threads
#00000: t=  0.0000 (00000) @ 11/04 21:31:23 (    0.00)
#00001: t=  0.0010 (00002) @ 11/04 21:31:25 (    1.57)

2460 particles
8 threads
#00000: t=  0.0000 (00000) @ 11/04 21:30:26 (    0.00)
#00001: t=  0.0010 (00002) @ 11/04 21:30:29 (    3.21)

## MsMpiBoost.cpp
#define MSMPI_NO_DEPRECATE_20

#include <iostream>
#include <boost/mpi.hpp>

int main()
{
    // MPI環境（MPI_InitとFinalizeをやってくれる）
    boost::mpi::environment  env(true);

## apu.txt
===================================================
GPU Caps Viewer v1.20.1.1
http://www.ozone3d.net/gpu_caps_viewer/
===================================================


===================================[ System / CPU ]
- CPU Name: AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G
- CPU Core Speed: 3718 MHz
- CPU logical cores: 4
	https://github.com/aokomoriuta/StudiesOfOpenCLWithCloo/tree/master/VectorAddition/HeavyWorkItem の倍精度での結果。
	単精度と同じでした（処理量変えても早くならない）。

	= ベクトル加算の試験 =
	1ワークアイテムの処理量を変えてみる

	プラットフォーム：NVIDIA CUDA (OpenCL 1.1 CUDA 4.1.1)
	デバイス数：2
	* GeForce GTX 295 (NVIDIA Corporation)
	* GeForce GTX 295 (NVIDIA Corporation)
	https://github.com/aokomoriuta/StudiesOfOpenCLWithCloo/tree/master/VectorAddition/MultiGpu の倍精度版。

	倍精度にすると更に速い（単一CPUに比べてx6）。
	あれ？倍精度演算のほうが計算速度遅いんじゃないの、と思ったが、たぶんメモリ律速のせい。

	= ベクトル加算の試験 =
	複数GPUを使う

	プラットフォーム：NVIDIA CUDA (OpenCL 1.1 CUDA 4.1.1)
	デバイス数：2
	https://github.com/aokomoriuta/StudiesOfOpenCLWithCloo/tree/master/VectorAddition/UseHostPointer の倍精度の結果

	単精度と傾向は同じ。
	ただしやはり加速率は倍精度のほうが上。

	= ベクトル加算の試験 =
	ホストポインタの使用有無での比較

	プラットフォーム：NVIDIA CUDA (OpenCL 1.1 CUDA 4.1.1)
	デバイス数：2
	#include<iostream>

	// 2次元ベクトルCPU
	void Length2()
	{
	// 要素数
	const int N = 5;

	// x, y方向成分
	double x[N] = {0, 1, 2, 3, 4};
	# カレントディレクトリに移動
	cd "対象ディレクトリ";

	# 各ファイルに対して操作
	Get-ChildItem \| ForEach-Object
	{
	# ファイル名を'-'で区切ってみたり
	$data = $_.Name.Split('-');

	# ファイル名が"H"で始まってたら除外したり
	Parallel.For(0, 200, i=>
	{
	result[i] = 0;
	for(int j = 0; j<200; j++)
	{
	result[i] += Math.Sqrt(Math.Sin(i + j));
	}
	});
	----------------------------------------------
	Device Info
	----------------------------------------------

	----------------------------------------------
	----------------------------------------------
	## Benchmark :: Dense Matrix-Matrix product
	----------------------------------------------

	-------------------------------
	1410 particles
	8 threads
	#00000: t= 0.0000 (00000) @ 11/04 21:31:23 ( 0.00)
	#00001: t= 0.0010 (00002) @ 11/04 21:31:25 ( 1.57)

	2460 particles
	8 threads
	#00000: t= 0.0000 (00000) @ 11/04 21:30:26 ( 0.00)
	#00001: t= 0.0010 (00002) @ 11/04 21:30:29 ( 3.21)
	#define MSMPI_NO_DEPRECATE_20

	#include <iostream>
	#include <boost/mpi.hpp>

	int main()
	{
	// MPI環境（MPI_InitとFinalizeをやってくれる）
	boost::mpi::environment env(true);
	===================================================
	GPU Caps Viewer v1.20.1.1
	http://www.ozone3d.net/gpu_caps_viewer/
	===================================================


	===================================[ System / CPU ]
	- CPU Name: AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G
	- CPU Core Speed: 3718 MHz
	- CPU logical cores: 4