C++ AMP: Introduction and Best Practices
45432 ワード
Introduction
46 simple examples showing different C++ AMP applications and best practices, from device aquisition, to array and array_view, to exception handling, to correct performance measurement. All examples are sufficiently commented but I will explain some of the concepts in this article as well.
Building the Sample
There are no special requirements. This is a Console application.
Description
The first step in working with AMP is selecting the device on which you want to run your code. Ideally you want a device that supports DirectX 11 and is not dedicated to rendering to the display. If you noticed in the block below, Has display is true. Ideally you want it to be false - it will be if you have a GPU which is not dedicated to rendering to the display. Here's an output of available accelerator properties on my machine. Please ensure you have updated GPU driver - I had a problem with creating NVidia accelerator until I installed the latest driver.
i
C++
Here's the code producing above output. It enumerates all available on my machine devices and prints properties of each one.
C++
If you are targeting a specific hardware, accelerator can be created directly by passing to accelerator constructor a system-wide unique path to a device if you know it (i.e. the “Device Instance Path” property for the device in Device Manager), e.g. accelerator a(L"PCI\VEN_10DE&DEV_06DA&SUBSYS_1520103C&REV_A3\4&ADCCE93&0&0018").
It is important to note that in order to debug GPU code (to set breakpoints in AMP) you must run on ref device.
I
C++
Hence I wrapped device selection in
C++
You can also print messages from AMP code to output window - I have an example in the solution.
accelerator_view is your device view on which your AMP code is executed. It is optional but you should specify it in your code explicitely.
C++
Following several examples demonstrate basic concepts of AMP. Please note that although I time execution of some of the code, the intention is just to get understanding of what AMP is doing under the hood and where it spends time. You will notice that the first time AMP is accessed there is a long pause. It is because runtime is initializing AMP framework. In addition, each kernel (restrict amp) must be compiled causing initial performance penalty. The last two examples in the solution show how to measure kernel code correctly. Before measuring performance the code executes a small warm-up routine to force JIT of the kernel.
But let's get back to basics. Most of the time you will be working with array_view, which is a pointer to the underlying data. array data type is used mostly when you want to measure performance, interop with DirectX, or store some data. AMP will automatically handle transferring data between the host and the device when you use array_view.
You can call functions from within the kernel. In that case function must be amp restricted.
C++
Several examples starting with ArrayViewOps show what should not be done in amp. Compiler will not allow you to remove const, for example, but in some cases you can run into trouble with pointers.
C++
Please note that your kernel code must execute fast. On most hardware the limit is 2 seconds. On my machine it is 7 seconds. You can see the value using the following powershell script:
C++
This behavior is controlled by Windows. If you have a GPU which is not dedicated to rendering, you can disable Timout Detection and Recovery by creating device from DirectX 11 ID3D11Device. You must specify
D3D11_CREATE_DEVICE_DISABLE_GPU_TIMEOUT flag which you will pass to teh CreateDevice function in order to do that.
C++
I have already mentioned that you can output information from the amp kernel to the debug output window. You do this using direct3d_printf function. Example follows:
C++
Please note that printf must be commented or removed once you have done debugging - when I ran my solution in RELEASE mode it caused the process to crash.
We've made it to measuring amp performance. You should be using array or use device.wait with array_view to get proper results. Still, I got some funny numbers when I ran my code. One thing important to understand is that parallel_for_each code is asynchronous which looks as if it were synchronous to the host. That approximately means that once you invoke it the execution will be scheduled on the device and control will be returned to the host but you will be guaranteed that you can access data only after the kernel execution completes. I skipped a topic on synchronisation between the device and the host but you will find well commented examples in the solution.
C++
Enjoy!
Source Code Files
AmpExamples - contain 46 functions, each demonstraiting different C++ AMP concept
46 simple examples showing different C++ AMP applications and best practices, from device aquisition, to array and array_view, to exception handling, to correct performance measurement. All examples are sufficiently commented but I will explain some of the concepts in this article as well.
Building the Sample
There are no special requirements. This is a Console application.
Description
The first step in working with AMP is selecting the device on which you want to run your code. Ideally you want a device that supports DirectX 11 and is not dedicated to rendering to the display. If you noticed in the block below, Has display is true. Ideally you want it to be false - it will be if you have a GPU which is not dedicated to rendering to the display. Here's an output of available accelerator properties on my machine. Please ensure you have updated GPU driver - I had a problem with creating NVidia accelerator until I installed the latest driver.
i
C++
/*
Accelerators and their properties.
-- Description: NVIDIA Quadro 5000M
Device path: PCI\VEN_10DE&DEV_06DA&SUBSYS_1520103C&REV_A3\4&ADCCE93&0&0018
Version: 11.0
Dedicated memory: 2047424 KB
Supports double precision: true
Limited double precision: true
Has display: true
Is emulated: false
Is debug: true
*/
Here's the code producing above output. It enumerates all available on my machine devices and prints properties of each one.
C++
void AmpExamples::AcceleratorProperties()
{
cout << "
Accelerators and their properties.
";
vector<accelerator> list = accelerator::get_all();
for_each(list.begin(), list.end(), [](const accelerator a) {
wcout << " -- Description: " << a.description << endl;
wcout << " Device path: " << a.device_path << endl;
wcout << " Version: " << (a.version >> 16) << '.' << (a.version & 0xFFFF) << endl;
wcout << " Dedicated memory: " << a.dedicated_memory << " KB" << endl;
wcout << " Supports double precision: " << ((a.supports_double_precision) ? "true" : "false") << endl; // Note that full double precision is required by the concurrency::precise_math functions in <amp_math.h>
wcout << " Limited double precision: " << ((a.supports_limited_double_precision) ? "true" : "false") << endl;
wcout << " Has display: " << ((a.has_display) ? "true" : "false") << endl;
wcout << " Is emulated: " << ((a.is_emulated) ? "true" : "false") << endl;
wcout << " Is debug: " << ((a.is_debug) ? "true" : "false") << endl;
wcout << endl;
});
#ifndef _DEBUG
// Requires ref accelerator for debugging on GPU!!!
bool r = PickAccelerator();
#endif
// Now that we have a GPU accelerator, we can create views to other accelerators
accelerator_view warp = accelerator(accelerator::direct3d_warp).default_view;
wcout << L"
Aquired another accelerator: " << warp.accelerator.description << endl;
// While default view is on the gpu
accelerator_view gpu = accelerator().default_view;
wcout << L" Default view: " << gpu.accelerator.description << endl;
}
bool AmpExamples::PickAccelerator()
{
bool success = false;
vector<accelerator> list = accelerator::get_all();
auto result = find_if(list.begin(), list.end(), [](const accelerator& a) {
return !a.is_emulated
&& a.supports_double_precision
//&& !a.has_display
;
});
if (result != list.end())
{
accelerator gpu = *result;
success = accelerator::set_default(gpu.device_path);
if (success)
{
wcout << "
Accelerator for the process: " << gpu.description << endl;
return true;
}
}
accelerator warp(L"direct3d\\warp");
success = accelerator::set_default(warp.device_path);
wcout << "
Accelerator for the process: " << warp.description << endl;
return success;
}
If you are targeting a specific hardware, accelerator can be created directly by passing to accelerator constructor a system-wide unique path to a device if you know it (i.e. the “Device Instance Path” property for the device in Device Manager), e.g. accelerator a(L"PCI\VEN_10DE&DEV_06DA&SUBSYS_1520103C&REV_A3\4&ADCCE93&0&0018").
It is important to note that in order to debug GPU code (to set breakpoints in AMP) you must run on ref device.
I
C++
/*
-- Description: Software Adapter
Device path: direct3d\ref
Version: 11.1
Dedicated memory: 0 KB
Supports double precision: true
Limited double precision: true
Has display: true
Is emulated: true
Is debug: true
*/
Hence I wrapped device selection in
C++
#ifndef _DEBUG
// Requires ref accelerator for debugging on GPU!!!
bool r = PickAccelerator();
#endif
You can also print messages from AMP code to output window - I have an example in the solution.
accelerator_view is your device view on which your AMP code is executed. It is optional but you should specify it in your code explicitely.
C++
void AmpExamples::AcceleratorViewProperties()
{
cout << "
Accelerator Views and their properties.
";
vector<accelerator> list = accelerator::get_all();
for_each(list.begin(), list.end(), [](accelerator a) {
accelerator_view av = a.create_view();
wcout << " -- Description: " << av.accelerator.description << endl;
wcout << " Version: " << (av.version >> 16) << '.' << (av.version & 0xFFFF) << endl;
wcout << " Is debug: " << ((av.is_debug) ? "true" : "false") << endl;
wcout << " Queing mode: " << ((av.queuing_mode == queuing_mode::queuing_mode_automatic) ? "automatic" : "immediate") << endl;
wcout << endl;
});
}
Following several examples demonstrate basic concepts of AMP. Please note that although I time execution of some of the code, the intention is just to get understanding of what AMP is doing under the hood and where it spends time. You will notice that the first time AMP is accessed there is a long pause. It is because runtime is initializing AMP framework. In addition, each kernel (restrict amp) must be compiled causing initial performance penalty. The last two examples in the solution show how to measure kernel code correctly. Before measuring performance the code executes a small warm-up routine to force JIT of the kernel.
But let's get back to basics. Most of the time you will be working with array_view, which is a pointer to the underlying data. array data type is used mostly when you want to measure performance, interop with DirectX, or store some data. AMP will automatically handle transferring data between the host and the device when you use array_view.
You can call functions from within the kernel. In that case function must be amp restricted.
C++
void AmpExamples::AddElementsInternal(
index<1> idx,
array_view<int, 1> sum,
const array_view<const int, 1> a,
const array_view<const int, 1> b
) restrict(amp)
{
sum[idx] = a[idx] + b[idx];
}
Several examples starting with ArrayViewOps show what should not be done in amp. Compiler will not allow you to remove const, for example, but in some cases you can run into trouble with pointers.
C++
void AmpExamples::PointerRestrictions()
{
int p[] = { 1, 2, 3, 4, 5 };
const int size = ARRAYSIZE(p);
array_view<int, 1> a(size, p);
parallel_for_each(a.extent, [=](index<1> idx) restrict(amp)
{
struct A
{
bool flag;
int data;
};
A a;
bool* p1 = &(a.flag);
//bool* p2 = p1++; // error C3599: '++' : cannot perform pointer arithmetic on pointer to bool in amp restricted code
//bool b = *(p2);
// Compiler Error:
// base class, data member or array element must be at least 4-byte aligned for amp-restricted function
//
/*
struct B
{
bool flag;
bool data;
};
B b;
*/
// Solution to B is to allign the struct
struct C
{
bool flag;
__declspec(align(4)) bool data; // Note that the alignment is only applied to the field
};
C c;
// To align a structure
typedef __declspec(align(4))
struct D
{
bool flag;
}
ALIGNED_BOOL;
ALIGNED_BOOL d[10]; // Now we can create an array of aligned fields
}
);
}
Please note that your kernel code must execute fast. On most hardware the limit is 2 seconds. On my machine it is 7 seconds. You can see the value using the following powershell script:
C++
/*
PS HKLM:\> dir
Hive: HKEY_LOCAL_MACHINE
Name Property
---- --------
BCD00000000
HARDWARE
SAM
dir : Requested registry access is not allowed.
At line:1 char:1
+ dir
+ ~~~
+ CategoryInfo : PermissionDenied: (HKEY_LOCAL_MACHINE\SECURITY:String) [Get-ChildItem], SecurityException
+ FullyQualifiedErrorId : System.Security.SecurityException,Microsoft.PowerShell.Commands.GetChildItemCommand
SOFTWARE
SYSTEM
PS HKLM:\> cd System\CurrentControlSet\Control\GraphicsDrivers
PS HKLM:\System\CurrentControlSet\Control\GraphicsDrivers> dir
Hive: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
Name Property
---- --------
AdditionalModeLists
Configuration
Connectivity
DCI Timeout : 7
UseNewKey
*/
This behavior is controlled by Windows. If you have a GPU which is not dedicated to rendering, you can disable Timout Detection and Recovery by creating device from DirectX 11 ID3D11Device. You must specify
D3D11_CREATE_DEVICE_DISABLE_GPU_TIMEOUT flag which you will pass to teh CreateDevice function in order to do that.
C++
void AmpExamples::DisableTDR()
{
cout << "
Disable TDR.
";
unsigned int flags = D3D11_CREATE_DEVICE_DISABLE_GPU_TIMEOUT; // DISABLE TDR!!!
#if _DEBUG
flags |= D3D11_CREATE_DEVICE_DEBUG;
#endif
ID3D11Device* device = nullptr;
ID3D11DeviceContext* context = nullptr;
D3D_DRIVER_TYPE driverTypes[] =
{
D3D_DRIVER_TYPE_HARDWARE,
D3D_DRIVER_TYPE_WARP,
D3D_DRIVER_TYPE_REFERENCE
};
D3D_FEATURE_LEVEL featureLevels[] =
{
D3D_FEATURE_LEVEL_11_0,
D3D_FEATURE_LEVEL_10_1,
D3D_FEATURE_LEVEL_10_0
};
D3D_FEATURE_LEVEL feature;
// http://msdn.microsoft.com/en-us/library/windows/desktop/ff476877(v=vs.85).aspx
//IDXGIAdapter* adapter = nullptr;
HRESULT hr = S_OK;
for (UINT i = 0; i < ARRAYSIZE(driverTypes); ++i)
{
D3D_DRIVER_TYPE driverType = driverTypes[i];
hr = D3D11CreateDevice(
nullptr, // dxgi adapter
driverType, // driver type
nullptr, // software rasterizer
flags, // flags
featureLevels, // feture levels
ARRAYSIZE(featureLevels), // feature levels
D3D11_SDK_VERSION, // sdk version
&device,
&feature,
&context
);
if (SUCCEEDED(hr))
{
break;
}
}
if (FAILED(hr) ||
((feature != D3D_FEATURE_LEVEL_11_1) && (feature != D3D_FEATURE_LEVEL_11_0))
)
{
cerr << " Failed to create Direct3D 11 device." << endl;
return;
}
// This accelerator_view will not time-out
accelerator_view av = create_accelerator_view(device);
wcout << " -- Description: " << av.accelerator.description << endl;
wcout << " Version: " << (av.version >> 16) << '.' << (av.version & 0xFFFF) << endl;
wcout << " Is debug: " << ((av.is_debug) ? "true" : "false") << endl;
wcout << " Queing mode: " << ((av.queuing_mode == queuing_mode::queuing_mode_automatic) ? "automatic" : "immediate") << endl;
wcout << endl;
}
I have already mentioned that you can output information from the amp kernel to the debug output window. You do this using direct3d_printf function. Example follows:
C++
// void direct3d_abort() restrict(amp)
// This function aborts the execution of a kernel. When the abort is detected by the runtime,
// it raises a runtime_exception on the host with the error message, “Reference Rasterizer: Shader abort instruction hit”.
//
// D3D11 MESSAGE: Reference Rasterizer: view[0,0] = 2 [ SHADER MESSAGE #2097410: SHADER_MESSAGE]
// D3D11 MESSAGE: Reference Rasterizer: view[0,1] = 3 [ SHADER MESSAGE #2097410: SHADER_MESSAGE]
// D3D11 MESSAGE: Reference Rasterizer: view[1,0] = 4 [ SHADER MESSAGE #2097410: SHADER_MESSAGE]
// D3D11 MESSAGE: Reference Rasterizer: view[1,1] = 5 [ SHADER MESSAGE #2097410: SHADER_MESSAGE]
//
//
// void direct3d_printf(const char *_Format_string, …) restrict(amp)
// (Parameters)_Format_string: The format string; ...: An optional list of parameters of variable count.
// This function accepts a format string and an optional list of parameters of variable count.
// It prints formatted output from a kernel to the Visual Studio output window.
//
// D3D11 ERROR: Reference Rasterizer: errorf: av[idx] = 2 [ SHADER ERROR #2097411: SHADER_ERROR]
// D3D11 ERROR: Reference Rasterizer: errorf: av[idx] = 3 [ SHADER ERROR #2097411: SHADER_ERROR]
// D3D11 ERROR: Reference Rasterizer: errorf: av[idx] = 4 [ SHADER ERROR #2097411: SHADER_ERROR]
// D3D11 ERROR: Reference Rasterizer: errorf: av[idx] = 5 [ SHADER ERROR #2097411: SHADER_ERROR]
//
//
// void direct3d_errorf(char *_Format_string, …) restrict(amp)
// This function has identical characteristics and usage to the direct3d_printf function,
// in that a message is printed to the output window. Additionally the C++ AMP runtime
// will raise a runtime_exception on the host with the same error message passed to the direct3d_errof call.
//
// D3D11 ERROR: Reference Rasterizer: Shader abort instruction hit at IP 462 [ EXECUTION ERROR #2097409: SHADER_ABORT]
// D3D11 ERROR: Reference Rasterizer: Shader abort instruction hit at IP 462 [ EXECUTION ERROR #2097409: SHADER_ABORT]
// D3D11 ERROR: Reference Rasterizer: Shader abort instruction hit at IP 462 [ EXECUTION ERROR #2097409: SHADER_ABORT]
// D3D11 ERROR: Reference Rasterizer: Shader abort instruction hit at IP 462 [ EXECUTION ERROR #2097409: SHADER_ABORT]
// //
void AmpExamples::DebugHelpers()
{
cout << "
Debugging Support in AMP.
";
const int width = 2;
const int height = 2;
const int size = width * height;
vector<int> data(size);
int i = 0;
generate(data.begin(), data.end(), [&i]{ return ++i; });
// In DEBUG mode with GPU ony selected, av will be ref!
// In other build configurations, these helpers will be replaced with NOOP.
//accelerator_view av = accelerator().create_view();
accelerator_view av = accelerator(accelerator::direct3d_ref).default_view;
wcout << L"
device: " << av.get_accelerator().description << endl;
concurrency::extent<2> ext(width, height);
array_view<int, 2> view(ext, data);
// printf
parallel_for_each(av, ext, [=](index<2> idx) restrict(amp) {
view[idx]++;
direct3d_printf(" view[%d,%d] = %d
", idx[0], idx[1], view[idx]); // Limit is 7 parameters, will throw exception in RELEASE
});
view.synchronize();
// errorf
try
{
parallel_for_each(av, ext, [=](index<2> idx) restrict(amp) {
direct3d_errorf(" errorf: av[idx] = %d
", view[idx]);
view[idx] *= 10;
});
view.synchronize();
}
catch(runtime_exception& e)
{
cout << "
errorf caused runtime exception: " << e.what() << endl;
}
// abort
try
{
parallel_for_each(av, ext, [=](index<2> idx) restrict(amp) {
view[idx] *= 10;
direct3d_abort(); // This will terminate the program when debugging in GPU only mode
});
view.synchronize();
}
catch(runtime_exception& e)
{
cout << "
aborted: " << e.what() << endl;
}
}
Please note that printf must be commented or removed once you have done debugging - when I ran my solution in RELEASE mode it caused the process to crash.
We've made it to measuring amp performance. You should be using array or use device.wait with array_view to get proper results. Still, I got some funny numbers when I ran my code. One thing important to understand is that parallel_for_each code is asynchronous which looks as if it were synchronous to the host. That approximately means that once you invoke it the execution will be scheduled on the device and control will be returned to the host but you will be guaranteed that you can access data only after the kernel execution completes. I skipped a topic on synchronisation between the device and the host but you will find well commented examples in the solution.
C++
/*
Measuring Performance 2. (In release mode)
device: NVIDIA Quadro 5000M
0 executed in 66991us ( 66ms) : copy-in 0us, kernel 66033us, copy 2 958us
1 executed in 19012us ( 19ms) : copy-in 994us, kernel 17023us, copy 2 994us
2 executed in 30010us ( 30ms) : copy-in 0us, kernel 29039us, copy 2 970us
3 executed in 48035us ( 48ms) : copy-in 1003us, kernel 45059us, copy 2 1972us
4 executed in 68055us ( 68ms) : copy-in 996us, kernel 66019us, copy 2 1039us
5 executed in 96026us ( 96ms) : copy-in 999us, kernel 93051us, copy 2 1975us
6 executed in 140032us (140ms) : copy-in 2008us, kernel 136021us, copy 2 2002us
7 executed in 192019us (192ms) : copy-in 1997us, kernel 187037us, copy 2 2984us
8 executed in 230028us (230ms) : copy-in 2996us, kernel 224043us, copy 2 2989us
9 executed in 281015us (281ms) : copy-in 3000us, kernel 275015us, copy 2 2998us
Measuring Performance 2. (One more run)
device: NVIDIA Quadro 5000M
0 executed in 10997us ( 10ms) : copy-in 975us, kernel 9048us, copy 2 973us
1 executed in 18967us ( 18ms) : copy-in 1003us, kernel 16987us, copy 2 976us
2 executed in 31041us ( 31ms) : copy-in 999us, kernel 29009us, copy 2 1033us
3 executed in 48008us ( 48ms) : copy-in 0us, kernel 45996us, copy 2 2011us
4 executed in 68004us ( 68ms) : copy-in 0us, kernel 67023us, copy 2 980us
5 executed in 96010us ( 96ms) : copy-in 1000us, kernel 92998us, copy 2 2011us
6 executed in 140985us (140ms) : copy-in 2002us, kernel 137009us, copy 2 1973us
7 executed in 194028us (194ms) : copy-in 2003us, kernel 190041us, copy 2 1984us
8 executed in 228027us (228ms) : copy-in 3025us, kernel 223019us, copy 2 1983us
9 executed in 280027us (280ms) : copy-in 2996us, kernel 274053us, copy 2 2977us
Measuring Performance 2. (Ran outside of VS)
device: NVIDIA Quadro 5000M
0 executed in 15565us ( 15ms) : copy-in 0us, kernel 15565us, copy 2 0us
1 executed in 15615us ( 15ms) : copy-in 0us, kernel 15615us, copy 2 0us
2 executed in 31177us ( 31ms) : copy-in 0us, kernel 31177us, copy 2 0us
3 executed in 46790us ( 46ms) : copy-in 0us, kernel 46790us, copy 2 0us
4 executed in 77949us ( 77ms) : copy-in 0us, kernel 62401us, copy 2 15547us
5 executed in 109174us (109ms) : copy-in 0us, kernel 93627us, copy 2 15546us
6 executed in 140438us (140ms) : copy-in 0us, kernel 124847us, copy 2 15591us
7 executed in 187180us (187ms) : copy-in 0us, kernel 187180us, copy 2 0us
8 executed in 234037us (234ms) : copy-in 0us, kernel 218435us, copy 2 15602us
9 executed in 280803us (280ms) : copy-in 0us, kernel 265210us, copy 2 15593us
*/
void AmpExamples::MeasurePerformance2()
{
cout << "
Measuring Performance 2.
";
time_point<system_clock> start = system_clock::now();
time_point<system_clock> stop = system_clock::now();
time_point<system_clock> tmStart = system_clock::now();
accelerator_view device = accelerator().default_view;
wcout << L" device: " << device.accelerator.description << endl << endl;
WarmUp(device);
// 10 samples increasing amount of data by i in each loop
for (int i = 0; i < 10; ++i)
{
// NUmber of rows and columns for both matrices
const int r1 = 300 + 100 * i;
const int c1 = 500 + 100 * i;
const int r2 = c1;
const int c2 = 400 + 100 * i;
assert(c1 == r2); // columns in m1 == rows in m2
vector<float> va(r1 * c1);
vector<float> vb(r2 * c2);
vector<float> vc(r1 * c2); // resultant matrix
RandomFill(va);
RandomFill(vb);
concurrency::extent<2> ea(r1, c1);
concurrency::extent<2> eb(r2, c2);
concurrency::extent<2> ec(r1, c2);
// Using arrays only to measure performance. If using arrqay_view, we have
// to manually force synchronization because parallel_for_each is async.
// When parallel_for_each returns, then computation is only scheduled on the device.
// To force execution you need to call wait() on accelerator_view.
array<float, 2> a(ea);
array<float, 2> b(eb);
array<float, 2> c(ec);
// Copy underlying data to the device
tmStart = system_clock::now();
start = system_clock::now();
copy(va.begin(), a);
copy(vb.begin(), b);
stop = system_clock::now();
long long tmCopy1 = duration_cast<microseconds>(stop - start).count();
// Run kernel
start = system_clock::now();
MultiplyMatrices(device, c, a, b);
device.wait(); // Ensure that kernel completed execution!!!
stop = system_clock::now();
long long tmKernel = duration_cast<microseconds>(stop - start).count();
// Copy data back to the host
start = system_clock::now();
copy(c, vc.begin());
stop = system_clock::now();
long long tmCopy2 = duration_cast<microseconds>(stop - start).count();
long long ms = duration_cast<milliseconds>(stop - tmStart).count();
long long us = duration_cast<microseconds>(stop - tmStart).count();
cout << " " << i << " executed in " << us << "us (" << ms << "ms)"
<< " : copy-in " << tmCopy1
<< "us, kernel " << tmKernel
<< "us, copy 2 " << tmCopy2
<< "us" << endl;
}
}
Enjoy!
Source Code Files
AmpExamples - contain 46 functions, each demonstraiting different C++ AMP concept