# General-purpose programming on GPU Asynchronous operations

#### Eugenio Rustico rustico@dmi.unict.it

D.M.I. - Università di Catania

## Updated: May 20, 2011







| Introduction |    |  |  |
|--------------|----|--|--|
|              |    |  |  |
|              |    |  |  |
| Overvi       | ew |  |  |

- 2 Asynchronous operations
- 3 Page-locked memory
- 4 Streams
- 5 Stream behavior
- 6 Mapped memory



| Introduction |  |  |  |  |
|--------------|--|--|--|--|
|              |  |  |  |  |
|              |  |  |  |  |



| Introduction |  |  |  |  |
|--------------|--|--|--|--|
|              |  |  |  |  |
|              |  |  |  |  |

Last step for 100% of GPU-based programs is downloading some data back to the host.



Eugenio Rustico rustico@dmi.unict.it General-purpose programming on GPU Università di Catania

| Introduction |  |  |  |  |
|--------------|--|--|--|--|
|              |  |  |  |  |
|              |  |  |  |  |

Last step for 100% of GPU-based programs is downloading some data back to the host.

Is it in general negligible? (hint: think of a video streaming application...)

| Introduction |  |  |  |  |
|--------------|--|--|--|--|
|              |  |  |  |  |
|              |  |  |  |  |
|              |  |  |  |  |

Last step for 100% of GPU-based programs is downloading some data back to the host.

Is it in general negligible? (hint: think of a video streaming application...)

Moreover: in multi-GPU applications with minimum problem interdependence, say *n* the number of GPUs, a typical need is  $4 \cdot (n-1)$  transfer requests per frame (why?)

| Introduction |  |  |  |
|--------------|--|--|--|
|              |  |  |  |
|              |  |  |  |

There are two techniques that help us partically covering these latencies:



Eugenio Rustico rustico@dmi.unict.it General-purpose programming on GPU Università di Catania

| Introduction |  |  |  |  |
|--------------|--|--|--|--|
|              |  |  |  |  |
|              |  |  |  |  |

There are two techniques that help us partically covering these latencies:

Asynchronous memory operations



Università di Catania

| Introduction |  |  |  |  |
|--------------|--|--|--|--|
|              |  |  |  |  |
|              |  |  |  |  |

There are two techniques that help us partically covering these latencies:

- Asynchronous memory operations
- Mapped memory

| troduction Asynchronous operations |  |  |  |
|------------------------------------|--|--|--|
|                                    |  |  |  |

- < ロ > < 団 > < 臣 > < 臣 > 三 - の <

Università di Catania

Eugenio Rustico rustico@dmi.unict.it

General-purpose programming on GPU

 Any CUDA device: kernel execution and host code (unless enviroment variable CUDA\_LAUNCH\_BLOCKING is set)



Università di Catania

Eugenio Rustico rustico@dmi.unict.it

General-purpose programming on GPU

|  |  | ior |  |
|--|--|-----|--|
|  |  |     |  |
|  |  |     |  |

- Any CUDA device: kernel execution and host code (unless enviroment variable CUDA\_LAUNCH\_BLOCKING is set)
- Any CUDA device: intra-device memcpys and host code



| ln' |  |  |  |  |  |
|-----|--|--|--|--|--|
|     |  |  |  |  |  |
|     |  |  |  |  |  |

- Any CUDA device: kernel execution and host code (unless enviroment variable CUDA\_LAUNCH\_BLOCKING is set)
- Any CUDA device: intra-device memcpys and host code
- Any CUDA device: host↔device memcpys of less than 64Kb



Università di Catania

Eugenio Rustico rustico@dmi.unict.it

General-purpose programming on GPU

Some pairs of operations can be performed *simultaneously*:

- Any CUDA device: kernel execution and host code (unless enviroment variable CUDA\_LAUNCH\_BLOCKING is set)
- Any CUDA device: intra-device memcpys and host code
- Any CUDA device: host $\leftrightarrow$ device memcpys of less than 64Kb
- Capability  $\geq$  1.1: kernel execution and host $\leftrightarrow$ device memcpy

Some pairs of operations can be performed *simultaneously*:

- Any CUDA device: kernel execution and host code (unless enviroment variable CUDA\_LAUNCH\_BLOCKING is set)
- Any CUDA device: intra-device memcpys and host code
- Any CUDA device: host $\leftrightarrow$ device memcpys of less than 64Kb
- Capability  $\geq$  1.1: kernel execution and host $\leftrightarrow$ device memcpy
- Some devices with capability  $\geq$  2.0: concurrent execution of two different kernels

Some pairs of operations can be performed *simultaneously*:

- Any CUDA device: kernel execution and host code (unless enviroment variable CUDA\_LAUNCH\_BLOCKING is set)
- Any CUDA device: intra-device memcpys and host code
- Any CUDA device: host $\leftrightarrow$ device memcpys of less than 64Kb
- Capability  $\geq$  1.1: kernel execution and host $\leftrightarrow$ device memcpy
- Some devices with capability  $\geq$  2.0: concurrent execution of two different kernels
- Some devices with capability ≥ 2.0: concurrent execution of two memcpys in different directions (PCI is full duplex!)

Some pairs of operations can be performed *simultaneously*:

- Any CUDA device: kernel execution and host code (unless enviroment variable CUDA\_LAUNCH\_BLOCKING is set)
- Any CUDA device: intra-device memcpys and host code
- Any CUDA device: host $\leftrightarrow$ device memcpys of less than 64Kb
- Capability  $\geq$  1.1: kernel execution and host $\leftrightarrow$ device memcpy
- Some devices with capability  $\geq$  2.0: concurrent execution of two different kernels
- Some devices with capability ≥ 2.0: concurrent execution of two memcpys in different directions (PCI is full duplex!)

See deviceQuery output to check yours (at runtime, check specific booleans in the deviceProperties).

Some pairs of operations can be performed *simultaneously*:

- Any CUDA device: kernel execution and host code (unless enviroment variable CUDA\_LAUNCH\_BLOCKING is set)
- Any CUDA device: intra-device memcpys and host code
- $\blacksquare$  Any CUDA device: host  $\leftrightarrow$  device memcpys of less than 64Kb
- $\blacksquare$  Capability  $\geq$  1.1: kernel execution and host  $\leftrightarrow device$  memcpy
- $\blacksquare$  Some devices with capability  $\geq$  2.0: concurrent execution of two different kernels
- Some devices with capability ≥ 2.0: concurrent execution of two memcpys in different directions (PCI is full duplex!)

See deviceQuery output to check yours (at runtime, check specific booleans in the deviceProperties).

Programming guide states that when an application is run via a CUDA debugger or profiler all launches are synchronous, but...

|  | Asynchronous operations |  |  |
|--|-------------------------|--|--|
|  |                         |  |  |
|  |                         |  |  |
|  |                         |  |  |

Concurrent kernel and  $\mathsf{host} {\leftrightarrow} \mathsf{device}$  memory transfer is particularly interesting:





Concurrent kernel and host  $\leftrightarrow \mbox{device}$  memory transfer is particularly interesting:

Perfect to cover most transfer latencies



Università di Catania

|  | Asynchronous operations |  |  |
|--|-------------------------|--|--|
|  |                         |  |  |
|  |                         |  |  |
|  |                         |  |  |
|  |                         |  |  |

Concurrent kernel and host  $\leftrightarrow$  device memory transfer is particularly interesting:

- Perfect to cover most transfer latencies
- Available since capability 1.1





Concurrent kernel and host  $\leftrightarrow$  device memory transfer is particularly interesting:

- Perfect to cover most transfer latencies
- Available since capability 1.1
- Not too complicated APIs



Concurrent kernel and host  $\leftrightarrow$  device memory transfer is particularly interesting:

- Perfect to cover most transfer latencies
- Available since capability 1.1
- Not too complicated APIs

There are three requirements for async memcpys: **page-locked** host memory, use of **streams** and -async calls.

|  | Page-locked memory |  |  |
|--|--------------------|--|--|
|  |                    |  |  |
|  |                    |  |  |
|  |                    |  |  |

Virtual allocable memory on host is bigger than physical memory (RAM). This is possible through a *paging* mechanism that swaps pages from RAM to disk and vice-versa.



|  | Page-locked memory |  |  |
|--|--------------------|--|--|
|  |                    |  |  |
|  |                    |  |  |
|  |                    |  |  |

Virtual allocable memory on host is bigger than physical memory (RAM). This is possible through a *paging* mechanism that swaps pages from RAM to disk and vice-versa.

Asynchronous memcpys require host memory to be page-locked: even if calling thread is paused, the host memory area subject of transfer should not be paged.

|  | Page-locked memory |  |  |
|--|--------------------|--|--|
|  |                    |  |  |
|  |                    |  |  |
|  |                    |  |  |

Virtual allocable memory on host is bigger than physical memory (RAM). This is possible through a *paging* mechanism that swaps pages from RAM to disk and vice-versa.

Asynchronous memcpys require host memory to be page-locked: even if calling thread is paused, the host memory area subject of transfer should not be paged.

Allocating too much page-locked memory may decrease overall system performance; it is critical to allocate less space than physical memory.

|  | Page-locked memory |  |  |
|--|--------------------|--|--|
|  |                    |  |  |
|  |                    |  |  |
|  |                    |  |  |

CUDA offers two simple methods to easily allocate page-locked memory, and one to free it:

```
cudaError_t cudaMallocHost(void **ptr,
    size_t size [, unsigned int flags]);
cudaError_t cudaHostAlloc(void **pHost,
    size_t size, unsigned int flags);
cudaError_t cudaFreeHost(void *ptr);
```

|  | Page-locked memory |  |  |
|--|--------------------|--|--|
|  |                    |  |  |
|  |                    |  |  |
|  |                    |  |  |

CUDA offers two simple methods to easily allocate page-locked memory, and one to free it:

```
cudaError_t cudaMallocHost(void **ptr,
    size_t size [, unsigned int flags]);
cudaError_t cudaHostAlloc(void **pHost,
    size_t size, unsigned int flags);
cudaError_t cudaFreeHost(void *ptr);
```

cudaMallocHost() is a special case of cudaHostAlloc() with default parameters: in reference manual, there is no mention to flags for cudaMallocHost() (official reason: C/C++ interoperability?).

|  | Page-locked memory |  |  |
|--|--------------------|--|--|
|  |                    |  |  |
|  |                    |  |  |
|  |                    |  |  |
|  |                    |  |  |

Flags:

cudaHostAllocWriteCombined : disable caching of mapped memory

|              |                             | Page-locked memory     |           |            |       |
|--------------|-----------------------------|------------------------|-----------|------------|-------|
|              |                             |                        |           |            |       |
|              |                             |                        |           |            |       |
|              |                             |                        |           |            |       |
|              |                             |                        |           |            |       |
| Exa          | imple:                      |                        |           |            |       |
|              | F -                         |                        |           |            |       |
| 1 #d         | <mark>efine</mark> DIM (102 | 4*1024)                |           |            |       |
| 2 <b>f</b> l | oat *harray, *              | <pre>harray_map;</pre> |           |            |       |
|              | r = cudaMalloc              |                        | . size    | of(float)* | DIM); |
|              | r = cudaHostAl              |                        | -         |            |       |
| 5            | sizeof(float                | <b>v</b> –             |           |            |       |
|              | cudaHostAllo                |                        | and a H a | -+ 4 7 7 M |       |
| 6            | CUDAHOSTAIIO                | cPortable              | сидано    | stAllocMap | ped); |
|              |                             |                        |           |            |       |
|              |                             |                        |           |            |       |
|              |                             |                        |           |            |       |
| 7 er         | ror = cudaFree              | Host(harray)           | );        |            |       |
| 8 er         | ror = cudaFree              | Host (harray           | man):     |            |       |

error = cudaFreeHost(harray\_map); В

◆□▶ ◆□▶ ◆□▶ ◆□▶ □ のへで

Eugenio Rustico rustico@dmi.unict.it

General-purpose programming on GPU

|         |    | Streams |  |
|---------|----|---------|--|
|         |    |         |  |
|         |    |         |  |
| Overvie | ew |         |  |

- 2 Asynchronous operations
- 3 Page-locked memory

### 4 Streams

5 Stream behavior

#### 6 Mapped memory

|  |  | Streams |  |
|--|--|---------|--|
|  |  |         |  |
|  |  |         |  |
|  |  |         |  |

Streams are ideal structures used to communicate to the runtime the depencencies/parallelisms of memory operations and kernels.



Eugenio Rustico rustico@dmi.unict.it General-purpose programming on GPU Università di Catania

|  | Streams |  |
|--|---------|--|
|  |         |  |
|  |         |  |
|  |         |  |
|  |         |  |

Streams are ideal structures used to communicate to the runtime the depencencies/parallelisms of memory operations and kernels.

A stream is a sequence of operations to be executed in order; to achieve concurrent kernel/memcpy execution, one has to use at least two streams.



Università di Catania

|  | Streams |  |
|--|---------|--|
|  |         |  |
|  |         |  |
|  |         |  |
|  |         |  |

Streams are ideal structures used to communicate to the runtime the depencencies/parallelisms of memory operations and kernels.

A stream is a sequence of operations to be executed in order; to achieve concurrent kernel/memcpy execution, one has to use at least two streams.

When stream is not specified to a kernel or memcpy operation, the default one (0) is used and operations are not concurrent.

|  | Streams |  |
|--|---------|--|
|  |         |  |
|  |         |  |
|  |         |  |

It is possible to enqueue in a stream, other than kernel launches and memory transfers, also CUDA events; they are used as separators for timing and inter-stream dependency purposes.



|  | Streams |  |
|--|---------|--|
|  |         |  |
|  |         |  |
|  |         |  |

It is possible to enqueue in a stream, other than kernel launches and memory transfers, also CUDA events; they are used as separators for timing and inter-stream dependency purposes.

It not possible to create a real dependency graph, but with an appropriate usage of events it is possible to ensure quite complicated dependencies.





|               |                |                      |                    |                             | Streams                     |         |                       |         |
|---------------|----------------|----------------------|--------------------|-----------------------------|-----------------------------|---------|-----------------------|---------|
| Stre          | am 1           |                      |                    |                             | _                           |         | stripe                | 1       |
|               | stripe 1<br>→D | work on<br>stripe 1  |                    | copy stripe 1<br>result D→H | event 1                     |         | stripe                |         |
| Stre          | am 2           |                      |                    |                             |                             |         | stripe                | 3       |
|               |                | copy stripe 2<br>H→D |                    | vork on<br>stripe 2         | copy stripe 2<br>result D→H | event 2 |                       |         |
| Stre          | am 3           |                      |                    |                             |                             |         |                       |         |
|               |                |                      | copy stripe<br>H→D | 3                           | work on<br>stripe 3         |         | v stripe 3<br>ult D→H | event 3 |
| •••• <u>t</u> | ime            |                      |                    |                             | $\rightarrow$               |         |                       |         |

Eugenio Rustico rustico@dmi.unict.it

General-purpose programming on GPU



One kernel at a time is executing

Università di Catania

<ロト < 団ト < 団ト < 団ト

|      |                |                      |             |                             | Streams                     |         |          |         |
|------|----------------|----------------------|-------------|-----------------------------|-----------------------------|---------|----------|---------|
|      |                |                      |             |                             |                             |         |          |         |
| Stre | am 1           |                      |             |                             |                             |         |          |         |
|      |                | work on              |             |                             |                             |         | stripe   | e 1     |
|      | stripe 1<br>→D | stripe 1             |             | copy stripe 1<br>result D→H | event 1                     |         | stripe   | e 2     |
| Stre | am 2           |                      |             |                             |                             |         | stripe   | e 3     |
|      |                |                      |             | wards an                    |                             |         |          |         |
|      |                | copy stripe 2<br>H→D |             | vork on<br>stripe 2         | copy stripe 2<br>result D→H | event 2 |          |         |
| Stre | am 3           |                      |             |                             |                             |         |          |         |
|      |                |                      | copy stripe | 3                           | work on                     | con     | stripe 3 |         |
|      |                |                      | H→D         |                             | stripe 3                    |         | ult D→H  | event 3 |
| t    | ime            |                      |             |                             |                             |         |          |         |
|      |                |                      |             |                             |                             |         |          |         |

- One kernel at a time is executing
- One memcpy at a time is executing

|        |          |               |             |               | Streams       |         |          |         |
|--------|----------|---------------|-------------|---------------|---------------|---------|----------|---------|
|        |          |               |             |               |               |         |          |         |
| Ctro   | am 1     |               |             |               |               |         |          |         |
| Stre   | amı      |               |             |               |               |         | stripe   | 1       |
| copy s | stripe 1 | work on       |             | copy stripe 1 | event 1       |         | Stripe   | -       |
| H      | →D       | stripe 1      |             | result D→H    | event 1       |         | stripe   | 2       |
|        |          |               |             |               |               |         |          |         |
| Stre   | am 2     |               |             |               |               |         | stripe   | 3       |
|        |          | copy stripe 2 | V           | work on       | copy stripe 2 |         |          |         |
|        |          | H→D           | 5           | stripe 2      | result D→H    | event 2 |          |         |
|        |          |               |             |               |               |         |          |         |
| Stre   | am 3     |               |             |               |               |         |          |         |
|        |          |               | copy stripe | з             | work on       | con     | stripe 3 |         |
|        |          |               | H→D         | 5             | stripe 3      |         | ult D→H  | event 3 |
|        |          |               |             |               |               |         |          |         |
| ••• t  | ime      |               |             |               |               |         |          |         |
|        |          |               |             |               |               |         |          |         |

- One kernel at a time is executing
- One memcpy at a time is executing
- Kenel and memcpys in different streams execute concurrently!



Kernel total time is the same as it was non concurrent; but download and upload times are partially covered, reducing total transfer time from  $2 \cdot t$  to  $2 \cdot \frac{t}{3}$ . See timeline profiling of SDK samepl simpleStreams  $\mathbf{x} \in \mathbb{R}$ ,  $\mathbf{x} \in \mathbb{R}$ ,  $\mathbf{x} \in \mathbb{R}$ 

To enqueue a kernel launch in a given stream with high level API, pass the stream as the fourth parameter:

```
1 cudaStreamCreate(&mystream);
2 ...
3 my_kernel <<< numBlocks, numThreads,
4 0, mystream>>> ( [args] );
```



Image: Image:

To enqueue a kernel launch in a given stream with high level API, pass the stream as the fourth parameter:

```
1 cudaStreamCreate(&mystream);
2 ...
3 my_kernel <<< numBlocks, numThreads,
4 0, mystream>>> ( [args] );
```

Memory transfer methods are identical but with a stream parameter and -async suffix:

```
cudaError_t cudaMemcpyAsync(
    void *dst,
    const void *src,
    size_t count,
    enum cudaMemcpyKind kind,
    cudaStream_t stream=0);
```

To enqueue a kernel launch in a given stream with high level API, pass the stream as the fourth parameter:

```
1 cudaStreamCreate(&mystream);
2 ...
3 my_kernel <<< numBlocks, numThreads,
4 0, mystream>>> ( [args] );
```

Memory transfer methods are identical but with a stream parameter and -async suffix:

```
cudaError_t cudaMemcpyAsync(
    void *dst,
    const void *src,
    size_t count,
    enum cudaMemcpyKind kind,
    cudaStream_t stream=0);
```

There is a *-async* version of every memcpy method. Note default null stream

|  |  | Streams |  |
|--|--|---------|--|
|  |  |         |  |
|  |  |         |  |

Let's see a typical usage example.

### **Creation:**

```
1 #define NSTREAMS 4
2 cudaStream_t stream[NSTREAMS];
3 for (int i = 0; i < NSTREAMS; ++i)
4 cudaStreamCreate(&stream[i])</pre>
```



Università di Catania

Eugenio Rustico rustico@dmi.unict.it

General-purpose programming on GPU

|  |  | Streams |  |
|--|--|---------|--|
|  |  |         |  |
|  |  |         |  |
|  |  |         |  |

Università di Catania

Let's see a typical usage example.

## **Creation:**

```
1 #define NSTREAMS 4
2 cudaStream_t stream[NSTREAMS];
3 for (int i = 0; i < NSTREAMS; ++i)
4 cudaStreamCreate(&stream[i])</pre>
```

#### **Destruction:**

```
5 for (int i = 0; i < NSTREAMS; ++i)
6 cudaStreamDestroy(stream[i]);</pre>
```

Eugenio Rustico rustico@dmi.unict.it

General-purpose programming on GPU

・ロト ・回ト ・ヨト ・ヨ

# **Enqueueing:**

| 1  | <pre>// use multiple for cycles: prefer breadth first</pre> |
|----|-------------------------------------------------------------|
| 2  | <pre>for (int i = 0; i &lt; NSTREAMS; ++i)</pre>            |
| 3  | cudaMemcpyAsync(inputDevPtr + i * size,                     |
| 4  | hostPtr + i * size, size,                                   |
| 5  | <pre>cudaMemcpyHostToDevice, stream[i]);</pre>              |
| 6  |                                                             |
| 7  | <pre>for (int i = 0; i &lt; NSTREAMS; ++i)</pre>            |
| 8  | MyKernel <<<100, 512, 0, stream[i]>>>                       |
| 9  | (outputDevPtr + i * size,                                   |
| 10 | inputDevPtr + i * size, size);                              |
| 11 |                                                             |
| 12 | <pre>for (int i = 0; i &lt; NSTREAMS; ++i)</pre>            |
| 13 | cudaMemcpyAsync(hostPtr + i * size,                         |
| 14 | outputDevPtr + i * size, size,                              |
| 15 | <pre>cudaMemcpyDeviceToHost, stream[i]);</pre>              |

```
Streams
                                                        Mapped memory
Synchronization:
// Wait for compute device to finish
cudaError_t cudaThreadSynchronize();
// Wait for a stream to complete everything
cudaError_t cudaStreamSynchronize(cudaStream_t stream);
// Waits for an event to complete
cudaError_t cudaEventSynchronize(cudaEvent_t event);
// Makes the given stream to wait for given
// event before any future operation is
// starte (inter-stream synchronization)
cudaError_t cudaStreamWaitEvent(cudaStream_t stream,
    cudaEvent_t event, unsigned int flags);
See the programming guide for more methods (e.g. queries) and implicit
```

synchronization mechanisms

|  |  | Stream behavior |  |
|--|--|-----------------|--|
|  |  |                 |  |
|  |  |                 |  |
|  |  |                 |  |

The GPU scheduler keeps two separate queues for kernels and memory operations. When an operation is in progress, it checks the other queue for possibly concurrent operations.

|  |  | Stream behavior |  |
|--|--|-----------------|--|
|  |  |                 |  |
|  |  |                 |  |
|  |  |                 |  |

The GPU scheduler keeps two separate queues for kernels and memory operations. When an operation is in progress, it checks the other queue for possibly concurrent operations.

No operations are checked but the first in every queue. This means that **operations enqueued with a depth-first policy will be executed serially**.

|  |  | Stream behavior |  |
|--|--|-----------------|--|
|  |  |                 |  |
|  |  |                 |  |

(queue scheme)



Università di Catania

Eugenio Rustico rustico@dmi.unict.it

General-purpose programming on GPU

|  |  | Mapped memory |
|--|--|---------------|
|  |  |               |
|  |  |               |
|  |  |               |

Some GPUs, especially in notebooks, are integrated on the mainboard and their global memory is a part of the system RAM "shared" with the CPU.



|  |  | Mapped memory |
|--|--|---------------|
|  |  |               |
|  |  |               |
|  |  |               |

Some GPUs, especially in notebooks, are integrated on the mainboard and their global memory is a part of the system RAM "shared" with the CPU.

In these special cases, do we really need explicit transfers?



|  |  | Mapped memory |
|--|--|---------------|
|  |  |               |
|  |  |               |
|  |  |               |

If we allocate a host buffer with flag cudaHostAllocMapped the buffer is prepared to be *mapped* to a subrange of the address space of thedevice; then, we can obtain a device pointer, pointing to the same physical address, with cudaHostGetDevicePointer():



Università di Catania

|  |  | Mapped memory |
|--|--|---------------|
|  |  |               |
|  |  |               |
|  |  |               |

If we allocate a host buffer with flag cudaHostAllocMapped the buffer is prepared to be *mapped* to a subrange of the address space of thedevice; then, we can obtain a device pointer, pointing to the same physical address, with cudaHostGetDevicePointer():

```
// flags must be 0
cudaError_t cudaHostGetDevicePointer(
    void **pDevice, void *pHost,
    unsigned int flags);
```

```
Mapped memory
1 // alloc buffer
  cudaHostAlloc((void**)&host_array,
2
       sizeof(float)*SIZE, cudaHostAllocMapped);
3
4
5
   // init data
   for(i=0; i < SIZE; i++)</pre>
6
       host_array[i]=(float)rand();
7
8
   // get mapped pointer (flags must be 0)
9
   cudaHostGetDevicePointer((void**)&device_pointer,
10
       (void*)host_array, 0);
11
12
13
   // launch kernel: direct access to buffer!
   vectorAdd <<<nblocks, nthreads>>>
14
       (device_pointer, SIZE);
15
```

|  |  | Mapped memory |
|--|--|---------------|
|  |  |               |
|  |  |               |
|  |  |               |

The mapped buffer is cached. We can disable caching using the flag cudaHostAllocWriteCombined when allocating.



Eugenio Rustico rustico@dmi.unict.it General-purpose programming on GPU Università di Catania

|  |  | Mapped memory |
|--|--|---------------|
|  |  |               |
|  |  |               |
|  |  |               |
|  |  |               |

The mapped buffer is cached. We can disable caching using the flag cudaHostAllocWriteCombined when allocating.

Standing on the programming guide, reading from a "write combined" buffer may be much faster (up to 40%), but writing on it from host may be expensive. It is recommended only when device reads alot, host writes only once.

|  |  | Mapped memory |
|--|--|---------------|
|  |  |               |
|  |  |               |

## Hands on code: aynchronous operations & timeline profiling



Eugenio Rustico rustico@dmi.unict.it General-purpose programming on GPU Università di Catania