Skip to content

Instantly share code, notes, and snippets.

@jefflarkin
Last active August 10, 2023 15:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jefflarkin/5c3931a9a70b823bfd5e718a80001d3a to your computer and use it in GitHub Desktop.
Save jefflarkin/5c3931a9a70b823bfd5e718a80001d3a to your computer and use it in GitHub Desktop.
OpenACC Unified memory & async clarifications

Background

OpenACC defines data acording to whether it is in discrete or shared memory. When in discrete, specific data operations are specified and implicit data clauses are defined. When in shared memory, data clauses may be ignored if they exist. As an optimization, an implementation may wish to use data clauses as optimization hints. I have historically thought of these in terms of CUDA Unified/Managed Memory with preferred location and prefetching hints. A few cases were brought to my attention that are potentially interesting examples of how this thinking may not be sufficient.

Modifying an allocation during an asynchronous region

I have been made aware of an application that extensively uses the pattern below. A temporary array is allocated locally, in the example below it is an automatic array, and dynamic data lifetimes are used to expose it to the device asynchronously. It is possible that the function would return, deallocting the automatic array, before all operations on that array have completed. Supporting this pattern requires either that memory allocation and deallocation are stream-ordered or the some sort of garbage collection is implemented to clean up the present table lazily after all operations have completed.

subroutine work(A, N)
integer :: i, N
real, dimension(N), intent(inout) :: A
real, dimension(N) :: B
!$acc enter data create(B(:)) async(1)
!$acc kernels async(1)
B(:) = 1.0
!$acc end kernels
! A device copy of B is created here.
!$acc parallel loop present(A(1:N),B(1:N)) async(1)
do i=1,N
A(i) = A(i) + B(i)
end do
!$acc exit data delete(B) async(1)
! No synchronization here, so B is immediately deallocated on the host
! and (presumably) removed from the present table, deallocating it on
! the device too. If the implementation tracks properly though, maybe
! the deallocation is delayed or they're using stream-ordered memory
! allocation and freeing.
end

In my opinion, this is a non-conforming program, since the lifetime of B may end before the references to B have completed. However, there's clearly ways to make this work on a discrete memory system. What relevant spec text would we refer to here?

Async with live stack variables

In the case of stack variables, an asynchronous compute region may require that some variables live beyond the end of the subroutine unless a wait is used before the end of the routine. Similar to above though, if a variable is copied into discrete memory the implementation could keep the variable alive until the completion of the compute routine that uses the variable. In this case, ignoring data clauses (or using them for prefetching) is insufficient to ensure correct execution. For scalars and other small variables, firstprivate would expose them to the device for the lifetime of the region, but for larger variables this might not be a good option.

void do_stufF_async(double *input, int N)
{
// Assume filter is too large to use firstprivate
double filter[3] = { -1, 0, 1 };
#pragma acc parallel loop copyin(filer[0:3]) copy(input[0:N]) async
for ( int i = 0; i < N; i++ )
{
// apply filter
}
// no synchronization
} // filter no longer exists and stack address may be reused

The above probably qualifes as a bad idea, but it's often non-trivial to recognize all stack usage and the compiler might not even recognize this case if it was put on the stack by a calling function, which wouldn't trigger an issue at this point but would if that routine returned before this asynchronous region completed. Assuming the data is in shared memory, the defined behavior is to take no data actions, even if an explicit data clause exists, but putting in discrete memory that is only deleted when done would make it possible to run this code.

@jefflarkin
Copy link
Author

2.6.3:
1329 Data in shared memory is accessible from the current device as well as to the local thread. Such
1330 data is available to the accelerator for the lifetime of the variable.

"If the variable is defined as global or file or function static, it must appear in a declare directive." - Is this true for shared memory?

359 Programmers need to
360 be very careful that the program uses appropriate synchronization to ensure that an assignment or
361 modification by a thread on any device to data in shared memory is complete and available before
362 that data is used by another thread on the same or another device.

@jefflarkin
Copy link
Author

@jdenny-ornl
Copy link

jdenny-ornl commented Aug 10, 2023

2.6.3: 1329 Data in shared memory is accessible from the current device as well as to the local thread. Such 1330 data is available to the accelerator for the lifetime of the variable.

"If the variable is defined as global or file or function static, it must appear in a declare directive." - Is this true for shared memory?

Some more text later in that paragraph:

"A data lifetime is the duration from when the data is first made available to the accelerator until it becomes unavailable."

and:

"For data not in shared memory, the data lifetime begins when it is made present and ends when it is no longer present."

The definition of "present data" from the glossary:

"data for which the sum of the structured and dynamic reference counters is greater than zero in a single device memory section".

As we discussed today, reference counting is synchronous with the host. Thus, in the first example above, the ref count of B becomes zero at the acc exit data, so it is no longer present after that, so its data lifetime ends, so it is not available to the accelerator, so the kernel accessing B is invalid because it might be executed after this point due to its async clause.

In other words, that program is invalid even for discrete memory. I'm just not sure any implementation will detect that violation.

My understanding of what the spec says here keeps changing. Am I still misunderstanding? Does the spec need to be changed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment