jefflarkin/00-intro.md

## 00-intro.md

      
    Raw
  

              00-intro.md
            
          
    Background

OpenACC defines data acording to whether it is in discrete or shared memory. When in discrete, specific data operations are specified and implicit data clauses are defined. When in shared memory, data clauses may be ignored if they exist. As an optimization, an implementation may wish to use data clauses as optimization hints. I have historically thought of these in terms of CUDA Unified/Managed Memory with preferred location and prefetching hints. A few cases were brought to my attention that are potentially interesting examples of how this thinking may not be sufficient.
Modifying an allocation during an asynchronous region

I have been made aware of an application that extensively uses the pattern below. A temporary array is allocated locally, in the example below it is an automatic array, and dynamic data lifetimes are used to expose it to the device asynchronously. It is possible that the function would return, deallocting the automatic array, before all operations on that array have completed. Supporting this pattern requires either that memory allocation and deallocation are stream-ordered or the some sort of garbage collection is implemented to clean up the present table lazily after all operations have completed.

  
## 01-async-deallocation.f90
subroutine work(A, N)
  integer :: i, N
  real, dimension(N), intent(inout) :: A
  real, dimension(N) :: B

  !$acc enter data create(B(:)) async(1)
  !$acc kernels async(1)
  B(:) = 1.0
  !$acc end kernels

  ! A device copy of B is created here.
  !$acc parallel loop present(A(1:N),B(1:N)) async(1)
  do i=1,N
    A(i) = A(i) + B(i)
  end do

  !$acc exit data delete(B) async(1)

  ! No synchronization here, so B is immediately deallocated on the host
  ! and (presumably) removed from the present table, deallocating it on
  ! the device too. If the implementation tracks properly though, maybe
  ! the deallocation is delayed or they're using stream-ordered memory
  ! allocation and freeing.
end

## 02-async-deallocation-commentary.md

      
    Raw
  

              02-async-deallocation-commentary.md
            
          
    In my opinion, this is a non-conforming program, since the lifetime of B may end before the references to B have completed. However, there's clearly ways to make this work on a discrete memory system. What relevant spec text would we refer to here?

  
## 03-async-with-stack-variables.md

      
    Raw
  

              03-async-with-stack-variables.md
            
          
    Async with live stack variables

In the case of stack variables, an asynchronous compute region may require that some variables live beyond the end of the subroutine unless a wait is used before the end of the routine. Similar to above though, if a variable is copied into discrete memory the implementation could keep the variable alive until the completion of the compute routine that uses the variable. In this case, ignoring data clauses (or using them for prefetching) is insufficient to ensure correct execution. For scalars and other small variables, firstprivate would expose them to the device for the lifetime of the region, but for larger variables this might not be a good option.

  
## 04-async-with-stack-variable.cpp
void do_stufF_async(double *input, int N)
{
  // Assume filter is too large to use firstprivate
  double filter[3] = { -1, 0, 1 };

  #pragma acc parallel loop copyin(filer[0:3]) copy(input[0:N]) async
  for ( int i = 0; i < N; i++ )
  {
    // apply filter
  }

  // no synchronization
} // filter no longer exists and stack address may be reused

## 05-async-with-stack-variable-discussion.md

      
    Raw
  

              05-async-with-stack-variable-discussion.md
            
          
    The above probably qualifes as a bad idea, but it's often non-trivial to recognize all stack usage and the compiler might not even recognize this case if it was put on the stack by a calling function, which wouldn't trigger an issue at this point but would if that routine returned before this asynchronous region completed. Assuming the data is in shared memory, the defined behavior is to take no data actions, even if an explicit data clause exists, but putting in discrete memory that is only deleted when done would make it possible to run this code.
	subroutine work(A, N)
	integer :: i, N
	real, dimension(N), intent(inout) :: A
	real, dimension(N) :: B

	!$acc enter data create(B(:)) async(1)
	!$acc kernels async(1)
	B(:) = 1.0
	!$acc end kernels

	! A device copy of B is created here.
	!$acc parallel loop present(A(1:N),B(1:N)) async(1)
	do i=1,N
	A(i) = A(i) + B(i)
	end do

	!$acc exit data delete(B) async(1)

	! No synchronization here, so B is immediately deallocated on the host
	! and (presumably) removed from the present table, deallocating it on
	! the device too. If the implementation tracks properly though, maybe
	! the deallocation is delayed or they're using stream-ordered memory
	! allocation and freeing.
	end
	void do_stufF_async(double *input, int N)
	{
	// Assume filter is too large to use firstprivate
	double filter[3] = { -1, 0, 1 };

	#pragma acc parallel loop copyin(filer[0:3]) copy(input[0:N]) async
	for ( int i = 0; i < N; i++ )
	{
	// apply filter
	}

	// no synchronization
	} // filter no longer exists and stack address may be reused