paniq/gpu_edge_flip.md Secret

## gpu_edge_flip.md

      
    Raw
  

              gpu_edge_flip.md
            
          
    Delaunay transformation using edge flips on the GPU w/o compute or atomics

This is the result of thinking about how to transform a triangle mesh into a constrained Delaunay triangulation on the GPU.
I need this functionality for the prototype. What happens specifically is that the vertices of the 2D mesh on which the prototype world runs are continuously deformed and refined as part of a physical simulation. Inverted triangles are used to detect and repair collisions. But for this to work, triangles representing air or vacuum must be continuously refined. (Here is a video of that technique in action: https://www.youtube.com/watch?v=dIIYP8e2tKo)
A very similar problem is the task of maintaining a proximity or pathfinding map for a set of points; The map is also set up as a triangle mesh, and, if Delaunay conforming, has the nice property that each point can enumerate all points that would be equivalent to the closest cells on a voronoi diagram. It's an excellent structure to have for AI logic and other gameplay relevant spatial queries.
I have a solution that operates on the CPU, but it is (shocking, I know) not fast enough. So the question is, how to parallelize this on the GPU, preferably in a way that is backwards compatible to GL 3.2 hardware.
There are two ways to optimize the mesh; The first one is to remove all triangles and Delaunay triangulate the vertices from scratch. The second one is to flip edges until the mesh is Delaunay conforming.
So far I've been considering the second method, as it's nearly always a substep of the first method, easier to implement and requires no spatial accelleration structures like grids or trees.
There is a paper that aims to exclusively solve this problem (http://swp.dcc.uchile.cl/TR/2011/TR_DCC-20110315-003.pdf). What I find most attractive about this is that the solution is GPU only and the refined data is ready to be rendered by GL, but the solution requires scatter-based evaluation, that is: compute shaders and atomics, which I don't necessarily have. Unfortunately this is the only solution I could find.
So I've been thinking about a gather-based approach to the same problem. It's good to know the invariants in this procedure beforehand: vertex positions, vertex count, edge count and triangle count don't change, so this appears like a straight forward 1:1 triangle index mapping algorithm. Edge flips can also be described as rotating two triangles within a quadric patch; only their indices change, nothing else.
It is known that the mesh can not be perfectly optimized in a single pass; if two edges qualify for a flip, but share a common triangle, the flip must be executed sequentially. In Navarro et al's solution, the algorithm is executed until the mesh is refined; We don't need this kind of precision, as we do one optimization pass per frame, and vertices are altered deterministically and localized; I'd speculate that 30-60 passes per second are enough to keep the mesh reasonably clean, and the worst that can happen is that an object appears to collide with an invisible wall, because the mesh could not be refined on time to prevent a triangle inversion. That is fine. So we can implicitly defer subsequent edge flips to the next frame.
Delaunay refinement requires at least two steps: Step 1 is identifying the edges that need to be flipped, step 2 is re-mapping triangle indices. The paper also has a third step, where broken references are fixed. Navarro et al use atomics in Step 1 to decide on which edges are flipped first, and this is the part I have to replace with something else.
Considering step 1: Let's say we process each (full-)edge of the mesh; The edge reads its two participating and opposing vertices to retrieve the quadric shape of which it forms the center, and uses a cost function to determine whether a flip would improve the quality of both triangles (the formula is in the air mesh paper); if it does, the edge is tagged. There is duplicate work here, as the quality for each triangle is computed three times, but let's ignore that for now.
Step 2: each triangle checks if an edge is tagged, and, if it does, reads the fourth vertex needed to rotate within the quadric and changes its indices accordingly (the rotation is either clockwise or counter-clockwise by convention.) This step is complicated by the fact that more than one edge of a triangle can be tagged; This messes with our assumption that both triangles sharing a tagged edge would perform a symmetrical rotation that leaves the mesh intact. When a triangle has more than one choice to rotate, however, two neighboring triangles could end up making conflicting moves, and destroy the mesh.
No simple greedy heuristic seems to work here: preferring the edge with the lowest index does not work, as the opposing triangle might see an edge that has an even lower index. Taking the edge indices of neighboring triangles into account, however, it appears that it is possible to establish priority without conflict: if a tagged edge that two triangles share has the lowest index of all tagged connected edges (max 4), then it will be flipped. I did a few tests on paper and could not imagine a corner case for this rule, so there might very well be one I'm unaware of.
The fact that we need to access neighboring triangles and full-edges in particular requires thinking about data structures. A vertex / triangle soup usually has no requirement for an explicit edge structure; But this paper would need one, which incurs additional synchronization and mapping issues. Could we do without one?
How about, instead of storing edges, we extend the triangle structure. Each triangle stores its three adjacent vertices a0 a1 a2 (either analog to GL_TRIANGLES_ADJACENCY, or into a separate parallel array), and the three analog triangle buffer indices t0 t1 t2, so that a(x) == triangles[t(x)], which are the same as the half-edge indices. The remaining vertex indices of the opposite half-edge can be calculated from (t(x) - (t(x) % 3)) + ((t(x) + n) % 3) where n is 1 or 2.
Pass 1 would then map triangles to a tag buffer, at the disadvantage that each tag would be computed and stored twice (as we can only map half-edges this way), but we end up with a nice 1:1 half-edge -> tag map (can be simple U8, with three flagbits per value);
Pass 2 then generates a new index buffer, using tags and the adjacent lowest edge index heuristic (using the vertex end points for comparison) to decide whether and in what direction a triangle's indices need to be rotated. The indices are swapped for a minimum footprint, as Navarro et al describe it.
After pass 2, just as with the edge structure in the paper, the adjacency information is partially broken. A third pass is needed to restore this information. Just as Navarro et al do it, the adjacent, now updated half-edge indices can be compared against the stored vertex indices. If a(x) != triangles[t(x)], then the correct values must be found in the indices of the triangles that were rotated together. This makes it necessary to fill a second U32 array in Pass 2 mapping a rotated triangle against its pair, or itself if no flip was performed.
I believe this should cover everything. Whether it works depends completely on the lowest edge index heuristic working as intended.
Here's a list of the structures required:
- array of V vertices (V*? bytes)
- array of F (~= 2*V) triangles (each triangle 3 * vertex index (uint32), F*3*4 bytes)
- array of F adjacent vertex triplets (each triangle 3 * vertex index (uint32), F*3*4 bytes)
- array of F adjacent half-edge triplets (each triangle 3 * edge index (uint32), F*3*4 bytes)
(edge index = F*3 + i, where i=0,1,2)
- array of F flip flag triplets (each triangle 3 bit flags (uint8), F bytes)
- array of F flipped pair triangle indices (each triangle 1 triangle index (uint32), F*4 bytes)
= 29 bytes extra per triangle.