Skip to content

Instantly share code, notes, and snippets.

@d4rkc0d3r
Last active September 12, 2024 14:49
Show Gist options
  • Save d4rkc0d3r/f77c1e96d4aeefd0d1eaf13fb096a2de to your computer and use it in GitHub Desktop.
Save d4rkc0d3r/f77c1e96d4aeefd0d1eaf13fb096a2de to your computer and use it in GitHub Desktop.

Here is what I learned about unity skinned meshes and blendshapes and how they are dealt with on the GPU.
I use Nvidia Nsight Graphics to profile performance and read buffer sizes on GPU. Performance numbers are for a 4090 clocked at 2310 MHz core and stock memory. Unity version is 2019.4.31f1

TLDR

For Unity 2019 Split meshes with blendshapes into two meshes, one for blendshapes and one without.

This does not help much with the memory usage, but it does help a lot with performance. The only VRAM savings is one less copy of POSITION, NORMAL, TANGENT.

For Unity 2021+ merge skinned meshes regardless of if they have blendshapes. No more extra copies and it always uses a fast compute shader now for the bone skinning.

Strip unused blendshapes from the mesh. This can save a lot of VRAM.

Blendshape buffer on GPU

  • Always has all blendshapes of the mesh, even if they are not used.
  • Always has delta normals and tangents even if they are all 0.
  • Buffer stores only deltas for vertices if the delta is non-zero.
struct Blendshape
{
    uint vertexIndex;
    float3 deltaPosition;
    float3 deltaNormal;
    float3 deltaTangent;
};

Blendshape skinning on GPU

  1. Copy POSITION, NORMAL, TANGENT to temporary buffer A
  2. For each active blendshape apply deltas multiplied by weight in place to temporary buffer A with one compute shader call each
    • This is only done for vertices that have a non-zero delta. So if you move less vertices in a blendshape it will be faster.
  3. Copy temporary buffer A to temporary buffer B
  4. Do regular bone skinning vertex shader with temporary buffer B as input and buffer A as stream output

My test meshes

First variant is single mesh with all blendshapes.

  • Mesh 155K polygons 115K vertices
    • 4.42 MiB for POSITION, NORMAL, TANGENT + 2 copies of it
    • 1.32 MiB for COLOR, TEXCOORD0
    • 3.53 MiB for BLENDINDICES, BLENDWEIGHTS
    • 8.05 MiB for 24 blendshapes
    • or 20.15 MiB for 61 blendshapes
  • Total of 26.16 MiB and 38.26 MiB
  • Thry VRAM calculator says 49.27 MiB and 73.48 MiB respectively

Second variant is mesh split in two, one for blendshapes and one without.

  • Mesh 139K polygons 105K vertices
    • 4.02 MiB for POSITION, NORMAL, TANGENT + 1 copy of it
    • 1.20 MiB for COLOR, TEXCOORD0
    • 3.21 MiB for BLENDINDICES, BLENDWEIGHTS
  • Mesh 16K polygons 10K vertices
    • 410 KiB for POSITION, NORMAL, TANGENT + 2 copies of it
    • 123 KiB for COLOR, TEXCOORD0
    • 164 KiB for BLENDINDICES, BLENDWEIGHTS (only 2 bones per vertex)
    • 8.05 MiB for 24 blendshapes
    • or 20.15 Mib for 61 blendshapes
  • Total of 22.01 MiB and 34.11 MiB
  • Thry VRAM calculator says 48.52 MiB and 72.74 MiB respectively

Mesh import settings have the legacy blendshape normals option enabled.

Performance

  • Merged mesh no active blendshapes
    • 0.04 ms for vertex shader
  • Merged mesh 1 active blendshape
    • 0.01 ms compute shader
    • 0.37 ms copy to temporary buffer
    • 0.37 ms for vertex shader
    • total 0.75 ms
  • Merged mesh 10 active blendshapes
    • 0.09 ms compute shader (10 calls)
    • 0.36 ms copy to temporary buffer
    • 0.41 ms for vertex shader
    • total 0.86 ms
  • Split mesh no active blendshapes
    • 0.005 ms for vertex shader
    • 0.035 ms for vertex shader
    • total of 0.04 ms
  • Split mesh 1 active blendshape
    • 0.01 ms compute shader
    • 0.035 ms copy to temporary buffer
    • 0.01 ms for vertex shader
    • 0.035 ms for vertex shader
    • total of 0.09 ms
  • Split mesh 10 active blendshapes
    • 0.09 ms compute shader (10 calls)
    • 0.035 ms copy to temporary buffer
    • 0.01 ms for vertex shader
    • 0.035 ms for vertex shader
    • total of 0.17 ms

Extra test on Unity 2021.3.21f1

  • Merged mesh no active blendshapes
    • 0.025 ms for compute shader (bones)
  • Merged mesh 1 active blendshape
    • 0.005 ms for compute shader (blendshape)
    • 0.025 ms for compute shader (bones)
    • total 0.03 ms
  • Merged mesh 10 active blendshapes
    • 0.080 ms for compute shaders (blendshapes)
    • 0.025 ms for compute shader (bones)
    • total 0.105 ms

Notes about the captured performance data

In this profiling run the first copy of POSITION, NORMAL, TANGENT is not showing up in timing data. When I did these tests on a 1080 Ti it was showing up. I am not certain that the timings correspond 1 to 1 to the event they are attached to. For example the final vertex shader is identical for both blendshape and non blendshape skinning, however it is much slower when any blendshapes are active. The time taken for the copies is also very suspiciously large. Copying 4.42 MiB of data with 1 TB/s bandwidth of a 4090 should take 0.0042 ms, so roughly 100x faster. I expect most of the time is just massive stalls in the GPU pipeline.

It is definitely this slow though since creating 20 copies of the merged mesh test avatar with a blendshape active predictable drops the fps while the split mesh doesn't suffer the same massive drop.

@hostilelogout
Copy link

hostilelogout commented Apr 30, 2023

with Thry's VRAM calc. that shows how much vram the body has etc. and between all the blendshapes they in total affected somewhat 90k vertices. and only took up 14 mb of vram. however i do see a noticeable difference in render thread time. jumps from 0.1 ms to 0.5 ms with all active. and doing a quick and dirty test in unity editor with 40 of the same avatar and materials etc. took up 25ms of cpu and 17.5 of gpu. ouch thats alot. mind you thats 90 blendshapes active all at once which is unlikely.

@d4rkc0d3r
Copy link
Author

d4rkc0d3r commented Apr 30, 2023

If you use the up to date version of Thry's VRAM calculator then the code to calculate mesh VRAM is written by me based on this research here.
As said, blendshape consumption varies by how many verts they affect (non 0 delta values).

@hostilelogout
Copy link

If you use the up to date version of Thry's VRAM calculator then the code to calculate mesh VRAM is written by me based on this research here. As said, blendshape consumption varies by how many verts they affect (non 0 delta values).

i know. thats also why i said i find the results i have very odd. since by that definition it should take up more then 60 mb of vram. but its not. and again in total those 90 blendshapes affects somewhat 90k vertices.

@d4rkc0d3r
Copy link
Author

If a blendshape only moves a single vertex it only takes up 40 bytes of vram. affected verts != verts in the mesh the blendshape is on.

@hostilelogout
Copy link

If a blendshape only moves a single vertex it only takes up 40 bytes of vram. affected verts != verts in the mesh the blendshape is on.

again i am aware. hence why i said they in total affect 90k. how many each one does affect i cant be bothered to find out ngl. but its possibly somewhere around 1-4k at best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment