Here is what I learned about unity skinned meshes and blendshapes and how they are dealt with on the GPU.
I use Nvidia Nsight Graphics to profile performance and read buffer sizes on GPU. Performance numbers are for a 4090 clocked at 2310 MHz core and stock memory. Unity version is 2019.4.31f1
For Unity 2019 Split meshes with blendshapes into two meshes, one for blendshapes and one without.
This does not help much with the memory usage, but it does help a lot with performance. The only VRAM savings is one less copy of POSITION, NORMAL, TANGENT.
For Unity 2021+ merge skinned meshes regardless of if they have blendshapes. No more extra copies and it always uses a fast compute shader now for the bone skinning.
Strip unused blendshapes from the mesh. This can save a lot of VRAM.
- Always has all blendshapes of the mesh, even if they are not used.
- Always has delta normals and tangents even if they are all 0.
- Buffer stores only deltas for vertices if the delta is non-zero.
struct Blendshape
{
uint vertexIndex;
float3 deltaPosition;
float3 deltaNormal;
float3 deltaTangent;
};
- Copy POSITION, NORMAL, TANGENT to temporary buffer A
- For each active blendshape apply deltas multiplied by weight in place to temporary buffer A with one compute shader call each
- This is only done for vertices that have a non-zero delta. So if you move less vertices in a blendshape it will be faster.
- Copy temporary buffer A to temporary buffer B
- Do regular bone skinning vertex shader with temporary buffer B as input and buffer A as stream output
First variant is single mesh with all blendshapes.
- Mesh 155K polygons 115K vertices
- 4.42 MiB for POSITION, NORMAL, TANGENT + 2 copies of it
- 1.32 MiB for COLOR, TEXCOORD0
- 3.53 MiB for BLENDINDICES, BLENDWEIGHTS
- 8.05 MiB for 24 blendshapes
- or 20.15 MiB for 61 blendshapes
- Total of 26.16 MiB and 38.26 MiB
- Thry VRAM calculator says 49.27 MiB and 73.48 MiB respectively
Second variant is mesh split in two, one for blendshapes and one without.
- Mesh 139K polygons 105K vertices
- 4.02 MiB for POSITION, NORMAL, TANGENT + 1 copy of it
- 1.20 MiB for COLOR, TEXCOORD0
- 3.21 MiB for BLENDINDICES, BLENDWEIGHTS
- Mesh 16K polygons 10K vertices
- 410 KiB for POSITION, NORMAL, TANGENT + 2 copies of it
- 123 KiB for COLOR, TEXCOORD0
- 164 KiB for BLENDINDICES, BLENDWEIGHTS (only 2 bones per vertex)
- 8.05 MiB for 24 blendshapes
- or 20.15 Mib for 61 blendshapes
- Total of 22.01 MiB and 34.11 MiB
- Thry VRAM calculator says 48.52 MiB and 72.74 MiB respectively
Mesh import settings have the legacy blendshape normals option enabled.
- Merged mesh no active blendshapes
- 0.04 ms for vertex shader
- Merged mesh 1 active blendshape
- 0.01 ms compute shader
- 0.37 ms copy to temporary buffer
- 0.37 ms for vertex shader
- total 0.75 ms
- Merged mesh 10 active blendshapes
- 0.09 ms compute shader (10 calls)
- 0.36 ms copy to temporary buffer
- 0.41 ms for vertex shader
- total 0.86 ms
- Split mesh no active blendshapes
- 0.005 ms for vertex shader
- 0.035 ms for vertex shader
- total of 0.04 ms
- Split mesh 1 active blendshape
- 0.01 ms compute shader
- 0.035 ms copy to temporary buffer
- 0.01 ms for vertex shader
- 0.035 ms for vertex shader
- total of 0.09 ms
- Split mesh 10 active blendshapes
- 0.09 ms compute shader (10 calls)
- 0.035 ms copy to temporary buffer
- 0.01 ms for vertex shader
- 0.035 ms for vertex shader
- total of 0.17 ms
Extra test on Unity 2021.3.21f1
- Merged mesh no active blendshapes
- 0.025 ms for compute shader (bones)
- Merged mesh 1 active blendshape
- 0.005 ms for compute shader (blendshape)
- 0.025 ms for compute shader (bones)
- total 0.03 ms
- Merged mesh 10 active blendshapes
- 0.080 ms for compute shaders (blendshapes)
- 0.025 ms for compute shader (bones)
- total 0.105 ms
In this profiling run the first copy of POSITION, NORMAL, TANGENT is not showing up in timing data. When I did these tests on a 1080 Ti it was showing up. I am not certain that the timings correspond 1 to 1 to the event they are attached to. For example the final vertex shader is identical for both blendshape and non blendshape skinning, however it is much slower when any blendshapes are active. The time taken for the copies is also very suspiciously large. Copying 4.42 MiB of data with 1 TB/s bandwidth of a 4090 should take 0.0042 ms, so roughly 100x faster. I expect most of the time is just massive stalls in the GPU pipeline.
It is definitely this slow though since creating 20 copies of the merged mesh test avatar with a blendshape active predictable drops the fps while the split mesh doesn't suffer the same massive drop.
Fantastic read. I always hear people complaining about how terrible blendshapes are (especially recently) it's nice to finally see some numbers. Do please keep up the excellent work!