graphitemaster/openglbb.md

## openglbb.md

      
    Raw
  

              openglbb.md
            
          
    OpenGL Black Bible

Author: Dale Weiler

Preface

The following is a writeup of the following things I've independently discovered
or been told over the years on how to utilize OpenGL effectively. Not everything
in here I gurantee to be factually correct - though several others share similar
ideas as present here. Take these are your own peril, like everything else -
nothing is absolute; so profile to be sure and check the standards if something
here is incorrect. These things presented here I've come to accept as being a safe
subset of OpenGL supported practically everywhere and is a safe bet to use if
you don't want to get caught up in the details.
Chapter 1. Things to avoid

Versions and Profiles

OpenGL has too many versions, most of which are garbage and should be avoided,
a safe bet is to utilize Core Profile - 3.2. Newer modern versions may be appealing,
however unless you're planning on producing a AAA quality title, all those new
features can be ignored. There is something to be said about compute shaders,
however those are still reasonably accessible through extensions. That being said
some other nice benefits of using 3.2 in particular is that it's mostly the
basis of GL ES 3 and WebGL. So not much effort is needed there to reach even
more targets. Similarly it seems support for 3.x GL is fair if you check the
usual suspects [Steam Hardware Survey] and [Wolfire's Support Matrix]. When
requesting this version of OpenGL, the concept of "Core Profile" is pretty
important since Compatability Profiles contain all the junk from previous
versions of OpenGL. You would think this would be a good thing, however, and
here is the first claim; Compatability Profiles run faster on Nvidia. I've never
confirmed this but I've read of it in several places, including former Nvidia
driver developers. So it may be beneficial to request Core on everything but
Nvidia.
Geometry shaders

One of the important features of our selected version of OpenGL is that it gained
support for Geometry shaders. Geometry shaders are a neat way to generate geometry,
they sit in a part of the pipeline that permits this however, unless you're Intel,
geometry shaders are going to be considerably slower than doing the equivlant work
on the CPU with a streaming vertex buffer. The reason for this is that both Nvidia
and AMD require geometry shaders make a round-trip through memory. The reason
this is required is because GL requires that the output of a geometry shader be
rendered in input order. This does not map well to Nvidia or AMD hardware, where
the fixed-function hardware, that does the rendering, must consume geometry shader
outputs serially. That serial consumption creates a syncronization point, which
for the parallel nature of GPUs is a bottle neck, so instead the shader needs to
buffer it, there are very few places where you can safely buffer this data. Either
on chip cache, which is limited in size or off chip DRAM. AMD uses on-chip cache,
but has to do a lot of work to deal with wrapping which means geometry heavy shaders
often cause huge stall points, whereas Nvidia uses off-chip DRAM, which has high
latency and cannot be hidden. Intel does not have these problems because their
threads, unlike AMD and Nvidia have their own huge register file. There is
something else to be said about the lack of support for Geometry Shaders in GLES
and WebGL, so by avoiding them - you make your code base that much more portable
to different flavors of OpenGL too.
Tesselation shaders

The evaluation and control shaders for tessellation suffer from similar problems
to geometry shaders, they're also incredibly specific in terms of what they're
meant to be used for; that the fact they're a shader at all is a testament to
what happens when you shoehorn in ways to achieve popular rendering techniques.
On a personal level, these things complicate the pipeline and actually introduce
a cost even when not used (in terms of their interaction with other things, which
in turn equates to added validation cost).
Cubemaps

Unless you're only planning on supporting Nvidia hardware, Cubemaps are going to
give you a bad time. There is no longer any fixed-function hardware for cubemaps
on modern GPUs. These things are all emulated in an unrolled fashion by the
driver using a packing strategy that is implementation-defined. These things
are definitely useful when sampling in a shader for doing environment reflections,
skybox rendering and a few other common things, but all those things can still
be done the manual way with cube geometry and individual faces. Lots of people
utilize cubemaps when doing shadow mapping for omni-directional lighting, this is
rather wrong - and it still surprises me people continue to do this since you
need border pixels in the cubemap for filter taps, which is not possible without
an extension. You're far better off doing this manually, at least then you can
control layout and exploit that layout for marginal performance gain.
Uniform Buffer Objects

Uniform buffer objects are a misused feature in modern day OpenGL. There is a
way to use them correctly, but chances are you're never going to hit a realistic
senario where using them correctly will offer a performance advantage over the
traditional uniform calls to update data. This is not surprsing because UBOs were
never meant to be used for replacing uniform calls, they were created for two
reasons, 1) for updating very large uniform data sets, specifically they were
far larger and could be used for that purpose, and 2) for sharing the same uniform
data with more than one shader program. This goes without saying, but I've herd
stories that several vendors implement UBOs with a texture, which is far more
costly to read (due to DRAM latency) than register-resident values of typical
uniforms.
MapBuffer

There is no safe way to use glMapBuffer without the appropriate glUnmapBuffer call
before a draw command because that is undefined. For this reason, MapBuffer is
literally useless, since it forces a syncronization point between client and
server. Just avoid it at all costs. glMapBufferRange with the appropriate access
flags to map it unsyncronized is far faster. Never unmap it, keep it resident
forever and deal with syncronization manually through the use of fence objects.
The standard (and safe) approach is to have as many fence objects as you do mapped
buffers (for double buffering) and just query completion state on the fence, don't
actually wait. Fine tune the buffer count and size for the workload. Or treat
some range as "staging" and the other range as "source". There is lots of ways to
misuse this so be careful.
State changes

This is pretty much an understood concept in OpenGL, however it's not exactly
as clear as simply avoiding state changes. Not all state changes are the same
and in nearly all cases, state changes are deferred until quite later. Typically,
ignored straight up to the draw call itself. This is actually what people complain
about when draw calls tend to be expensinve. The issuing of the draw call is
literally costless, what is costly is all the state changes "queued" up to the
call, and the validation ontop to ensure the series of state changes even constitute
a valid state to begin issuing a draw command in. In either case a good general
rule is not only to avoid state changes, but to organize draw calls, and
information in such a way that you avoid having to make changes at all, for instance
batch by material. This goes without saying, but this can only be taken so far
since draw call order does matter for things like alpha blending.
One such example of a nasty state change to avoid is depth/stencil mask and
test state, this one is particularly nasty because changing it often results
in a shader recompilation. The most common place both of these are changed,
and often scissor too is when clearing the render target because depth/stencil
mask (as well as scissor) is respected when doing a glClear. The problem is that
glClear is often implemented internally by the driver itself as a fullscreen
quad being rendered, which has a shader itself. So when state is changed here,
the shader used for the glClear operation itself gets recompiled. Instead it's
best to ensure your last few draw commands switch the state back to what is
needed for the glClear to work.
Chapter 2. Things to watch out for

Framebuffer objects

Framebuffer objects, the defacto way to do offscreen rendering to a texture,
in many ways - you need to tread very carefully with. In particular not all
vendors get attachments for them correct. It's also very easy to accidentally
misuse the attachments and have it silently work on one platform and fail on
another. In particular, never under any circumstances have multiple attachments
which have different types. If you have one color attachment of GL_RGBA8, then
all other attachments should be GL_RGBA8. You're still allowed to have depth,
and stencil attachments, however it's good practice to avoid individual depth
and stencil attachments in favor of utilizing combined depth stencil formats
which is what the hardware uses. The general rule is if you plan on clearing
depth and stencil individually, for which ever reason, then do not utilize a
combined depth and stencil format, if however you're planning on clearing them
at the same time, then always use a combined depth stencil format. Not following
this simple rule will result in some really annoying stalls on a variety of
hardware, especially tiled-deferred hardware.
Speaking of tiled-deferred hardware, it's never a good idea to render to an FBO
and then immediately source the attachment textures in a draw call, that will
always result in a syncronization point due to the nature of how tiled-deferred
works, instead utilizing more FBOs for the purposes of doing this far later is
more beneficial, but can chew up video memory quickly, a good example would be
shadow mapping. Atlasing is going to be a win for TDR / mobile here but will
likely be slower on desktop. So you may want different rendering paths here if you
look to achieve best performance on both types of hardware.
Read backs

It's never safe to read back depth or stencil, the standard does permit you to
do this, but Intel in particular is notoriously bad for this and you'll pretty
much always get inconsistent results across Intel driver versions and hardware.
If you need to read back depth or stencil, sampling it in a shader and writing it
out to a color attachment via fullscreen quad or triangle may be the better
approach here, plus it's a nice place to do linearization of depth too.
If you're streaming read backs for doing things like light injection passes for
global illumination, radiosity or just recording frames for video, always use
Pixel buffer objects, never use glReadPixels directly, PBOs allow you to do
a non-stalling async read back in a safe and consistent manner. It's also good for
doing a non-hitching screenshot, which can be nice if someone accidentally hits
the screenshot keybind in your game and it isn't this garring experience.
Program binary

One of the extensions that is supported by our choice in GL is the program binary
extension, this extension is very neat in that it lets you serialize program
state into a binary representation which you can reload and reuse to avoid the
cost of compiling shaders on subsequent runs. The problem is that the outputted
result isn't compatible with anything but the machine which produced it, and
sometimes driver version changes can even break the produced binaries. This is
easy enough to protect against and is expected for people who know the extension.
However what is less known is that even if a specific hardware and software
configuration claims it supports the extension, that doesn't actually mean that
it supports the extension, instead you have to query the amount of "program
binary formats" that are supported, and if that is zero - then program binaries are
not supported. As of writing the only configuration that does this is Intel with
Mesa, but it's still something to watch out for.
Samplers

Sampler objects are fine to use, I encourage them since they more appropriately
map to other graphic APIs. That's not what this section is about though, it's about
specifying sampler slots for a shader with glUniform*. For some reason or another,
Intel continues to only support specifying a sampler slot with glUniform1i, if
you try and use glUniform1iv, which I have in the past because I was feeling
weird one day, it won't work.
Chapter 3. Streaming

General rules about streaming

If you're streaming any type of buffer data, the only two access flags you care
about is GL_DYNAMIC_DRAW and GL_STREAM_DRAW. The general rule is if you're
creating some data and want to render it only once, soon as you're finished
uploading it, then you use GL_STREAM_DRAW. If you're going to be reusing the
same buffer but changing it's data a lot, then you use GL_DYNAMIC_DRAW. Never,
under any circumstances use GL_STATIC_DRAW for streaming contents.
When streaming data, it may be beneficial to double buffer, or triple buffer
your contents, that is prepare the data for the next frame or the frame after
the next frame, when doing things this way, orphan the buffer just sourced to
inform the driver of this behavior, that is use glBufferData with null data
and zero size. Do not use glBufferSubData to orphan. This goes without saying,
but the same is also true for textures.
Through various personal experimentation, I've concluded there is no guranteed
best way to do partial buffer or texture updates that is performant everywhere.
What I've discovered instead is that for specifying a complete replacement of
an existing buffer, glBufferData on Nvidia wins, whereas glBufferSubData
wins on AMD and Intel. What is more concerning is that for partial updates,
sometimes glBufferData is also faster too, depending on how large the partial
update is.
What continues to be true, regardless of vendor is that if you're packing a lot
of data inside a vertex buffer sourced for several different draws, it's best to
always keep your vertices aligned on a natrual 16 byte boundary, this is even
true on mobile. This is not much of a concern for GL_STATIC_DRAW though where
it appears the implementation does its own optimization anyways to initially
specified data. Keep this in mind for streaming.