fincs/notes.md Secret

## notes.md

      
    Raw
  

              notes.md
            
          
    Random PICA200 RE Notes

blah blah
Procedural Textures

PICA200 has Texture Unit 3, which implements configurable procedural textures. The texture coordinate inputs (u,v) undergo the following steps:

The absolute value is taken: (u,v) = (abs(u), abs(v))
Noise generation: basic Perlin-ish 2D noise is added to both inputs.
The absolute value is taken again.
Inputs are shifted according to one of 3 fixed modes.
Clipping/repeating/mirroring to make sure both inputs are in the [0,1] range.
Combining both inputs to a single value using a fixed mathematical function.
1D mapping function which transforms the combined input using a user-provided LUT.
Color table lookup, which is essentially a 1D color texture.

Noise generation

Input coordinates are modified according to this formula:
int prng17(int v)
{
    const int h[] = int[](0,4,10,8,4,9,7,12,5,15,13,14,11,15,2,11);
    return ((v%9+2)*3 & 0xF) ^ h[(v/9) & 0xF];
}

float picarand(vec2 point)
{
    const int h[] = int[](10,2,15,8,0,7,4,5,5,13,2,6,13,9,3,14);
    int u2 = prng17(int(point.x));
    int v2 = prng17(int(point.y));
    v2 += ((u2 & 3) == 1) ? 4 : 0;
    v2 ^= (u2 & 1) * 6;
    v2 += 10 + u2;
    v2 &= 0xF;
    v2 ^= h[u2];
    return -1.0 + float(v2)*2.0/15.0;
}

float noise_coef(vec2 x, vec2 freq, vec2 phase)
{
    vec2 grid  = 9.0*freq*abs(x + phase);
    vec2 point = floor(grid);
    vec2 frac  = grid-point;
    float v00  = picarand(point);
    float v01  = picarand(point + vec2(0.0,1.0));
    float v10  = picarand(point + vec2(1.0,0.0));
    float v11  = picarand(point + vec2(1.0,1.1));
    
    float g0 = v00*(frac.x + frac.y);
    float g1 = v10*(frac.x + frac.y - 1.0);
    float g2 = v01*(frac.x + frac.y - 1.0);
    float g3 = v11*(frac.x + frac.y - 2.0);
    float x0 = mix(g0,g1, noiselut(frac.x));
    float x1 = mix(g2,g3, noiselut(frac.x));
    return mix(x0,x1, noiselut(frac.y));
}

vec2 noise_engine(vec2 x, vec2 freq, vec2 ampl, vec2 phase)
{
    float noise = noise_coef(x, freq, phase);
    return x + ampl*noise;
}
Random numbers in the range [-1,1] are generated alongside a discrete grid using a PRNG algorithm. The exact offset and spacing of the PRNG grid is adjusted using the frequency and phase parameters. Continuous PRNG values are interpolated from the nearest discrete ones using the user-provided Noise LUT (which usually contains the mathematical function 3x²-2x³). Interpolation is essentially classic 2D Perlin.
The PRNG algorithm internally generates 4-bit unsigned numbers, which are afterwards expanded to the output range using the formula -1.0 + val*2.0/15.0. The PRNG algorithm has period 144 in both directions (which is 16*9). The PRNG algorithm was reverse-engineered through the joint efforts of @MerryMage, @wwylele and @fincs.
Shifting

Inputs are shifted according to the following modes:

GPU_PT_NONE: No shifting occurs.
GPU_PT_ODD: u += offset * ((floor(v) / 2) % 2) (and viceversa)
GPU_PT_EVEN: u += offset * (((floor(v) + 1) / 2) % 2) (and viceversa)

The shifting offset is 1 for mirrored repeat mode (see below), and 0.5 otherwise. The value of the opposite coordinate (which goes through the floor operation) is taken from the stage prior to applying noise.
Clipping

Inputs undergo clipping. There exist several modes:

GPU_PT_CLAMP_TO_ZERO: u,v get clamped to 0 if they are greater than 1.
GPU_PT_CLAMP_TO_EDGE: u,v get clamped to 1 if they are greater than 1.
GPU_PT_REPEAT: The fractional part of u,v is used (ignoring the integer part).
GPU_PT_MIRRORED_REPEAT: For both u,v: (integer_part%2)==0 ? fractional_part : (1.0-fractional_part)
GPU_PT_PULSE: Produces 1 if greater (or equal?) than 0.5; 0 otherwise.

Combiner

A fixed function that maps the two inputs to a single output can be selected from the following candidates:

GPU_PT_U: u is used as the output.
GPU_PT_U2: u² is used as the output.
GPU_PT_V: v is used as the output.
GPU_PT_V2: v² is used as the output.
GPU_PT_ADD: (u+v)/2
GPU_PT_ADD2: (u²+v²)/2
GPU_PT_SQRT2: sqrt(u²+v²)
GPU_PT_MIN: min(u,v)
GPU_PT_MAX: max(u,v)
GPU_PT_RMAX: the average of GPU_PT_ADD and GPU_PT_SQRT2.

Note that outputs greater than 1 are clamped to 1 (this can happen in GPU_PT_SQRT2 and GPU_PT_RMAX).
Mapper

The combined value in the range [0,1] is mapped to another value in the range [0,1] using a user-defined LUT. This can be used e.g. for creating smooth bands. The hardware supports two modes: either a single LUT is used to map values, or two LUTs are used to create two different values, one of which is used later to generate the RGB color (passed through the 1D color texture) and the other for the alpha (used directly).
1D color texture

The PICA200 has internal 1D color RAM that can store data for 256 colors (as well as deltas between them for use by linear interpolation). With the proctex registers, the user specifies the starting index within the color LUT for the 1D color texture, as well as its length (which can be any number). There also exists a mode involving 1D mipmaps, however it hasn't been tested yet.
Shadow Mapping

PICA200 has a special mode for rendering shadow maps. This is basically the classic depth texture render algorithm, where the scene in light-space coordinates is rendered to a depth buffer. However the PICA200 has some custom extensions that allow for soft shadow rendering.

Shadow texture rendering mode is enabled with GPU_FRAGOPMODE_SHADOW.
In this mode, the color buffer is used as depth buffer, and the depth buffer must be disabled/unbound (otherwise output will be glitched).
Color buffer format is specified as "RGBA8", although the real format is 3 bytes of big-endian depth information + 1 byte of soft shadow strength (same as D24S8 basically). Shadow strength is 0 for 100% shadow and 255 for 0% shadow.
Fragment stage outputs a color, however only the green component has any effect. When green is zero, hard shadows are rendered (the fragment depth field is updated). When green is non-zero, soft shadows are rendered (the green value is written to the soft-shadow-strength field; and the depth field is not updated at all).
Output stage settings appear to be completely ignored. Depth test appears to be hardcoded to GPU_LESS (therefore requiring appropriate settings of the depth map registers in order to convert [-1,0] into [0,1] and function properly). It's also necessary to disable perspective divide for the depth value (i.e. "W-buffering").
In order to render soft shadows, the DMP silhouette geometry shader is usually used.

When rendering with fragment lighting, it is possible to specify which texture unit should be the source of shadow map information, however only texture unit 0 supports shadow textures. If a non-shadow texture is used as shadow map, the fragment light color is instead multiplied directly by the color coming from the non-shadow texture; therefore bypassing the entire shadow mapping logic.
Several parameters are configurable, such as Z bias (used to prevent shadow acne when doing depth comparisons in fragment lighting), and "penumbra bias/scale" (which affects soft shadow rendering in shadow fragop mode). The latter works by multiplying the soft shadow strength by a factor with the formula 1 / (L + H * zDraw/zRef), where H/L are the higher/lower halfwords in GPUREG_FRAGOP_SHADOW, zDraw is the raw depth value of the fragment being rendered, and zRef is the depth value currently stored in the shadowmap. Alternatively this formula can be expressed as 1 / (bias + scale * (1 - zDraw/zRef)), since normally H = -scale and L = scale + bias.
The texture unit receives coordinates to the shadow texture through usual means (texcoord0). This would contain the same light-space coordinates used in the shadowmap rendering stage; however since these are also texture coordinates they are in the [0,1] range instead of [-1,1] (so they must be properly converted by the vertex shader). The third texture component (texcoord0w) is used as the light-space depth for comparison with the stored value in the shadow texture and it is also in the [0,1] range (for orthographic lightspace projection matrices it suffices to pass -z). Perspective divide can also be optionally enabled for shadow lookups in order to divide u and v by w, however the w is still used as the light-space depth; so it is still necessary to perform the same scaling used during shadowmap-rendering to the w coordinate (and this affects u,v too in order to cancel out the scaling during the division: (u/k)/(w/k) = u/w.
Shadow cubemap textures are also supported, and since they are cube textures the full set of texture coordinates (u,v,w) is used as a vector. The vector is used to select a texel across the 6 faces of the cubemap, as well as the depth value used for comparison (which corresponds to the greatest component in the vector). An important thing to mention is that in general, the PICA200 needs the faces of cube textures to be stored upside down for some reason (so this must be taken into account when rendering shadow cubemaps).
(Some information was figured out and kindly provided by @wwylele)
Gas

PICA200 has support for accumulative gas rendering. DMP has filed a patent that (attempts to) describe their algorithm; although the document is really messy, suffers from OCR errors and is entirely in Japanese. Therefore we must make an effort to decipher this document.
...todo...
Random Quirks

This is a list of random PICA200 quirks that have been found while REing:

Changing the shader program (i.e. writing to vsh/gsh code/opdesc mem) implies having to configure again GPUREG_ATTRIBBUFFERS_FORMAT_LOW / GPUREG_VSH_INPUTBUFFER_CONFIG / GPUREG_VSH_NUM_ATTR / GPUREG_VSH_ATTRIBUTES_PERMUTATION_LOW; otherwise the PICA can crash.
Two dummy writes to GPUREG_PRIMITIVE_CONFIG at the end of DrawElements command sequence are absolutely required; otherwise the PICA can crash while changing shader configuration too.
Trying to update LUTs while their associated feature is turned off appears to fail. This was discovered while trying to preload proctex LUTs while the proctex unit was disabled. It is currently not known in which way each LUT is affected, so perhaps it should be more thoroughly tested.

Sources of PICA200 Information

The following sources were used in the task of REing the PICA200:

DMP papers:

Ohbuchi and Unno - "A Real-Time Configurable Shader Based on Lookup Tables"
Kazakov and Ohbuchi - "Primitive Processing and Advanced Shading Architecture for Embedded Space"
The gas rendering patent mentioned in the Gas section.
CEDEC 2012 Procedural Texture lecture


Games/applications:

Steel Diver Sub Wars - this game contains a list of symbols. Most of DMP/Nintendo's OpenGL ES 2.x driver was mapped using this game; including a full list of reserved DMP shader uniforms used to configure the various fixed function registers of the PICA200. The best part about this game is that it's a free download on the eShop so it's very easy to get it.
Brunswick Pro Bowling - this game contains an ELF with full DWARF2 information, including numerous structures and enumerations. However this game is a US-only retail title, so it is harder to obtain.
CTRAging - this is the factory testing program; however some New 3DS units had full leftover data for it and therefore it was possible to be fully rescued. This program is a goldmine of GPU information since it contains GPU tests for the more obscure features such as gas rendering, subdivision, particles, etc; including shaders and resources. This program doesn't have symbols, but a binary diff tool can be used to port over the (overall common) OpenGL driver symbols for easy RE and xref in IDA.


Other sources:

A pastebin containing the full list of reserved DMP-suffixed GLenum values (useful for REing the games listed above) is easily found via searching for "GL_GEOMETRY_PRIMITIVE_DMP" (currently it's the second result on Google).