drhelius/Gameboy's video hardware

## Gameboy's video hardware
Nitty Gritty Gameboy Cycle Timing
                  ---------------------------------


A document about the down and dirty timing of the Gameboy's video hardware.

Written by:  Kevin Horton
Version: 0.01 (preliminary)

My findings here are based on the original DMG, Super Gameboy, and GB
Pocket.  All three appear to behave identically during testing, and the
SGB was used for all the reverse engineering.

An HP54645D mixed signal oscilloscope/logic analyzer was connected to the
SGB using a 16 wire pod.  A 20 pin ribbon cable with IDC plug on the end
was soldered to various points on the SGB, and a pin header was inserted
into the IDC plug so that the pod connectors could be plugged in to
monitor the goings-on of the hardware.

---


I have discovered some interesting things about how the Gameboy fetches
VRAM data in general.  First, it will actually stop clocking the LCD and
stall it if it needs to fetch something and is not ready to send the
data out quite yet.

Secondly, the window function is a restarting of the data fetching
state machine, which is used to read the background tiles.  The window
is triggered N clocks after the start of rendering, where N is determined
by the value in the xwindow register.

So, without further delay...

Scanline timing
---------------

During the discussion of scanline timing, I will be ignoring Y timing
totally, since Y timing is unrelated to VRAM access patterns.  The Y
timing only affects WHICH KIND of VRAM access occurs, and does not
affect it in any other way.

There are a couple cases that will be discussed, from simplest to most
complex.

* * * * *

The first case is what I call the degenerate case: xwindow is set to 0ffh
which disables the window totally, and then xscroll is adjusted.

There are only 8 different possible cases.

These types of access take from 173.5 to 180.5 cycles.  The reasoning for
the half cycle will be described later.

The access pattern looks like this:

B01   - (6 cycles) fetch Background nametable byte, then 2 tile planes
B01s  - (167.5 + (xscroll % 7) cycles) fetch another tile and sprite
			window.

Where:
B = reading the background tile # (i.e. out of 1800h from the first
nametable)
0,1 = where the tile graphics are fetched.  bitplanes 0 and 1.
s = sprite window.  the sprite hardware will insert reads here if needed.

Each access to VRAM (B, 0, 1, s) takes 2 cycles to occur.  A "cycle" is
exactly 1 period of the main input clock to the gameboy CPU chip.  This
is nominally 4.19MHz approximately.

The last four accesses (B01s) is repeated until the proper number of
cycles has elapsed.

In the xscroll = 0 case, it will run for 167.5 cycles.  This has an
interesting side effect- any tile access that is not complete just
gets unceremoniously cut off. This means that there will be 20 complete
tile accesses (B01s) and then 7.5 clocks worth of a 21st access,
cutting off the last half cycle of the sprite window.

In the xscroll = 2 case, it will run for 169.5 cycles. Similar to above,
this will result in 21 complete B01s accesses, and then 1.5 cycles of the
B fetch on a 22nd access.

This pattern repeats until xscroll = 7 (taking a total of 180.5 cycles)
until snapping back to 173.5 cycles when xscroll = 8.

The total number of cycles taken is (173.5 + (xscroll % 7)).

Now, you have been wondering what this extra 1/2 cycle business is about.
Well, it has to do with how the display clock is generated.  The display
clock is generated via inverting the main input clock.  I suspect it was
done so that the video hardware can get the data to the LCD ready on the
falling edge of the main clock.  The display clocks data in on the RISING
edge of the display clock, thus necessitating the inverted display clock
relative to the main clock.

That causes the vram access pattern to be extended 1/2 clock on the end
to accommodate the inverted clock.

* * * * *

So, that takes care of the easy case.  Now for what happens when the
window is used.

NOTE: When xwindow = 00h or xwindow = 0a6h, different things happen.
I will explain them later on.  For now, the following information
holds when ((xwindow > 00h) && (xwindow < 0a6h)).

Interestingly, adding the window does not change a whole lot- in fact,
it simply restarts the whole fetch sequence over again, no matter where
it was!  The timing is generated like so:

B01   (6 cycles) fetch background tile nametable+bitplanes
B01_  (1 to 172 cycles) ((xscroll % 7) + xwindow + 1)
W01   (6 cycles) fetch window tile nametable+bitplanes
W01_  (1.5 to 166.5 cycles)  (166.5 - xwindow)

As can be seen, it's very similar to the first case.  Only now,
the number of B12_ accesses is controlled by the xscroll value
and the xwindow value.  As before, when the number of cycles has
elapsed, the access pattern is just cut short, and the W12 (W = window
nametable entry) access starts.

This window access pattern is identical to the background one, except
the window nametable is being accessed instead.

Turning on the window incurs a 6 cycle penalty, so the total number of
cycles taken is (173.5 + 6 + (xscroll % 7)).

* * * * *

OK, now things get slightly strange.  When xwindow = 0, some slightly
different rules come into effect.

When (xscroll % 7) = 0 to 6, things work a bit different.  Timing
looks like this:

B01B (7 cycles) technically the last B is part of the B01s pattern.
W01  (6 cycles) as above, the start of the window access pattern
W01s (167.5 to 173.5 cycles) (167.5 + (scroll % 7))

As before, the W01s pattern is repeated for the required number of
cycles.  When the count has expired, the access pattern is just cut off.

This takes 180.5 to 186.5 cycles.

When (xscroll % 7) = 7, then the timing is slightly modified version
of the above.  The access pattern is identical to when (xscroll % 7) = 6
except an extra cycle is inserted in the first sprite window, causing
the total amount to be 187.5 cycles.

* * * * *

When xwindow = 0a6h, then timing is identical to when the window is
disabled, i.e. 173.5 to 180.5 cycles.  The difference is the window
nametable is used instead of the background nametable.  Rendering
starts from the SECOND tile of each line, however.  The net effect of
this is, the window register appears to be scrolled 8 pixels to the right
(if xscroll = 0).

Only the lower 3 bits of the xscroll register are used in this mode.
It shifts the the window left 0-7 cycles.   Taking into account the first
paragraph above about the nametable, the net effect is the window appears
to be xscrolled 8 to 15 pixels left.

Effective xscroll = (xscroll % 7) + 8

The top scanline of the screen is ALWAYS reflecting the very first
scanline of the window, when ywindow is less than or equal to 08fh.

The second scanline of the screen will reflect the second scanline of
the window, but ONLY when ywindow = 00h.  Any other ywindow value will
result in the background showing for the first scanline.

The other scanlines of the screen (lines 3-144) will show the window,
*starting from the third window scanline* depending on the ywindow value.

This is hard to describe, but the effect is simple:

GB line:   (ywindow = 0)
1   window 1
2   window 2
3   window 3
4   window 4

GB line:   (ywindow = 1)
1   window 1
2   background 2
3   window 3
4   window 4

GB line:   (ywindow = 2)
1   window 1
2   background 2
3   background 3
4   window 3
5   window 4
6   window 5

GB line:   (ywindow = 3)
1   window 1
2   background 2
3   background 3
4   background 4
5   window 3
6   window 4

----


What addresses are read during rendering
----------------------------------------

Referring back to the fetch patterns above, I will go through what
addresses are read.

For the degenerate case, the access pattern looks like this:

B01
B01s (repeated 20-22x)

Assuming that xscroll = 0, and yscroll = 0, and we have the background
reading the 9800h nametable:

B01  (reads 9800h)
B01s (reads 9800h, 9801h, 9802h, 9803h...9814h)

This is fairly simple: it just starts at the very upper left char of the
nametable and starts reading.   It ends up reading 9800h TWICE.  The
first access is just thrown away and is never used.  It's here, because
it helps during windowing (I will describe later in the window section).

The way the characters are read is performed something like this:

At the start of the scanline:

1) latch the current character address
2) read a character from the address
3) read another character from the SAME address
4) increment address
5) read another character
6) repeat 4 and 5 20-22 times.

In step 1, the address we latch is calculated like this:

yscroll is the yscroll register on the GB CPU
xscroll is the xscroll register on the GB CPU
ycounter is the current scanline we are rendering (0-143)
whichnt is bit 3 of the LCD control reg on the GB CPU

ybase = (yscroll + ycounter)   // calculates the effective vis. scanline

charaddr = (0x9800 | (whichnt << 10) | ((ybase & 0xf8) << 2) |
		   ((xscroll & 0xf8) >> 3)

Another way to represent this address:

15                0
-------------------
1001 1NYY YYYX XXXX

N = nametable #
Y = upper 5 bits of ybase
X = upper 5 bits of xscroll (which is then incremented between chars)


In step 2, we read from charaddr,  and throw the result away
In step 3, we read from charaddr again and use it for the first vis. char
In step 4, *only the lower 5 bits* of charaddr is incremented
In step 5, we read the next character

Then, we repeat it enough times to fill out the scanline.

Once the nametable entry is fetched, we have to fetch the tile planes.

Depending on the state of the "BG & Window Tile Data Select" register,
which is LCD control bit 4, tile accesses are done one of two ways.


ntbyte = the nametable byte we read from the above NT address.

if (lcdcontrol[4]) tileaddr = (ntbyte << 4) | ((ysub & 0x7) << 1)
else tileaddr = (0x1000 - (ntbyte << 4)) | ((ysub & 0x7) << 1)

We will read the desired bytes for the tile data from tileaddr and
tileaddr+1

Notice that xscroll's lower 3 bits don't SEEM to play into any of the
calculations above... this is because xscroll[2:0] does not affect
which characters are fetched in any way.  Fine xscroll (lower 3 bits
of xscroll) only adjust the timing during LCD writing (explained
below).


* * * * *

So, this is all fine and good.. but what happens during windowing?
It's not much different than the above.

During a typical VRAM access pattern with the window, it looks something
like this:

B01
B01s (repeated N times)
W01
W01s (repeated M times)

The background rendering sequence is identical to the background only
sequence described previously.

When the window accesses start, the address calculation is similar...

First, a typical reading sequence:

B01  (9800h)
B01s (9800h, 9801h, 9802h, ...)
W01  (9C00h)
W01s (9C01h, 9C02h, ...)

The first change is that the first W01 access is NOT thrown away.
There is no duplicated read here as in the background read.

The nametable read is calculated like so:

windnt = the window nametable (LCD control bit 6)

basew = (ycount - yscroll)   // calculates the effective window scanline

charaddr = (0x9800 | (windnt << 10) | ((basew & 0xf8) << 2)

Another way to represent this address:

15                0
-------------------
1001 1NYY YYY0 0000

N = nametable #
Y = upper 5 bits of basew


We simply read characters starting from the charaddr address, and
increment it each time we read a character until the scanline is
finished.

The tile plane address is calculated the same as it is calculated for
the background reads.


That wraps up the actual VRAM access patterns.

---

LCD write timing
----------------

Before I can describe how the LCD timing works, I have to first explain
how the LCD itself works.

The LCD is composed of a 2 bit wide by 159 bit deep shift register,
where the input pixels are shifted.  Each rising edge of the display
clock, data is shifted one stage down the register.

When the display latch signal is activated, this shift register's value
is latched into the LCD column drivers.

The shift register is only 159 bits- the input data is used as the 160th
bit for latching into the LCD column drivers.

* The first pixel shifted into the register appears on the first column
* The last pixel shifted in appears on the second to last column
* The input pixel data on the input lines appears on the first column

Terrible ASCII:


DCLK:    the display pixel clock
data0/1: the 2 bit pixel data
lat:     the latch signal.  when pulsed, latches the shift reg. data
bias:    the LCD bias voltage (contrast wheel adjusts this)
inv:     the LCD inverse signal (explained later)


                 +-----------------+
LCD DCLK o-------|CLK              |
                 |                 |
                 |  159*2 bit S/R  |
LCD data0 o---*--|D0               |
LCD data1 o-*-+--|D1               |
            | |  |   pix 1 ->  159 |
            | |  +-----------------+
            | |    | ........... |
          +--------------------------+
          | pix 0     pix 1 ->  159  |
          |                          |
LCD lat o-|latch    160*2 bit latch  |
          |                          |
          |      160*2 outputs       |
          +--------------------------+
            | .................... |
          +--------------------------+
          |                          |
          |    160 output drivers    |
LCDbias o-|bias                      |
LCD inv o-|invert                    |
          |                          |
          |        LCD outputs       |
          +--------------------------+
            ||||||||||||||||||||||||
       +---------------------------------+
       |      col 159   <-    col 0      |
       |           LCD columns           |
       |                                 |
       |                                 |
       |           LCD display           |
       |                                 |
       |                                 |
       |                                 |
       |                                 |
       |                                 |
       +---------------------------------+


So now that the LCD column driving and latching has been explained,
the display timing that follows should make a bit more sense.

Because the LCD is clocked, unlike a CRT, this means that the hardware
has the ability to stop clocking the LCD for awhile if it feels like it.
The GB video hardware indeed does do this, and even uses it to advantage
during scrolling, sprite fetching, and starting the window rendering.

When windowing is disabled, the display clock always runs for the last
159 cycles.  This is very interesting to me, because that means the
video hardware is actively shifting pixels out to the display, but
some of these pixels DO NOT HAVE A CORRESPONDING DCLK!  This is how
fine X scroll is achieved- the first 0 to 7 pixels are just thrown away.
They get shifted out, but since the display clock is not running, they
do not get shifted in.  By delaying these cycles, the display data will
shift left from 0 to 7 pixels.

The windowing function works the same way- the pixel where the window
is started will restart the rendering engine and thus allow single pixel
precision on where the window starts on the LCD.

During the first 6 cycles of the window fetch, the LCD clock is stalled.
This lets the pipeline fill and then display clocking resumes after the
6 cycle delay.

That takes care of the timing of display data clocking.

Now for the interesting part about how data is read and shifted out the
LCD data pins:

When the nametable fetch starts, the LAST tile data read will be latched
into two 8 bit shift registers, and then shift out the data pins one
pixel per clock.  No matter what.

So, referring to the access pattern again:

B01   read the first tile
B01s  latch the pixel data into the output shift registers
B01s  latch the pixel data into the output shift registers
B01s...

Since each B01s access takes exactly 8 cycles, the output shift registers
will be exactly refilled when they are empty, and continue the output
data sending without interruption.

Fine xscroll is effected by controlling the point in this process where
the display clocking is started relative to the start of the rendering
phase.  The data will always shift out the pixel data pins at the same
point in the render cycle, but since the DCLK is started earlier or
later, the point where the LCD starts latching data changes relative to
the data.  This causes a 0-7 shift in the data on the LCD.

The afore-mentioned output shift registers will blindly shift out their
8 pixels of data without stopping, except when the LCD hardware is
stalled by a sprite fetch (described later).  Thus, the timing of the
VRAM reads determines the amount of fine xscroll on the background
and on the window.

---
	Nitty Gritty Gameboy Cycle Timing
	---------------------------------


	A document about the down and dirty timing of the Gameboy's video hardware.

	Written by: Kevin Horton
	Version: 0.01 (preliminary)

	My findings here are based on the original DMG, Super Gameboy, and GB
	Pocket. All three appear to behave identically during testing, and the
	SGB was used for all the reverse engineering.

	An HP54645D mixed signal oscilloscope/logic analyzer was connected to the
	SGB using a 16 wire pod. A 20 pin ribbon cable with IDC plug on the end
	was soldered to various points on the SGB, and a pin header was inserted
	into the IDC plug so that the pod connectors could be plugged in to
	monitor the goings-on of the hardware.

	---


	I have discovered some interesting things about how the Gameboy fetches
	VRAM data in general. First, it will actually stop clocking the LCD and
	stall it if it needs to fetch something and is not ready to send the
	data out quite yet.

	Secondly, the window function is a restarting of the data fetching
	state machine, which is used to read the background tiles. The window
	is triggered N clocks after the start of rendering, where N is determined
	by the value in the xwindow register.

	So, without further delay...

	Scanline timing
	---------------

	During the discussion of scanline timing, I will be ignoring Y timing
	totally, since Y timing is unrelated to VRAM access patterns. The Y
	timing only affects WHICH KIND of VRAM access occurs, and does not
	affect it in any other way.

	There are a couple cases that will be discussed, from simplest to most
	complex.

	* * * * *

	The first case is what I call the degenerate case: xwindow is set to 0ffh
	which disables the window totally, and then xscroll is adjusted.

	There are only 8 different possible cases.

	These types of access take from 173.5 to 180.5 cycles. The reasoning for
	the half cycle will be described later.

	The access pattern looks like this:

	B01 - (6 cycles) fetch Background nametable byte, then 2 tile planes
	B01s - (167.5 + (xscroll % 7) cycles) fetch another tile and sprite
	window.

	Where:
	B = reading the background tile # (i.e. out of 1800h from the first
	nametable)
	0,1 = where the tile graphics are fetched. bitplanes 0 and 1.
	s = sprite window. the sprite hardware will insert reads here if needed.

	Each access to VRAM (B, 0, 1, s) takes 2 cycles to occur. A "cycle" is
	exactly 1 period of the main input clock to the gameboy CPU chip. This
	is nominally 4.19MHz approximately.

	The last four accesses (B01s) is repeated until the proper number of
	cycles has elapsed.

	In the xscroll = 0 case, it will run for 167.5 cycles. This has an
	interesting side effect- any tile access that is not complete just
	gets unceremoniously cut off. This means that there will be 20 complete
	tile accesses (B01s) and then 7.5 clocks worth of a 21st access,
	cutting off the last half cycle of the sprite window.

	In the xscroll = 2 case, it will run for 169.5 cycles. Similar to above,
	this will result in 21 complete B01s accesses, and then 1.5 cycles of the
	B fetch on a 22nd access.

	This pattern repeats until xscroll = 7 (taking a total of 180.5 cycles)
	until snapping back to 173.5 cycles when xscroll = 8.

	The total number of cycles taken is (173.5 + (xscroll % 7)).

	Now, you have been wondering what this extra 1/2 cycle business is about.
	Well, it has to do with how the display clock is generated. The display
	clock is generated via inverting the main input clock. I suspect it was
	done so that the video hardware can get the data to the LCD ready on the
	falling edge of the main clock. The display clocks data in on the RISING
	edge of the display clock, thus necessitating the inverted display clock
	relative to the main clock.

	That causes the vram access pattern to be extended 1/2 clock on the end
	to accommodate the inverted clock.

	* * * * *

	So, that takes care of the easy case. Now for what happens when the
	window is used.

	NOTE: When xwindow = 00h or xwindow = 0a6h, different things happen.
	I will explain them later on. For now, the following information
	holds when ((xwindow > 00h) && (xwindow < 0a6h)).

	Interestingly, adding the window does not change a whole lot- in fact,
	it simply restarts the whole fetch sequence over again, no matter where
	it was! The timing is generated like so:

	B01 (6 cycles) fetch background tile nametable+bitplanes
	B01_ (1 to 172 cycles) ((xscroll % 7) + xwindow + 1)
	W01 (6 cycles) fetch window tile nametable+bitplanes
	W01_ (1.5 to 166.5 cycles) (166.5 - xwindow)

	As can be seen, it's very similar to the first case. Only now,
	the number of B12_ accesses is controlled by the xscroll value
	and the xwindow value. As before, when the number of cycles has
	elapsed, the access pattern is just cut short, and the W12 (W = window
	nametable entry) access starts.

	This window access pattern is identical to the background one, except
	the window nametable is being accessed instead.

	Turning on the window incurs a 6 cycle penalty, so the total number of
	cycles taken is (173.5 + 6 + (xscroll % 7)).

	* * * * *

	OK, now things get slightly strange. When xwindow = 0, some slightly
	different rules come into effect.

	When (xscroll % 7) = 0 to 6, things work a bit different. Timing
	looks like this:

	B01B (7 cycles) technically the last B is part of the B01s pattern.
	W01 (6 cycles) as above, the start of the window access pattern
	W01s (167.5 to 173.5 cycles) (167.5 + (scroll % 7))

	As before, the W01s pattern is repeated for the required number of
	cycles. When the count has expired, the access pattern is just cut off.

	This takes 180.5 to 186.5 cycles.

	When (xscroll % 7) = 7, then the timing is slightly modified version
	of the above. The access pattern is identical to when (xscroll % 7) = 6
	except an extra cycle is inserted in the first sprite window, causing
	the total amount to be 187.5 cycles.

	* * * * *

	When xwindow = 0a6h, then timing is identical to when the window is
	disabled, i.e. 173.5 to 180.5 cycles. The difference is the window
	nametable is used instead of the background nametable. Rendering
	starts from the SECOND tile of each line, however. The net effect of
	this is, the window register appears to be scrolled 8 pixels to the right
	(if xscroll = 0).

	Only the lower 3 bits of the xscroll register are used in this mode.
	It shifts the the window left 0-7 cycles. Taking into account the first
	paragraph above about the nametable, the net effect is the window appears
	to be xscrolled 8 to 15 pixels left.

	Effective xscroll = (xscroll % 7) + 8

	The top scanline of the screen is ALWAYS reflecting the very first
	scanline of the window, when ywindow is less than or equal to 08fh.

	The second scanline of the screen will reflect the second scanline of
	the window, but ONLY when ywindow = 00h. Any other ywindow value will
	result in the background showing for the first scanline.

	The other scanlines of the screen (lines 3-144) will show the window,
	starting from the third window scanline depending on the ywindow value.

	This is hard to describe, but the effect is simple:

	GB line: (ywindow = 0)
	1 window 1
	2 window 2
	3 window 3
	4 window 4

	GB line: (ywindow = 1)
	1 window 1
	2 background 2
	3 window 3
	4 window 4

	GB line: (ywindow = 2)
	1 window 1
	2 background 2
	3 background 3
	4 window 3
	5 window 4
	6 window 5

	GB line: (ywindow = 3)
	1 window 1
	2 background 2
	3 background 3
	4 background 4
	5 window 3
	6 window 4

	----


	What addresses are read during rendering
	----------------------------------------

	Referring back to the fetch patterns above, I will go through what
	addresses are read.

	For the degenerate case, the access pattern looks like this:

	B01
	B01s (repeated 20-22x)

	Assuming that xscroll = 0, and yscroll = 0, and we have the background
	reading the 9800h nametable:

	B01 (reads 9800h)
	B01s (reads 9800h, 9801h, 9802h, 9803h...9814h)

	This is fairly simple: it just starts at the very upper left char of the
	nametable and starts reading. It ends up reading 9800h TWICE. The
	first access is just thrown away and is never used. It's here, because
	it helps during windowing (I will describe later in the window section).

	The way the characters are read is performed something like this:

	At the start of the scanline:

	1) latch the current character address
	2) read a character from the address
	3) read another character from the SAME address
	4) increment address
	5) read another character
	6) repeat 4 and 5 20-22 times.

	In step 1, the address we latch is calculated like this:

	yscroll is the yscroll register on the GB CPU
	xscroll is the xscroll register on the GB CPU
	ycounter is the current scanline we are rendering (0-143)
	whichnt is bit 3 of the LCD control reg on the GB CPU

	ybase = (yscroll + ycounter) // calculates the effective vis. scanline

	charaddr = (0x9800 \| (whichnt << 10) \| ((ybase & 0xf8) << 2) \|
	((xscroll & 0xf8) >> 3)

	Another way to represent this address:

	15 0
	-------------------
	1001 1NYY YYYX XXXX

	N = nametable #
	Y = upper 5 bits of ybase
	X = upper 5 bits of xscroll (which is then incremented between chars)


	In step 2, we read from charaddr, and throw the result away
	In step 3, we read from charaddr again and use it for the first vis. char
	In step 4, only the lower 5 bits of charaddr is incremented
	In step 5, we read the next character

	Then, we repeat it enough times to fill out the scanline.

	Once the nametable entry is fetched, we have to fetch the tile planes.

	Depending on the state of the "BG & Window Tile Data Select" register,
	which is LCD control bit 4, tile accesses are done one of two ways.


	ntbyte = the nametable byte we read from the above NT address.

	if (lcdcontrol[4]) tileaddr = (ntbyte << 4) \| ((ysub & 0x7) << 1)
	else tileaddr = (0x1000 - (ntbyte << 4)) \| ((ysub & 0x7) << 1)

	We will read the desired bytes for the tile data from tileaddr and
	tileaddr+1

	Notice that xscroll's lower 3 bits don't SEEM to play into any of the
	calculations above... this is because xscroll[2:0] does not affect
	which characters are fetched in any way. Fine xscroll (lower 3 bits
	of xscroll) only adjust the timing during LCD writing (explained
	below).


	* * * * *

	So, this is all fine and good.. but what happens during windowing?
	It's not much different than the above.

	During a typical VRAM access pattern with the window, it looks something
	like this:

	B01
	B01s (repeated N times)
	W01
	W01s (repeated M times)

	The background rendering sequence is identical to the background only
	sequence described previously.

	When the window accesses start, the address calculation is similar...

	First, a typical reading sequence:

	B01 (9800h)
	B01s (9800h, 9801h, 9802h, ...)
	W01 (9C00h)
	W01s (9C01h, 9C02h, ...)

	The first change is that the first W01 access is NOT thrown away.
	There is no duplicated read here as in the background read.

	The nametable read is calculated like so:

	windnt = the window nametable (LCD control bit 6)

	basew = (ycount - yscroll) // calculates the effective window scanline

	charaddr = (0x9800 \| (windnt << 10) \| ((basew & 0xf8) << 2)

	Another way to represent this address:

	15 0
	-------------------
	1001 1NYY YYY0 0000

	N = nametable #
	Y = upper 5 bits of basew


	We simply read characters starting from the charaddr address, and
	increment it each time we read a character until the scanline is
	finished.

	The tile plane address is calculated the same as it is calculated for
	the background reads.


	That wraps up the actual VRAM access patterns.

	---

	LCD write timing
	----------------

	Before I can describe how the LCD timing works, I have to first explain
	how the LCD itself works.

	The LCD is composed of a 2 bit wide by 159 bit deep shift register,
	where the input pixels are shifted. Each rising edge of the display
	clock, data is shifted one stage down the register.

	When the display latch signal is activated, this shift register's value
	is latched into the LCD column drivers.

	The shift register is only 159 bits- the input data is used as the 160th
	bit for latching into the LCD column drivers.

	* The first pixel shifted into the register appears on the first column
	* The last pixel shifted in appears on the second to last column
	* The input pixel data on the input lines appears on the first column

	Terrible ASCII:


	DCLK: the display pixel clock
	data0/1: the 2 bit pixel data
	lat: the latch signal. when pulsed, latches the shift reg. data
	bias: the LCD bias voltage (contrast wheel adjusts this)
	inv: the LCD inverse signal (explained later)


	+-----------------+
	LCD DCLK o-------\|CLK \|
	\| \|
	\| 159*2 bit S/R \|
	LCD data0 o---*--\|D0 \|
	LCD data1 o-*-+--\|D1 \|
	\| \| \| pix 1 -> 159 \|
	\| \| +-----------------+
	\| \| \| ........... \|
	+--------------------------+
	\| pix 0 pix 1 -> 159 \|
	\| \|
	LCD lat o-\|latch 160*2 bit latch \|
	\| \|
	\| 160*2 outputs \|
	+--------------------------+
	\| .................... \|
	+--------------------------+
	\| \|
	\| 160 output drivers \|
	LCDbias o-\|bias \|
	LCD inv o-\|invert \|
	\| \|
	\| LCD outputs \|
	+--------------------------+
	\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|
	+---------------------------------+
	\| col 159 <- col 0 \|
	\| LCD columns \|
	\| \|
	\| \|
	\| LCD display \|
	\| \|
	\| \|
	\| \|
	\| \|
	\| \|
	+---------------------------------+


	So now that the LCD column driving and latching has been explained,
	the display timing that follows should make a bit more sense.

	Because the LCD is clocked, unlike a CRT, this means that the hardware
	has the ability to stop clocking the LCD for awhile if it feels like it.
	The GB video hardware indeed does do this, and even uses it to advantage
	during scrolling, sprite fetching, and starting the window rendering.

	When windowing is disabled, the display clock always runs for the last
	159 cycles. This is very interesting to me, because that means the
	video hardware is actively shifting pixels out to the display, but
	some of these pixels DO NOT HAVE A CORRESPONDING DCLK! This is how
	fine X scroll is achieved- the first 0 to 7 pixels are just thrown away.
	They get shifted out, but since the display clock is not running, they
	do not get shifted in. By delaying these cycles, the display data will
	shift left from 0 to 7 pixels.

	The windowing function works the same way- the pixel where the window
	is started will restart the rendering engine and thus allow single pixel
	precision on where the window starts on the LCD.

	During the first 6 cycles of the window fetch, the LCD clock is stalled.
	This lets the pipeline fill and then display clocking resumes after the
	6 cycle delay.

	That takes care of the timing of display data clocking.

	Now for the interesting part about how data is read and shifted out the
	LCD data pins:

	When the nametable fetch starts, the LAST tile data read will be latched
	into two 8 bit shift registers, and then shift out the data pins one
	pixel per clock. No matter what.

	So, referring to the access pattern again:

	B01 read the first tile
	B01s latch the pixel data into the output shift registers
	B01s latch the pixel data into the output shift registers
	B01s...

	Since each B01s access takes exactly 8 cycles, the output shift registers
	will be exactly refilled when they are empty, and continue the output
	data sending without interruption.

	Fine xscroll is effected by controlling the point in this process where
	the display clocking is started relative to the start of the rendering
	phase. The data will always shift out the pixel data pins at the same
	point in the render cycle, but since the DCLK is started earlier or
	later, the point where the LCD starts latching data changes relative to
	the data. This causes a 0-7 shift in the data on the LCD.

	The afore-mentioned output shift registers will blindly shift out their
	8 pixels of data without stopping, except when the LCD hardware is
	stalled by a sprite fetch (described later). Thus, the timing of the
	VRAM reads determines the amount of fine xscroll on the background
	and on the window.

	---