Skip to content

Instantly share code, notes, and snippets.

@sliminality
Created November 14, 2018 06:40
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sliminality/dab21fa834eae0a70193c7cd69c356d5 to your computer and use it in GitHub Desktop.
Save sliminality/dab21fa834eae0a70193c7cd69c356d5 to your computer and use it in GitHub Desktop.
Documentation for the Across Lite *.puz format, reformatted from https://code.google.com/archive/p/puz/wikis/FileFormat.wiki

.puz file format documentation

Reformatted from original source: https://code.google.com/archive/p/puz/wikis/FileFormat.wiki

Header

Define a short to be a little-endian two byte integer. The file header is then described in the following table.

Component Offset End Length Type Description
Checksum 0x00 0x01 0x2 short overall file checksum
File Magic 0x02 0x0D 0xC string NUL-terminated constant string: 4143 524f 5353 2644 4f57 4e00 ("ACROSS&DOWN")

The following checksums are described in more detail in a separate section below.

Component Offset End Length Type Description
CIB Checksum 0x0E 0x0F 0x2 short (defined later)
Masked Low Checksums 0x10 0x13 0x4 A set of checksums, XOR-masked against a magic string.
Masked High Checksums 0x14 0x17 0x4 A set of checksums, XOR-masked against a magic string.
Component Offset End Length Type Description
Version String(?) 0x18 0x1B 0x4 string e.g. "1.2\0"
Reserved1C(?) 0x1C 0x1D 0x2 ? In many files, this is uninitialized memory
Scrambled Checksum 0x1E 0x1F 0x2 short In scrambled puzzles, a checksum of the real solution (details below). Otherwise, 0x0000.
Width 0x2C 0x2C 0x1 byte The width of the board
Height 0x2D 0x2D 0x1 byte The height of the board
# of Clues 0x2E 0x2F 0x2 short The number of clues for this board
Unknown Bitmask 0x30 0x31 0x2 short A bitmask. Operations unknown.
Scrambled Tag 0x32 0x33 0x2 short 0 for unscrambled puzzles. Nonzero (often 4) for scrambled puzzles.

Puzzle Layout and State

Next come the board solution and player state. (If a player works on a puzzle and then saves their game, the cells they've filled are stored in the state. Otherwise the state is all blank cells and contains a subset of the information in the solution.)

Boards are stored as a single string of ASCII, with one character per cell of the board beginning at the top-left and scanning in reading order, left to right then top to bottom. We'll use this board as a running example (where # represents a black cell, and the letters are the filled-in solution).

C A T
# # A
# # R

At the end of the header (offset 0x34) comes the solution to the puzzle. Non-playable (ie: black) cells are denoted by . So for this example, the board is stored as nine bytes: CAT..A..R

Next comes the player state, stored similarly. Empty cells are stored as -, so the example board before any cells had been filled in is stored as: ---..-..-.

Strings Section

Immediately following the boards comes the strings. All strings are encoded in ISO-8859-1 and end with a NUL. Even if a string is empty, its trailing NUL still appears in the file. In order, the strings are:

Description Example
Title Theme: .PUZ format
Author J. Puz / W. Shortz
Copyright (c) 2007 J. Puz
Clue#1 Cued, in pool
... ...more clues...
Clue#n Quiet
Notes http://mywebsite

These first three example strings would appear in the file as the following, where \0 represents a NUL: Theme: .PUZ format\0J. Puz / W. Shortz\0(c) 2007 J. Puz\0

In some NYT puzzles, a "Note" has been included in the title instead of using the designated notes field. In all the examples we've seen, the note has been separated from the title by a space (ASCII 0x20) and begins with the string "NOTE:" or "Note:". It's not known if this is flagged anywhere else in the file. It doesn't seem that Across Lite handles these notes - they are just included with the title (which looks ugly).

The clues are arranged numerically. When two clues have the same number, the Across clue comes before the Down clue.

Clue Assignment

Nowhere in the file does it specify which cells get numbers or which clues correspond to which numbers. These are instead derived from the shape of the puzzle.

Here's a sketch of one way to assign numbers and clues to cells. First, some helper functions:

# Returns true if the cell at (x, y) gets an "across" clue number.
def cell_needs_across_number(x, y):
	# Check that there is no blank to the left of us
	if x == 0 or is_black_cell(x-1, y):
		# Check that there is space (at least two cells) for a word here
		if x+1 < width and is_black_cell(x+1):
			return True
	return False

def cell_needs_down_number(x, y):
# ...as above, but on the y axis

And then the actual assignment code:

# An array mapping across clues to the "clue number".
# So across_numbers[2] = 7 means that the 3rd across clue number points at cell number 7.

across_numbers = []
cur_cell_number = 1
# Iterate through th
for y in 0..height:
	for x in 0..width:
		if is_black_cell(x, y):
			continue

		assigned_number = False
		if cell_needs_across_number(x, y):
			across_numbers.append(cur_cell_number)
			cell_numbers[x][y] = cell_number
			assigned_number = True
		if cell_needs_down_number(x, y):
			# ...as above, with "down" instead
		if assigned_number:    
			cell_number += 1

Checksums

The file format uses a variety of checksums.

The checksumming routine used in PUZ is a variant of CRC-16. To checksum a region of memory, the following is used:

unsigned short cksum_region(unsigned char *base, int len, unsigned short cksum) {
	int i;
	for (i = 0; i < len; i++) {
		if (cksum & 0x0001)
			cksum = (cksum >> 1) + 0x8000;
		else
			cksum = cksum >> 1;
		cksum += *(base+i);
	}
	return cksum;
}

The CIB checksum (which appears as its own field in the header as well as elsewhere) is a checksum over eight bytes of the header starting at the board width:

c_cib = cksum_region(data + 0x2C, 8, 0);

The primary board checksum uses the CIB checksum and other data:

cksum = c_cib;
cksum = cksum_region(solution, w*h, cksum);
cksum = cksum_region(grid, w*h, cksum);

if (strlen(title) > 0)
	cksum = cksum_region(title, strlen(title)+1, cksum);

if (strlen(author) > 0)
	cksum = cksum_region(author, strlen(author)+1, cksum);

if (strlen(copyright) > 0)
	cksum = cksum_region(copyright, strlen(copyright)+1, cksum);

for (i = 0; i < num_of_clues; i++)
	cksum = cksum_region(clue[i], strlen(clue[i]), cksum);

if (strlen(notes) > 0)
	cksum = cksum_region(notes, strlen(notes)+1, cksum);

Masked Checksums

The values from 0x10-0x17 are a real pain to generate. They are the result of masking off and XORing four checksums; 0x10-0x13 are the low bytes, while 0x14-0x17 are the high bytes.

To calculate these bytes, we must first calculate four checksums:

  • CIB Checksum:

     c_cib = cksum_region(CIB, 0x08, 0x0000);
  • Solution Checksum:

     c_sol = cksum_region(solution, w*h, 0x0000);
  • Grid Checksum:

     c_grid = cksum_region(grid, w*h, 0x0000);
  • A partial board checksum:

     c_part = 0x0000;
     if (strlen(title) > 0)
     	c_part = cksum_region(title, strlen(title)+1, c_part);
     if (strlen(author) > 0)
     	c_part = cksum_region(author, strlen(author)+1, c_part);
     if (strlen(copyright) > 0)
     	c_part = cksum_region(copyright, strlen(copyright)+1, c_part);
     for (int i = 0; i < n_clues; i++)
     	c_part = cksum_region(clue[i], strlen(clue[i]), c_part);
     if (strlen(notes) > 0)
     	c_part = cksum_region(notes, strlen(notes)+1, c_part);

Once these four checksums are obtained, they're stuffed into the file thusly:

file[0x10] = 0x49 ^ (c_cib & 0xFF);
file[0x11] = 0x43 ^ (c_sol & 0xFF);
file[0x12] = 0x48 ^ (c_grid & 0xFF);
file[0x13] = 0x45 ^ (c_part & 0xFF);

file[0x14] = 0x41 ^ ((c_cib & 0xFF00) >> 8);
file[0x15] = 0x54 ^ ((c_sol & 0xFF00) >> 8);
file[0x16] = 0x45 ^ ((c_grid & 0xFF00) >> 8);
file[0x17] = 0x44 ^ ((c_part & 0xFF00) >> 8);

Note that these hex values in ASCII are the string "ICHEATED".

@ssksameer56
Copy link

Check that there is space (at least two cells) for a word here
if x+1 < width and is_black_cell(x+1):

Shouldn't the condition be to check x+2? Since we want to keep atleast 2 cells long words. For example at 6,8 we want to check if 8,8 is black or not?

@martinfitzgibbons
Copy link

Hi
I need to write an import method to get the Word and Clue from several .puz files I made when I was a teacher. I know you have broken it down above but I am not use to working with binary files (mainly just text) how do I get just the word and clues out of the file?
Xword Pic

@ssksameer56
Copy link

I dont think I am the right person for this, but any application or utility that supports puz files should allow you to provide a path to the file. The application would open it and find the words and clues. I had written a utility in the language Go.

@martinfitzgibbons
Copy link

martinfitzgibbons commented Nov 18, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment