A proof-of-concept golfing language, where the structure of a program and its operators are distinct.
A structure comes before the operator array.
Op count to number of structs:
0: 1
1: 3
2: 9
3: 36
4: 162
5: 783
6: 3969
7: 20817
8: 112023
9: 615033
10: 3431403
11: 19398690
12: 110880900
13: 639730305
14: 3720657807
15: 21790419444
16: 128398625658
17: 760668489729
18: 4528069760691
19: 27070491820644
20: 162464919528222
21: 978463778897637
22: 5911727071716891
23: 35821932198013812
24: 217642657066130500
Op count to required bits for structure and total bits:
0: 0 0
1: 2 8
2: 4 16
3: 6 24
4: 8 32
5: 10 40
6: 12 48
7: 15 57
8: 17 65
9: 20 74
10: 22 82
11: 25 91
12: 27 99
13: 30 108
14: 32 116
15: 35 125
16: 37 133
17: 40 142
18: 43 151
19: 45 159
20: 48 168
21: 50 176
22: 53 185
23: 55 193
24: 58 202
Bits to ops:
0: 0 (1)
1: 0 (3)
2: 0 (7)
3: 0 (15)
4: 0 (31)
5: 0 (63)
6: 1 (1)
7: 1 (3)
8: 1 (7)
9: 1 (15)
10: 1 (31)
11: 1 (63)
12: 2 (1)
13: 2 (3)
14: 2 (7)
15: 2 (15)
16: 2 (31)
17: 2 (63)
18: 3 (1)
19: 3 (3)
20: 3 (7)
21: 3 (15)
22: 3 (31)
23: 3 (63)
24: 4 (1)
25: 4 (3)
26: 4 (7)
27: 4 (15)
28: 4 (31)
29: 4 (63)
30: 4 (127)
31: 4 (255)
32: 5 (4)
33: 5 (12)
34: 5 (28)
35: 5 (60)
36: 5 (124)
37: 5 (252)
38: 5 (508)
39: 5 (1020)
40: 6 (16)
41: 6 (48)
42: 6 (112)
43: 6 (240)
44: 6 (496)
45: 6 (1008)
46: 6 (2032)
47: 6 (4080)
48: 7 (64)
A structure comes before the operator reference. It starts with a number of 1
bits, then a 0
bit. Then, there are three data bits, and an additional three bits for each 1
at the start. Then, f(n) - 1
where f(n) = 8 ** n + f(n - 1)
and f(0) = 1
is added to the number represented by the data bits, where n
is the number of 1
bits at the start, giving the index in the sequence of program structures. The structure takes approximately (n - 2) / 4
bytes, where n
is the number of operators.
The sequence of program structures represents all possible structures as ternary numbers, where each digit is the child count of a node, in prefix notation. E.g., 2010
represents a program like 1 N 2 +
in postfix notation. These ternary numbers are sorted first by length, shortest first, then by the ratios of 0
s, 1
s, and 2
s. If x
, y
, and z
represented the amounts of 0
s, 1
s, and 2
s respectively, the smallest (y - z) ** 2
would come first, or in the event of a tie, the smallest ((x * 2) - (y + z)) ** 2
. In the event that there is still a tie, the larger number comes first (this ensures 0
s are farther to the right). The first 200 terms in this sequence are:
0: 0
1: 10
2: 200
3: 110
4: 2100
5: 2010
6: 1200
7: 1110
8: 21100
9: 21010
10: 20110
11: 12100
12: 12010
13: 11200
14: 22000
15: 20200
16: 11110
17: 221000
18: 220100
19: 220010
20: 212000
21: 210200
22: 202100
23: 202010
24: 201200
25: 122000
26: 120200
27: 211100
28: 211010
29: 210110
30: 201110
31: 121100
32: 121010
33: 120110
34: 112100
35: 112010
36: 111200
37: 111110
38: 2211000
39: 2210100
40: 2210010
41: 2201100
42: 2201010
43: 2200110
44: 2121000
45: 2120100
46: 2120010
47: 2112000
48: 2110200
49: 2102100
50: 2102010
51: 2101200
52: 2021100
53: 2021010
54: 2020110
55: 2012100
56: 2012010
57: 2011200
58: 1221000
59: 1220100
60: 1220010
61: 1212000
62: 1210200
63: 1202100
64: 1202010
65: 1201200
66: 1122000
67: 1120200
68: 2111100
69: 2111010
70: 2110110
71: 2101110
72: 2011110
73: 1211100
74: 1211010
75: 1210110
76: 1201110
77: 1121100
78: 1121010
79: 1120110
80: 1112100
81: 1112010
82: 1111200
83: 2220000
84: 2202000
85: 2200200
86: 2022000
87: 2020200
88: 1111110
89: 22111000
90: 22110100
91: 22110010
92: 22101100
93: 22101010
94: 22100110
95: 22011100
96: 22011010
97: 22010110
98: 22001110
99: 21211000
100: 21210100
101: 21210010
102: 21201100
103: 21201010
104: 21200110
105: 21121000
106: 21120100
107: 21120010
108: 21112000
109: 21110200
110: 21102100
111: 21102010
112: 21101200
113: 21021100
114: 21021010
115: 21020110
116: 21012100
117: 21012010
118: 21011200
119: 20211100
120: 20211010
121: 20210110
122: 20201110
123: 20121100
124: 20121010
125: 20120110
126: 20112100
127: 20112010
128: 20111200
129: 12211000
130: 12210100
131: 12210010
132: 12201100
133: 12201010
134: 12200110
135: 12121000
136: 12120100
137: 12120010
138: 12112000
139: 12110200
140: 12102100
141: 12102010
142: 12101200
143: 12021100
144: 12021010
145: 12020110
146: 12012100
147: 12012010
148: 12011200
149: 11221000
150: 11220100
151: 11220010
152: 11212000
153: 11210200
154: 11202100
155: 11202010
156: 11201200
157: 11122000
158: 11120200
159: 22210000
160: 22201000
161: 22200100
162: 22200010
163: 22120000
164: 22102000
165: 22100200
166: 22021000
167: 22020100
168: 22020010
169: 22012000
170: 22010200
171: 22002100
172: 22002010
173: 22001200
174: 21220000
175: 21202000
176: 21200200
177: 21022000
178: 21020200
179: 20221000
180: 20220100
181: 20220010
182: 20212000
183: 20210200
184: 20202100
185: 20202010
186: 20201200
187: 20122000
188: 20120200
189: 12220000
190: 12202000
191: 12200200
192: 12022000
193: 12020200
194: 21111100
195: 21111010
196: 21110110
197: 21101110
198: 21011110
199: 20111110
Strings are written using a special monad (or a dyad for format strings if supported). These take a tree and, rather than treating it as code, use the operator list and formatting to generate binary for the string's contents. For the format part, this comes from the same op-count-to-number-of-structs table as the structure. E.g., there are 783 trees which use 5 operators. These can simply be multiplied together, giving 783 * (64 ** 5) possible strings with 5 operators, which is 840739848192 or about 40 bits. A number will be added to the numeric representation of the string, that being the number of strings possible with fewer operators, removing any possibility of duplicates.
Catstruct's strings are bit-based, so a way to turn the string indices into bits is required. This is done using bijective base 2, with the digits 0 and 1 rather than 1 and 2, as the string specified by 0001 is not the same as 000000001. The first few numbers in this system would be:
0:
1: 0
2: 1
3: 00
4: 01
5: 10
6: 11
7: 000
8: 001
9: 010
10: 011
11: 100
12: 101
13: 110
14: 111
15: 0000
16: 0001
17: 0010
18: 0011
19: 0100
20: 0101
21: 0110
22: 0111
23: 1000
24: 1001
Catstruct's strings consist of groups of bits with headers indicating their function:
00
: A huffman-coded corpus character010
: A 14-bit Unicode character011
: A 21-bit Unicode character1
: A corpus word
Corpus words can have the following codes:
0
: Title case if preceded by a.
,?
, or!
, preceding space if not at start of string1000
: No spacing1001
: Opposite of normal casing1010
: No spacing, opposite of normal casing1011
: No spacing, all capitals1100
: Add,
1101
: Add.
1110
: Add?
1111
: Add!
c