Idea for temporal bitnet optimization: one could perhaps avoid needing float accumulators during training; say we alter a random number of weights each pass (for a 1-bit bitnet that's a xor, always). the xor bits can be reconstructed from a global random seed value. if we memorize the resulting loss delta and the seeds of the past N steps, then we can compute a moving average for each xor bit with 1/N sub-bit precision.
with this, one could try doing N prospective measurements, and then accumulate the results (which cost very little in terms of space - so N could be high!) to get a really good and smooth gradient; that approach might be able to replace Adam completely.
Come to think of it, that approach might even work for float weights and Forward Gradient Optimization, since the seed can also be used to create random floats with gaussian distribution.