Skip to content

Instantly share code, notes, and snippets.

@paniq
Last active March 8, 2024 12:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save paniq/0f92fa7f9c0d53f2bc3d9f0075ff0467 to your computer and use it in GitHub Desktop.
Save paniq/0f92fa7f9c0d53f2bc3d9f0075ff0467 to your computer and use it in GitHub Desktop.

Idea for temporal bitnet optimization: one could perhaps avoid needing float accumulators during training; say we alter a random number of weights each pass (for a 1-bit bitnet that's a xor, always). the xor bits can be reconstructed from a global random seed value. if we memorize the resulting loss delta and the seeds of the past N steps, then we can compute a moving average for each xor bit with 1/N sub-bit precision.

with this, one could try doing N prospective measurements, and then accumulate the results (which cost very little in terms of space - so N could be high!) to get a really good and smooth gradient; that approach might be able to replace Adam completely.

Come to think of it, that approach might even work for float weights and Forward Gradient Optimization, since the seed can also be used to create random floats with gaussian distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment