tysam-code/condensed-ml-tidbits.txt

## condensed-ml-tidbits.txt
# [IN-DEV currently]

# Maintained/Initially created by Fern. Say hi to me and feel free to ask any questions as needed! <3 :'))))
# If anything here is self-cited/has no citation, that means that it's a conclusion I arrived at over time, or in
#     deriving something from the basics, however, there may be work elaborating it in further detail (feel free to comment if there's an especially relevant link).

# Misc
- LayerNorm/RMSNorm might be acting as lateral inhibition, a paradigm attempted in many 2000's and surrounding ML papers (Fern, {relevant sources needed})
- 'Soft' (pre-determined or pre-compiled) architectures in the weights of your network can greatly increase convergences times and/or generalization.
- Downcasting dtypes to a lower bit depth in your dot products can be a 'free' efficiency improvement in some circumstances.
- GELU appears to be the most efficient (as of this hardware generation) runtime-vs-result activation function with 1 input. Swiglu appears to be the most efficent for two inputs.
- GELU(x) + -GELU(-x) = x, this can be used in a variety of creative manners.

# Tips, tricks, and advice for developing neural networks.

There are 3 (general) phases in building a neural network from the ground up: the initial build phase, the debugging phase (starts as soon as the network functionally runs without syntax/etc errors), and tuning (getting the network to actually pefrom well, and efficiently).

(TODO TODO TODO: Sort into areas? <3 :')))) )

## Initial Build
-- Start with what works: if it works well, and is simple.

## Debugging Phase

## Tuning

## General Long-term Ideology

- For anything other than raw text, your dataloader is likely your bottleneck. Rewrite it to run on GPU.
- When in doubt about something, try to phrase it in terms of a dot product.
- Very few preprocessing ops are actually more efficient on the CPU, your GPU supports multiple streams, use it.
- Start with the simplest thing possible. If it doesn't physically hurt to do, it's not simple enough.
- A terminal is the simplest logging output for early-stage development.
- A method developed without using a validation set will just go in circles.
- Starting with a pre-existing method will likely save you time over developing from scratch. Assume it has at least 17 bugs, 4 of which are very serious.
- Loss spikes permanently harm your network, fix them ASAP. I believe this can extend smoothly to high variance as well, in some contexts.
- Always plot or print the grad norms of your network when debugging. This can tell you if certain layer are being used, if you're overfitting, etc. Sudden phase changes are no bueno.
- Your network performance is effectively limited by the longest ~'number of hops' in the shortest path from your first network layer to your loss. Techniques like residuals and dense layers effectively shorten this.
- If there's a bug that you don't understand, your test case is likely too complicated, simplify until it resolves.
- Don't chase the new shiny. Have an actual reason if you really need to implement a method that just came out this last week/month/year.
- Print and/or chart everything you can when first developing out a new feature. Read the numbers of different tensors physically in the first stage of developing, statistics don't always tell the whole story.
- Prune unused logging ASAP when you're done with it, it's okay to delete and rewrite log stents for cleanliness.
- Walk through the network steps for a single iteration in your head or on paper to try to catch dataflow and/or tensor magnitude issues from a different angle.
- Establish a good baseline, sometimes removing a single component can make an entire method fail, premature optimization is tempting and can waste a lot of time.
- Additionally, multivariate exploration with fallback/resets can be a good exploration strategy -- change several ops together and see how they evolve. Go back to your baseline (for now) if they don't improve.
- Developing without a solid, tiny, rapidly-converging proxy task and network is probably the quickest way to waste your company's/organization's/personal funding.
- Networks with a solid scaling law are worth using.
- Complexity is a cost factor, the earlier you add complexity, the more exponentially expensive it gets. Minimize your complexity:performance ratio if possible.
- If manually stepping a hyperparameter in one direction fails, step the same amount in the opposite direction, no matter how silly. Then watch performance, linearly interpolate between them to the predicted-best value, and start again.
- If manually stepping a hyperparameter in some direction works well, keep going, no matter how silly it gets.
- Some hyperparameters are entangled, knowing this can greatly speed development. Sometimes these entanglements can be simplifed/removed.
- If you can't think of anything, do one of the first dumb things you can think of on a proxy task, no matter how dumb. Then 3-5 more dumb things. This method works in creating writing, and oddly enough can result in new solutions.
- Automated hyperparameter tuning can be great, and saves human hours, but you don't get the same visceral 'feel' for performance as when manually tuning a network. Both have benefits and tradeoffs.
- Adam (or approprite, domain-established variant) if you're prototyping or only seeing your data once, SGD for later on, and if you're seeing your data multiple times.
- Linear OneCycle can take you far, but can clash with Adam's beta2s being initalized at zero. RAdam is one cheap way around this, for example.
- Boil down what you are doing to the pith of the method, and the fundamental operations that can do it. Keep aggressively simplifying until you find that irreducible complexity.
- Ask yourself if there's a creative way to functionally do the same general thing as a given method in fewer kernel launches (i.e. unique operations). This can take a while to learn, many operations are capable of implicitly doing several things at once.
- Write your code to be hacked-into in the future, optimize for both future you and future experimentation. Argument-passing chains of 2,3 or more between functions becomes a functional nightmare very quickly.
- Whiten your data in-network if possible. This, among other things, also allows you to (potentially) freeze your first layer, greatly acellerating convergence.
- Use torch.channels_last when possible with 4D, convolution-based data to avoid transposes for tensor cores on NVIDIA GPUs.


# Postulations
- Batchnorm greatly accelerates network convergence by biasing the second order (step-to-step variance of the variances) of the network to zero.
- LayerNorm/RMSNorm might be acting as lateral inhibition in LLMs, a cheap way to communicate temporal information (and a paradigm attempted in many 2000's and surrounding ML papers) ({relevant sources needed})
- Grokking is likely just a failure mode where the entropy of the network does not match the entropy of the data. It's (generally) not an ideal thing to have, fix it ASAP.


# Paper-based results (by year)

## 2023
- (TODO check method description) Ranking LLM outputs values in order via cross-entropy greatly outperforms an RL-based reward head. (https://arxiv.org/abs/2305.18290)
- You can remove residuals in some networks by initializing all layers with an identity init and exploiting the nearly-linear area of some activation functions.

## 2022
- Teaching a sequential language model to externalize its reasoning steps greatly increases its performance on a lot of difficult, much more nonlinear tasks. (https://arxiv.org/pdf/2201.11903.pdf)

##

## 2017
- Batchnorm induces a train-(validation/test) domain gap, annealing over time towards the validation/test mode seems to reduce this effect somewhat (https://arxiv.org/pdf/1702.03275.pdf)

## TODOs (to sort, papers by year....)
- Smaller networks are less tolerant to data noise, larger networks are.
	# [IN-DEV currently]

	# Maintained/Initially created by Fern. Say hi to me and feel free to ask any questions as needed! <3 :'))))
	# If anything here is self-cited/has no citation, that means that it's a conclusion I arrived at over time, or in
	# deriving something from the basics, however, there may be work elaborating it in further detail (feel free to comment if there's an especially relevant link).

	# Misc
	- LayerNorm/RMSNorm might be acting as lateral inhibition, a paradigm attempted in many 2000's and surrounding ML papers (Fern, {relevant sources needed})
	- 'Soft' (pre-determined or pre-compiled) architectures in the weights of your network can greatly increase convergences times and/or generalization.
	- Downcasting dtypes to a lower bit depth in your dot products can be a 'free' efficiency improvement in some circumstances.
	- GELU appears to be the most efficient (as of this hardware generation) runtime-vs-result activation function with 1 input. Swiglu appears to be the most efficent for two inputs.
	- GELU(x) + -GELU(-x) = x, this can be used in a variety of creative manners.

	# Tips, tricks, and advice for developing neural networks.

	There are 3 (general) phases in building a neural network from the ground up: the initial build phase, the debugging phase (starts as soon as the network functionally runs without syntax/etc errors), and tuning (getting the network to actually pefrom well, and efficiently).

	(TODO TODO TODO: Sort into areas? <3 :')))) )

	## Initial Build
	-- Start with what works: if it works well, and is simple.

	## Debugging Phase

	## Tuning

	## General Long-term Ideology

	- For anything other than raw text, your dataloader is likely your bottleneck. Rewrite it to run on GPU.
	- When in doubt about something, try to phrase it in terms of a dot product.
	- Very few preprocessing ops are actually more efficient on the CPU, your GPU supports multiple streams, use it.
	- Start with the simplest thing possible. If it doesn't physically hurt to do, it's not simple enough.
	- A terminal is the simplest logging output for early-stage development.
	- A method developed without using a validation set will just go in circles.
	- Starting with a pre-existing method will likely save you time over developing from scratch. Assume it has at least 17 bugs, 4 of which are very serious.
	- Loss spikes permanently harm your network, fix them ASAP. I believe this can extend smoothly to high variance as well, in some contexts.
	- Always plot or print the grad norms of your network when debugging. This can tell you if certain layer are being used, if you're overfitting, etc. Sudden phase changes are no bueno.
	- Your network performance is effectively limited by the longest ~'number of hops' in the shortest path from your first network layer to your loss. Techniques like residuals and dense layers effectively shorten this.
	- If there's a bug that you don't understand, your test case is likely too complicated, simplify until it resolves.
	- Don't chase the new shiny. Have an actual reason if you really need to implement a method that just came out this last week/month/year.
	- Print and/or chart everything you can when first developing out a new feature. Read the numbers of different tensors physically in the first stage of developing, statistics don't always tell the whole story.
	- Prune unused logging ASAP when you're done with it, it's okay to delete and rewrite log stents for cleanliness.
	- Walk through the network steps for a single iteration in your head or on paper to try to catch dataflow and/or tensor magnitude issues from a different angle.
	- Establish a good baseline, sometimes removing a single component can make an entire method fail, premature optimization is tempting and can waste a lot of time.
	- Additionally, multivariate exploration with fallback/resets can be a good exploration strategy -- change several ops together and see how they evolve. Go back to your baseline (for now) if they don't improve.
	- Developing without a solid, tiny, rapidly-converging proxy task and network is probably the quickest way to waste your company's/organization's/personal funding.
	- Networks with a solid scaling law are worth using.
	- Complexity is a cost factor, the earlier you add complexity, the more exponentially expensive it gets. Minimize your complexity:performance ratio if possible.
	- If manually stepping a hyperparameter in one direction fails, step the same amount in the opposite direction, no matter how silly. Then watch performance, linearly interpolate between them to the predicted-best value, and start again.
	- If manually stepping a hyperparameter in some direction works well, keep going, no matter how silly it gets.
	- Some hyperparameters are entangled, knowing this can greatly speed development. Sometimes these entanglements can be simplifed/removed.
	- If you can't think of anything, do one of the first dumb things you can think of on a proxy task, no matter how dumb. Then 3-5 more dumb things. This method works in creating writing, and oddly enough can result in new solutions.
	- Automated hyperparameter tuning can be great, and saves human hours, but you don't get the same visceral 'feel' for performance as when manually tuning a network. Both have benefits and tradeoffs.
	- Adam (or approprite, domain-established variant) if you're prototyping or only seeing your data once, SGD for later on, and if you're seeing your data multiple times.
	- Linear OneCycle can take you far, but can clash with Adam's beta2s being initalized at zero. RAdam is one cheap way around this, for example.
	- Boil down what you are doing to the pith of the method, and the fundamental operations that can do it. Keep aggressively simplifying until you find that irreducible complexity.
	- Ask yourself if there's a creative way to functionally do the same general thing as a given method in fewer kernel launches (i.e. unique operations). This can take a while to learn, many operations are capable of implicitly doing several things at once.
	- Write your code to be hacked-into in the future, optimize for both future you and future experimentation. Argument-passing chains of 2,3 or more between functions becomes a functional nightmare very quickly.
	- Whiten your data in-network if possible. This, among other things, also allows you to (potentially) freeze your first layer, greatly acellerating convergence.
	- Use torch.channels_last when possible with 4D, convolution-based data to avoid transposes for tensor cores on NVIDIA GPUs.


	# Postulations
	- Batchnorm greatly accelerates network convergence by biasing the second order (step-to-step variance of the variances) of the network to zero.
	- LayerNorm/RMSNorm might be acting as lateral inhibition in LLMs, a cheap way to communicate temporal information (and a paradigm attempted in many 2000's and surrounding ML papers) ({relevant sources needed})
	- Grokking is likely just a failure mode where the entropy of the network does not match the entropy of the data. It's (generally) not an ideal thing to have, fix it ASAP.


	# Paper-based results (by year)

	## 2023
	- (TODO check method description) Ranking LLM outputs values in order via cross-entropy greatly outperforms an RL-based reward head. (https://arxiv.org/abs/2305.18290)
	- You can remove residuals in some networks by initializing all layers with an identity init and exploiting the nearly-linear area of some activation functions.

	## 2022
	- Teaching a sequential language model to externalize its reasoning steps greatly increases its performance on a lot of difficult, much more nonlinear tasks. (https://arxiv.org/pdf/2201.11903.pdf)

	##

	## 2017
	- Batchnorm induces a train-(validation/test) domain gap, annealing over time towards the validation/test mode seems to reduce this effect somewhat (https://arxiv.org/pdf/1702.03275.pdf)

	## TODOs (to sort, papers by year....)
	- Smaller networks are less tolerant to data noise, larger networks are.