Enrico envomp

## copy-of-eulers_method.ipynb

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                envomp
                / copy-of-eulers_method.ipynb
            
            
              Last active
              May 12, 2025 22:31
            
              
                Copy of Eulers_Method.ipynb
              
          
      Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## cross_entropy.py
import numpy as np
from torch.functional import F
import torch
import pandas as pd


def labels_to_one_hot(labels, num_classes):
    one_hot = torch.zeros(num_classes, dtype=torch.float64, device=labels.device)
    one_hot[labels] = 1
    return one_hot

## causal_results.txt
default llama(dim=3072, n_layers=28, n_heads=24, n_kv_heads=8, vocab_size=128256, multiple_of=256, ffn_dim_multiplier=1.0, max_seq_len=2048)

default RAdamScheduleFree(lr=1e-4, weight_decay=0.05, betas=(0.9, 0.98))
default AdamWScheduleFree(weight_decay=0.1, betas=(0.9, 0.98), lr=1e-4, warmup_steps=200)

| method (causal)               | run | model                                 | optimizer, (scheduler) | batch_size | gdnt_acc | duration | device | dtype    | epochs | batches_per_epoch | vloss  | correct_lp |
|-------------------------------|-----|---------------------------------------|------------------------|------------|----------|----------|--------|----------|--------|-------------------|--------|------------|
| lora on [wq, wk, wv]          | 1   | llama                                 | RAdamScheduleFree      | 2          | 6        | 2h 30min | A100   | float32  | 5      | 420               | 0.2881 | 4955       |
| full finetune                 | 2   | llama                                 | AdamW

## temporal_proof.txt
Simple example of S4 in temporal logic:

Branching Time Temporal Logic (BTL): simple example using relation "is_accessible_from" and W could be all the possible states during Wednesday. (tree of all different things I do on Wednesday)
 - Reflexive: is_accessible_from(eat, eat). If I'm eating, then i'm eating
 - Transitive: if is_accessible_from(wake_up, going_to_school) and is_accessible_from(going_to_school, eat) then is_accessible_from(wake_up, eat)
 - Connectedness: Some states can remain unreachable. The result is a tree of possible chains of actions (future)

Flash the paint picture

◻ - it is necessary that. Or for the current example, "in every possible future branch from the current moment"

## prompt.txt
Solve the following reasoning task where given known facts and rules on how to deduce new facts, conclude whether fact being queried can be deduced from known facts and rules. per rule, only one new fact is deduced.

Here are some examples:
1.
facts: ['95']
rules: [[['133'], '110'], [['86'], '146'], [['117', '110', '146'], '113'], [['110'], '117'], [['95'], '142'], [['0'], '133'], [['17'], '110'], [['133'], '86'], [['95', '0'], '86'], [['133', '86', '113'], '110'], [['142'], '17'], [['146'], '113'], [['113'], '0']]
query: 17
results in True

2.

## tokenizer.txt
rule_block = 200
deduction_separator = 201
rule_separator = 202
fact_block = 203
query_block = 204
preds_block = 205
end_of_turn = 206
end_of_text = 207
special_tokens = {1: 210, 0: 211}
pad = 208

## bigram_vs_full_seq_probability.txt
Dataset:

<s> I exist </s>
<s> Not that I want to </s>
<s> I want food </s>
<s> It is not what I want </s>


## ds.txt
[2, 3, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]
[0, 7, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]
[1, 7, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]
[2, 1, 128000, 128003, 128009] [-100, -100, -100, 128003, 128009]
[7, 6, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]
[0, 3, 128000, 128003, 128009] [-100, -100, -100, 128003, 128009]
[4, 4, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]
[4, 2, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]
[3, 7, 128000, 128003, 128009] [-100, -100, -100, 128003, 128009]
[2, 5, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]

## superglue_some_models.txt
######################################### original :


Solve the following match question by detailing every reasoning step.

Question:
Let's imagine a population of 100 humans. At the start of every epoch, every human gives birth to a child. We have to murder X of the children before they grow up to humans by the end of the epoch. If we want to have exactly 1000 humans after 10 epochs, then what is the value of X?

Answer:
At the start of each epoch, the population increases by 100, so after 10 epochs, the population will be 100 * 10 = 1000.

## Pretraining a question answering model
Pretraining a question answering model
- Selected model size of 70mil is not nearly enough for a model to fully comprehend context.
- It was enough for model to start talking on correct topic and form coherent answers.
- 3 to 4 epochs over smaller dataset is more than enough. Anything beyond that leads to over-training and regression.
- dataset shouldn't allow for "easy win" answers, such as a specific string being the correct answer 50% of the time.
- Having inappropriately large of a vocabulary compared to model size probably had a negative affect also.


MixtralForCausalLM(
	import numpy as np
	from torch.functional import F
	import torch
	import pandas as pd


	def labels_to_one_hot(labels, num_classes):
	one_hot = torch.zeros(num_classes, dtype=torch.float64, device=labels.device)
	one_hot[labels] = 1
	return one_hot
	default llama(dim=3072, n_layers=28, n_heads=24, n_kv_heads=8, vocab_size=128256, multiple_of=256, ffn_dim_multiplier=1.0, max_seq_len=2048)

	default RAdamScheduleFree(lr=1e-4, weight_decay=0.05, betas=(0.9, 0.98))
	default AdamWScheduleFree(weight_decay=0.1, betas=(0.9, 0.98), lr=1e-4, warmup_steps=200)

	\| method (causal) \| run \| model \| optimizer, (scheduler) \| batch_size \| gdnt_acc \| duration \| device \| dtype \| epochs \| batches_per_epoch \| vloss \| correct_lp \|
	\|-------------------------------\|-----\|---------------------------------------\|------------------------\|------------\|----------\|----------\|--------\|----------\|--------\|-------------------\|--------\|------------\|
	\| lora on [wq, wk, wv] \| 1 \| llama \| RAdamScheduleFree \| 2 \| 6 \| 2h 30min \| A100 \| float32 \| 5 \| 420 \| 0.2881 \| 4955 \|
	\| full finetune \| 2 \| llama \| AdamW
	Simple example of S4 in temporal logic:

	Branching Time Temporal Logic (BTL): simple example using relation "is_accessible_from" and W could be all the possible states during Wednesday. (tree of all different things I do on Wednesday)
	- Reflexive: is_accessible_from(eat, eat). If I'm eating, then i'm eating
	- Transitive: if is_accessible_from(wake_up, going_to_school) and is_accessible_from(going_to_school, eat) then is_accessible_from(wake_up, eat)
	- Connectedness: Some states can remain unreachable. The result is a tree of possible chains of actions (future)

	Flash the paint picture

	◻ - it is necessary that. Or for the current example, "in every possible future branch from the current moment"
	Solve the following reasoning task where given known facts and rules on how to deduce new facts, conclude whether fact being queried can be deduced from known facts and rules. per rule, only one new fact is deduced.

	Here are some examples:
	1.
	facts: ['95']
	rules: [[['133'], '110'], [['86'], '146'], [['117', '110', '146'], '113'], [['110'], '117'], [['95'], '142'], [['0'], '133'], [['17'], '110'], [['133'], '86'], [['95', '0'], '86'], [['133', '86', '113'], '110'], [['142'], '17'], [['146'], '113'], [['113'], '0']]
	query: 17
	results in True

	2.
	rule_block = 200
	deduction_separator = 201
	rule_separator = 202
	fact_block = 203
	query_block = 204
	preds_block = 205
	end_of_turn = 206
	end_of_text = 207
	special_tokens = {1: 210, 0: 211}
	pad = 208
	Dataset:

	<s> I exist </s>
	<s> Not that I want to </s>
	<s> I want food </s>
	<s> It is not what I want </s>
	[2, 3, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]
	[0, 7, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]
	[1, 7, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]
	[2, 1, 128000, 128003, 128009] [-100, -100, -100, 128003, 128009]
	[7, 6, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]
	[0, 3, 128000, 128003, 128009] [-100, -100, -100, 128003, 128009]
	[4, 4, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]
	[4, 2, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]
	[3, 7, 128000, 128003, 128009] [-100, -100, -100, 128003, 128009]
	[2, 5, 128000, 128002, 128009] [-100, -100, -100, 128002, 128009]
	######################################### original :


	Solve the following match question by detailing every reasoning step.

	Question:
	Let's imagine a population of 100 humans. At the start of every epoch, every human gives birth to a child. We have to murder X of the children before they grow up to humans by the end of the epoch. If we want to have exactly 1000 humans after 10 epochs, then what is the value of X?

	Answer:
	At the start of each epoch, the population increases by 100, so after 10 epochs, the population will be 100 * 10 = 1000.
	Pretraining a question answering model
	- Selected model size of 70mil is not nearly enough for a model to fully comprehend context.
	- It was enough for model to start talking on correct topic and form coherent answers.
	- 3 to 4 epochs over smaller dataset is more than enough. Anything beyond that leads to over-training and regression.
	- dataset shouldn't allow for "easy win" answers, such as a specific string being the correct answer 50% of the time.
	- Having inappropriately large of a vocabulary compared to model size probably had a negative affect also.



	MixtralForCausalLM(