Anthony antmikinka

## OpenELM ANE Principles Implementations.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                antmikinka
                / OpenELM ANE Principles Implementations.md
            
            
              Last active
              May 31, 2024 12:28
            
          
    Made from my ANE Optimizer Custom GPT
Chat Link
The code you provided appears to implement the OpenELM model, a transformer-based model architecture optimized for language modeling tasks. Below is an overview of its components and functionalities:


OpenELMRMSNorm (RMS Normalization Layer):

Implements a custom RMS normalization layer, which normalizes the input tensor and scales it by a learnable parameter.


OpenELMPreTrainedModel:


## ANE-Principles-In-recurrent-drafting.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                antmikinka
                / ANE-Principles-In-recurrent-drafting.md
            
            
              Created
              May 31, 2024 11:57
            
          
    Made from my ANE-Optimizer Custom GPT.
Chat Link
The provided code implements several Apple Neural Engine (ANE) principles to optimize the performance of a language model, specifically the Llama model. Here's a detailed breakdown of the principles applied:
1. Proper Use of Inheritance and Initialization


Classes like LlamaRMSNorm, LlamaRotaryEmbedding, LlamaMLP, LlamaAttention, and LlamaDecoderLayer inherit from torch.nn.Module and properly initialize their components using super().__init__(). This ensures the base class is correctly set up before adding additional attributes or methods.
LlamaPreTrainedModel and LlamaModel properly initialize their base class PreTrainedModel using super().__init__(config).


## openelm-coreml.py
import argparse
import numpy as np
import torch
import torch.nn as nn
import coremltools as ct
from transformers import AutoTokenizer, AutoModelForCausalLM

# When using float16, all predicted logits are 0. To be debugge


## executorch log.txt

[A[0m[27m[24m[J[0m[49m[27m[24m[38;5;196m❯[0m[38;5;196m[49m[39m[27m[24m python -m examples.models.llama2.export_llama --checkpoint /Users/anthonymikinka/executorch/llama-2-7b-chat/consolidated.00.pth --params /Users/anthonymikinka/executorch/llama-2-7b-chat/params.json -kv --coreml -qmode 8da4w[K[13D[?12l[?25h

]2;python -m examples.models.llama2.export_llama --checkpoint  --params  -kv   ]1;pythonCould not import fairseq2 modules.
INFO:root:Loading model with checkpoint=/Users/anthonymikinka/executorch/llama-2-7b-chat/consolidated.00.pth, params=/Users/anthonymikinka/executorch/llama-2-7b-chat/params.json, use_kv_cache=True, weight_type=WeightType.LLAMA
INFO:root:Loaded model with dtype=torch.bfloat16
INFO:datasets:PyTorch version 2.3.0 available.
linear: layers.0.attention.wq, in=4096, out=4096
linear: layers.0.attention.wk, in=4096, out=4096
linear: layers.0.attention.wv, in=4096, out=4096

## PyTorch-executorch-issue 3443 terminal.txt
[target=executorch.exir.dialects.edge._ops.quantized_decomposed.quantize_per_token.default](args = (%getitem_593, %getitem_594, %getitem_595, -128, 127, torch.int8), kwargs = {})
  %quantized_decomposed_dequantize_per_token_default_188 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.quantized_decomposed.dequantize_per_token.default](args = (%quantized_decomposed_quantize_per_token_default_188, %getitem_594, %getitem_595, -128, 127, torch.int8, torch.float32), kwargs = {})
  %lowered_module_108 : [num_users=1] = get_attr[target=lowered_module_108]
    backend_id: CoreMLBackend
    lowered graph():
      %arg55_1 : [num_users=1] = placeholder[target=arg55_1]
      %_lifted_tensor_constant827 : [num_users=1] = placeholder[target=_lifted_tensor_constant827]
      %quantized_decomposed_dequantize_per_channel_group_default_188 : [num_users=1] = placeholder[target=quantized_decomposed_dequantize_per_channel_group_default_188]
      %quantized_decomposed_dequantize_per_token_default_188 : [n

## Optimization Guidelines for the Apple Neural Engine.txt
Comprehensive Optimization Guidelines for the Apple Neural Engine (ANE)
Tensor Considerations:

Shapes: Utilize tensor shapes that are powers of 2 (e.g., 2, 4, 8, 16) to enhance memory allocation and access.
Sizes: Keep tensor sizes small, aiming for multiples of 16 (e.g., 16, 32, 48, 64) to optimize memory usage.
Alignment: Ensure tensors are aligned to 16-byte boundaries to optimize memory access and computation. This is crucial for both performance and model compatibility with ANE hardware constraints.
ANE Hardware Maximums:

Maximum Tensor Dimension Size: The ANE can only load tensors with a dimension size of at most 16,384.
Maximum Model Block Size: The model block size should not exceed 1024.
	import argparse
	import numpy as np
	import torch
	import torch.nn as nn
	import coremltools as ct
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# When using float16, all predicted logits are 0. To be debugge

	[A[0m[27m[24m[J[0m[49m[27m[24m[38;5;196m❯[0m[38;5;196m[49m[39m[27m[24m python -m examples.models.llama2.export_llama --checkpoint /Users/anthonymikinka/executorch/llama-2-7b-chat/consolidated.00.pth --params /Users/anthonymikinka/executorch/llama-2-7b-chat/params.json -kv --coreml -qmode 8da4w[K[13D[?12l[?25h

	]2;python -m examples.models.llama2.export_llama --checkpoint --params -kv ]1;pythonCould not import fairseq2 modules.
	INFO:root:Loading model with checkpoint=/Users/anthonymikinka/executorch/llama-2-7b-chat/consolidated.00.pth, params=/Users/anthonymikinka/executorch/llama-2-7b-chat/params.json, use_kv_cache=True, weight_type=WeightType.LLAMA
	INFO:root:Loaded model with dtype=torch.bfloat16
	INFO:datasets:PyTorch version 2.3.0 available.
	linear: layers.0.attention.wq, in=4096, out=4096
	linear: layers.0.attention.wk, in=4096, out=4096
	linear: layers.0.attention.wv, in=4096, out=4096
	[target=executorch.exir.dialects.edge._ops.quantized_decomposed.quantize_per_token.default](args = (%getitem_593, %getitem_594, %getitem_595, -128, 127, torch.int8), kwargs = {})
	%quantized_decomposed_dequantize_per_token_default_188 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.quantized_decomposed.dequantize_per_token.default](args = (%quantized_decomposed_quantize_per_token_default_188, %getitem_594, %getitem_595, -128, 127, torch.int8, torch.float32), kwargs = {})
	%lowered_module_108 : [num_users=1] = get_attr[target=lowered_module_108]
	backend_id: CoreMLBackend
	lowered graph():
	%arg55_1 : [num_users=1] = placeholder[target=arg55_1]
	%_lifted_tensor_constant827 : [num_users=1] = placeholder[target=_lifted_tensor_constant827]
	%quantized_decomposed_dequantize_per_channel_group_default_188 : [num_users=1] = placeholder[target=quantized_decomposed_dequantize_per_channel_group_default_188]
	%quantized_decomposed_dequantize_per_token_default_188 : [n
	Comprehensive Optimization Guidelines for the Apple Neural Engine (ANE)
	Tensor Considerations:

	Shapes: Utilize tensor shapes that are powers of 2 (e.g., 2, 4, 8, 16) to enhance memory allocation and access.
	Sizes: Keep tensor sizes small, aiming for multiples of 16 (e.g., 16, 32, 48, 64) to optimize memory usage.
	Alignment: Ensure tensors are aligned to 16-byte boundaries to optimize memory access and computation. This is crucial for both performance and model compatibility with ANE hardware constraints.
	ANE Hardware Maximums:

	Maximum Tensor Dimension Size: The ANE can only load tensors with a dimension size of at most 16,384.
	Maximum Model Block Size: The model block size should not exceed 1024.