Skip to content

Instantly share code, notes, and snippets.

View antmikinka's full-sized avatar
🎯
Focusing

Anthony antmikinka

🎯
Focusing
View GitHub Profile

Made from my ANE Optimizer Custom GPT

Chat Link

The code you provided appears to implement the OpenELM model, a transformer-based model architecture optimized for language modeling tasks. Below is an overview of its components and functionalities:

  1. OpenELMRMSNorm (RMS Normalization Layer):

    • Implements a custom RMS normalization layer, which normalizes the input tensor and scales it by a learnable parameter.
  2. OpenELMPreTrainedModel:

Made from my ANE-Optimizer Custom GPT.

Chat Link

The provided code implements several Apple Neural Engine (ANE) principles to optimize the performance of a language model, specifically the Llama model. Here's a detailed breakdown of the principles applied:

1. Proper Use of Inheritance and Initialization

  • Classes like LlamaRMSNorm, LlamaRotaryEmbedding, LlamaMLP, LlamaAttention, and LlamaDecoderLayer inherit from torch.nn.Module and properly initialize their components using super().__init__(). This ensures the base class is correctly set up before adding additional attributes or methods.
  • LlamaPreTrainedModel and LlamaModel properly initialize their base class PreTrainedModel using super().__init__(config).
@antmikinka
antmikinka / openelm-coreml.py
Created May 29, 2024 03:24
forked from pcuena
import argparse
import numpy as np
import torch
import torch.nn as nn
import coremltools as ct
from transformers import AutoTokenizer, AutoModelForCausalLM
# When using float16, all predicted logits are 0. To be debugge
This file has been truncated, but you can view the full file.
❯ python -m examples.models.llama2.export_llama --checkpoint /Users/anthonymikinka/executorch/llama-2-7b-chat/consolidated.00.pth --params /Users/anthonymikinka/executorch/llama-2-7b-chat/params.json -kv --coreml -qmode 8da4w[?12l[?25h
]2;python -m examples.models.llama2.export_llama --checkpoint --params -kv ]1;pythonCould not import fairseq2 modules.
INFO:root:Loading model with checkpoint=/Users/anthonymikinka/executorch/llama-2-7b-chat/consolidated.00.pth, params=/Users/anthonymikinka/executorch/llama-2-7b-chat/params.json, use_kv_cache=True, weight_type=WeightType.LLAMA
INFO:root:Loaded model with dtype=torch.bfloat16
INFO:datasets:PyTorch version 2.3.0 available.
linear: layers.0.attention.wq, in=4096, out=4096
linear: layers.0.attention.wk, in=4096, out=4096
linear: layers.0.attention.wv, in=4096, out=4096
[target=executorch.exir.dialects.edge._ops.quantized_decomposed.quantize_per_token.default](args = (%getitem_593, %getitem_594, %getitem_595, -128, 127, torch.int8), kwargs = {})
%quantized_decomposed_dequantize_per_token_default_188 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.quantized_decomposed.dequantize_per_token.default](args = (%quantized_decomposed_quantize_per_token_default_188, %getitem_594, %getitem_595, -128, 127, torch.int8, torch.float32), kwargs = {})
%lowered_module_108 : [num_users=1] = get_attr[target=lowered_module_108]
backend_id: CoreMLBackend
lowered graph():
%arg55_1 : [num_users=1] = placeholder[target=arg55_1]
%_lifted_tensor_constant827 : [num_users=1] = placeholder[target=_lifted_tensor_constant827]
%quantized_decomposed_dequantize_per_channel_group_default_188 : [num_users=1] = placeholder[target=quantized_decomposed_dequantize_per_channel_group_default_188]
%quantized_decomposed_dequantize_per_token_default_188 : [n
@antmikinka
antmikinka / Optimization Guidelines for the Apple Neural Engine.txt
Last active May 16, 2024 14:33
Optimization Guidelines for the Apple Neural Engine (ANE)
Comprehensive Optimization Guidelines for the Apple Neural Engine (ANE)
Tensor Considerations:
Shapes: Utilize tensor shapes that are powers of 2 (e.g., 2, 4, 8, 16) to enhance memory allocation and access.
Sizes: Keep tensor sizes small, aiming for multiples of 16 (e.g., 16, 32, 48, 64) to optimize memory usage.
Alignment: Ensure tensors are aligned to 16-byte boundaries to optimize memory access and computation. This is crucial for both performance and model compatibility with ANE hardware constraints.
ANE Hardware Maximums:
Maximum Tensor Dimension Size: The ANE can only load tensors with a dimension size of at most 16,384.
Maximum Model Block Size: The model block size should not exceed 1024.