Skip to content

Instantly share code, notes, and snippets.

@chauhang
Last active April 7, 2024 21:08
Show Gist options
  • Save chauhang/ca75857c6a152df65b79302fefa1fe2c to your computer and use it in GitHub Desktop.
Save chauhang/ca75857c6a152df65b79302fefa1fe2c to your computer and use it in GitHub Desktop.
executorch llama2

Initial failures on base model downloads pytorch/executorch#2907

Command

python -m examples.models.llama2.export_llama --checkpoint $MODEL_PATH/consolidated.00.pth --params $MODEL_PATH/params.json -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32

Error

Could not import fairseq2 modules.
INFO:root:Loading model with checkpoint=/Users/gchauhan/dev/llama-fast/checkpoints/meta-llama/Llama-2-7b/consolidated.00.pth, params=/Users/gchauhan/dev/llama-fast/checkpoints/meta-llama/Llama-2-7b/params.json, use_kv_cache=True, weight_type=WeightType.LLAMA
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/gchauhan/dev/executorch/examples/models/llama2/export_llama.py", line 30, in <module>
    main()  # pragma: no cover
    ^^^^^^
  File "/Users/gchauhan/dev/executorch/examples/models/llama2/export_llama.py", line 26, in main
    export_llama(modelname, args)
  File "/Users/gchauhan/dev/executorch/examples/models/llama2/export_llama_lib.py", line 408, in export_llama
    return _export_llama(modelname, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gchauhan/dev/executorch/examples/models/llama2/export_llama_lib.py", line 529, in _export_llama
    builder_exported_to_edge = _prepare_for_llama_export(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gchauhan/dev/executorch/examples/models/llama2/export_llama_lib.py", line 486, in _prepare_for_llama_export
    load_llama_model(
  File "/Users/gchauhan/dev/executorch/examples/models/llama2/builder.py", line 83, in load_llama_model
    model, example_inputs, _ = EagerModelFactory.create_model(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gchauhan/dev/executorch/examples/models/model_factory.py", line 44, in create_model
    model = model_class(**kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gchauhan/dev/executorch/examples/models/llama2/model.py", line 139, in __init__
    self.model_ = Transformer(model_args)
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/et/lib/python3.11/site-packages/executorch/examples/models/llama2/llama_transformer.py", line 418, in __init__
    self.tok_embeddings = nn.Embedding(params.vocab_size, params.dim)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/et/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 143, in __init__
    self.weight = Parameter(torch.empty((num_embeddings, embedding_dim), **factory_kwargs),
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/et/lib/python3.11/site-packages/torch/utils/_device.py", line 78, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Trying to create tensor with negative dimension -1: [-1, 4096]
@chauhang
Copy link
Author

chauhang commented Apr 7, 2024

Unable to run on Android Emulator

adb push for 4GB pte file hangs or crashes the emulator

Run model on Android (actual device worked)

Copy files

adb push xnnpack_llama2.pte /data/local/tmp/
adb push tokenizer.bin /data/local/tmp/
adb push cmake-out-android/examples/models/llama2/llama_main /data/local/tmp/

Run model on device

adb shell "cd /data/local/tmp && ./llama_main --model_path ./xnnpack_llama2.pte --tokenizer_path ./tokenizer.bin --prompt "Once upon a time" --seq_len 120"
I 00:00:00.003152 executorch:cpuinfo_utils.cpp:61] Reading file /sys/devices/soc0/image_version
I 00:00:00.003479 executorch:cpuinfo_utils.cpp:77] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003550 executorch:cpuinfo_utils.cpp:157] Number of efficient cores 4
I 00:00:00.003586 executorch:main.cpp:65] Resetting threadpool with num threads = 4
I 00:00:00.008603 executorch:runner.cpp:49] Creating LLaMa runner: model_path=./xnnpack_llama2.pte, tokenizer_path=./tokenizer.bin
I 00:00:12.040637 executorch:runner.cpp:64] Reading metadata from model
I 00:00:12.041047 executorch:runner.cpp:123] get_vocab_size: 32000
I 00:00:12.041061 executorch:runner.cpp:123] get_bos_id: 1
I 00:00:12.041077 executorch:runner.cpp:123] get_eos_id: 2
I 00:00:12.041089 executorch:runner.cpp:123] get_n_bos: 1
I 00:00:12.041095 executorch:runner.cpp:123] get_n_eos: 1
I 00:00:12.041100 executorch:runner.cpp:123] get_max_seq_len: 128
I 00:00:12.041105 executorch:runner.cpp:123] use_kv_cache: 0
I 00:00:12.041110 executorch:runner.cpp:123] use_sdpa_with_kv_cache: 0
I 00:00:12.041114 executorch:runner.cpp:123] append_eos_to_prompt: 0
Once upon a time, there was a beautiful city called Baghdad.istration, the Iraqi government is working to remove the names of all Americans and allies in Iraq from the terrorist list. The Iraqi government wants the world to believe that the new Iraq will be a peaceful, democratic place where all the world's people can feel safe. But, if the world's people believe in that new Iraq, they have to believe that the Iraqi government will put the names of all Iraqis who are terrorists on their terrorist listI 00:25:22.676836 executorch:runner.cpp:411] 	Prompt Tokens: 2    Generated Tokens: 117
I 00:25:22.677070 executorch:runner.cpp:417] 	Model Load Time:		12.051000 (seconds)
I 00:25:22.677151 executorch:runner.cpp:427] 	Total inference time:		1510.609000 (seconds)		 Rate: 	0.077452 (tokens/second)
I 00:25:22.677205 executorch:runner.cpp:435] 		Prompt evaluation:	4.939000 (seconds)		 Rate: 	0.404940 (tokens/second)
I 00:25:22.677380 executorch:runner.cpp:446] 		Generated 117 tokens:	1505.670000 (seconds)		 Rate: 	0.077706 (tokens/second)
I 00:25:22.677457 executorch:runner.cpp:454] 	Time to first generated token:	8.448000 (seconds)
I 00:25:22.677507 executorch:runner.cpp:461] 	Sampling time over 119 tokens:	0.136000 (seconds)

PyTorchObserver {"prompt_tokens":2,"generated_tokens":117,"model_load_start_ms":1712522260853,"model_load_end_ms":1712522272904,"inference_start_ms":1712522272904,"inference_end_ms":1712523783513,"prompt_eval_end_ms":1712522277843,"first_token_ms":1712522281352,"aggregate_sampling_time_ms":136,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment