Skip to content

Instantly share code, notes, and snippets.

@alex4o
Created February 4, 2024 10:20
Show Gist options
  • Save alex4o/2879f3997e72fc5cc44d78ee54333113 to your computer and use it in GitHub Desktop.
Save alex4o/2879f3997e72fc5cc44d78ee54333113 to your computer and use it in GitHub Desktop.
Vulkan Phi 2 Android
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "c3bdc368-aa45-4d59-b18c-b441ea5dd011",
"metadata": {},
"outputs": [],
"source": [
"import llama_cpp"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "53043669-5ce4-497c-a081-1d8ea69b1041",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"llama_model_loader: loaded meta data with 21 key-value pairs and 325 tensors from ../phi-2-orange.Q4_K_M.gguf (version GGUF V3 (latest))\n",
"llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.\n",
"llama_model_loader: - kv 0: general.architecture str = phi2\n",
"llama_model_loader: - kv 1: general.name str = Phi2\n",
"llama_model_loader: - kv 2: phi2.context_length u32 = 2048\n",
"llama_model_loader: - kv 3: phi2.embedding_length u32 = 2560\n",
"llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 10240\n",
"llama_model_loader: - kv 5: phi2.block_count u32 = 32\n",
"llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32\n",
"llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32\n",
"llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010\n",
"llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32\n",
"llama_model_loader: - kv 10: general.file_type u32 = 15\n",
"llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false\n",
"llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2\n",
"llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = [\"!\", \"\\\"\", \"#\", \"$\", \"%\", \"&\", \"'\", ...\n",
"llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\n",
"llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = [\"Ġ t\", \"Ġ a\", \"h e\", \"i n\", \"r e\",...\n",
"llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256\n",
"llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50295\n",
"llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256\n",
"llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 50256\n",
"llama_model_loader: - kv 20: general.quantization_version u32 = 2\n",
"llama_model_loader: - type f32: 195 tensors\n",
"llama_model_loader: - type q4_K: 81 tensors\n",
"llama_model_loader: - type q5_K: 32 tensors\n",
"llama_model_loader: - type q6_K: 17 tensors\n",
"llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).\n",
"llm_load_print_meta: format = GGUF V3 (latest)\n",
"llm_load_print_meta: arch = phi2\n",
"llm_load_print_meta: vocab type = BPE\n",
"llm_load_print_meta: n_vocab = 51200\n",
"llm_load_print_meta: n_merges = 50000\n",
"llm_load_print_meta: n_ctx_train = 2048\n",
"llm_load_print_meta: n_embd = 2560\n",
"llm_load_print_meta: n_head = 32\n",
"llm_load_print_meta: n_head_kv = 32\n",
"llm_load_print_meta: n_layer = 32\n",
"llm_load_print_meta: n_rot = 32\n",
"llm_load_print_meta: n_embd_head_k = 80\n",
"llm_load_print_meta: n_embd_head_v = 80\n",
"llm_load_print_meta: n_gqa = 1\n",
"llm_load_print_meta: n_embd_k_gqa = 2560\n",
"llm_load_print_meta: n_embd_v_gqa = 2560\n",
"llm_load_print_meta: f_norm_eps = 1.0e-05\n",
"llm_load_print_meta: f_norm_rms_eps = 0.0e+00\n",
"llm_load_print_meta: f_clamp_kqv = 0.0e+00\n",
"llm_load_print_meta: f_max_alibi_bias = 0.0e+00\n",
"llm_load_print_meta: n_ff = 10240\n",
"llm_load_print_meta: n_expert = 0\n",
"llm_load_print_meta: n_expert_used = 0\n",
"llm_load_print_meta: rope scaling = linear\n",
"llm_load_print_meta: freq_base_train = 10000.0\n",
"llm_load_print_meta: freq_scale_train = 1\n",
"llm_load_print_meta: n_yarn_orig_ctx = 2048\n",
"llm_load_print_meta: rope_finetuned = unknown\n",
"llm_load_print_meta: model type = 3B\n",
"llm_load_print_meta: model ftype = Q4_K - Medium\n",
"llm_load_print_meta: model params = 2.78 B\n",
"llm_load_print_meta: model size = 1.66 GiB (5.14 BPW) \n",
"llm_load_print_meta: general.name = Phi2\n",
"llm_load_print_meta: BOS token = 50256 '<|endoftext|>'\n",
"llm_load_print_meta: EOS token = 50295 '<|im_end|>'\n",
"llm_load_print_meta: UNK token = 50256 '<|endoftext|>'\n",
"llm_load_print_meta: PAD token = 50256 '<|endoftext|>'\n",
"llm_load_print_meta: LF token = 128 'Ä'\n",
"llm_load_tensors: ggml ctx size = 0.25 MiB\n",
"llm_load_tensors: offloading 32 repeating layers to GPU\n",
"llm_load_tensors: offloading non-repeating layers to GPU\n",
"llm_load_tensors: offloaded 33/33 layers to GPU\n",
"llm_load_tensors: Vulkan buffer size = 1634.32 MiB\n",
"llm_load_tensors: CPU buffer size = 70.31 MiB\n",
"..........................................................................................\n",
"llama_new_context_with_model: n_ctx = 256\n",
"llama_new_context_with_model: freq_base = 10000.0\n",
"llama_new_context_with_model: freq_scale = 1\n",
"llama_kv_cache_init: Vulkan KV buffer size = 80.00 MiB\n",
"llama_new_context_with_model: KV self size = 80.00 MiB, K (f16): 40.00 MiB, V (f16): 40.00 MiB\n",
"llama_new_context_with_model: Vulkan_Host input buffer size = 2.75 MiB\n",
"llama_new_context_with_model: Vulkan compute buffer size = 60.50 MiB\n",
"llama_new_context_with_model: Vulkan_Host compute buffer size = 2.75 MiB\n",
"llama_new_context_with_model: graph splits (measure): 3\n",
"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | \n",
"Model metadata: {'general.quantization_version': '2', 'tokenizer.ggml.padding_token_id': '50256', 'tokenizer.ggml.unknown_token_id': '50256', 'tokenizer.ggml.eos_token_id': '50295', 'tokenizer.ggml.bos_token_id': '50256', 'tokenizer.ggml.model': 'gpt2', 'tokenizer.ggml.add_bos_token': 'false', 'phi2.attention.head_count_kv': '32', 'general.file_type': '15', 'phi2.attention.head_count': '32', 'phi2.rope.dimension_count': '32', 'phi2.attention.layer_norm_epsilon': '0.000010', 'phi2.block_count': '32', 'phi2.embedding_length': '2560', 'phi2.context_length': '2048', 'phi2.feed_forward_length': '10240', 'general.name': 'Phi2', 'general.architecture': 'phi2'}\n"
]
}
],
"source": [
"model = llama_cpp.Llama(\n",
" model_path=\"../phi-2-orange.Q4_K_M.gguf\",\n",
" chat_format=\"chatml\",\n",
" seed=123123123,\n",
" n_gpu_layers=33,\n",
" n_ctx=256\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "67b8dd90-a83b-4f8d-87e1-93fcb7bd7b61",
"metadata": {},
"outputs": [],
"source": [
"def ask(q):\n",
" for m in model.create_chat_completion(\n",
" messages=[{\n",
" \"role\": \"user\",\n",
" \"content\": q\n",
" }],\n",
" stream=True\n",
" ):\n",
" if 'content' in m[\"choices\"][0]['delta']:\n",
" print(m[\"choices\"][0]['delta']['content'], end=\"\")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "53a357b7-8f48-4c1b-8ac3-338886ca7653",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The Eiffel Tower is located in Paris, France. It stands on the Champ de Mars in the 7th arrondissement of Paris. The iconic landmark was designed by Gustave Eiffel and completed in 1889 for the Exposition Universelle (World's Fair). Today, it remains one of the most recognizable structures worldwide and a symbol of both Paris and France. The Eiffel Tower attracts millions of visitors each year who come to admire its architectural beauty and enjoy panoramic views from its observation decks."
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"llama_print_timings: load time = 26815.85 ms\n",
"llama_print_timings: sample time = 199.17 ms / 108 runs ( 1.84 ms per token, 542.24 tokens per second)\n",
"llama_print_timings: prompt eval time = 26815.43 ms / 27 tokens ( 993.16 ms per token, 1.01 tokens per second)\n",
"llama_print_timings: eval time = 40759.91 ms / 107 runs ( 380.93 ms per token, 2.63 tokens per second)\n",
"llama_print_timings: total time = 71708.37 ms / 134 tokens\n"
]
}
],
"source": [
"ask(\"In which city we can find the eiffel tower?\")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "ebdd3410-42b0-4e69-8bfc-2b560e873349",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Llama.generate: prefix-match hit\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"San Francisco, California is known for its diverse culinary scene and unique seafood offerings. Here are five of the best burrito places to try in the city:\n",
"\n",
"1. The Burrito Company (The Burrito Company) - Located at 1610 Mission St., this popular chain offers a variety of delicious burritos with fresh ingredients like grilled shrimp, avocado, and cilantro. Their menu also includes vegetarian options like black bean, corn, and avocado burritos. With locations across the Bay Area, you can easily find one near you.\n",
"\n",
"2. Burrito Bell (Burrito Bell) - This family-owned restaurant offers a wide selection of burritos with a focus on quality ingredients like fresh fish, grilled vegetables, and homemade tortillas. Their menu includes popular options like the Fish & Veggie Burrito and the Chicken & Rice Burrito. With locations throughout San Francisco, you can easily find one nearby.\n",
"\n",
"3. Burrito Fiesta (Burrito Fiesta) - Known for their delicious fish tacos and burritos, this popular restaurant offers a variety of options like the Fish & Rice Burrito and"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"llama_print_timings: load time = 26815.85 ms\n",
"llama_print_timings: sample time = 446.29 ms / 228 runs ( 1.96 ms per token, 510.88 tokens per second)\n",
"llama_print_timings: prompt eval time = 26309.46 ms / 19 tokens ( 1384.71 ms per token, 0.72 tokens per second)\n",
"llama_print_timings: eval time = 97429.58 ms / 227 runs ( 429.21 ms per token, 2.33 tokens per second)\n",
"llama_print_timings: total time = 131404.34 ms / 246 tokens\n"
]
}
],
"source": [
"ask(\"What are the 5 best burito places in San Francisco CA?\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f82fa2f-4d60-4bf9-b87f-93429629b345",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment