Created
April 15, 2024 23:23
-
-
Save Verdagon/03b0fb47f9cc733fee01f7ace4eed017 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
% ./main -m ../mixtral/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf -t 8 -n 1 -p "<|system|>You are a helpful assistant that only ever answers Yes or No. You answer one word then immediately stop, with nothing else in the response, not even punctuation or space, just one word Yes or no.<|im_end|><|im_start|>Is 11134 an even number? Answer yes or no, nothing else.</s><|assistant|>" | |
Log start | |
main: build = 2392 (bb6d00bb) | |
main: built with Apple clang version 14.0.3 (clang-1403.0.22.14.1) for arm64-apple-darwin23.4.0 | |
main: seed = 1713222923 | |
llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from ../mixtral/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf (version GGUF V3 (latest)) | |
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
llama_model_loader: - kv 0: general.architecture str = llama | |
llama_model_loader: - kv 1: general.name str = mistralai_mixtral-8x7b-instruct-v0.1 | |
llama_model_loader: - kv 2: llama.context_length u32 = 32768 | |
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 | |
llama_model_loader: - kv 4: llama.block_count u32 = 32 | |
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 | |
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 | |
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 | |
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 | |
llama_model_loader: - kv 9: llama.expert_count u32 = 8 | |
llama_model_loader: - kv 10: llama.expert_used_count u32 = 2 | |
llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000 | |
llama_model_loader: - kv 13: general.file_type u32 = 17 | |
llama_model_loader: - kv 14: tokenizer.ggml.model str = llama | |
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... | |
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... | |
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... | |
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1 | |
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 2 | |
llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0 | |
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 | |
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true | |
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false | |
llama_model_loader: - kv 24: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... | |
llama_model_loader: - kv 25: general.quantization_version u32 = 2 | |
llama_model_loader: - type f32: 65 tensors | |
llama_model_loader: - type f16: 32 tensors | |
llama_model_loader: - type q8_0: 64 tensors | |
llama_model_loader: - type q5_K: 833 tensors | |
llama_model_loader: - type q6_K: 1 tensors | |
llm_load_vocab: special tokens definition check successful ( 259/32000 ). | |
llm_load_print_meta: format = GGUF V3 (latest) | |
llm_load_print_meta: arch = llama | |
llm_load_print_meta: vocab type = SPM | |
llm_load_print_meta: n_vocab = 32000 | |
llm_load_print_meta: n_merges = 0 | |
llm_load_print_meta: n_ctx_train = 32768 | |
llm_load_print_meta: n_embd = 4096 | |
llm_load_print_meta: n_head = 32 | |
llm_load_print_meta: n_head_kv = 8 | |
llm_load_print_meta: n_layer = 32 | |
llm_load_print_meta: n_rot = 128 | |
llm_load_print_meta: n_embd_head_k = 128 | |
llm_load_print_meta: n_embd_head_v = 128 | |
llm_load_print_meta: n_gqa = 4 | |
llm_load_print_meta: n_embd_k_gqa = 1024 | |
llm_load_print_meta: n_embd_v_gqa = 1024 | |
llm_load_print_meta: f_norm_eps = 0.0e+00 | |
llm_load_print_meta: f_norm_rms_eps = 1.0e-05 | |
llm_load_print_meta: f_clamp_kqv = 0.0e+00 | |
llm_load_print_meta: f_max_alibi_bias = 0.0e+00 | |
llm_load_print_meta: n_ff = 14336 | |
llm_load_print_meta: n_expert = 8 | |
llm_load_print_meta: n_expert_used = 2 | |
llm_load_print_meta: causal attm = 1 | |
llm_load_print_meta: pooling type = 0 | |
llm_load_print_meta: rope type = 0 | |
llm_load_print_meta: rope scaling = linear | |
llm_load_print_meta: freq_base_train = 1000000.0 | |
llm_load_print_meta: freq_scale_train = 1 | |
llm_load_print_meta: n_yarn_orig_ctx = 32768 | |
llm_load_print_meta: rope_finetuned = unknown | |
llm_load_print_meta: ssm_d_conv = 0 | |
llm_load_print_meta: ssm_d_inner = 0 | |
llm_load_print_meta: ssm_d_state = 0 | |
llm_load_print_meta: ssm_dt_rank = 0 | |
llm_load_print_meta: model type = 7B | |
llm_load_print_meta: model ftype = Q5_K - Medium | |
llm_load_print_meta: model params = 46.70 B | |
llm_load_print_meta: model size = 30.02 GiB (5.52 BPW) | |
llm_load_print_meta: general.name = mistralai_mixtral-8x7b-instruct-v0.1 | |
llm_load_print_meta: BOS token = 1 '<s>' | |
llm_load_print_meta: EOS token = 2 '</s>' | |
llm_load_print_meta: UNK token = 0 '<unk>' | |
llm_load_print_meta: PAD token = 0 '<unk>' | |
llm_load_print_meta: LF token = 13 '<0x0A>' | |
llm_load_tensors: ggml ctx size = 0.76 MiB | |
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 8192.00 MiB, offs = 0 | |
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 8192.00 MiB, offs = 8482390016 | |
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 8192.00 MiB, offs = 16964780032 | |
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 6381.27 MiB, offs = 25447170048, (30957.33 / 10922.67)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size | |
llm_load_tensors: offloading 32 repeating layers to GPU | |
llm_load_tensors: offloading non-repeating layers to GPU | |
llm_load_tensors: offloaded 33/33 layers to GPU | |
llm_load_tensors: Metal buffer size = 30649.56 MiB | |
llm_load_tensors: CPU buffer size = 85.94 MiB | |
.................................................................................................... | |
llama_new_context_with_model: n_ctx = 512 | |
llama_new_context_with_model: freq_base = 1000000.0 | |
llama_new_context_with_model: freq_scale = 1 | |
ggml_metal_init: allocating | |
ggml_metal_init: found device: Apple M2 Pro | |
ggml_metal_init: picking default device: Apple M2 Pro | |
ggml_metal_init: default.metallib not found, loading from source | |
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil | |
ggml_metal_init: loading '/Users/verdagon/llama.cpp/ggml-metal.metal' | |
ggml_metal_init: GPU name: Apple M2 Pro | |
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) | |
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) | |
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) | |
ggml_metal_init: simdgroup reduction support = true | |
ggml_metal_init: simdgroup matrix mul. support = true | |
ggml_metal_init: hasUnifiedMemory = true | |
ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB | |
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 64.00 MiB, (31023.14 / 10922.67)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size | |
llama_kv_cache_init: Metal KV buffer size = 64.00 MiB | |
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB | |
llama_new_context_with_model: CPU input buffer size = 10.01 MiB | |
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 114.53 MiB, (31137.67 / 10922.67)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size | |
llama_new_context_with_model: Metal compute buffer size = 114.53 MiB | |
llama_new_context_with_model: CPU compute buffer size = 8.00 MiB | |
llama_new_context_with_model: graph splits (measure): 2 | |
ggml_metal_graph_compute: command buffer 6 failed with status 5 | |
system_info: n_threads = 8 / 12 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | | |
sampling: | |
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 | |
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
sampling order: | |
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature | |
generate: n_ctx = 512, n_batch = 512, n_predict = 1, n_keep = 1 | |
<|system|>You are a helpful assistant that only ever answers Yes or No. You answer one word then immediately stop, with nothing else in the response, not even punctuation or space, just one word Yes or no.<|im_end|><|im_start|>Is 11134 an even number? Answer yes or no, nothing else.<|assistant|>ggml_metal_graph_compute: command buffer 6 failed with status 5 | |
llama_print_timings: load time = 167843.41 ms | |
llama_print_timings: sample time = 0.54 ms / 1 runs ( 0.54 ms per token, 1869.16 tokens per second) | |
llama_print_timings: prompt eval time = 192299.17 ms / 86 tokens ( 2236.04 ms per token, 0.45 tokens per second) | |
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) | |
llama_print_timings: total time = 192319.20 ms / 87 tokens | |
ggml_metal_free: deallocating | |
Log end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment