willy-liu/gist:66e9c3f60170ffd657abcdfea2f20d46

## gistfile1.txt
(bitnet-cpp) willy@linux2025:~/Desktop/linux2025/term-project/BitNet$ uftrace record ./build/bin/llama-cli   -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf   -p "why sky blue" -n 50 --temp 0
(bitnet-cpp) willy@linux2025:~/Desktop/linux2025/term-project/BitNet$ uftrace report


Total time   Self time       Calls  Function
  ==========  ==========  ==========  ====================
    1.057  m  118.955 us          10  ggml_graph_compute_secondary_thread
    1.056  m   85.570 ms         306  ggml_graph_compute_thread
    1.050  m  134.310 ms      341496  ggml_compute_forward
    1.049  m  453.917 ms       82926  ggml_compute_forward_mul_mat
    1.047  m    1.953 ms         730  std::condition_variable::wait
    1.006  m    6.459  s      715668  ggml_compute_forward_mul_mat_one_chunk
   59.627  s    8.976 us           2  llama_load_model_from_file
   53.581  s    9.249 ms       53676  std::__invoke_impl
   53.532  s   53.532  s         475  linux:schedule
   53.522  s  488.271 us           1  main
   53.511  s    0.394 us           1  std::thread::_State_impl::_M_run
   53.511  s    0.243 us           1  std::thread::_Invoker::operator()
   53.511  s    0.233 us           1  std::thread::_Invoker::_M_invoke
   53.511  s    0.222 us           1  std::__invoke
   53.511  s  356.595 us           1  common_log::resume::$_0::operator()
   41.933  s   51.806 us         102  llama_decode
   40.352  s   37.444  s    37528064  ggml_vec_dot_i2_i8_s
   38.564  s   20.057  s      424062  ggml_barrier
   30.567  s   10.666 us           1  common_init_from_params
   29.813  s    3.619 us           1  llama_model_load
   25.760  s  655.968 ms           1  llm_load_vocab
   23.430  s   23.362  s   219431233  ggml_thread_cpu_relax
   20.966  s  565.998 us          51  llama_decode_internal
   20.692  s  188.142 us         265  ggml_graph_compute_check_for_work
   20.691  s   12.213  s         265  ggml_graph_compute_poll_for_work
   19.392  s  134.185 us          51  llama_graph_compute
   19.392  s   14.396 us          51  ggml_backend_sched_graph_compute_async
   19.392  s   43.438 us          51  ggml_backend_sched_compute_splits
   19.392  s   17.850 us          51  ggml_backend_graph_compute_async
   19.392  s   53.413 us          51  ggml_backend_cpu_graph_compute
   19.343  s   36.156 us          51  ggml_graph_compute
   18.876  s   18.866  s    11763456  ggml_vec_dot_f16
   10.305  s  242.974 ms      280147  std::map::emplace
    7.129  s   27.913 ms      283667  std::map::lower_bound
    7.101  s   70.411 ms      283667  std::_Rb_tree::lower_bound
    7.087  s    4.670  s       11538  quantize_row_i8_s
    6.977  s    1.007  s      287758  std::_Rb_tree::_M_lower_bound
    6.603  s   20.279 ms      128256  llama_token_to_piece::cxx11
    6.583  s   13.725 ms      130209  llama_token_to_piece
    6.569  s   68.857 ms      130155  llama_token_to_piece_impl
    6.351  s  338.351 ms      129591  llama_decode_text
    6.254  s    2.722  s    17604382  std::operator<
    4.376  s   76.622 ms      871921  unicode_utf8_to_byte
    4.298  s   77.186 ms      872268  std::unordered_map::at
    4.221  s  132.559 ms      872268  std::__detail::_Map_base::at
    4.174  s  565.350 ms     5865830  std::less::operator()
    4.174  s  508.574 ms      409750  unicode_cpts_from_utf8
    3.874  s  380.694 ms      873171  std::_Hashtable::find
    3.770  s    1.000 ms           1  llama_model_loader::llama_model_loader
    3.540  s  431.597 ms          44  gguf_kv_to_str::cxx11
	(bitnet-cpp) willy@linux2025:~/Desktop/linux2025/term-project/BitNet$ uftrace record ./build/bin/llama-cli -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "why sky blue" -n 50 --temp 0
	(bitnet-cpp) willy@linux2025:~/Desktop/linux2025/term-project/BitNet$ uftrace report


	Total time Self time Calls Function
	========== ========== ========== ====================
	1.057 m 118.955 us 10 ggml_graph_compute_secondary_thread
	1.056 m 85.570 ms 306 ggml_graph_compute_thread
	1.050 m 134.310 ms 341496 ggml_compute_forward
	1.049 m 453.917 ms 82926 ggml_compute_forward_mul_mat
	1.047 m 1.953 ms 730 std::condition_variable::wait
	1.006 m 6.459 s 715668 ggml_compute_forward_mul_mat_one_chunk
	59.627 s 8.976 us 2 llama_load_model_from_file
	53.581 s 9.249 ms 53676 std::__invoke_impl
	53.532 s 53.532 s 475 linux:schedule
	53.522 s 488.271 us 1 main
	53.511 s 0.394 us 1 std::thread::_State_impl::_M_run
	53.511 s 0.243 us 1 std::thread::_Invoker::operator()
	53.511 s 0.233 us 1 std::thread::_Invoker::_M_invoke
	53.511 s 0.222 us 1 std::__invoke
	53.511 s 356.595 us 1 common_log::resume::$_0::operator()
	41.933 s 51.806 us 102 llama_decode
	40.352 s 37.444 s 37528064 ggml_vec_dot_i2_i8_s
	38.564 s 20.057 s 424062 ggml_barrier
	30.567 s 10.666 us 1 common_init_from_params
	29.813 s 3.619 us 1 llama_model_load
	25.760 s 655.968 ms 1 llm_load_vocab
	23.430 s 23.362 s 219431233 ggml_thread_cpu_relax
	20.966 s 565.998 us 51 llama_decode_internal
	20.692 s 188.142 us 265 ggml_graph_compute_check_for_work
	20.691 s 12.213 s 265 ggml_graph_compute_poll_for_work
	19.392 s 134.185 us 51 llama_graph_compute
	19.392 s 14.396 us 51 ggml_backend_sched_graph_compute_async
	19.392 s 43.438 us 51 ggml_backend_sched_compute_splits
	19.392 s 17.850 us 51 ggml_backend_graph_compute_async
	19.392 s 53.413 us 51 ggml_backend_cpu_graph_compute
	19.343 s 36.156 us 51 ggml_graph_compute
	18.876 s 18.866 s 11763456 ggml_vec_dot_f16
	10.305 s 242.974 ms 280147 std::map::emplace
	7.129 s 27.913 ms 283667 std::map::lower_bound
	7.101 s 70.411 ms 283667 std::_Rb_tree::lower_bound
	7.087 s 4.670 s 11538 quantize_row_i8_s
	6.977 s 1.007 s 287758 std::_Rb_tree::_M_lower_bound
	6.603 s 20.279 ms 128256 llama_token_to_piece::cxx11
	6.583 s 13.725 ms 130209 llama_token_to_piece
	6.569 s 68.857 ms 130155 llama_token_to_piece_impl
	6.351 s 338.351 ms 129591 llama_decode_text
	6.254 s 2.722 s 17604382 std::operator<
	4.376 s 76.622 ms 871921 unicode_utf8_to_byte
	4.298 s 77.186 ms 872268 std::unordered_map::at
	4.221 s 132.559 ms 872268 std::__detail::_Map_base::at
	4.174 s 565.350 ms 5865830 std::less::operator()
	4.174 s 508.574 ms 409750 unicode_cpts_from_utf8
	3.874 s 380.694 ms 873171 std::_Hashtable::find
	3.770 s 1.000 ms 1 llama_model_loader::llama_model_loader
	3.540 s 431.597 ms 44 gguf_kv_to_str::cxx11