Skip to content

Instantly share code, notes, and snippets.

@andriihomiak
Last active February 28, 2024 06:58
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save andriihomiak/75bb162456ed7604992888ac2d50cef6 to your computer and use it in GitHub Desktop.
Save andriihomiak/75bb162456ed7604992888ac2d50cef6 to your computer and use it in GitHub Desktop.
llama.cpp on cato report

Table of Contents

  1. Run CPU inference of LLaMA on cato-poc
    1. CLI setup
    2. Run ubuntu container and start transmission
    3. llama.cpp setup
      1. llama.cpp job
      2. Model Conversion
      3. Inference
      4. Model quantization
      5. Smaller models
      6. Performance summary

Run CPU inference of LLaMA on cato-poc

CLI setup

neuro config switch-cluster cato-poc

Run ubuntu container and start transmission

Using cpu-huge preset:

neuro run --preset cpu-huge --detach -v storage://cato-poc/andriikhomiak/llama/data:/var/data --name transmission ubuntu -- sleep infinity

Installing dependencies:

neuro exec transmission -- /bin/bash -c 'apt update && apt upgrade && apt install -y transmission-daemon tmux'

Starting transmission-daemon:

neuro exec transmission -- tmux new-session -s transmission-daemon -d 'transmission-daemon -w /var/data -f'

Adding torrent:

neuro exec transmission -- transmission-remote --add "magnet:?xt=urn:btih:ZXXDAUWYLRUXXBHUYEMS6Q5CE5WA3LVA&dn=LLaMA"

Start the torrent:

neuro exec transmission -- transmission-remote -t LLaMA -s

To view the progress:

neuro exec transmission -- transmission-remote -l

Once the torrent is finished - kill transmission tmux session:

neuro exec transmission -- tmux kill-session -t transmission-daemon

Stop the transmission job:

neuro kill transmission

DONE llama.cpp setup

llama.cpp job

neuro run --name llama-cpp --preset cpu-huge --detach -v storage://cato-poc/andriikhomiak/llama/data/LLaMA:/models --entrypoint "/bin/bash" ghcr.io/ggerganov/llama.cpp:full -c "sleep infinity"

Some utils for resource usage monitoring and git

neuro exec llama-cpp -- apt install -y htop git tmux

First we need to rebuild the binaries (otherwise we get segmentation faults):

neuro exec llama-cpp -- make -B

Model Conversion

  1. 16-bit

    Start f16 model conversion:

    neuro exec llama-cpp -- tmux new-session -s convert-f16 -d 'python3 ./convert-pth-to-ggml.py /models/65B 1'
    

    See 16-bit conversion process:

    neuro exec llama-cpp -- tmux a -t convert-f16
    
  2. 32-bit (TAKES A LOT OF SPACE)

    Due to storage limitations of current cato-poc setup, f32 mode was not used

    Start f32 model conversion:

    neuro exec llama-cpp -- tmux new-session -s convert-f32 -d 'python3 ./convert-pth-to-ggml.py /models/65B 0'
    

    See 32-bit conversion process:

    neuro exec llama-cpp -- tmux a -t convert-f32
    
  3. Misc

    Monitor storage usage

    neuro exec llama-cpp -- watch -n 10 -d du -hs /models/65B/*
    

Inference

Resource usage can be monitored then:

neuro exec llama-cpp -- htop

To run inference:

neuro exec llama-cpp -- bash -c './main -m /models/65B/ggml-model-f16.bin -p "Hello!" -s 42 -t $(nproc) -n 256'

main: build = 0 (2d13786)
main: seed  = 42
llama.cpp: loading model from /models/65B/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 169.45 KB
llama_model_load_internal: mem required  = 128109.20 MB (+ 5120.00 MB per state)
llama_init_from_file: kv self size  = 1280.00 MB

system_info: n_threads = 56 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0


 Hello! Welcome to the site of my new novel, The Paper Cell.
Following a successful crowdfunding campaign on Publishizer, I am now under contract with Pegasus Elliot Mackenzie Publishers to publish this book in 2018. More details will follow shortly. [end of text]

llama_print_timings:        load time =  9968.28 ms
llama_print_timings:      sample time =    36.64 ms /    63 runs   (    0.58 ms per run)
llama_print_timings: prompt eval time =  3834.07 ms /     3 tokens ( 1278.02 ms per token)
llama_print_timings:        eval time = 231021.07 ms /    62 runs   ( 3726.15 ms per run)
llama_print_timings:       total time = 241053.44 ms

Model quantization

  1. Q4_0

    To start quantization of f16 model to q4_0 format:

    neuro exec llama-cpp -- tmux new-session -s quantize-f16-q4_0 -d './quantize /models/65B/ggml-model-f16.bin q4_0 $(nproc)'
    

    To monitor the process of quantization:

    neuro exec llama-cpp -- tmux a -t quantize-f16-q4_0
    

    Inference of the quantized model:

    neuro exec llama-cpp -- bash -c './main -m /models/65B/ggml-model-q4_0.bin -p "Hello!" -s 42 -t $(nproc) -n 256'
    
    main: build = 0 (2d13786)
    main: seed  = 42
    llama.cpp: loading model from /models/65B/ggml-model-q4_0.bin
    llama_model_load_internal: format     = ggjt v1 (latest)
    llama_model_load_internal: n_vocab    = 32000
    llama_model_load_internal: n_ctx      = 512
    llama_model_load_internal: n_embd     = 8192
    llama_model_load_internal: n_mult     = 256
    llama_model_load_internal: n_head     = 64
    llama_model_load_internal: n_layer    = 80
    llama_model_load_internal: n_rot      = 128
    llama_model_load_internal: ftype      = 2 (mostly Q4_0)
    llama_model_load_internal: n_ff       = 22016
    llama_model_load_internal: n_parts    = 1
    llama_model_load_internal: model size = 65B
    llama_model_load_internal: ggml ctx size = 169.45 KB
    llama_model_load_internal: mem required  = 42501.70 MB (+ 5120.00 MB per state)
    llama_init_from_file: kv self size  = 1280.00 MB
    
    system_info: n_threads = 56 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
    sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
    generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0
    
    
     Hello! We are very excited to be starting our 3rd year here at the Learning Lodge.
    Our mission is "To provide a safe learning environment in which each child can experience growth spiritually, socially, emotionally and physically through creative play."
    Our goal is to help prepare your children for their future by giving them tools to learn, create & explore. We will be using the A Beka curriculum along with supplemental lessons in a classroom setting.
    The Learning Lodge has been serving the Prescott Valley community since 1987. Our Director, Dianne Frost, holds an Early Childhood Diploma and is certified in CPR & First Aid for children and adults as well as MAT (medication administration training). She also attends many workshops to keep up with the latest trends in childcare/education. Her staff are all fingerprint cleared, have been background checked and hold current First Aid, CPR & Medication certifications.
    Our goal is to help prepare your children for their future by giving them tools to learn, create & explore through a variety of developmentally appropriate activities in a classroom setting
    llama_print_timings:        load time = 15482.74 ms
    llama_print_timings:      sample time =   150.95 ms /   256 runs   (    0.59 ms per run)
    llama_print_timings: prompt eval time =  1864.05 ms /     3 tokens (  621.35 ms per token)
    llama_print_timings:        eval time = 334274.16 ms /   255 runs   ( 1310.88 ms per run)
    llama_print_timings:       total time = 349994.59 ms
    
  2. Q8_0

    To start quantization of f16 model to q8_0 format:

    neuro exec llama-cpp -- tmux new-session -s quantize-f16-q8_0 -d './quantize /models/65B/ggml-model-f16.bin q8_0 $(nproc)'
    

    To monitor the process of quantization:

    neuro exec llama-cpp -- tmux a -t quantize-f16-q8_0
    

    Inference of the quantized model:

    neuro exec llama-cpp -- bash -c './main -m /models/65B/ggml-model-q8_0.bin -p "Hello!" -s 42 -t $(nproc) -n 256'
    
    main: build = 1 (2d13786)
    main: seed  = 42
    llama.cpp: loading model from /models/65B/ggml-model-q8_0.bin
    llama_model_load_internal: format     = ggjt v1 (latest)
    llama_model_load_internal: n_vocab    = 32000
    llama_model_load_internal: n_ctx      = 512
    llama_model_load_internal: n_embd     = 8192
    llama_model_load_internal: n_mult     = 256
    llama_model_load_internal: n_head     = 64
    llama_model_load_internal: n_layer    = 80
    llama_model_load_internal: n_rot      = 128
    llama_model_load_internal: ftype      = 7 (mostly Q8_0)
    llama_model_load_internal: n_ff       = 22016
    llama_model_load_internal: n_parts    = 1
    llama_model_load_internal: model size = 65B
    llama_model_load_internal: ggml ctx size = 169.45 KB
    llama_model_load_internal: mem required  = 73631.70 MB (+ 5120.00 MB per state)
    llama_init_from_file: kv self size  = 1280.00 MB
    
    system_info: n_threads = 56 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
    sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
    generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0
    
    
     Hello! Welcome to the site of author, blogger and freelance writer D.E. Haggerty, who also writes under the pen name A.R. Winters. Whether you’re here for book reviews, author interviews or just a bit of writing fun, grab your favorite beverage and make yourself at home. I’m delighted to have you visit!
    When you were young did anyone ever tell you that you couldn’t do something? Or maybe they said that it wasn’t lady-like, manly or appropriate for someone of your age? Then I hope you’ll be inspired by my middle grade novel My Life in Middles – A Story about a Girl Who Lived on the Border of Nothing.
    Molly Bennett moves from one town to another with alarming regularity. As soon as she makes friends and puts down roots, her mother packs up Molly’s things and they move once again. When Molly finds herself in Middles, a small town of 613 people in northern Wisconsin, it might be the final straw. Her mother promises that this will be their last move (for now) as long as everything goes well with her job at the university – teaching
    llama_print_timings:        load time = 174643.77 ms
    llama_print_timings:      sample time =   158.91 ms /   256 runs   (    0.62 ms per run)
    llama_print_timings: prompt eval time =  2714.20 ms /     3 tokens (  904.73 ms per token)
    llama_print_timings:        eval time = 567679.63 ms /   255 runs   ( 2226.19 ms per run)
    llama_print_timings:       total time = 742582.40 ms
    

Smaller models

  1. 30B

    1. Convert

      neuro exec llama-cpp -- tmux new-session -s convert-f16-30B -d 'python3 ./convert-pth-to-ggml.py /models/30B 1'
      

      Monitor progress:

      neuro exec llama-cpp -- tmux a -t convert-f16-30B
      
    2. Quantize

      1. Q4_0

        neuro exec llama-cpp -- tmux new-session -s quantize-f16-30B-q4_0 -d './quantize /models/30B/ggml-model-f16.bin q4_0 $(nproc)'
        

        Monitor progress:

        neuro exec llama-cpp -- tmux a -t quantize-f16-30B-q4_0
        
      2. Q8_0

        neuro exec llama-cpp -- tmux new-session -s quantize-f16-30B-q8_0 -d './quantize /models/30B/ggml-model-f16.bin q8_0 $(nproc)'
        

        Monitor progress:

        neuro exec llama-cpp -- tmux a -t quantize-f16-30B-q8_0
        
    3. Inference

      1. f16

        neuro exec llama-cpp -- bash -c './main -m /models/30B/ggml-model-f16.bin -p "Hello!" -s 42 -t $(nproc) -n 256'
        
        main: build = 1 (95078cc)
        main: seed  = 42
        llama.cpp: loading model from /models/30B/ggml-model-f16.bin
        llama_model_load_internal: format     = ggjt v1 (latest)
        llama_model_load_internal: n_vocab    = 32000
        llama_model_load_internal: n_ctx      = 512
        llama_model_load_internal: n_embd     = 6656
        llama_model_load_internal: n_mult     = 256
        llama_model_load_internal: n_head     = 52
        llama_model_load_internal: n_layer    = 60
        llama_model_load_internal: n_rot      = 128
        llama_model_load_internal: ftype      = 1 (mostly F16)
        llama_model_load_internal: n_ff       = 17920
        llama_model_load_internal: n_parts    = 1
        llama_model_load_internal: model size = 30B
        llama_model_load_internal: ggml ctx size = 127.27 KB
        llama_model_load_internal: mem required  = 64349.70 MB (+ 3124.00 MB per state)
        ^[wllama_init_from_file: kv self size  =  780.00 MB
        
        system_info: n_threads = 56 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
        sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
        generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0
        
        
         Hello! Welcome to the world of Blog!
        I have been working for a non-profit organization for 2 years now as a volunteer. I never thought I could be so committed and dedicated to something that does not even benefit me financially. In fact, it has costed me money – money that I can put to better use for my education.
        I have been blessed with the opportunity to help many people by doing fundraisers and events for the organization. I was able to share about our mission and vision and how we are committed to serving our community in the best possible way. And that is what makes me happy. Making a difference in this world and helping others who are less fortunate than myself.
        I have always been passionate about volunteering. As a result, I believe it is my responsibility to give back to society as much as I can, and being able to serve in the community gives me that opportunity. There are many great things we could do to help our world become a better place by simply helping the underprivileged members of our society. And by doing this, not only am I giving them hope and changing their lives for the better – but I am also changing my own life in a positive way as
        llama_print_timings:        load time = 144145.99 ms
        llama_print_timings:      sample time =   184.64 ms /   256 runs   (    0.72 ms per run)
        llama_print_timings: prompt eval time =  2165.80 ms /     3 tokens (  721.93 ms per token)
        llama_print_timings:        eval time = 486918.37 ms /   255 runs   ( 1909.48 ms per run)
        llama_print_timings:       total time = 631348.62 ms
        
      2. Q4_0

        neuro exec llama-cpp -- bash -c './main -m /models/30B/ggml-model-q4_0.bin -p "Hello!" -s 42 -t $(nproc) -n 256'
        
        main: build = 1 (95078cc)
        main: seed  = 42
        llama.cpp: loading model from /models/30B/ggml-model-q4_0.bin
        llama_model_load_internal: format     = ggjt v1 (latest)
        llama_model_load_internal: n_vocab    = 32000
        llama_model_load_internal: n_ctx      = 512
        llama_model_load_internal: n_embd     = 6656
        llama_model_load_internal: n_mult     = 256
        llama_model_load_internal: n_head     = 52
        llama_model_load_internal: n_layer    = 60
        llama_model_load_internal: n_rot      = 128
        llama_model_load_internal: ftype      = 2 (mostly Q4_0)
        llama_model_load_internal: n_ff       = 17920
        llama_model_load_internal: n_parts    = 1
        llama_model_load_internal: model size = 30B
        llama_model_load_internal: ggml ctx size = 127.27 KB
        llama_model_load_internal: mem required  = 21695.48 MB (+ 3124.00 MB per state)
        llama_init_from_file: kv self size  =  780.00 MB
        
        system_info: n_threads = 56 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
        sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
        generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0
        
        
         Hello! Welcome to the Mental Health and Wellbeing page for the Student Guild. In this section you will find all of the events, campaigns and opportunities that I am organising with my team on behalf of students at the University of Manchester.
        Throughout the year we will be running campaigns to raise awareness and promote positive mental health amongst the student population. This academic year our main focus is on reducing stigma around mental illness, by tackling negative perceptions through informing and educating students.
        I am very passionate about my portfolio area and I hope that you will be too! Please feel free to contact me at any point with your feedback, ideas or if you would like some support. [end of text]
        
        llama_print_timings:        load time = 55627.24 ms
        llama_print_timings:      sample time =    99.31 ms /   154 runs   (    0.64 ms per run)
        llama_print_timings: prompt eval time =  1514.71 ms /     3 tokens (  504.90 ms per token)
        llama_print_timings:        eval time = 93902.14 ms /   153 runs   (  613.74 ms per run)
        llama_print_timings:       total time = 149679.96 ms
        
      3. Q8_0

        neuro exec llama-cpp -- bash -c './main -m /models/30B/ggml-model-q8_0.bin -p "Hello!" -s 42 -t $(nproc) -n 256'
        
        main: build = 1 (95078cc)
        main: seed  = 42
        llama.cpp: loading model from /models/30B/ggml-model-q8_0.bin
        llama_model_load_internal: format     = ggjt v1 (latest)
        llama_model_load_internal: n_vocab    = 32000
        llama_model_load_internal: n_ctx      = 512
        llama_model_load_internal: n_embd     = 6656
        llama_model_load_internal: n_mult     = 256
        llama_model_load_internal: n_head     = 52
        llama_model_load_internal: n_layer    = 60
        llama_model_load_internal: n_rot      = 128
        llama_model_load_internal: ftype      = 7 (mostly Q8_0)
        llama_model_load_internal: n_ff       = 17920
        llama_model_load_internal: n_parts    = 1
        llama_model_load_internal: model size = 30B
        llama_model_load_internal: ggml ctx size = 127.27 KB
        llama_model_load_internal: mem required  = 37206.10 MB (+ 3124.00 MB per state)
        llama_init_from_file: kv self size  =  780.00 MB
        
        system_info: n_threads = 56 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
        sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
        generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0
        
        
         Hello! Welcome to the last post of 2018. I am so glad we made it another year, and I am looking forward to what lies ahead in 2019.
        On today’s podcast, I am giving you a recap of my year in blogging. It was not easy for me to do this, but here goes nothing! [end of text]
        
        llama_print_timings:        load time = 89717.80 ms
        llama_print_timings:      sample time =    56.47 ms /    76 runs   (    0.74 ms per run)
        llama_print_timings: prompt eval time =  1178.37 ms /     3 tokens (  392.79 ms per token)
        llama_print_timings:        eval time = 80210.79 ms /    75 runs   ( 1069.48 ms per run)
        llama_print_timings:       total time = 170012.43 ms
        
  2. 13B

    1. Convert

      neuro exec llama-cpp -- tmux new-session -s convert-f16-13B -d 'python3 ./convert-pth-to-ggml.py /models/13B 1'
      

      Monitor progress:

      neuro exec llama-cpp -- tmux a -t convert-f16-13B
      
    2. Quantize

      1. Q4_0

        neuro exec llama-cpp -- tmux new-session -s quantize-f16-13B-q4_0 -d './quantize /models/13B/ggml-model-f16.bin q4_0 $(nproc)'
        

        Monitor progress:

        neuro exec llama-cpp -- tmux a -t quantize-f16-13B-q4_0
        
      2. Q8_0

        neuro exec llama-cpp -- tmux new-session -s quantize-f16-13B-q8_0 -d './quantize /models/13B/ggml-model-f16.bin q8_0 $(nproc)'
        

        Monitor progress:

        neuro exec llama-cpp -- tmux a -t quantize-f16-13B-q8_0
        
    3. Inference

      1. f16

        neuro exec llama-cpp -- bash -c './main -m /models/13B/ggml-model-f16.bin -p "Hello!" -s 42 -t $(nproc) -n 256'
        
        main: build = 1 (95078cc)
        main: seed  = 42
        llama.cpp: loading model from /models/13B/ggml-model-f16.bin
        llama_model_load_internal: format     = ggjt v1 (latest)
        llama_model_load_internal: n_vocab    = 32000
        llama_model_load_internal: n_ctx      = 512
        llama_model_load_internal: n_embd     = 5120
        llama_model_load_internal: n_mult     = 256
        llama_model_load_internal: n_head     = 40
        llama_model_load_internal: n_layer    = 40
        llama_model_load_internal: n_rot      = 128
        llama_model_load_internal: ftype      = 1 (mostly F16)
        llama_model_load_internal: n_ff       = 13824
        llama_model_load_internal: n_parts    = 1
        llama_model_load_internal: model size = 13B
        llama_model_load_internal: ggml ctx size =  85.08 KB
        llama_model_load_internal: mem required  = 26874.67 MB (+ 1608.00 MB per state)
        llama_init_from_file: kv self size  =  400.00 MB
        
        system_info: n_threads = 56 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
        sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
        generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0
        
        
         Hello! We are back from our trip to the UK. It was quite a crazy trip and we had some really good, but also bad experiences... But this post is not about that.
        I've been away for almost 3 weeks and I haven't posted in all that time! The reason is simple, it took me quite long to get back into the routine of blogging again. When I got back from a holiday, I usually spend at least a week catching up with my normal routines. And then I need some time to relax and unwind... So now that I have caught up (more or less) with everything, it is time for me to get started on the blog again!
        Today's post is about the new release from the Born Pretty Store - 10 Colors Nail Pen.
        I've already used this nail pen in a previous video and showed you how to use it, but I wanted to give it another try to see if it works better for me than before. For some reason, maybe the nails were too short, the first time didn't turn out that well... But after today's manicure, I definitely think this is a great product!
        I
        llama_print_timings:        load time = 53105.75 ms
        llama_print_timings:      sample time =   181.55 ms /   256 runs   (    0.71 ms per run)
        llama_print_timings: prompt eval time =   871.91 ms /     3 tokens (  290.64 ms per token)
        llama_print_timings:        eval time = 209508.93 ms /   255 runs   (  821.60 ms per run)
        llama_print_timings:       total time = 262883.21 ms
        
      2. Q4_0

        neuro exec llama-cpp -- bash -c './main -m /models/13B/ggml-model-q4_0.bin -p "Hello!" -s 42 -t $(nproc) -n 256'
        
        main: build = 1 (95078cc)
        main: seed  = 42
        llama.cpp: loading model from /models/13B/ggml-model-q4_0.bin
        llama_model_load_internal: format     = ggjt v1 (latest)
        llama_model_load_internal: n_vocab    = 32000
        llama_model_load_internal: n_ctx      = 512
        llama_model_load_internal: n_embd     = 5120
        llama_model_load_internal: n_mult     = 256
        llama_model_load_internal: n_head     = 40
        llama_model_load_internal: n_layer    = 40
        llama_model_load_internal: n_rot      = 128
        llama_model_load_internal: ftype      = 2 (mostly Q4_0)
        llama_model_load_internal: n_ff       = 13824
        llama_model_load_internal: n_parts    = 1
        llama_model_load_internal: model size = 13B
        llama_model_load_internal: ggml ctx size =  85.08 KB
        llama_model_load_internal: mem required  = 9807.48 MB (+ 1608.00 MB per state)
        llama_init_from_file: kv self size  =  400.00 MB
        
        system_info: n_threads = 56 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
        sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
        generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0
        
        
         Hello! We are back from our trip to the Philippines. It was fun and so tiring. I will share with you some pics of my outfits while traveling, shopping for pasalubong (for us and for friends) and exploring a new place.
        My first look is one of my go-to style when going on trip – comfort meets fashion 🙂 I wore this comfy white shirt from Zara with printed black shorts which was very handy to wear because of the air conditioned rooms (haha). To add some color, I accessorized it with my favorite pair of pink studded shoes from Penshoppe and a yellow bag that I bought in Japan.
        My second look is one of my casual outfits when going around town or exploring the city 🙂 It’s nice to get away from your usual dressing up, but still be stylish by wearing bright colors with prints. This printed maxi skirt was so fun and comfy – I can run, sit, jump in it! But my favorite outfit are these two pieces: a tangerine orange jacket which looks really cool on me for its color
        llama_print_timings:        load time = 20817.56 ms
        llama_print_timings:      sample time =   169.65 ms /   256 runs   (    0.66 ms per run)
        llama_print_timings: prompt eval time =   437.21 ms /     3 tokens (  145.74 ms per token)
        llama_print_timings:        eval time = 94067.15 ms /   255 runs   (  368.89 ms per run)
        llama_print_timings:       total time = 115140.10 ms
        
      3. Q8_0

        neuro exec llama-cpp -- bash -c './main -m /models/13B/ggml-model-q8_0.bin -p "Hello!" -s 42 -t $(nproc) -n 256'
        
        main: build = 1 (95078cc)
        main: seed  = 42
        llama.cpp: loading model from /models/13B/ggml-model-q8_0.bin
        llama_model_load_internal: format     = ggjt v1 (latest)
        llama_model_load_internal: n_vocab    = 32000
        llama_model_load_internal: n_ctx      = 512
        llama_model_load_internal: n_embd     = 5120
        llama_model_load_internal: n_mult     = 256
        llama_model_load_internal: n_head     = 40
        llama_model_load_internal: n_layer    = 40
        llama_model_load_internal: n_rot      = 128
        llama_model_load_internal: ftype      = 7 (mostly Q8_0)
        llama_model_load_internal: n_ff       = 13824
        llama_model_load_internal: n_parts    = 1
        llama_model_load_internal: model size = 13B
        llama_model_load_internal: ggml ctx size =  85.08 KB
        llama_model_load_internal: mem required  = 16013.73 MB (+ 1608.00 MB per state)
        llama_init_from_file: kv self size  =  400.00 MB
        
        system_info: n_threads = 56 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
        sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
        generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0
        
        
         Hello! Welcome to the world of A.R.T.
        You might be familiar with some of our work already, but even if you’re not, rest assured that there are many more ways we can help you and your business succeed than you may have previously thought possible.
        Here at A.R.T., we are passionate about branding. We know from experience how important it is to get the right message across and in a style which engages and resonates with your target audience, but more than that, what’s just as important is making sure you stand out from your competitors and get noticed for all the right reasons.
        We create powerful visual communication through design, whether it be corporate identity, print materials or websites. We are a full service creative agency with over 20 years experience in branding and design. At A.R.T. we believe that the success of your business is dependent on how well you can communicate both internally and externally – and this means to us, being strategic about what needs to be said as much as it does about how you say it. Our creative team use these skills daily to help our clients engage their audiences.
        You can rely on A.R
        llama_print_timings:        load time = 27751.71 ms
        llama_print_timings:      sample time =   175.95 ms /   256 runs   (    0.69 ms per run)
        llama_print_timings: prompt eval time =   614.81 ms /     3 tokens (  204.94 ms per token)
        llama_print_timings:        eval time = 142526.13 ms /   255 runs   (  558.93 ms per run)
        llama_print_timings:       total time = 170539.20 ms
        
  3. 7B

    1. Convert

      neuro exec llama-cpp -- tmux new-session -s convert-f16-7B -d 'python3 ./convert-pth-to-ggml.py /models/7B 1'
      

      Monitor progress:

      neuro exec llama-cpp -- tmux a -t convert-f16-7B
      
    2. Quantize

      1. Q4_0

        neuro exec llama-cpp -- tmux new-session -s quantize-f16-7B-q4_0 -d './quantize /models/7B/ggml-model-f16.bin q4_0 $(nproc)'
        

        Monitor progress:

        neuro exec llama-cpp -- tmux a -t quantize-f16-7B-q4_0
        
      2. Q8_0

        neuro exec llama-cpp -- tmux new-session -s quantize-f16-7B-q8_0 -d './quantize /models/7B/ggml-model-f16.bin q8_0 $(nproc)'
        

        Monitor progress:

        neuro exec llama-cpp -- tmux a -t quantize-f16-7B-q8_0
        
    3. Inference

      1. f16

        neuro exec llama-cpp -- bash -c './main -m /models/7B/ggml-model-f16.bin -p "Hello!" -s 42 -t $(nproc) -n 256'
        
        main: build = 1 (95078cc)
        main: seed  = 42
        llama.cpp: loading model from /models/7B/ggml-model-f16.bin
        llama_model_load_internal: format     = ggjt v1 (latest)
        llama_model_load_internal: n_vocab    = 32000
        llama_model_load_internal: n_ctx      = 512
        llama_model_load_internal: n_embd     = 4096
        llama_model_load_internal: n_mult     = 256
        llama_model_load_internal: n_head     = 32
        llama_model_load_internal: n_layer    = 32
        llama_model_load_internal: n_rot      = 128
        llama_model_load_internal: ftype      = 1 (mostly F16)
        llama_model_load_internal: n_ff       = 11008
        llama_model_load_internal: n_parts    = 1
        llama_model_load_internal: model size = 7B
        llama_model_load_internal: ggml ctx size =  68.20 KB
        llama_model_load_internal: mem required  = 14645.08 MB (+ 1026.00 MB per state)
        llama_init_from_file: kv self size  =  256.00 MB
        
        system_info: n_threads = 56 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
        sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
        generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0
        
        
         Hello! Welcome to the Plymouth County Sheriff's Office website. We are here to answer any questions you may have about our office or services we provide, and most importantly to let you know that we truly care about your safety and well-being in our community.
        We are proud of the hard work and accomplishments made by all of those involved with this department in making Plymouth County a safe place to live, work, and raise families. Our law enforcement professionals do a great job every day keeping you and your family safe. Their tireless efforts and dedication to the community are second-to-none.
        As Sheriff of Plymouth County I hope that you find this website informative and helpful in accessing information about our agency, our services, and the communities we serve. Please feel free to contact us if there is anything we can do for you.
        Sheriff Mark A. Devlin [end of text]
        
        llama_print_timings:        load time =  5303.98 ms
        llama_print_timings:      sample time =   131.45 ms /   194 runs   (    0.68 ms per run)
        llama_print_timings: prompt eval time =   502.76 ms /     3 tokens (  167.59 ms per token)
        llama_print_timings:        eval time = 94994.38 ms /   193 runs   (  492.20 ms per run)
        llama_print_timings:       total time = 100494.23 ms
        
      2. Q4_0

        neuro exec llama-cpp -- bash -c './main -m /models/7B/ggml-model-q4_0.bin -p "Hello!" -s 42 -t $(nproc) -n 256'
        
        main: build = 1 (95078cc)
        main: seed  = 42
        llama.cpp: loading model from /models/7B/ggml-model-q4_0.bin
        llama_model_load_internal: format     = ggjt v1 (latest)
        llama_model_load_internal: n_vocab    = 32000
        llama_model_load_internal: n_ctx      = 512
        llama_model_load_internal: n_embd     = 4096
        llama_model_load_internal: n_mult     = 256
        llama_model_load_internal: n_head     = 32
        llama_model_load_internal: n_layer    = 32
        llama_model_load_internal: n_rot      = 128
        llama_model_load_internal: ftype      = 2 (mostly Q4_0)
        llama_model_load_internal: n_ff       = 11008
        llama_model_load_internal: n_parts    = 1
        llama_model_load_internal: model size = 7B
        llama_model_load_internal: ggml ctx size =  68.20 KB
        llama_model_load_internal: mem required  = 5809.33 MB (+ 1026.00 MB per state)
        llama_init_from_file: kv self size  =  256.00 MB
        
        system_info: n_threads = 56 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
        sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
        generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0
        
        
         Hello! Welcome to the Cranberry Township Public Library. We are located at 30 Township Boulevard in Cranberry Township, PA. Whether you're looking for a good read or to learn something new, your local library is here to help. We offer access to books, movies and music; research databases with thousands of magazines and journals; online learning platforms and eBooks; local history collections; and much more!
        Our goal is to be your source for ideas and inspiration as you learn, explore, and grow.
        Get connected by joining our email newsletter list or following us on social media: Facebook | Twitter | Instagram | Pinterest
        30 Township Boulevard • Cranberry Township, PA 16066 • (724) 795-1248 [end of text]
        
        llama_print_timings:        load time = 10977.61 ms
        llama_print_timings:      sample time =   116.03 ms /   177 runs   (    0.66 ms per run)
        llama_print_timings: prompt eval time =   283.26 ms /     3 tokens (   94.42 ms per token)
        llama_print_timings:        eval time = 42488.67 ms /   176 runs   (  241.41 ms per run)
        llama_print_timings:       total time = 53642.97 ms
        
      3. Q8_0

        neuro exec llama-cpp -- bash -c './main -m /models/7B/ggml-model-q8_0.bin -p "Hello!" -s 42 -t $(nproc) -n 256'
        
        main: build = 1 (95078cc)
        main: seed  = 42
        llama.cpp: loading model from /models/7B/ggml-model-q8_0.bin
        llama_model_load_internal: format     = ggjt v1 (latest)
        llama_model_load_internal: n_vocab    = 32000
        llama_model_load_internal: n_ctx      = 512
        llama_model_load_internal: n_embd     = 4096
        llama_model_load_internal: n_mult     = 256
        llama_model_load_internal: n_head     = 32
        llama_model_load_internal: n_layer    = 32
        llama_model_load_internal: n_rot      = 128
        llama_model_load_internal: ftype      = 7 (mostly Q8_0)
        llama_model_load_internal: n_ff       = 11008
        llama_model_load_internal: n_parts    = 1
        llama_model_load_internal: model size = 7B
        llama_model_load_internal: ggml ctx size =  68.20 KB
        llama_model_load_internal: mem required  = 9022.33 MB (+ 1026.00 MB per state)
        llama_init_from_file: kv self size  =  256.00 MB
        
        system_info: n_threads = 56 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
        sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
        generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0
        
        
         Hello! Welcome to the Plymouth County Sheriff's Office website. We are here to keep you informed about what is going on in our area and to help you obtain information and services from us more easily.
        The Plymouth County Sheriff's Office is a full service law enforcement agency serving the residents of Plymouth County, Iowa and surrounding areas since 1854. The Sheriff’s Department is responsible for providing patrol services to all unincorporated areas of the county. In addition, it has the responsibility of protecting life and property in incorporated cities of Plymouth County such as Le Mars, Merrill, Akron, Kingsley, Homer, Ida Grove, Meriden, Westfield, Struble, Correctionville, Peterson, Granville, Laurens, Hinton, and Remsen. The Sheriff’s Department provides a full range of law enforcement services to the county.
        The Plymouth County Sheriff's Office is committed to providing quality law enforcement service to Plymouth County residents 24 hours a day, seven days a week. The Plymouth County Sheriff's Office has an authorized staff
        llama_print_timings:        load time = 17504.76 ms
        llama_print_timings:      sample time =   171.81 ms /   256 runs   (    0.67 ms per run)
        llama_print_timings: prompt eval time =   476.97 ms /     3 tokens (  158.99 ms per token)
        llama_print_timings:        eval time = 86557.12 ms /   255 runs   (  339.44 ms per run)
        llama_print_timings:       total time = 104318.80 ms
        

Performance summary

Model Time per run, ms Mem required
65B/ggml-model-f16.bin 3726.15 128109.20 MB (+ 5120.00 MB per state)
65B/ggml-model-q8_0.bin 2226.19 73631.70 MB (+ 5120.00 MB per state)
65B/ggml-model-q4_0.bin 1310.88 42501.70 MB (+ 5120.00 MB per state)
30B/ggml-model-f16.bin 1909.48 64349.70 MB (+ 3124.00 MB per state)
30B/ggml-model-q8_0.bin 1069.48 37206.10 MB (+ 3124.00 MB per state)
30B/ggml-model-q4_0.bin 613.74 21695.48 MB (+ 3124.00 MB per state)
13B/ggml-model-f16.bin 821.60 26874.67 MB (+ 1608.00 MB per state)
13B/ggml-model-q8_0.bin 558.93 16013.73 MB (+ 1608.00 MB per state)
13B/ggml-model-q4_0.bin 368.89 9807.48 MB (+ 1608.00 MB per state)
7B/ggml-model-f16.bin 492.20 128109.20 MB (+ 5120.00 MB per state)
7B/ggml-model-q8_0.bin 339.44 9022.33 MB (+ 1026.00 MB per state)
7B/ggml-model-q4_0.bin 241.41 5809.33 MB (+ 1026.00 MB per state)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment