vuiseng9/bigcode.md

## bigcode.md

      
    Raw
  

              bigcode.md
            
          
    Install

git clone https://github.com/bigcode-project/bigcode-evaluation-harness
pip install -e .
Deterministic Generation

mistralai/Mistral-7B-v0.1 should result in "pass@1": 0.29878 paper: 30.5%, 0.7% gap
accelerate launch $WORKDIR/main.py \
    --model $LMID \
    --max_length_generation 512 \
    --tasks humaneval \
    --batch_size $BS \
    --n_samples 1 \
    --no_do_sample \
    --temperature 0.0 \
    --top_p 1.0 \
    --precision bf16 \
    --allow_code_execution \
    --use_auth_token \
    --metric_output_path $OUTDIR/evaluation_results.json \
    --save_generations \
    --save_generations_path $OUTDIR/generated.json \
    # --limit 5

# unsure about how to disable top-p directly, also wonder if it is ever used since we do --no_do_sample
# when --no_do_sample, --n_samples will result in n duplicate, it is greedy decoding after all, so set it to 1;
#      beam search might be different - need to find out
# BS must be 1 for greedy case per current implementation.
Default Evaluation on Open Leaderboard with BigCode

mistralai/Mistral-7B-v0.1
    "pass@1": 0.2825609756097561,
    "pass@10": 0.41052352768889244
accelerate launch $WORKDIR/main.py \
    --model $LMID \
    --max_length_generation 512 \
    --tasks humaneval \
    --batch_size $BS \
    --n_samples 50 \
    --do_sample \
    --temperature 0.2 \
    --top_p 0.95 \
    --precision bf16 \
    --allow_code_execution \
    --use_auth_token \
    --metric_output_path $OUTDIR/evaluation_results.json \
    --save_generations \
    --save_generations_path $OUTDIR/generated.json \
    # --limit 5
Other Information


open codegen leaderboard
https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
Great Discussion on Mistral7B reproducibility of HumanEval
bigcode-project/bigcode-evaluation-harness#165
Quick read on pass@k metric
https://deepgram.com/learn/humaneval-llm-benchmark