Skip to content

Instantly share code, notes, and snippets.

@vuiseng9
Last active January 24, 2024 02:52
Show Gist options
  • Save vuiseng9/d67335650dd64f2683e48bfeab684df5 to your computer and use it in GitHub Desktop.
Save vuiseng9/d67335650dd64f2683e48bfeab684df5 to your computer and use it in GitHub Desktop.

Install

git clone https://github.com/bigcode-project/bigcode-evaluation-harness
pip install -e .

Deterministic Generation

mistralai/Mistral-7B-v0.1 should result in "pass@1": 0.29878 paper: 30.5%, 0.7% gap

accelerate launch $WORKDIR/main.py \
    --model $LMID \
    --max_length_generation 512 \
    --tasks humaneval \
    --batch_size $BS \
    --n_samples 1 \
    --no_do_sample \
    --temperature 0.0 \
    --top_p 1.0 \
    --precision bf16 \
    --allow_code_execution \
    --use_auth_token \
    --metric_output_path $OUTDIR/evaluation_results.json \
    --save_generations \
    --save_generations_path $OUTDIR/generated.json \
    # --limit 5

# unsure about how to disable top-p directly, also wonder if it is ever used since we do --no_do_sample
# when --no_do_sample, --n_samples will result in n duplicate, it is greedy decoding after all, so set it to 1;
#      beam search might be different - need to find out
# BS must be 1 for greedy case per current implementation.

Default Evaluation on Open Leaderboard with BigCode

mistralai/Mistral-7B-v0.1

    "pass@1": 0.2825609756097561,
    "pass@10": 0.41052352768889244
accelerate launch $WORKDIR/main.py \
    --model $LMID \
    --max_length_generation 512 \
    --tasks humaneval \
    --batch_size $BS \
    --n_samples 50 \
    --do_sample \
    --temperature 0.2 \
    --top_p 0.95 \
    --precision bf16 \
    --allow_code_execution \
    --use_auth_token \
    --metric_output_path $OUTDIR/evaluation_results.json \
    --save_generations \
    --save_generations_path $OUTDIR/generated.json \
    # --limit 5

Other Information

  1. open codegen leaderboard https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
  2. Great Discussion on Mistral7B reproducibility of HumanEval bigcode-project/bigcode-evaluation-harness#165
  3. Quick read on pass@k metric https://deepgram.com/learn/humaneval-llm-benchmark
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment