Skip to content

Instantly share code, notes, and snippets.

@cedrickchee
Created March 12, 2023 12:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cedrickchee/1f24fa3a5e3371910e1959b96a8dff94 to your computer and use it in GitHub Desktop.
Save cedrickchee/1f24fa3a5e3371910e1959b96a8dff94 to your computer and use it in GitHub Desktop.
Installing 8/4-bit LLaMA with text-generation-webui on Linux

Installing 8-bit LLaMA with text-generation-webui

Linux

  1. Follow the instructions here under "Installation option 1: conda"

  2. Download the desired Hugging Face converted model for LLaMA here

  3. Copy the entire model folder, for example llama-13b-hf, into text-generation-webuimodels

  4. Run the following command in your conda environment: python server.py --model llama-13b-hf --load-in-8bit

8-bit LLaMA with Split Offloading

  1. Edit text-generation-webui/modules/models.py line 102 from params.append("load_in_8bit=True"... to params.extend(["load_in_8bit=True", "llm_int8_enable_fp32_cpu_offload=True"])

  2. In ./local/python/envs/textgen/lib/site-packages/transformers/modeling_utils.py search for:

if isinstance(device_map, str):
    if model._no_split_modules is None:
        raise ValueError(
            f"{model.__class__.__name__} does not support device_map='{device_map}' yet."
        )

        no_split_modules = model._no_split_modules

    if device_map not in ["auto", "balanced", "balanced_low_0", "sequential"]:
        raise ValueError(
            "If passing a string for device_map, please choose 'auto', 'balanced', 'balanced_low_0' or "
            "'sequential'."
        )

    elif device_map in ["balanced", "balanced_low_0"] and get_balanced_memory is None:
        raise ValueError(
            f"device_map={device_map} requires a source install of Accelerate."
        )

    if device_map != "sequential" and get_balanced_memory is not None:
        max_memory = get_balanced_memory(
            model,
            max_memory=max_memory,
            no_split_module_classes=no_split_modules,
            dtype=torch_dtype,
            low_zero=(device_map == "balanced_low_0"),
        )

        model.tie_weights()

        device_map = infer_auto_device_map(
            model,
            no_split_module_classes=no_split_modules,
            dtype=torch_dtype if not load_in_8bit else torch.int8,
            max_memory=max_memory,
        )

cut this entire section and paste it right before the line "# Extend the modules to not convert to keys that are supposed to be offloaded to cpu or `disk"

  1. Run the following command in your conda environment: python server.py --model llama-13b-hf --load-in-8bit --auto-devices

Installing 4-bit LLaMA with text-generation-webui

Linux

  1. Follow the instructions here under "Installation option 1: conda"

  2. Continue with the 4-bit specific instructions here

Tips and Output Settings in text-generation-webui

  • For a ChatGPT/CharacterAI style chat, pass --cai-chat to server.py
  • You can use examples in character cards to guide responses toward a desired output.
  • For a more creative chat, use: temp 0.72, rep pen 1.1, top_k 0, and top_p 0.73
  • For a more precise chat, use temp 0.7, repetition_penalty 1.1764705882352942, top_k 40, and top_p 0.1

Resources used for this guide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment