cedrickchee/text-generation-webui-guide.md

## text-generation-webui-guide.md

      
    Raw
  

              text-generation-webui-guide.md
            
          
    Installing 8-bit LLaMA with text-generation-webui

Linux


Follow the instructions here under "Installation option 1: conda"


Download the desired Hugging Face converted model for LLaMA here


Copy the entire model folder, for example llama-13b-hf, into text-generation-webuimodels


Run the following command in your conda environment: python server.py --model llama-13b-hf --load-in-8bit


8-bit LLaMA with Split Offloading


Edit text-generation-webui/modules/models.py line 102 from
params.append("load_in_8bit=True"... to params.extend(["load_in_8bit=True", "llm_int8_enable_fp32_cpu_offload=True"])


In ./local/python/envs/textgen/lib/site-packages/transformers/modeling_utils.py search for:


if isinstance(device_map, str):
    if model._no_split_modules is None:
        raise ValueError(
            f"{model.__class__.__name__} does not support device_map='{device_map}' yet."
        )

        no_split_modules = model._no_split_modules

    if device_map not in ["auto", "balanced", "balanced_low_0", "sequential"]:
        raise ValueError(
            "If passing a string for device_map, please choose 'auto', 'balanced', 'balanced_low_0' or "
            "'sequential'."
        )

    elif device_map in ["balanced", "balanced_low_0"] and get_balanced_memory is None:
        raise ValueError(
            f"device_map={device_map} requires a source install of Accelerate."
        )

    if device_map != "sequential" and get_balanced_memory is not None:
        max_memory = get_balanced_memory(
            model,
            max_memory=max_memory,
            no_split_module_classes=no_split_modules,
            dtype=torch_dtype,
            low_zero=(device_map == "balanced_low_0"),
        )

        model.tie_weights()

        device_map = infer_auto_device_map(
            model,
            no_split_module_classes=no_split_modules,
            dtype=torch_dtype if not load_in_8bit else torch.int8,
            max_memory=max_memory,
        )
cut this entire section and paste it right before the line "# Extend the modules to not convert to keys that are supposed to be offloaded to cpu or `disk"

Run the following command in your conda environment: python server.py --model llama-13b-hf --load-in-8bit --auto-devices

Installing 4-bit LLaMA with text-generation-webui

Linux


Follow the instructions here under "Installation option 1: conda"


Continue with the 4-bit specific instructions here


Tips and Output Settings in text-generation-webui


For a ChatGPT/CharacterAI style chat, pass --cai-chat to server.py
You can use examples in character cards to guide responses toward a desired output.
For a more creative chat, use: temp 0.72, rep pen 1.1, top_k 0, and top_p 0.73
For a more precise chat, use temp 0.7, repetition_penalty 1.1764705882352942, top_k 40, and top_p 0.1

Resources used for this guide


Support for LLaMA models · Issue #147 · oobabooga/text-generation-webui · GitHub
Hugging Face Models
Detailed parameters
GPTQ for LLaMA
65b_samples.txt