-
Follow the instructions here under "Installation option 1: conda"
-
Download the desired Hugging Face converted model for LLaMA here
-
Copy the entire model folder, for example llama-13b-hf, into text-generation-webuimodels
-
Run the following command in your conda environment: python server.py --model llama-13b-hf --load-in-8bit
-
Edit
text-generation-webui/modules/models.py
line 102 fromparams.append("load_in_8bit=True"... to params.extend(["load_in_8bit=True", "llm_int8_enable_fp32_cpu_offload=True"])
-
In
./local/python/envs/textgen/lib/site-packages/transformers/modeling_utils.py
search for:
if isinstance(device_map, str):
if model._no_split_modules is None:
raise ValueError(
f"{model.__class__.__name__} does not support device_map='{device_map}' yet."
)
no_split_modules = model._no_split_modules
if device_map not in ["auto", "balanced", "balanced_low_0", "sequential"]:
raise ValueError(
"If passing a string for device_map, please choose 'auto', 'balanced', 'balanced_low_0' or "
"'sequential'."
)
elif device_map in ["balanced", "balanced_low_0"] and get_balanced_memory is None:
raise ValueError(
f"device_map={device_map} requires a source install of Accelerate."
)
if device_map != "sequential" and get_balanced_memory is not None:
max_memory = get_balanced_memory(
model,
max_memory=max_memory,
no_split_module_classes=no_split_modules,
dtype=torch_dtype,
low_zero=(device_map == "balanced_low_0"),
)
model.tie_weights()
device_map = infer_auto_device_map(
model,
no_split_module_classes=no_split_modules,
dtype=torch_dtype if not load_in_8bit else torch.int8,
max_memory=max_memory,
)
cut this entire section and paste it right before the line "# Extend the modules to not convert to keys that are supposed to be offloaded to cpu
or `disk"
- Run the following command in your conda environment:
python server.py --model llama-13b-hf --load-in-8bit --auto-devices
-
Follow the instructions here under "Installation option 1: conda"
-
Continue with the 4-bit specific instructions here
- For a ChatGPT/CharacterAI style chat, pass
--cai-chat
toserver.py
- You can use examples in character cards to guide responses toward a desired output.
- For a more creative chat, use:
temp 0.72
,rep pen 1.1
,top_k 0
, andtop_p 0.73
- For a more precise chat, use
temp 0.7
,repetition_penalty 1.1764705882352942
,top_k 40
, andtop_p 0.1