Skip to content

Instantly share code, notes, and snippets.

@python273
Created April 7, 2023 22:04
Show Gist options
  • Save python273/ca23361caf1cde9dc06bbc9acd44b22d to your computer and use it in GitHub Desktop.
Save python273/ca23361caf1cde9dc06bbc9acd44b22d to your computer and use it in GitHub Desktop.
MALLOC(3) Linux Programmer's Manual MALLOC(3)
NAME
malloc, free, calloc, realloc, reallocarray - allocate and free dynamic memory
SYNOPSIS
#include <stdlib.h>
void *malloc(size_t size);
void free(void *ptr);
void *calloc(size_t nmemb, size_t size);
void *realloc(void *ptr, size_t size);
void *reallocarray(void *ptr, size_t nmemb, size_t size);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
reallocarray():
Since glibc 2.29:
_DEFAULT_SOURCE
Glibc 2.28 and earlier:
_GNU_SOURCE
DESCRIPTION
The malloc() function allocates size bytes and returns a pointer to the allocated memory. The
memory is not initialized. If size is 0, then malloc() returns either NULL, or a unique
pointer value that can later be successfully passed to free().
The free() function frees the memory space pointed to by ptr, which must have been returned by
a previous call to malloc(), calloc(), or realloc(). Otherwise, or if free(ptr) has already
been called before, undefined behavior occurs. If ptr is NULL, no operation is performed.
The calloc() function allocates memory for an array of nmemb elements of size bytes each and
returns a pointer to the allocated memory. The memory is set to zero. If nmemb or size is 0,
then calloc() returns either NULL, or a unique pointer value that can later be successfully
passed to free(). If the multiplication of nmemb and size would result in integer overflow,
then calloc() returns an error. By contrast, an integer overflow would not be detected in the
following call to malloc(), with the result that an incorrectly sized block of memory would be
allocated:
malloc(nmemb * size);
The realloc() function changes the size of the memory block pointed to by ptr to size bytes.
The contents will be unchanged in the range from the start of the region up to the minimum of
the old and new sizes. If the new size is larger than the old size, the added memory will not
be initialized. If ptr is NULL, then the call is equivalent to malloc(size), for all values of
size; if size is equal to zero, and ptr is not NULL, then the call is equivalent to free(ptr).
Unless ptr is NULL, it must have been returned by an earlier call to malloc(), calloc(), or re‐
alloc(). If the area pointed to was moved, a free(ptr) is done.
The reallocarray() function changes the size of the memory block pointed to by ptr to be large
enough for an array of nmemb elements, each of which is size bytes. It is equivalent to the
call
realloc(ptr, nmemb * size);
However, unlike that realloc() call, reallocarray() fails safely in the case where the multi‐
plication would overflow. If such an overflow occurs, reallocarray() returns NULL, sets errno
to ENOMEM, and leaves the original block of memory unchanged.
RETURN VALUE
The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably
aligned for any built-in type. On error, these functions return NULL. NULL may also be re‐
turned by a successful call to malloc() with a size of zero, or by a successful call to cal‐
loc() with nmemb or size equal to zero.
The free() function returns no value.
The realloc() function returns a pointer to the newly allocated memory, which is suitably
aligned for any built-in type, or NULL if the request failed. The returned pointer may be the
same as ptr if the allocation was not moved (e.g., there was room to expand the allocation in-
place), or different from ptr if the allocation was moved to a new address. If size was equal
to 0, either NULL or a pointer suitable to be passed to free() is returned. If realloc()
fails, the original block is left untouched; it is not freed or moved.
On success, the reallocarray() function returns a pointer to the newly allocated memory. On
failure, it returns NULL and the original block of memory is left untouched.
(nnexp) kirill@quiet:~/3rdworkspace/llama.cpp$ make
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native
I LDFLAGS:
I CC: cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX: g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
cc -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native -c examples/common.cpp -o common.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o -o main
==== Run ./main -h for help. ====
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/quantize/quantize.cpp ggml.o llama.o -o quantize
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding
(nnexp) kirill@quiet:~/3rdworkspace/llama.cpp$ sudo ./main -m ./models/7B/ggml-model-q4_0.bin --mlock --ctx_size 2048 -b 512 -t 16 --color -n 64 -f /home/kirill/3rdworkspace/GPTQ-for-LLaMa/bench_prompt.txt
main: seed = 1680904413
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 2048
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: type = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size = 81.25 KB
llama_model_load: mem required = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/7B/ggml-model-q4_0.bin'
llama_model_load: model size = 4017.27 MB / num tensors = 291
llama_init_from_file: kv self size = 1024.00 MB
system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 512, n_predict = 64, n_keep = 0
MALLOC(3) Linux Programmer's Manual MALLOC(3)
NAME
malloc, free, calloc, realloc, reallocarray - allocate and free dynamic memory
SYNOPSIS
#include <stdlib.h>
void *malloc(size_t size);
void free(void *ptr);
void *calloc(size_t nmemb, size_t size);
void *realloc(void *ptr, size_t size);
void *reallocarray(void *ptr, size_t nmemb, size_t size);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
reallocarray():
Since glibc 2.29:
_DEFAULT_SOURCE
Glibc 2.28 and earlier:
_GNU_SOURCE
DESCRIPTION
The malloc() function allocates size bytes and returns a pointer to the allocated memory. The
memory is not initialized. If size is 0, then malloc() returns either NULL, or a unique
pointer value that can later be successfully passed to free().
The free() function frees the memory space pointed to by ptr, which must have been returned by
a previous call to malloc(), calloc(), or realloc(). Otherwise, or if free(ptr) has already
been called before, undefined behavior occurs. If ptr is NULL, no operation is performed.
The calloc() function allocates memory for an array of nmemb elements of size bytes each and
returns a pointer to the allocated memory. The memory is set to zero. If nmemb or size is 0,
then calloc() returns either NULL, or a unique pointer value that can later be successfully
passed to free(). If the multiplication of nmemb and size would result in integer overflow,
then calloc() returns an error. By contrast, an integer overflow would not be detected in the
following call to malloc(), with the result that an incorrectly sized block of memory would be
allocated:
malloc(nmemb * size);
The realloc() function changes the size of the memory block pointed to by ptr to size bytes.
The contents will be unchanged in the range from the start of the region up to the minimum of
the old and new sizes. If the new size is larger than the old size, the added memory will not
be initialized. If ptr is NULL, then the call is equivalent to malloc(size), for all values of
size; if size is equal to zero, and ptr is not NULL, then the call is equivalent to free(ptr).
Unless ptr is NULL, it must have been returned by an earlier call to malloc(), calloc(), or re‐
alloc(). If the area pointed to was moved, a free(ptr) is done.
The reallocarray() function changes the size of the memory block pointed to by ptr to be large
enough for an array of nmemb elements, each of which is size bytes. It is equivalent to the
call
realloc(ptr, nmemb * size);
However, unlike that realloc() call, reallocarray() fails safely in the case where the multi‐
plication would overflow. If such an overflow occurs, reallocarray() returns NULL, sets errno
to ENOMEM, and leaves the original block of memory unchanged.
RETURN VALUE
The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably
aligned for any built-in type. On error, these functions return NULL. NULL may also be re‐
turned by a successful call to malloc() with a size of zero, or by a successful call to cal‐
loc() with nmemb or size equal to zero.
The free() function returns no value.
The realloc() function returns a pointer to the newly allocated memory, which is suitably
aligned for any built-in type, or NULL if the request failed. The returned pointer may be the
same as ptr if the allocation was not moved (e.g., there was room to expand the allocation in-
place), or different from ptr if the allocation was moved to a new address. If size was equal
to 0, either NULL or a pointer suitable to be passed to free() is returned. If realloc()
fails, the original block is left untouched; it is not freed or moved.
On success, the reallocarray() function returns a pointer to the newly allocated memory. On
failure, it returns NULL and the original block of memory is left untouched.
DESCRIPTION
The malloc() function allocates size bytes and returns a pointer to the allocated memory.
The memory is not initialized. If size is 0, then malloc() returns either NULL, or a unique
pointer value that can later be successfully passed to
llama_print_timings: load time = 12524.92 ms
llama_print_timings: sample time = 27.27 ms / 64 runs ( 0.43 ms per run)
llama_print_timings: prompt eval time = 27051.89 ms / 1151 tokens ( 23.50 ms per token)
llama_print_timings: eval time = 7044.22 ms / 63 runs ( 111.81 ms per run)
llama_print_timings: total time = 35084.64 ms
(nnexp) kirill@quiet:~/3rdworkspace/llama.cpp$
(nnexp) kirill@quiet:~/3rdworkspace/llama.cpp$ LLAMA_OPENBLAS=1 make
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native
I LDFLAGS: -lopenblas
I CC: cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX: g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
cc -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native -c examples/common.cpp -o common.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o -o main -lopenblas
==== Run ./main -h for help. ====
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/quantize/quantize.cpp ggml.o llama.o -o quantize -lopenblas
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity -lopenblas
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding -lopenblas
(nnexp) kirill@quiet:~/3rdworkspace/llama.cpp$ sudo ./main -m ./models/7B/ggml-model-q4_0.bin --mlock --ctx_size 2048 -b 512 -t 16 --color -n 64 -f /home/kirill/3rdworkspace/GPTQ-for-LLaMa/bench_prompt.txt
main: seed = 1680904475
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 2048
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: type = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size = 81.25 KB
llama_model_load: mem required = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/7B/ggml-model-q4_0.bin'
llama_model_load: model size = 4017.27 MB / num tensors = 291
llama_init_from_file: kv self size = 1024.00 MB
system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 512, n_predict = 64, n_keep = 0
MALLOC(3) Linux Programmer's Manual MALLOC(3)
NAME
malloc, free, calloc, realloc, reallocarray - allocate and free dynamic memory
SYNOPSIS
#include <stdlib.h>
void *malloc(size_t size);
void free(void *ptr);
void *calloc(size_t nmemb, size_t size);
void *realloc(void *ptr, size_t size);
void *reallocarray(void *ptr, size_t nmemb, size_t size);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
reallocarray():
Since glibc 2.29:
_DEFAULT_SOURCE
Glibc 2.28 and earlier:
_GNU_SOURCE
DESCRIPTION
The malloc() function allocates size bytes and returns a pointer to the allocated memory. The
memory is not initialized. If size is 0, then malloc() returns either NULL, or a unique
pointer value that can later be successfully passed to free().
The free() function frees the memory space pointed to by ptr, which must have been returned by
a previous call to malloc(), calloc(), or realloc(). Otherwise, or if free(ptr) has already
been called before, undefined behavior occurs. If ptr is NULL, no operation is performed.
The calloc() function allocates memory for an array of nmemb elements of size bytes each and
returns a pointer to the allocated memory. The memory is set to zero. If nmemb or size is 0,
then calloc() returns either NULL, or a unique pointer value that can later be successfully
passed to free(). If the multiplication of nmemb and size would result in integer overflow,
then calloc() returns an error. By contrast, an integer overflow would not be detected in the
following call to malloc(), with the result that an incorrectly sized block of memory would be
allocated:
malloc(nmemb * size);
The realloc() function changes the size of the memory block pointed to by ptr to size bytes.
The contents will be unchanged in the range from the start of the region up to the minimum of
the old and new sizes. If the new size is larger than the old size, the added memory will not
be initialized. If ptr is NULL, then the call is equivalent to malloc(size), for all values of
size; if size is equal to zero, and ptr is not NULL, then the call is equivalent to free(ptr).
Unless ptr is NULL, it must have been returned by an earlier call to malloc(), calloc(), or re‐
alloc(). If the area pointed to was moved, a free(ptr) is done.
The reallocarray() function changes the size of the memory block pointed to by ptr to be large
enough for an array of nmemb elements, each of which is size bytes. It is equivalent to the
call
realloc(ptr, nmemb * size);
However, unlike that realloc() call, reallocarray() fails safely in the case where the multi‐
plication would overflow. If such an overflow occurs, reallocarray() returns NULL, sets errno
to ENOMEM, and leaves the original block of memory unchanged.
RETURN VALUE
The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably
aligned for any built-in type. On error, these functions return NULL. NULL may also be re‐
turned by a successful call to malloc() with a size of zero, or by a successful call to cal‐
loc() with nmemb or size equal to zero.
The free() function returns no value.
The realloc() function returns a pointer to the newly allocated memory, which is suitably
aligned for any built-in type, or NULL if the request failed. The returned pointer may be the
same as ptr if the allocation was not moved (e.g., there was room to expand the allocation in-
place), or different from ptr if the allocation was moved to a new address. If size was equal
to 0, either NULL or a pointer suitable to be passed to free() is returned. If realloc()
fails, the original block is left untouched; it is not freed or moved.
On success, the reallocarray() function returns a pointer to the newly allocated memory. On
failure, it returns NULL and the original block of memory is left untouched. The behavior on an
overflow is defined by the compiler vendor.
DESCRIPTION OF THE EXCEPTIONS THAT MAY BE THROWN
The malloc() function may throw a malloc failure (which is not to be confused with the standard
malloc() failure
llama_print_timings: load time = 14944.75 ms
llama_print_timings: sample time = 35.45 ms / 64 runs ( 0.55 ms per run)
llama_print_timings: prompt eval time = 40494.75 ms / 1151 tokens ( 35.18 ms per token)
llama_print_timings: eval time = 7111.35 ms / 63 runs ( 112.88 ms per run)
llama_print_timings: total time = 48636.15 ms
(nnexp) kirill@quiet:~/3rdworkspace/llama.cpp$
import time
import traceback
from transformers import (
LlamaForCausalLM, LlamaTokenizer, BitsAndBytesConfig, TextStreamer,
GenerationConfig
)
import torch
cuda_is_available = torch.cuda.is_available()
gpu_count = torch.cuda.device_count()
if not cuda_is_available or gpu_count == 0:
print("CUDA is not available")
exit(1)
print("CUDA version: " + torch.version.cuda)
quantization_config = BitsAndBytesConfig(
llm_int8_skip_modules=["lm_head"],
load_in_8bit=True,
)
start_model = time.perf_counter_ns()
model_name = "./models/7B/"
tokenizer = LlamaTokenizer.from_pretrained(model_name)
streamer = TextStreamer(tokenizer)
model: LlamaForCausalLM = LlamaForCausalLM.from_pretrained(
model_name,
device_map="auto",
quantization_config=quantization_config,
torch_dtype=torch.float16,
cache_dir="cache"
)
def main():
with open("/home/kirill/3rdworkspace/GPTQ-for-LLaMa/bench_prompt.txt") as f:
prompt = f.read()
gen_in = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()
with torch.no_grad():
generation_config = GenerationConfig(
max_new_tokens=64,
min_new_tokens=64,
do_sample=True,
use_cache=True,
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.2,
temperature=0.7,
top_k=40,
top_p=1.0,
# early_stopping=True,
)
try:
print('generation start')
start_gen = time.perf_counter_ns()
generated_ids = model.generate(
gen_in,
generation_config=generation_config,
streamer=streamer,
# stopping_criteria=
)
end_gen = time.perf_counter_ns()
except Exception:
print("\033[91m")
traceback.print_exc()
print("\033[0m")
exit(1)
# print(repr(generated_ids))
# generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
# print(repr(generated_text))
gen_time_ms = (end_gen - start_gen) / 1e6
print(f"generation time: {gen_time_ms} ms")
total_tokens = len(gen_in) + len(generated_ids[0])
print(f"tokens consumed: {total_tokens}")
print(f"gen time per token: {gen_time_ms / total_tokens} ms")
print(f"total time: {(end_gen - start_model) / 1e9} s")
if __name__ == "__main__":
main()
from pathlib import Path
import time
import traceback
from transformers import (
LlamaForCausalLM, LlamaTokenizer, BitsAndBytesConfig, TextStreamer,
GenerationConfig
)
import torch
from llama_inference import load_quant
import warnings
warnings.filterwarnings("ignore")
cuda_is_available = torch.cuda.is_available()
gpu_count = torch.cuda.device_count()
if not cuda_is_available or gpu_count == 0:
print("CUDA is not available")
exit(1)
print("CUDA version: " + torch.version.cuda)
DEV = torch.device('cuda:0')
start_model = time.perf_counter_ns()
model_name = Path("~/workspace/llama-gpu/models/7B/").expanduser()
tokenizer = LlamaTokenizer.from_pretrained(model_name)
streamer = TextStreamer(tokenizer)
model = load_quant(model_name, "llama7b-4bit-128g.safetensors", 4, 128, -1)
model.to(DEV)
def main():
with open("/home/kirill/3rdworkspace/GPTQ-for-LLaMa/bench_prompt.txt") as f:
prompt = f.read()
gen_in = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()
with torch.no_grad():
generation_config = GenerationConfig(
max_new_tokens=64,
min_new_tokens=64,
do_sample=True,
use_cache=True,
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.2,
temperature=0.7,
top_k=40,
top_p=1.0,
# early_stopping=True,
)
try:
print('generation start')
start_gen = time.perf_counter_ns()
generated_ids = model.generate(
gen_in,
generation_config=generation_config,
streamer=streamer,
# stopping_criteria=
)
end_gen = time.perf_counter_ns()
except Exception:
print("\033[91m")
traceback.print_exc()
print("\033[0m")
exit(1)
# print(repr(generated_ids))
# generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
# print(repr(generated_text))
gen_time_ms = (end_gen - start_gen) / 1e6
print(f"generation time: {gen_time_ms} ms")
total_tokens = len(gen_in) + len(generated_ids[0])
print(f"tokens consumed: {total_tokens}")
print(f"gen time per token: {gen_time_ms / total_tokens} ms")
print(f"total time: {(end_gen - start_model) / 1e9} s")
if __name__ == "__main__":
main()
CUDA version: 11.8
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /home/kirill/miniconda3/envs/llama-1/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/kirill/miniconda3/envs/llama-1/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
generation start
<s>MALLOC(3) Linux Programmer's Manual MALLOC(3)
NAME
malloc, free, calloc, realloc, reallocarray - allocate and free dynamic memory
SYNOPSIS
#include <stdlib.h>
void *malloc(size_t size);
void free(void *ptr);
void *calloc(size_t nmemb, size_t size);
void *realloc(void *ptr, size_t size);
void *reallocarray(void *ptr, size_t nmemb, size_t size);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
reallocarray():
Since glibc 2.29:
_DEFAULT_SOURCE
Glibc 2.28 and earlier:
_GNU_SOURCE
DESCRIPTION
The malloc() function allocates size bytes and returns a pointer to the allocated memory. The
memory is not initialized. If size is 0, then malloc() returns either NULL, or a unique
pointer value that can later be successfully passed to free().
The free() function frees the memory space pointed to by ptr, which must have been returned by
a previous call to malloc(), calloc(), or realloc(). Otherwise, or if free(ptr) has already
been called before, undefined behavior occurs. If ptr is NULL, no operation is performed.
The calloc() function allocates memory for an array of nmemb elements of size bytes each and
returns a pointer to the allocated memory. The memory is set to zero. If nmemb or size is 0,
then calloc() returns either NULL, or a unique pointer value that can later be successfully
passed to free(). If the multiplication of nmemb and size would result in integer overflow,
then calloc() returns an error. By contrast, an integer overflow would not be detected in the
following call to malloc(), with the result that an incorrectly sized block of memory would be
allocated:
malloc(nmemb * size);
The realloc() function changes the size of the memory block pointed to by ptr to size bytes.
The contents will be unchanged in the range from the start of the region up to the minimum of
the old and new sizes. If the new size is larger than the old size, the added memory will not
be initialized. If ptr is NULL, then the call is equivalent to malloc(size), for all values of
size; if size is equal to zero, and ptr is not NULL, then the call is equivalent to free(ptr).
Unless ptr is NULL, it must have been returned by an earlier call to malloc(), calloc(), or re‐
alloc(). If the area pointed to was moved, a free(ptr) is done.
The reallocarray() function changes the size of the memory block pointed to by ptr to be large
enough for an array of nmemb elements, each of which is size bytes. It is equivalent to the
call
realloc(ptr, nmemb * size);
However, unlike that realloc() call, reallocarray() fails safely in the case where the multi‐
plication would overflow. If such an overflow occurs, reallocarray() returns NULL, sets errno
to ENOMEM, and leaves the original block of memory unchanged.
RETURN VALUE
The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably
aligned for any built-in type. On error, these functions return NULL. NULL may also be re‐
turned by a successful call to malloc() with a size of zero, or by a successful call to cal‐
loc() with nmemb or size equal to zero.
The free() function returns no value.
The realloc() function returns a pointer to the newly allocated memory, which is suitably
aligned for any built-in type, or NULL if the request failed. The returned pointer may be the
same as ptr if the allocation was not moved (e.g., there was room to expand the allocation in-
place), or different from ptr if the allocation was moved to a new address. If size was equal
to 0, either NULL or a pointer suitable to be passed to free() is returned. If realloc()
fails, the original block is left untouched; it is not freed or moved.
On success, the reallocarray() function returns a pointer to the newly allocated memory. On
failure, it returns NULL and the original block of memory is left untouched.
NOTES
1) The C standard does not mandate automatic storage management. Some implementations use
manual bookkeeping while others provide automatic storage management through a system
library routine. When used automatically, most libraries define some form of "free"; see
man page malloc.*
generation time: 5748.10601 ms
tokens consumed: 1217
gen time per token: 4.723176672144619 ms
total time: 11.363934879 s
CUDA version: 11.8
Loading model ...
Done.
generation start
<s>MALLOC(3) Linux Programmer's Manual MALLOC(3)
NAME
malloc, free, calloc, realloc, reallocarray - allocate and free dynamic memory
SYNOPSIS
#include <stdlib.h>
void *malloc(size_t size);
void free(void *ptr);
void *calloc(size_t nmemb, size_t size);
void *realloc(void *ptr, size_t size);
void *reallocarray(void *ptr, size_t nmemb, size_t size);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
reallocarray():
Since glibc 2.29:
_DEFAULT_SOURCE
Glibc 2.28 and earlier:
_GNU_SOURCE
DESCRIPTION
The malloc() function allocates size bytes and returns a pointer to the allocated memory. The
memory is not initialized. If size is 0, then malloc() returns either NULL, or a unique
pointer value that can later be successfully passed to free().
The free() function frees the memory space pointed to by ptr, which must have been returned by
a previous call to malloc(), calloc(), or realloc(). Otherwise, or if free(ptr) has already
been called before, undefined behavior occurs. If ptr is NULL, no operation is performed.
The calloc() function allocates memory for an array of nmemb elements of size bytes each and
returns a pointer to the allocated memory. The memory is set to zero. If nmemb or size is 0,
then calloc() returns either NULL, or a unique pointer value that can later be successfully
passed to free(). If the multiplication of nmemb and size would result in integer overflow,
then calloc() returns an error. By contrast, an integer overflow would not be detected in the
following call to malloc(), with the result that an incorrectly sized block of memory would be
allocated:
malloc(nmemb * size);
The realloc() function changes the size of the memory block pointed to by ptr to size bytes.
The contents will be unchanged in the range from the start of the region up to the minimum of
the old and new sizes. If the new size is larger than the old size, the added memory will not
be initialized. If ptr is NULL, then the call is equivalent to malloc(size), for all values of
size; if size is equal to zero, and ptr is not NULL, then the call is equivalent to free(ptr).
Unless ptr is NULL, it must have been returned by an earlier call to malloc(), calloc(), or re‐
alloc(). If the area pointed to was moved, a free(ptr) is done.
The reallocarray() function changes the size of the memory block pointed to by ptr to be large
enough for an array of nmemb elements, each of which is size bytes. It is equivalent to the
call
realloc(ptr, nmemb * size);
However, unlike that realloc() call, reallocarray() fails safely in the case where the multi‐
plication would overflow. If such an overflow occurs, reallocarray() returns NULL, sets errno
to ENOMEM, and leaves the original block of memory unchanged.
RETURN VALUE
The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably
aligned for any built-in type. On error, these functions return NULL. NULL may also be re‐
turned by a successful call to malloc() with a size of zero, or by a successful call to cal‐
loc() with nmemb or size equal to zero.
The free() function returns no value.
The realloc() function returns a pointer to the newly allocated memory, which is suitably
aligned for any built-in type, or NULL if the request failed. The returned pointer may be the
same as ptr if the allocation was not moved (e.g., there was room to expand the allocation in-
place), or different from ptr if the allocation was moved to a new address. If size was equal
to 0, either NULL or a pointer suitable to be passed to free() is returned. If realloc()
fails, the original block is left untouched; it is not freed or moved.
On success, the reallocarray() function returns a pointer to the newly allocated memory. On
failure, it returns NULL and the original block of memory is left untouched.
NOTES
This interface is compatible with both ISO C99 and POSIX specifications.
The implementation adopts the principle "first available wins" when returning pointers obtained
through calls to malloc(). Specifically, if an application invokes malloc() repeatedly within one
segment
generation time: 21058.956373 ms
tokens consumed: 1217
gen time per token: 17.303990446179128 ms
total time: 23.015911156 s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment