-
-
Save python273/ca23361caf1cde9dc06bbc9acd44b22d to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
MALLOC(3) Linux Programmer's Manual MALLOC(3) | |
NAME | |
malloc, free, calloc, realloc, reallocarray - allocate and free dynamic memory | |
SYNOPSIS | |
#include <stdlib.h> | |
void *malloc(size_t size); | |
void free(void *ptr); | |
void *calloc(size_t nmemb, size_t size); | |
void *realloc(void *ptr, size_t size); | |
void *reallocarray(void *ptr, size_t nmemb, size_t size); | |
Feature Test Macro Requirements for glibc (see feature_test_macros(7)): | |
reallocarray(): | |
Since glibc 2.29: | |
_DEFAULT_SOURCE | |
Glibc 2.28 and earlier: | |
_GNU_SOURCE | |
DESCRIPTION | |
The malloc() function allocates size bytes and returns a pointer to the allocated memory. The | |
memory is not initialized. If size is 0, then malloc() returns either NULL, or a unique | |
pointer value that can later be successfully passed to free(). | |
The free() function frees the memory space pointed to by ptr, which must have been returned by | |
a previous call to malloc(), calloc(), or realloc(). Otherwise, or if free(ptr) has already | |
been called before, undefined behavior occurs. If ptr is NULL, no operation is performed. | |
The calloc() function allocates memory for an array of nmemb elements of size bytes each and | |
returns a pointer to the allocated memory. The memory is set to zero. If nmemb or size is 0, | |
then calloc() returns either NULL, or a unique pointer value that can later be successfully | |
passed to free(). If the multiplication of nmemb and size would result in integer overflow, | |
then calloc() returns an error. By contrast, an integer overflow would not be detected in the | |
following call to malloc(), with the result that an incorrectly sized block of memory would be | |
allocated: | |
malloc(nmemb * size); | |
The realloc() function changes the size of the memory block pointed to by ptr to size bytes. | |
The contents will be unchanged in the range from the start of the region up to the minimum of | |
the old and new sizes. If the new size is larger than the old size, the added memory will not | |
be initialized. If ptr is NULL, then the call is equivalent to malloc(size), for all values of | |
size; if size is equal to zero, and ptr is not NULL, then the call is equivalent to free(ptr). | |
Unless ptr is NULL, it must have been returned by an earlier call to malloc(), calloc(), or re‐ | |
alloc(). If the area pointed to was moved, a free(ptr) is done. | |
The reallocarray() function changes the size of the memory block pointed to by ptr to be large | |
enough for an array of nmemb elements, each of which is size bytes. It is equivalent to the | |
call | |
realloc(ptr, nmemb * size); | |
However, unlike that realloc() call, reallocarray() fails safely in the case where the multi‐ | |
plication would overflow. If such an overflow occurs, reallocarray() returns NULL, sets errno | |
to ENOMEM, and leaves the original block of memory unchanged. | |
RETURN VALUE | |
The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably | |
aligned for any built-in type. On error, these functions return NULL. NULL may also be re‐ | |
turned by a successful call to malloc() with a size of zero, or by a successful call to cal‐ | |
loc() with nmemb or size equal to zero. | |
The free() function returns no value. | |
The realloc() function returns a pointer to the newly allocated memory, which is suitably | |
aligned for any built-in type, or NULL if the request failed. The returned pointer may be the | |
same as ptr if the allocation was not moved (e.g., there was room to expand the allocation in- | |
place), or different from ptr if the allocation was moved to a new address. If size was equal | |
to 0, either NULL or a pointer suitable to be passed to free() is returned. If realloc() | |
fails, the original block is left untouched; it is not freed or moved. | |
On success, the reallocarray() function returns a pointer to the newly allocated memory. On | |
failure, it returns NULL and the original block of memory is left untouched. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(nnexp) kirill@quiet:~/3rdworkspace/llama.cpp$ make | |
I llama.cpp build info: | |
I UNAME_S: Linux | |
I UNAME_P: x86_64 | |
I UNAME_M: x86_64 | |
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native | |
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native | |
I LDFLAGS: | |
I CC: cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 | |
I CXX: g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 | |
cc -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native -c ggml.c -o ggml.o | |
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native -c llama.cpp -o llama.o | |
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native -c examples/common.cpp -o common.o | |
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o -o main | |
==== Run ./main -h for help. ==== | |
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/quantize/quantize.cpp ggml.o llama.o -o quantize | |
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity | |
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding | |
(nnexp) kirill@quiet:~/3rdworkspace/llama.cpp$ sudo ./main -m ./models/7B/ggml-model-q4_0.bin --mlock --ctx_size 2048 -b 512 -t 16 --color -n 64 -f /home/kirill/3rdworkspace/GPTQ-for-LLaMa/bench_prompt.txt | |
main: seed = 1680904413 | |
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ... | |
llama_model_load: n_vocab = 32000 | |
llama_model_load: n_ctx = 2048 | |
llama_model_load: n_embd = 4096 | |
llama_model_load: n_mult = 256 | |
llama_model_load: n_head = 32 | |
llama_model_load: n_layer = 32 | |
llama_model_load: n_rot = 128 | |
llama_model_load: f16 = 2 | |
llama_model_load: n_ff = 11008 | |
llama_model_load: n_parts = 1 | |
llama_model_load: type = 1 | |
llama_model_load: ggml map size = 4017.70 MB | |
llama_model_load: ggml ctx size = 81.25 KB | |
llama_model_load: mem required = 5809.78 MB (+ 1026.00 MB per state) | |
llama_model_load: loading tensors from './models/7B/ggml-model-q4_0.bin' | |
llama_model_load: model size = 4017.27 MB / num tensors = 291 | |
llama_init_from_file: kv self size = 1024.00 MB | |
system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | | |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000 | |
generate: n_ctx = 2048, n_batch = 512, n_predict = 64, n_keep = 0 | |
MALLOC(3) Linux Programmer's Manual MALLOC(3) | |
NAME | |
malloc, free, calloc, realloc, reallocarray - allocate and free dynamic memory | |
SYNOPSIS | |
#include <stdlib.h> | |
void *malloc(size_t size); | |
void free(void *ptr); | |
void *calloc(size_t nmemb, size_t size); | |
void *realloc(void *ptr, size_t size); | |
void *reallocarray(void *ptr, size_t nmemb, size_t size); | |
Feature Test Macro Requirements for glibc (see feature_test_macros(7)): | |
reallocarray(): | |
Since glibc 2.29: | |
_DEFAULT_SOURCE | |
Glibc 2.28 and earlier: | |
_GNU_SOURCE | |
DESCRIPTION | |
The malloc() function allocates size bytes and returns a pointer to the allocated memory. The | |
memory is not initialized. If size is 0, then malloc() returns either NULL, or a unique | |
pointer value that can later be successfully passed to free(). | |
The free() function frees the memory space pointed to by ptr, which must have been returned by | |
a previous call to malloc(), calloc(), or realloc(). Otherwise, or if free(ptr) has already | |
been called before, undefined behavior occurs. If ptr is NULL, no operation is performed. | |
The calloc() function allocates memory for an array of nmemb elements of size bytes each and | |
returns a pointer to the allocated memory. The memory is set to zero. If nmemb or size is 0, | |
then calloc() returns either NULL, or a unique pointer value that can later be successfully | |
passed to free(). If the multiplication of nmemb and size would result in integer overflow, | |
then calloc() returns an error. By contrast, an integer overflow would not be detected in the | |
following call to malloc(), with the result that an incorrectly sized block of memory would be | |
allocated: | |
malloc(nmemb * size); | |
The realloc() function changes the size of the memory block pointed to by ptr to size bytes. | |
The contents will be unchanged in the range from the start of the region up to the minimum of | |
the old and new sizes. If the new size is larger than the old size, the added memory will not | |
be initialized. If ptr is NULL, then the call is equivalent to malloc(size), for all values of | |
size; if size is equal to zero, and ptr is not NULL, then the call is equivalent to free(ptr). | |
Unless ptr is NULL, it must have been returned by an earlier call to malloc(), calloc(), or re‐ | |
alloc(). If the area pointed to was moved, a free(ptr) is done. | |
The reallocarray() function changes the size of the memory block pointed to by ptr to be large | |
enough for an array of nmemb elements, each of which is size bytes. It is equivalent to the | |
call | |
realloc(ptr, nmemb * size); | |
However, unlike that realloc() call, reallocarray() fails safely in the case where the multi‐ | |
plication would overflow. If such an overflow occurs, reallocarray() returns NULL, sets errno | |
to ENOMEM, and leaves the original block of memory unchanged. | |
RETURN VALUE | |
The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably | |
aligned for any built-in type. On error, these functions return NULL. NULL may also be re‐ | |
turned by a successful call to malloc() with a size of zero, or by a successful call to cal‐ | |
loc() with nmemb or size equal to zero. | |
The free() function returns no value. | |
The realloc() function returns a pointer to the newly allocated memory, which is suitably | |
aligned for any built-in type, or NULL if the request failed. The returned pointer may be the | |
same as ptr if the allocation was not moved (e.g., there was room to expand the allocation in- | |
place), or different from ptr if the allocation was moved to a new address. If size was equal | |
to 0, either NULL or a pointer suitable to be passed to free() is returned. If realloc() | |
fails, the original block is left untouched; it is not freed or moved. | |
On success, the reallocarray() function returns a pointer to the newly allocated memory. On | |
failure, it returns NULL and the original block of memory is left untouched. | |
DESCRIPTION | |
The malloc() function allocates size bytes and returns a pointer to the allocated memory. | |
The memory is not initialized. If size is 0, then malloc() returns either NULL, or a unique | |
pointer value that can later be successfully passed to | |
llama_print_timings: load time = 12524.92 ms | |
llama_print_timings: sample time = 27.27 ms / 64 runs ( 0.43 ms per run) | |
llama_print_timings: prompt eval time = 27051.89 ms / 1151 tokens ( 23.50 ms per token) | |
llama_print_timings: eval time = 7044.22 ms / 63 runs ( 111.81 ms per run) | |
llama_print_timings: total time = 35084.64 ms | |
(nnexp) kirill@quiet:~/3rdworkspace/llama.cpp$ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(nnexp) kirill@quiet:~/3rdworkspace/llama.cpp$ LLAMA_OPENBLAS=1 make | |
I llama.cpp build info: | |
I UNAME_S: Linux | |
I UNAME_P: x86_64 | |
I UNAME_M: x86_64 | |
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native -DGGML_USE_OPENBLAS -I/usr/local/include/openblas | |
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native | |
I LDFLAGS: -lopenblas | |
I CC: cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 | |
I CXX: g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 | |
cc -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -c ggml.c -o ggml.o | |
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native -c llama.cpp -o llama.o | |
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native -c examples/common.cpp -o common.o | |
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o -o main -lopenblas | |
==== Run ./main -h for help. ==== | |
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/quantize/quantize.cpp ggml.o llama.o -o quantize -lopenblas | |
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity -lopenblas | |
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding -lopenblas | |
(nnexp) kirill@quiet:~/3rdworkspace/llama.cpp$ sudo ./main -m ./models/7B/ggml-model-q4_0.bin --mlock --ctx_size 2048 -b 512 -t 16 --color -n 64 -f /home/kirill/3rdworkspace/GPTQ-for-LLaMa/bench_prompt.txt | |
main: seed = 1680904475 | |
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ... | |
llama_model_load: n_vocab = 32000 | |
llama_model_load: n_ctx = 2048 | |
llama_model_load: n_embd = 4096 | |
llama_model_load: n_mult = 256 | |
llama_model_load: n_head = 32 | |
llama_model_load: n_layer = 32 | |
llama_model_load: n_rot = 128 | |
llama_model_load: f16 = 2 | |
llama_model_load: n_ff = 11008 | |
llama_model_load: n_parts = 1 | |
llama_model_load: type = 1 | |
llama_model_load: ggml map size = 4017.70 MB | |
llama_model_load: ggml ctx size = 81.25 KB | |
llama_model_load: mem required = 5809.78 MB (+ 1026.00 MB per state) | |
llama_model_load: loading tensors from './models/7B/ggml-model-q4_0.bin' | |
llama_model_load: model size = 4017.27 MB / num tensors = 291 | |
llama_init_from_file: kv self size = 1024.00 MB | |
system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | | |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000 | |
generate: n_ctx = 2048, n_batch = 512, n_predict = 64, n_keep = 0 | |
MALLOC(3) Linux Programmer's Manual MALLOC(3) | |
NAME | |
malloc, free, calloc, realloc, reallocarray - allocate and free dynamic memory | |
SYNOPSIS | |
#include <stdlib.h> | |
void *malloc(size_t size); | |
void free(void *ptr); | |
void *calloc(size_t nmemb, size_t size); | |
void *realloc(void *ptr, size_t size); | |
void *reallocarray(void *ptr, size_t nmemb, size_t size); | |
Feature Test Macro Requirements for glibc (see feature_test_macros(7)): | |
reallocarray(): | |
Since glibc 2.29: | |
_DEFAULT_SOURCE | |
Glibc 2.28 and earlier: | |
_GNU_SOURCE | |
DESCRIPTION | |
The malloc() function allocates size bytes and returns a pointer to the allocated memory. The | |
memory is not initialized. If size is 0, then malloc() returns either NULL, or a unique | |
pointer value that can later be successfully passed to free(). | |
The free() function frees the memory space pointed to by ptr, which must have been returned by | |
a previous call to malloc(), calloc(), or realloc(). Otherwise, or if free(ptr) has already | |
been called before, undefined behavior occurs. If ptr is NULL, no operation is performed. | |
The calloc() function allocates memory for an array of nmemb elements of size bytes each and | |
returns a pointer to the allocated memory. The memory is set to zero. If nmemb or size is 0, | |
then calloc() returns either NULL, or a unique pointer value that can later be successfully | |
passed to free(). If the multiplication of nmemb and size would result in integer overflow, | |
then calloc() returns an error. By contrast, an integer overflow would not be detected in the | |
following call to malloc(), with the result that an incorrectly sized block of memory would be | |
allocated: | |
malloc(nmemb * size); | |
The realloc() function changes the size of the memory block pointed to by ptr to size bytes. | |
The contents will be unchanged in the range from the start of the region up to the minimum of | |
the old and new sizes. If the new size is larger than the old size, the added memory will not | |
be initialized. If ptr is NULL, then the call is equivalent to malloc(size), for all values of | |
size; if size is equal to zero, and ptr is not NULL, then the call is equivalent to free(ptr). | |
Unless ptr is NULL, it must have been returned by an earlier call to malloc(), calloc(), or re‐ | |
alloc(). If the area pointed to was moved, a free(ptr) is done. | |
The reallocarray() function changes the size of the memory block pointed to by ptr to be large | |
enough for an array of nmemb elements, each of which is size bytes. It is equivalent to the | |
call | |
realloc(ptr, nmemb * size); | |
However, unlike that realloc() call, reallocarray() fails safely in the case where the multi‐ | |
plication would overflow. If such an overflow occurs, reallocarray() returns NULL, sets errno | |
to ENOMEM, and leaves the original block of memory unchanged. | |
RETURN VALUE | |
The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably | |
aligned for any built-in type. On error, these functions return NULL. NULL may also be re‐ | |
turned by a successful call to malloc() with a size of zero, or by a successful call to cal‐ | |
loc() with nmemb or size equal to zero. | |
The free() function returns no value. | |
The realloc() function returns a pointer to the newly allocated memory, which is suitably | |
aligned for any built-in type, or NULL if the request failed. The returned pointer may be the | |
same as ptr if the allocation was not moved (e.g., there was room to expand the allocation in- | |
place), or different from ptr if the allocation was moved to a new address. If size was equal | |
to 0, either NULL or a pointer suitable to be passed to free() is returned. If realloc() | |
fails, the original block is left untouched; it is not freed or moved. | |
On success, the reallocarray() function returns a pointer to the newly allocated memory. On | |
failure, it returns NULL and the original block of memory is left untouched. The behavior on an | |
overflow is defined by the compiler vendor. | |
DESCRIPTION OF THE EXCEPTIONS THAT MAY BE THROWN | |
The malloc() function may throw a malloc failure (which is not to be confused with the standard | |
malloc() failure | |
llama_print_timings: load time = 14944.75 ms | |
llama_print_timings: sample time = 35.45 ms / 64 runs ( 0.55 ms per run) | |
llama_print_timings: prompt eval time = 40494.75 ms / 1151 tokens ( 35.18 ms per token) | |
llama_print_timings: eval time = 7111.35 ms / 63 runs ( 112.88 ms per run) | |
llama_print_timings: total time = 48636.15 ms | |
(nnexp) kirill@quiet:~/3rdworkspace/llama.cpp$ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import time | |
import traceback | |
from transformers import ( | |
LlamaForCausalLM, LlamaTokenizer, BitsAndBytesConfig, TextStreamer, | |
GenerationConfig | |
) | |
import torch | |
cuda_is_available = torch.cuda.is_available() | |
gpu_count = torch.cuda.device_count() | |
if not cuda_is_available or gpu_count == 0: | |
print("CUDA is not available") | |
exit(1) | |
print("CUDA version: " + torch.version.cuda) | |
quantization_config = BitsAndBytesConfig( | |
llm_int8_skip_modules=["lm_head"], | |
load_in_8bit=True, | |
) | |
start_model = time.perf_counter_ns() | |
model_name = "./models/7B/" | |
tokenizer = LlamaTokenizer.from_pretrained(model_name) | |
streamer = TextStreamer(tokenizer) | |
model: LlamaForCausalLM = LlamaForCausalLM.from_pretrained( | |
model_name, | |
device_map="auto", | |
quantization_config=quantization_config, | |
torch_dtype=torch.float16, | |
cache_dir="cache" | |
) | |
def main(): | |
with open("/home/kirill/3rdworkspace/GPTQ-for-LLaMa/bench_prompt.txt") as f: | |
prompt = f.read() | |
gen_in = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda() | |
with torch.no_grad(): | |
generation_config = GenerationConfig( | |
max_new_tokens=64, | |
min_new_tokens=64, | |
do_sample=True, | |
use_cache=True, | |
pad_token_id=tokenizer.eos_token_id, | |
repetition_penalty=1.2, | |
temperature=0.7, | |
top_k=40, | |
top_p=1.0, | |
# early_stopping=True, | |
) | |
try: | |
print('generation start') | |
start_gen = time.perf_counter_ns() | |
generated_ids = model.generate( | |
gen_in, | |
generation_config=generation_config, | |
streamer=streamer, | |
# stopping_criteria= | |
) | |
end_gen = time.perf_counter_ns() | |
except Exception: | |
print("\033[91m") | |
traceback.print_exc() | |
print("\033[0m") | |
exit(1) | |
# print(repr(generated_ids)) | |
# generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] | |
# print(repr(generated_text)) | |
gen_time_ms = (end_gen - start_gen) / 1e6 | |
print(f"generation time: {gen_time_ms} ms") | |
total_tokens = len(gen_in) + len(generated_ids[0]) | |
print(f"tokens consumed: {total_tokens}") | |
print(f"gen time per token: {gen_time_ms / total_tokens} ms") | |
print(f"total time: {(end_gen - start_model) / 1e9} s") | |
if __name__ == "__main__": | |
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pathlib import Path | |
import time | |
import traceback | |
from transformers import ( | |
LlamaForCausalLM, LlamaTokenizer, BitsAndBytesConfig, TextStreamer, | |
GenerationConfig | |
) | |
import torch | |
from llama_inference import load_quant | |
import warnings | |
warnings.filterwarnings("ignore") | |
cuda_is_available = torch.cuda.is_available() | |
gpu_count = torch.cuda.device_count() | |
if not cuda_is_available or gpu_count == 0: | |
print("CUDA is not available") | |
exit(1) | |
print("CUDA version: " + torch.version.cuda) | |
DEV = torch.device('cuda:0') | |
start_model = time.perf_counter_ns() | |
model_name = Path("~/workspace/llama-gpu/models/7B/").expanduser() | |
tokenizer = LlamaTokenizer.from_pretrained(model_name) | |
streamer = TextStreamer(tokenizer) | |
model = load_quant(model_name, "llama7b-4bit-128g.safetensors", 4, 128, -1) | |
model.to(DEV) | |
def main(): | |
with open("/home/kirill/3rdworkspace/GPTQ-for-LLaMa/bench_prompt.txt") as f: | |
prompt = f.read() | |
gen_in = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda() | |
with torch.no_grad(): | |
generation_config = GenerationConfig( | |
max_new_tokens=64, | |
min_new_tokens=64, | |
do_sample=True, | |
use_cache=True, | |
pad_token_id=tokenizer.eos_token_id, | |
repetition_penalty=1.2, | |
temperature=0.7, | |
top_k=40, | |
top_p=1.0, | |
# early_stopping=True, | |
) | |
try: | |
print('generation start') | |
start_gen = time.perf_counter_ns() | |
generated_ids = model.generate( | |
gen_in, | |
generation_config=generation_config, | |
streamer=streamer, | |
# stopping_criteria= | |
) | |
end_gen = time.perf_counter_ns() | |
except Exception: | |
print("\033[91m") | |
traceback.print_exc() | |
print("\033[0m") | |
exit(1) | |
# print(repr(generated_ids)) | |
# generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] | |
# print(repr(generated_text)) | |
gen_time_ms = (end_gen - start_gen) / 1e6 | |
print(f"generation time: {gen_time_ms} ms") | |
total_tokens = len(gen_in) + len(generated_ids[0]) | |
print(f"tokens consumed: {total_tokens}") | |
print(f"gen time per token: {gen_time_ms / total_tokens} ms") | |
print(f"total time: {(end_gen - start_model) / 1e9} s") | |
if __name__ == "__main__": | |
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CUDA version: 11.8 | |
===================================BUG REPORT=================================== | |
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues | |
================================================================================ | |
CUDA SETUP: CUDA runtime path found: /home/kirill/miniconda3/envs/llama-1/lib/libcudart.so | |
CUDA SETUP: Highest compute capability among GPUs detected: 7.5 | |
CUDA SETUP: Detected CUDA version 118 | |
CUDA SETUP: Loading binary /home/kirill/miniconda3/envs/llama-1/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cuda118.so... | |
generation start | |
<s>MALLOC(3) Linux Programmer's Manual MALLOC(3) | |
NAME | |
malloc, free, calloc, realloc, reallocarray - allocate and free dynamic memory | |
SYNOPSIS | |
#include <stdlib.h> | |
void *malloc(size_t size); | |
void free(void *ptr); | |
void *calloc(size_t nmemb, size_t size); | |
void *realloc(void *ptr, size_t size); | |
void *reallocarray(void *ptr, size_t nmemb, size_t size); | |
Feature Test Macro Requirements for glibc (see feature_test_macros(7)): | |
reallocarray(): | |
Since glibc 2.29: | |
_DEFAULT_SOURCE | |
Glibc 2.28 and earlier: | |
_GNU_SOURCE | |
DESCRIPTION | |
The malloc() function allocates size bytes and returns a pointer to the allocated memory. The | |
memory is not initialized. If size is 0, then malloc() returns either NULL, or a unique | |
pointer value that can later be successfully passed to free(). | |
The free() function frees the memory space pointed to by ptr, which must have been returned by | |
a previous call to malloc(), calloc(), or realloc(). Otherwise, or if free(ptr) has already | |
been called before, undefined behavior occurs. If ptr is NULL, no operation is performed. | |
The calloc() function allocates memory for an array of nmemb elements of size bytes each and | |
returns a pointer to the allocated memory. The memory is set to zero. If nmemb or size is 0, | |
then calloc() returns either NULL, or a unique pointer value that can later be successfully | |
passed to free(). If the multiplication of nmemb and size would result in integer overflow, | |
then calloc() returns an error. By contrast, an integer overflow would not be detected in the | |
following call to malloc(), with the result that an incorrectly sized block of memory would be | |
allocated: | |
malloc(nmemb * size); | |
The realloc() function changes the size of the memory block pointed to by ptr to size bytes. | |
The contents will be unchanged in the range from the start of the region up to the minimum of | |
the old and new sizes. If the new size is larger than the old size, the added memory will not | |
be initialized. If ptr is NULL, then the call is equivalent to malloc(size), for all values of | |
size; if size is equal to zero, and ptr is not NULL, then the call is equivalent to free(ptr). | |
Unless ptr is NULL, it must have been returned by an earlier call to malloc(), calloc(), or re‐ | |
alloc(). If the area pointed to was moved, a free(ptr) is done. | |
The reallocarray() function changes the size of the memory block pointed to by ptr to be large | |
enough for an array of nmemb elements, each of which is size bytes. It is equivalent to the | |
call | |
realloc(ptr, nmemb * size); | |
However, unlike that realloc() call, reallocarray() fails safely in the case where the multi‐ | |
plication would overflow. If such an overflow occurs, reallocarray() returns NULL, sets errno | |
to ENOMEM, and leaves the original block of memory unchanged. | |
RETURN VALUE | |
The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably | |
aligned for any built-in type. On error, these functions return NULL. NULL may also be re‐ | |
turned by a successful call to malloc() with a size of zero, or by a successful call to cal‐ | |
loc() with nmemb or size equal to zero. | |
The free() function returns no value. | |
The realloc() function returns a pointer to the newly allocated memory, which is suitably | |
aligned for any built-in type, or NULL if the request failed. The returned pointer may be the | |
same as ptr if the allocation was not moved (e.g., there was room to expand the allocation in- | |
place), or different from ptr if the allocation was moved to a new address. If size was equal | |
to 0, either NULL or a pointer suitable to be passed to free() is returned. If realloc() | |
fails, the original block is left untouched; it is not freed or moved. | |
On success, the reallocarray() function returns a pointer to the newly allocated memory. On | |
failure, it returns NULL and the original block of memory is left untouched. | |
NOTES | |
1) The C standard does not mandate automatic storage management. Some implementations use | |
manual bookkeeping while others provide automatic storage management through a system | |
library routine. When used automatically, most libraries define some form of "free"; see | |
man page malloc.* | |
generation time: 5748.10601 ms | |
tokens consumed: 1217 | |
gen time per token: 4.723176672144619 ms | |
total time: 11.363934879 s |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CUDA version: 11.8 | |
Loading model ... | |
Done. | |
generation start | |
<s>MALLOC(3) Linux Programmer's Manual MALLOC(3) | |
NAME | |
malloc, free, calloc, realloc, reallocarray - allocate and free dynamic memory | |
SYNOPSIS | |
#include <stdlib.h> | |
void *malloc(size_t size); | |
void free(void *ptr); | |
void *calloc(size_t nmemb, size_t size); | |
void *realloc(void *ptr, size_t size); | |
void *reallocarray(void *ptr, size_t nmemb, size_t size); | |
Feature Test Macro Requirements for glibc (see feature_test_macros(7)): | |
reallocarray(): | |
Since glibc 2.29: | |
_DEFAULT_SOURCE | |
Glibc 2.28 and earlier: | |
_GNU_SOURCE | |
DESCRIPTION | |
The malloc() function allocates size bytes and returns a pointer to the allocated memory. The | |
memory is not initialized. If size is 0, then malloc() returns either NULL, or a unique | |
pointer value that can later be successfully passed to free(). | |
The free() function frees the memory space pointed to by ptr, which must have been returned by | |
a previous call to malloc(), calloc(), or realloc(). Otherwise, or if free(ptr) has already | |
been called before, undefined behavior occurs. If ptr is NULL, no operation is performed. | |
The calloc() function allocates memory for an array of nmemb elements of size bytes each and | |
returns a pointer to the allocated memory. The memory is set to zero. If nmemb or size is 0, | |
then calloc() returns either NULL, or a unique pointer value that can later be successfully | |
passed to free(). If the multiplication of nmemb and size would result in integer overflow, | |
then calloc() returns an error. By contrast, an integer overflow would not be detected in the | |
following call to malloc(), with the result that an incorrectly sized block of memory would be | |
allocated: | |
malloc(nmemb * size); | |
The realloc() function changes the size of the memory block pointed to by ptr to size bytes. | |
The contents will be unchanged in the range from the start of the region up to the minimum of | |
the old and new sizes. If the new size is larger than the old size, the added memory will not | |
be initialized. If ptr is NULL, then the call is equivalent to malloc(size), for all values of | |
size; if size is equal to zero, and ptr is not NULL, then the call is equivalent to free(ptr). | |
Unless ptr is NULL, it must have been returned by an earlier call to malloc(), calloc(), or re‐ | |
alloc(). If the area pointed to was moved, a free(ptr) is done. | |
The reallocarray() function changes the size of the memory block pointed to by ptr to be large | |
enough for an array of nmemb elements, each of which is size bytes. It is equivalent to the | |
call | |
realloc(ptr, nmemb * size); | |
However, unlike that realloc() call, reallocarray() fails safely in the case where the multi‐ | |
plication would overflow. If such an overflow occurs, reallocarray() returns NULL, sets errno | |
to ENOMEM, and leaves the original block of memory unchanged. | |
RETURN VALUE | |
The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably | |
aligned for any built-in type. On error, these functions return NULL. NULL may also be re‐ | |
turned by a successful call to malloc() with a size of zero, or by a successful call to cal‐ | |
loc() with nmemb or size equal to zero. | |
The free() function returns no value. | |
The realloc() function returns a pointer to the newly allocated memory, which is suitably | |
aligned for any built-in type, or NULL if the request failed. The returned pointer may be the | |
same as ptr if the allocation was not moved (e.g., there was room to expand the allocation in- | |
place), or different from ptr if the allocation was moved to a new address. If size was equal | |
to 0, either NULL or a pointer suitable to be passed to free() is returned. If realloc() | |
fails, the original block is left untouched; it is not freed or moved. | |
On success, the reallocarray() function returns a pointer to the newly allocated memory. On | |
failure, it returns NULL and the original block of memory is left untouched. | |
NOTES | |
This interface is compatible with both ISO C99 and POSIX specifications. | |
The implementation adopts the principle "first available wins" when returning pointers obtained | |
through calls to malloc(). Specifically, if an application invokes malloc() repeatedly within one | |
segment | |
generation time: 21058.956373 ms | |
tokens consumed: 1217 | |
gen time per token: 17.303990446179128 ms | |
total time: 23.015911156 s |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment