I have tried using ctransformers and llama-cpp-python to load large language models in Python, and now I'm starting to learn about LLAMA.CPP. It seems like LLAMA.CPP has better performance than llama-cpp-python, and LLAMA.CPP is also the foundation for many important projects, so I must study it in depth instead of going in circles. My goal in learning C++ is for things related to LLAMA.CPP, so now I will get right to it and slowly analyze the LLAMA.CPP source code in the process of using it, and learn C++ along the way.
First I need to compile LLAMA.CPP on Android's Termux, I won't go into the tedious details, just follow the instructions on Github. The compilation and execution files are all in llama.cpp/build/bin, just use bash and you can easily achieve all related functions. If you want more advanced functions you have to modify the source code yourself and recompile. My ultimate goal is to do this.
First I tried the CLI and web UI functions, running LLama2-13B-Chat on my phone, which is currently the best large language model that can run on phones and be used commercially. Initial tests show an average of 2.5 tokens per second, quite a bit faster than the 1.6 tokens per second using Llama-Cpp-Python, and it also generates less heat on the phone when running. The LLAMA.CPP server also provides API functions that make it very convenient to integrate large language models into various projects.
Then with OpenBlas:
brian@localhost:~$ llama.cpp/build/bin/main -m /sdcard/Documents/Pydroid3/llm/models/llama-2-13b-chat.ggmlv3.q4_K_M.bin --ctx_size 4096 --temp 0.8 --top_k 50 --top_p 0.9 --batch_size 4096 --repeat_penalty 1.17647 --color -p "List 5 fruits.
"
main: warning: base model only supports context sizes no greater than 2048 tokens (4096 specified)
main: build = 918 (7c529ce)
main: seed = 1690500323
llama.cpp: loading model from /sdcard/Documents/Pydroid3/llm/models/llama-2-13b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 4096
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 8165.96 MB (+ 3200.00 MB per state)
llama_new_context_with_model: kv self size = 3200.00 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.176470, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 50, tfs_z = 1.000000, top_p = 0.900000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0
List 5 fruits.
1) Apple
2) Banana
3) Mango
4) Orange
5) Watermelon [end of text]
llama_print_timings: load time = 2558.52 ms
llama_print_timings: sample time = 15.02 ms / 25 runs ( 0.60 ms per token, 1664.56 tokens per second)
llama_print_timings: prompt eval time = 5010.56 ms / 9 tokens ( 556.73 ms per token, 1.80 tokens per second)
llama_print_timings: eval time = 12744.24 ms / 24 runs ( 531.01 ms per token, 1.88 tokens per second)
llama_print_timings: total time = 17774.46 ms
brian@localhost:~$ bash llserver.sh
{"timestamp":1690500355,"level":"INFO","function":"main","line":1123,"message":"build info","build":918,"commit":"7c529ce"}
{"timestamp":1690500355,"level":"INFO","function":"main","line":1125,"message":"system info","n_threads":4,"total_threads":8,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | "}
llama.cpp: loading model from /sdcard/Documents/Pydroid3/llm/models/llama-2-13b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 4096
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 8165.96 MB (+ 3200.00 MB per state)
llama_new_context_with_model: kv self size = 3200.00 MB
llama server listening at http://localhost:8899
{"timestamp":1690500356,"level":"INFO","function":"main","line":1341,"message":"HTTP server listening","hostname":"localhost","port":8899}
{"timestamp":1690500356,"level":"INFO","function":"log_server_request","line":1090,"message":"request","remote_addr":"::1","remote_port":39078,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1690500357,"level":"INFO","function":"log_server_request","line":1090,"message":"request","remote_addr":"::1","remote_port":39078,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1690500357,"level":"INFO","function":"log_server_request","line":1090,"message":"request","remote_addr":"::1","remote_port":39080,"status":200,"method":"GET","path":"/completion.js","params":{}}
llama_print_timings: load time = 51890.58 ms
llama_print_timings: sample time = 122.20 ms / 44 runs ( 2.78 ms per token, 360.08 tokens per second)
llama_print_timings: prompt eval time = 51877.49 ms / 40 tokens ( 1296.94 ms per token, 0.77 tokens per second)
llama_print_timings: eval time = 22306.16 ms / 43 runs ( 518.75 ms per token, 1.93 tokens per second)
llama_print_timings: total time = 74354.03 ms
{"timestamp":1690500442,"level":"INFO","function":"log_server_request","line":1090,"message":"request","remote_addr":"::1","remote_port":38934,"status":200,"method":"POST","path":"/completion","params":{}}
^C
brian@localhost:~$
Now recompile Llama-Cpp-Python again. With both using OPENBLAS, it still seems a little slower than LLaMa.CPP.
Integrated with several features, including Highlight and streaming files from SQL. Most importantly, you can integrate the developed functions into your own system according to your needs at any time. It is difficult to meet such requirements when using someone else's interface.
沒有留言:
發佈留言