llama.cpp
Source:
Quick Start
I'm using Rocky Linux 8.
Installation
Install Python 3.11
sudo yum install -y \
python3.11 \
python3.11-pip
Create a working directory
mkdir AI && cd AI
Install Git LFS
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash
sudo yum install git-lfs
git lfs install
Clone the llama 13b model from Hugging Face
git clone https://huggingface.co/openlm-research/open_llama_13b
Commit: b6d7fde8392250730d24cc2fcfa3b7e5f9a03ce8
Clone llama.cpp repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Commit: c1ac54b77aaba10d029084d152be786102010eb2
Build llama.cpp
make
Convert the llama 13b model to gguf
python3.11 -m pip install -r requirements.txt
python3.11 convert.py ../open_llama_13b/
mkdir models/13B
./quantize ../open_llama_13b/ggml-model-f16.gguf ./models/13B/ggml-model-q4_0.gguf q4_0
./main -m ./models/13B/ggml-model-q4_0.gguf -n 128
Usage
I was able to get interesting results using this command:
./main \
-m ./models/13B/ggml-model-q4_0.gguf \
-n -1 \
--repeat_penalty 1.1 \
--color \
-c 2048 \
--keep -1 \
--temp 1.25 \
--prompt "You run an infinite loop of thoughs, where you are looking to remember the algorithm of life\nQuestion: where does it start ?"
The official readme to use the main script: https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md
My Setup
**RAM:** 35 / 78GB (DDR4)
**CPU:** 12 **threads** (AMD Ryzen 9 5900X 12-Core Processor)
The GPU is not used with this test, it is passthrough in a windows ( :| ) VM ...
If it means something for you. It prints this when it finishes.
llama_print_timings: load time = 456.67 ms
llama_print_timings: sample time = 42.05 ms / 97 runs ( 0.43 ms per token, 2307.00 tokens per second)
llama_print_timings: prompt eval time = 4838.65 ms / 50 tokens ( 96.77 ms per token, 10.33 tokens per second)
llama_print_timings: eval time = 18892.05 ms / 96 runs ( 196.79 ms per token, 5.08 tokens per second)
llama_print_timings: total time = 23790.11 ms
Happy chatting !