Webux Lab - Blog
Webux Lab Logo

Webux Lab

By Studio Webux

Search

By Tommy Gingras

Last update 2023-08-27

AI

llama.cpp

Source:

Quick Start

I'm using Rocky Linux 8.

Installation

Install Python 3.11

sudo yum install -y \
    python3.11 \
    python3.11-pip

Create a working directory

mkdir AI && cd AI

Install Git LFS

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash
sudo yum install git-lfs

git lfs install

Clone the llama 13b model from Hugging Face

git clone https://huggingface.co/openlm-research/open_llama_13b

Commit: b6d7fde8392250730d24cc2fcfa3b7e5f9a03ce8

Clone llama.cpp repository

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Commit: c1ac54b77aaba10d029084d152be786102010eb2

Build llama.cpp

make

Convert the llama 13b model to gguf

python3.11 -m pip install -r requirements.txt
python3.11 convert.py ../open_llama_13b/
mkdir models/13B
./quantize ../open_llama_13b/ggml-model-f16.gguf ./models/13B/ggml-model-q4_0.gguf q4_0
./main -m ./models/13B/ggml-model-q4_0.gguf -n 128

Usage

I was able to get interesting results using this command:

./main \
-m ./models/13B/ggml-model-q4_0.gguf \
-n -1 \
--repeat_penalty 1.1 \
--color \
-c 2048 \
--keep -1 \
--temp 1.25 \
--prompt "You run an infinite loop of thoughs, where you are looking to remember the algorithm of life\nQuestion: where does it start ?"

The official readme to use the main script: https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md


My Setup

**RAM:** 35 / 78GB (DDR4)
**CPU:** 12 **threads** (AMD Ryzen 9 5900X 12-Core Processor)
The GPU is not used with this test, it is passthrough in a windows ( :| ) VM ...

If it means something for you. It prints this when it finishes.

llama_print_timings:        load time =   456.67 ms
llama_print_timings:      sample time =    42.05 ms /    97 runs   (    0.43 ms per token,  2307.00 tokens per second)
llama_print_timings: prompt eval time =  4838.65 ms /    50 tokens (   96.77 ms per token,    10.33 tokens per second)
llama_print_timings:        eval time = 18892.05 ms /    96 runs   (  196.79 ms per token,     5.08 tokens per second)
llama_print_timings:       total time = 23790.11 ms

Happy chatting !