A step by step guide to running a local LLM with llama-cpp-python

Posted: April 02, 2024

As we enter the era of GenAI, LLMS can come with a hefty price tag. Fortunately, the open source community has introduced a lgithweight version of these cutting-edge technologies, enabling experimentation without breaking your budget. This article will guide you though three simple steps to kickstart your journey with llama-cpp-python. A lightweight LLM model levering the strengths of C++, Python, and innovative quantization techniques.

Guide

Step 1 - Download the model

Hugging Face model and save to <current working directory>/models/codellama-7b.Q4_0.gguf

Setting Up Your Environment

We'll need to create a conda environment to use a consistent version of Python and isolate our dependencies

$ conda create -n llama python=3.9.16
$ conda activate llama

Install llama-cpp-python

Now, let's install llama-cpp-python and its dependencies:

$ pip install llama-cpp-python
$ pip install 'llama-cpp-python[server]'

Note: For MacBook Pro (Intel Chip), use the following commands:

$ CMAKE_ARGS="-DLLAMA_METAL=off -DLLAMA_CLBLAST=on" FORCE_CMAKE=1 pip install -U git+https://github.com/abetlen/llama-cpp-python.git --no-cache-dir
$ pip install 'llama-cpp-python[server]'

Run the model

With everything set up, we can run our model

$ python3 -m llama_cpp.server --model models/codellama-7b.Q4_0.gguf --n_gpu_layers 1

Once the model is running, you can access the Swagger documentation at http://localhost:8000/docs

Exploration

Now comes the fun part- exploring what the model can do! Navigate to http://localhost:8000/docs where you’ll find the Swagger docs. Expand the view for the first API call- POST /v1/completions, click “Try it out”, and then click “Execute”

After the API call has finished, we can view the response. If everything is set up correctly, it should respond with “Paris”

I’m really excited to continue to explore the possibilities and tech behind local LLM’s. Stay tuned for more updates!