Running Elyza models on GPU using llama-cpp-python

Motivation

Quantization is essential to run LLM on the local workstation (12-16 GB of GPU memory). In this post, I summarize my attempt to maximize GPU resources using llama-cpp-python.

The content includes some of my mistakes, as I got into some areas due to my lack of understanding.

Background

To try out recent LLMs on a local workstation with low GPU memory capacity, quantization is a must. I introduced one aspect of this in this post.

Since then, I have researched quantization on the Internet and tried a few things. Currently there are two typical types of quantization, AutoAWQ and AutoGPTQ. I wanted to use AutoAWQ, but I found out it needs Compute Capability 7.5 (Turing and later architectures). In my environment, I use the following two types of GPUs. RTX A4000 (Compute Capability 8.6) and TITAN V (Compute Capability 7.0). I gave up on AutoAWQ and tried AutoGPTQ since AutoAWQ would not work on my TITAN V environment.

While trying AutoGPTQ, I found that it sometimes worked, sometimes did not work, and sometimes caused CUDA out of memory, although the true cause was unknown, I wonder it was a combination of library versions. Furthermore, different errors occurred between TITAN V and RTX A4000.

Then, recently, I often saw articles about using LLM with llama-cpp (llama-cpp-python) on the net.

For this reason, I decided to create a JupyterLab using llama-cpp-python and use LLM from the notebook. As an LLM, it has a reputation and is relatively well informed ELYZA.

In fact, using llama-cpp-python, I used the file converted to gguf format and uploaded to huggingface.

Create a docker container for JupyterLab

Failure example

At first, I thought that I could use a GPU in JupyterLab, making a docker container installed with llama-cpp-python via pip, and setting n_gpu_layers=-1 in the Llama parameter.

Here is the Dockerfile I created, including some remnants from when I was experimenting with AutoGPTQ (lol)

# Docker image with JupyterLab available.
# Installed the necessary packages for the NoteBooks I have created so far.

FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

# Set bash as the default shell
ENV SHELL=/bin/bash

# Build with some basic utilities
RUN apt-get update && apt-get install -y \
    python3-pip apt-utils vim \
    git git-lfs \
    curl unzip wget

# alias python='python3'
RUN ln -s /usr/bin/python3 /usr/bin/python

RUN pip install --upgrade pip setuptools \
	&& pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 \
	--index-url https://download.pytorch.org/whl/cu121 \
	&& pip install jupyterlab matplotlib pandas scikit-learn ipywidgets \
	&& pip install transformers accelerate sentencepiece einops \
	&& pip install langchain bitsandbytes protobuf \
	&& pip install auto-gptq optimum \
	&& pip install llama-cpp-python

# Create a working directory
WORKDIR /workdir

# Port number in container side
EXPOSE 8888

ENTRYPOINT ["jupyter-lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root", "--NotebookApp.token=''"]

CMD ["--notebook-dir=/workdir"]

The actual code I tried was the code from this page. However, the model used was as follows.

model_path="ELYZA-japanese-Llama-2-7b-fast-instruct-q4_K_M.gguf",

The code made a summary, but I thought it was slower than expected, so I did some research on the net.

What I found out is that when incorporating lla-cpp-python, you need to specify parameters.

GPU-enabled Dockerfile

In the failed Dockerfile, I removed “&& pip install llama-cpp-python” and added the following before “# Create a woking director”.

# Install llama-cpp-python
RUN CUDACXX=/usr/local/cuda-12/bin/nvcc CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_ARCHITECTURES=all-major" FORCE_CMAKE=1 \
    pip install jupyterlab llama-cpp-python --no-cache-dir --force-reinstall --upgrade

Evaluation and future

Evaluation

I compared the total times in case of n_gpu_layers=-1 (GPUs are used in all layers) and n_gpu_layers=-1 is commented out (when only CPU is used without GPUs).

GPU use or not total time (ms)
Use GPU in all layers 5,256.59
No GPU use 82,449.76

Using a GPU, it is about 15 times faster.

Future

I will try to build a system that can talk like ChatGPT using the Elyza model. In that case, I would like to use models other than Elyza, such as Calm2 7b.