Motivation
Quantization is essential to run LLM on the local workstation (12-16 GB of GPU memory). In this post, I summarize my attempt to maximize GPU resources using llama-cpp-python.
The content includes some of my mistakes, as I got into some areas due to my lack of understanding.
Background
To try out recent LLMs on a local workstation with low GPU memory capacity, quantization is a must. I introduced one aspect of this in this post.
Since then, I have researched quantization on the Internet and tried a few things. Currently there are two typical types of quantization, AutoAWQ and AutoGPTQ. I wanted to use AutoAWQ, but I found out it needs Compute Capability 7.5 (Turing and later architectures). In my environment, I use the following two types of GPUs. RTX A4000 (Compute Capability 8.6) and TITAN V (Compute Capability 7.0). I gave up on AutoAWQ and tried AutoGPTQ since AutoAWQ would not work on my TITAN V environment.
While trying AutoGPTQ, I found that it sometimes worked, sometimes did not work, and sometimes caused CUDA out of memory, although the true cause was unknown, I wonder it was a combination of library versions. Furthermore, different errors occurred between TITAN V and RTX A4000.
Then, recently, I often saw articles about using LLM with llama-cpp (llama-cpp-python) on the net.
For this reason, I decided to create a JupyterLab using llama-cpp-python and use LLM from the notebook. As an LLM, it has a reputation and is relatively well informed ELYZA.
In fact, using llama-cpp-python, I used the file converted to gguf format and uploaded to huggingface.
Create a docker container for JupyterLab
Failure example
At first, I thought that I could use a GPU in JupyterLab, making a docker container installed with llama-cpp-python via pip, and setting n_gpu_layers=-1 in the Llama parameter.
Here is the Dockerfile I created, including some remnants from when I was experimenting with AutoGPTQ (lol)
# Docker image with JupyterLab available.
# Installed the necessary packages for the NoteBooks I have created so far.
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
# Set bash as the default shell
ENV SHELL=/bin/bash
# Build with some basic utilities
RUN apt-get update && apt-get install -y \
python3-pip apt-utils vim \
git git-lfs \
curl unzip wget
# alias python='python3'
RUN ln -s /usr/bin/python3 /usr/bin/python
RUN pip install --upgrade pip setuptools \
&& pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 \
--index-url https://download.pytorch.org/whl/cu121 \
&& pip install jupyterlab matplotlib pandas scikit-learn ipywidgets \
&& pip install transformers accelerate sentencepiece einops \
&& pip install langchain bitsandbytes protobuf \
&& pip install auto-gptq optimum \
&& pip install llama-cpp-python
# Create a working directory
WORKDIR /workdir
# Port number in container side
EXPOSE 8888
ENTRYPOINT ["jupyter-lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root", "--NotebookApp.token=''"]
CMD ["--notebook-dir=/workdir"]
The actual code I tried was the code from this page. However, the model used was as follows.
model_path="ELYZA-japanese-Llama-2-7b-fast-instruct-q4_K_M.gguf",
The code made a summary, but I thought it was slower than expected, so I did some research on the net.
What I found out is that when incorporating lla-cpp-python, you need to specify parameters.
GPU-enabled Dockerfile
In the failed Dockerfile, I removed “&& pip install llama-cpp-python” and added the following before “# Create a woking director”.
# Install llama-cpp-python
RUN CUDACXX=/usr/local/cuda-12/bin/nvcc CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_ARCHITECTURES=all-major" FORCE_CMAKE=1 \
pip install jupyterlab llama-cpp-python --no-cache-dir --force-reinstall --upgrade
Evaluation and future
Evaluation
I compared the total times in case of n_gpu_layers=-1 (GPUs are used in all layers) and n_gpu_layers=-1 is commented out (when only CPU is used without GPUs).
GPU use or not | total time (ms) |
---|---|
Use GPU in all layers | 5,256.59 |
No GPU use | 82,449.76 |
Using a GPU, it is about 15 times faster.
Future
I will try to build a system that can talk like ChatGPT using the Elyza model. In that case, I would like to use models other than Elyza, such as Calm2 7b.