Running LLMs in a local environment using ollama

Motivation

In this post, I mentioned that the LLMs that can build knowledge graphs are OpenAI and Mistral (via API). On the Internet, I have seen examples of GraphRAG environments being built using ollama, as in this post.

I would like to try to build a knowledge graph using LLM in a local environment. In this post, I will summarize the process of installing ollama.

Information sources

  1. ollama/ollama - Information from dockerhub.
  2. Ollama is now available as an official Docker image - Information on installing ollama as a docker container on ollama’s blog post.
  3. Experience with Ollama! Gemma2-2b-jpn opens up a new era of Japanese AI applications - Information on installing gemma2 Japanese version with ollama.
  4. Running Google’s Japanese version of Gemma 2 2B with Ollama - Also information on installing gemma2 Japanese version with ollama.
  5. Select LLM in local environment and run it with LangChain. - I found it helpful to have a list of ollama CLI commands. I also checked 4 LLMs with the code in this article.
  6. OllamaLLM Connection refused… - I used this as a reference when I was checking the code in JupyterLab and got a connection refused error.
  7. From installation to Japanese model introduction - I referred to this when I incorporated the GGUF file into ollama.

Install ollama

Create docker-compose.yml

I created the following yaml, referring to the command line for starting docker containers in sources 1. and 2.

services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ./ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: always

The host’s ‘’. /ollama” directory under an NFS mount. Thinking that by doing so, it can be shared with another GPU machine.

$ pwd
/mnt/nfs2/workspace/ollama
$ ls -l
合計 8
-rw-rw-r-- 1 kenji kenji  325 11月 21 18:33 docker-compose.yml
drwxrwxr-x 3 kenji kenji 4096 11月 22 15:28 ollama
$ sudo docker compose up -d

Start the ollama container as described above.

(download and) run the LLM model

Enter the container started above and run the model in the container using the ollama command as follows

$ sudo docker exec -it ollama /bin/bash
# ollama --version
ollama version is 0.4.2
# ollama run schroneko/gemma-2-2b-jpn-it

Execution Results

The sample from Source 3. was used as-is and executed.

>>> 日本の四季について端的にわかりやすく教えて
日本の四季は、大きく分けて**春、夏、秋、冬**の4つの季節があります。

* **春:**  桜の花が咲き乱れる季節です。暖かくなり、植物が芽吹きます。
* **夏:**  暑く、日差しが強くなります。海や湖で泳ぎ、屋外で遊びます。
* **秋:**  紅葉が美しい季節です。気温が下がり、冷たい風が吹きます。
* **冬:**  寒く、雪が降ることがあります。冬スポーツを楽しんだり、暖房で体を温めたりし
ます。

Using ollama model from JupyterLab

I added “pip install ollama, langchain-ollama” to In the Dockerfile introduced in this post, then build JupyterLab Docker container. The Dockerfile will eventually be reviewed, but more on that later.

Executable code

I ran the code from the JupyterLab container, referring to the code in source 5.

from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM

model = OllamaLLM(base_url = "http://192.168.11.4:11434", model="schroneko/gemma-2-2b-jpn-it")

# PROMPT
template = """Question: {question}
Answer: ステップバイステップで考えてみましょう。"""

prompt = ChatPromptTemplate.from_template(template)

# CHAIN
chain = prompt | model
result = chain.invoke({"question": "美味しいパスタの作り方は?"})
print(result)

The key point is the base_url of the OllamaLLM() argument. This is based on source 6. If base_url is not specified, a connection refused error occurs.

Incorporate gguf file as a model

Referring to source 7., download “Q8-0” from this list on huggingface and Prepared a Modelfile.

$ wget https://huggingface.co/mradermacher/Llama-3-neoAI-8B-Chat-v0.1-GGUF/resolve/main\
/Llama-3-neoAI-8B-Chat-v0.1.Q8_0.gguf
$ cat Modelfile
FROM ./Llama-3-neoAI-8B-Chat-v0.1.Q8_0.gguf

The above work is done under the ollama directory in the yaml file description. Then you can access the gguf file and modelfile from the container.

Enter the ollama container (by “docker exec” already described) and create the model by the ollama command.

#ollama create neoai-8b-chat -f Modelfile

What I noticed.

  1. If I pull the model in the ollama container, it seems to run when the JupyterLab code connects the model (OllamaLLM) and I can access it.
  2. When you change the model in JupyterLab, it seems to stop and run the model on the ollama side.
  3. When I enter the container and do “ollama stop model”, it seems to be releasing GPU memory.

Review JupyterLab container

Now the ollama container can do LLM processing on the backend. So far, I have embedded llama-cpp-python in the JupyterLab container. From now on, llama-cpp-python is no longer required in the JupyterLab container, simplifying the JupyterLab container.

Dockerfile for JupyterLab container

The simplified Dockerfile is as follows

# Docker image with JupyterLab available.
# Installed the necessary packages for the NoteBooks I have created so far.

FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

# Set bash as the default shell
ENV SHELL=/bin/bash

# Build with some basic utilities
RUN apt update \
        && apt install -y \
        wget \
        bzip2 \
        git \
        git-lfs \
        curl \
        unzip \
        file \
        xz-utils \
        sudo \
        python3 \
        python3-pip && \
        apt-get autoremove -y && \
        apt-get clean && \
        rm -rf /usr/local/src/*

# alias python='python3'
RUN ln -s /usr/bin/python3 /usr/bin/python

RUN pip install --upgrade pip setuptools \
        && pip install torch torchvision torchaudio \
         --index-url https://download.pytorch.org/whl/cu121 \
        && pip install jupyterlab matplotlib pandas scikit-learn ipywidgets \
        && pip install transformers accelerate sentencepiece einops \
        && pip install auto-gptq optimum \
        && pip install langchain bitsandbytes protobuf \
        && pip install langchain-community langchain_openai wikipedia \
        && pip install langchain-huggingface unstructured html2text rank-bm25 janome \
        && pip install langchain-chroma sudachipy sudachidict_full \
        && pip install langchain-experimental neo4j \
        && pip install json-repair langchain-mistralai \
        && pip install mysql-connector-python \
        && pip install ragas datasets neo4j-graphrag \
        && pip install pypdf tiktoken sentence_transformers faiss-gpu trafilatura \
        && pip install ollama langchain-ollama

# Create a working directory
WORKDIR /workdir

# Port number in container side
EXPOSE 8888

ENTRYPOINT ["jupyter-lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root", "--NotebookApp.token=''"]

CMD ["--notebook-dir=/workdir"]

Summary

By separating the JupyterLab container and the ollama container, I was able to separate the coding part (front end) from the part that handles the LLM model (back end), which is structurally cleaner. I think this will be a good result when creating applications that use LLM models in the future.