Try RAG with LlamaIndex -

Motivation

In this post where I tested Chatbot UI, I mentioned that one of my future challenges is to work with RAG (Retrieval Augmented Generation). In this post, I summarized how to achieve RAG using LlamaIndex.

Actually, I tried RAG using Langchain late last year. Since then, I have heard a lot of keywords with LlamaIndex, so I decided to realize RAG using LlamaIndex this time.

Sources

How to execute RAG in a local environment using LlamaIndex The content of the article (including code and additional information for RAG) was used verbatim. The article was written in January of this year and was written in the code before LlamaIndex was upgraded, so that part needs to be changed.
Tried RAG with LangChain using Elyza 7b An article I referred to when trying RAG last year during the year-end vacations.
[ImportError: cannot import name ‘SimpleDirectoryReader’ from ’llama_index’ (unknown location)](https://qiita.com/miyamotok0105/items/9f 5d1fc8b92e3447a75f) The library has been reorganized due to the version change of llama-index.
Web Page Reader In source 1, text data was used as data for the embedded model, It contains hints for using Web page information as the embedding model.

Implementation in LlamaIndex

Dockerfile

The first half of the Dockerfile is based on Source 1, and the part that incorporates the python library is largely taken from the Dockerfile used in this post. The first half of the file is based on the source 1.

The two lines before “pip install llama-index” are necessary if llama_index is v0.10.0 or higher.

FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

# Set bash as the default shell
ENV SHELL=/bin/bash

# Build with some basic utilities
RUN apt update \
	&& apt install -y \
        wget \
        bzip2 \
        git \
        git-lfs \
        curl \
        unzip \
        file \
        xz-utils \
        sudo \
        python3 \
        python3-pip && \
        apt-get autoremove -y && \
        apt-get clean && \
        rm -rf /usr/local/src/*

# alias python='python3'
RUN ln -s /usr/bin/python3 /usr/bin/python

RUN pip install --upgrade pip setuptools \
	&& pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 \
	--index-url https://download.pytorch.org/whl/cu121 \
	&& pip install jupyterlab matplotlib pandas scikit-learn ipywidgets \
	&& pip install transformers accelerate sentencepiece einops \
	&& pip install langchain bitsandbytes protobuf \
	&& pip install auto-gptq optimum \
	&& pip install pypdf sentence-transformers \
	&& pip install llama-index-embeddings-huggingface llama-index-llms-openai \
	&& pip install llama-index-llms-llama-cpp \
	&& pip install llama-index

# Install llama-cpp-python[server] with cuBLAS on
RUN CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 \
        pip install llama-cpp-python[server] --force-reinstall --no-cache-dir

# Create a working directory
WORKDIR /workdir

# Port number in container side
EXPOSE 8888

ENTRYPOINT ["jupyter-lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root", "--NotebookApp.token=''"]

CMD ["--notebook-dir=/workdir"]

An excerpt of the pip list for the container created in the above Dockerfile.

llama_cpp_python                        0.2.75
llama-index                             0.10.38
llama-index-agent-openai                0.2.5
llama-index-cli                         0.1.12
llama-index-core                        0.10.38.post2
llama-index-embeddings-huggingface      0.2.0
llama-index-embeddings-openai           0.1.10
llama-index-indices-managed-llama-cloud 0.1.6
llama-index-legacy                      0.9.48
llama-index-llms-llama-cpp              0.1.3
llama-index-llms-openai                 0.1.20
llama-index-multi-modal-llms-openai     0.1.6
llama-index-program-openai              0.1.6
llama-index-question-gen-openai         0.1.3
llama-index-readers-file                0.1.22
llama-index-readers-llama-parse         0.1.4

torch                                   2.2.2+cu121
torchaudio                              2.2.2+cu121
torchvision                             0.17.2+cu121

python source changes

As mentioned in the Dockerfile section, the library configuration of llama_index has changed since v0.10.0, and the previous code would result in the following error.

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 5
      2 import os
      3 import sys
----> 5 from llama_index import (
      6     LLMPredictor,
      7     PromptTemplate,
      8     ServiceContext,
      9     SimpleDirectoryReader,
     10     VectorStoreIndex,
     11 )
     12 from llama_index.callbacks import CallbackManager, LlamaDebugHandler
     13 from llama_index.embeddings import HuggingFaceEmbedding

ImportError: cannot import name 'LLMPredictor' from 'llama_index' (unknown location)

So, I modified it as follows.

import logging
import os
import sys

"""
from llama_index import (
    LLMPredictor,
    PromptTemplate,
    ServiceContext,
    SimpleDirectoryReader,
    VectorStoreIndex,
)

from llama_index.callbacks import CallbackManager, LlamaDebugHandler
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.llms import LlamaCPP
"""
from llama_index.legacy import LLMPredictor
from llama_index.core.prompts import PromptTemplate
from llama_index.core import ServiceContext, SimpleDirectoryReader, VectorStoreIndex

from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.llama_cpp import LlamaCPP

# ログレベルの設定
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, force=True)

The part of the f string specifying the LLM path is as follows for my environment.

model_path = f"../20240421_llamacpp/ELYZA-japanese-Llama-2-7b-fast-instruct-q4_K_M.gguf"

To use GPU, the n_gpu_layers should be

    model_kwargs={"n_ctx": 4096, "n_gpu_layers": -1},

In addition, the EMBEDDING_DEVICE should be

EMBEDDING_DEVICE = "cuda"

In addition, in the explanation of the article, it was said that chunk_size was set to 512, so it was changed as follows.

    chunk_size=512,

Other than that, I used the code from source 1. as is.

How it works

When I ran the code, I waited about 5 minutes for it to finish loading the LLM (Elyza model quantized by gguf), the RAG embedding model, and indexing.

When I threw a question, the answer came back in a couple of seconds.

GPU utilization is around 60-70% when returning answers. can run on both RTX A4000 (16GB) and TITAN V (12GB). used about 9GB of GPU memory.

Creating an embedded model from a web page

In the source 1, I used text data for the embedded model; I further tried using data from Wikipedia and other web pages with reference to source 4.

The page r process was used as the web page. The following cells were

# ドキュメントの読み込み
documents = SimpleDirectoryReader("data").load_data()

The following changes were made

target_url = "https://ja.wikipedia.org/wiki/R%E9%81%8E%E7%A8%8B"
# NOTE: the html_to_text=True option requires html2text to be installed
documents = SimpleWebPageReader(html_to_text=True).load_data(
    [target_url]
)

The results of the execution are as follows (translating output results in Japanese to English)

## Question: what is the mechanism of the r-process in neutron star mergers?
(omitted).
## Answer:
 The r-process can occur as a result of neutron star mergers.

Neutron stars are very dense, and in them nuclei with unstable neutron capture are produced. This produced nucleus with unstable neutron capture repeatedly beta-decays and incorporates neutrons into nuclei such as nickel56, resulting in a process called the r-process.

## Question: explain the co-evolution of galaxies and black holes
(omitted)
## Answer:
 Information not in the context information will not be included in the answer.

Also, the answer to this question is "I don't know" because there is no information that the release of energy from neutron star mergers ultimately results in a black hole.

I was able to get a slightly more technical answer to the question about the r-process, but when I asked questions about content/concepts that are not in the embedded model, such as the question about the co-evolution of galaxies and black holes, I was not able to get an answer.

Future

I would like to challenge the following issues regarding data loading into embedding model.

multiple websites and an entire website
PDFs
combinations of text files, PDFs, websites.

I would like to deepen my understanding of the structure of llamaindex and improve the accuracy of my answers using RAG.