Motivation

I wanted to try out a large-scale language model (LLM) for Japanese, so I used rinna, which was released in May. To save installation time, I ran rinna under a docker container environment.

I ran into some problems in doing so, which are summarized below.

Sources.

1.rinna press release - See here about rinna. 3.6 billion parameters. rinna we tried this time is described as rinna 3.6b below. 2. 2.rinna introduction article - An article from INTERNET Watch.　The sample code is also adapted from this article. 3. 3.I created a story outline of the movie “My Neighbor Totoro” using the GPT Japanese language model (the first one) - Container with Pytorch. The content regarding the images was based on this article.

Execution result of rinna 3.6b

python code to run

Save the sample code from source 2. as ${HOME}/workspace/rinna.py.

The contents of the sample code are as follows: 5th model uncommented and enabled.

import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft", use_fast=False)
# 標準
#model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft")
# 自動
#model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft", device_map='auto')
# 自動(VRAM16GB以下でも8GBはNG)
#model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft", torch_dtype=torch.float16, device_map='auto')
# CPU指定
#model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft").to("cpu")
# GPU指定
model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft").to("cuda")
# GPU指定(VRAM16GB以下でも8GBはNG)
#model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft", torch_dtype=torch.float16).to("cuda")


#繰り返し(抜けるのはCtl+c)
prompt = ""
while True:
    # 質問を入力
    question = input("質問をどうぞ： ")

    if question.lower() == 'clear':
        question = ""
        prompt = ""

    prompt = prompt+f"ユーザー: {question}<NL>システム: "

    # 時間計測開始
    start = time.time()

    token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

    with torch.no_grad():
        output_ids = model.generate(
            token_ids.to(model.device),
            do_sample=True,
            max_new_tokens=128,
            temperature=0.7,
            pad_token_id=tokenizer.pad_token_id,
            bos_token_id=tokenizer.bos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    output = tokenizer.decode(output_ids.tolist()[0][token_ids.size(1):])
    output = output.replace("<NL>", "\n")

    # 時間表示
    end = time.time()
    print(end-start)

#    print(prompt)
    print(output)
    prompt = prompt+output+"<NL>"

Download from NGC CATALOG and run the container

Referring to source 3., pull (download) Pytorch container from NGC CATALOG.

Look at the Tags on the Catlog > Containers > Pytorch page and see that 23.07-py3 is the latest, so download and run it immediately. For more information about docker running in user mode, please refer to this page.

$ mkdir ${HOME}/workspace
$ docker pull nvcr.io/nvidia/pytorch:23.07-py3
$ docker run -d -it --gpus all --name gpt -v ${HOME}/workspace:/workspace/local nvcr.io/nvidia/pytorch:23.07-py3
$ docker exec -it gpt /bin/bash

Running on RTX A4000

Operating in a docker environment

From here, we will operate in the docker container launched above.

Lines starting with a # prompt indicate that the operation is performed in a container environment.

# python -m pip install transformers sentencepiece accelerate
# python rinna.py

DeferredCudaCallError

The following error occurred while executing the above python code.

raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/aten/src/ATen/cuda/CUDAContext.\
cpp":50, please report a bug to PyTorch. device=1, num_gpus=

I looked into it and found some comments on the internet that it was an error during MIG setup. My workstation is equipped with two GPUs, an A4000 and a Quadro K600. Given this, I figured I needed to specify which GPU to use, so I did the following

# python -m pip install transformers sentencepiece accelerate
# export CUDA_VISIBLE_DEVICES="0"
# python rinna.py

Execution Result

The string following “Ask a question(質問をどうぞ):” was entered below.

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
質問をどうぞ： 日本の終戦記念日はいつですか
1.9004747867584229
1945年8月15日です。</s>
質問をどうぞ： 江戸幕府を開いたのは誰ですか
1.9013473987579346
徳川家康は1603年に亡くなり、1604年に将軍になりました。</s>
質問をどうぞ： 大化の改新について教えてください
5.68342661857605
大化の改新は、天智天皇が日本の政治改革を行い、中央集権政府を確立した革命的な政治改革でした。この改革\
は中央政府を強化し、律令制度を確立しました。</s>
質問をどうぞ： ありがとうございました
0.6675028800964355
どういたしまして</s>
質問をどうぞ：

The Edo Shogunate’s answer is a bit odd, but I’ll ignore the details.

Running on TITAN V

The same was executed on another workstation implementing TITAN V (+Quadro K2000).

As above, I downloaded the pytorch container from NGC CATALOG and executed rinna.py stored in ${HOME}/workspace. The operating procedure is as follows.

$ docker run -d -it --gpus all --name gpt -v ${HOEM}/workspace:/workspace/local \
nvcr.io/nvidia/pytorch:23.07-py3
$ docker exec -it gpt /bin/bash

### The following are operations in a container ###

# export CUDA_VISIBLE_DEVICES="0"
# python -m pip install transformers sentencepiece accelerate
# cd local
# python rinna.py

NVIDIA driver is too old

When I run the above, I get the following error

RuntimeError: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.

The same container that worked on the A4000 resulted in “NVIDIA driver on your system is too old” on TITAN V, as described above, and the reason is unknown.

Assuming that the NVIDIA driver is not the cause, but that pytorch does not support TITAN V, we decided to try with an older version of the pytorch container.

In case you are wondering, the torch-related version of the pytorch container in 23.07-py3 was as follows

# pip list | grep torch
pytorch-quantization      2.1.2
torch                     2.1.0a0+b5021ba
torch-tensorrt            1.5.0.dev0
torchdata                 0.7.0a0
torchtext                 0.16.0a0
torchvision               0.16.0a0

Trying with a one year old pytorch container

I decided to try with an older version of the pytorch container and, for no reason, I used the one year old 22.07-py3 as follows.

$ docker run -d -it --gpus all --name gpt -v /home/kenji/workspace:/workspace/local \
nvcr.io/nvidia/pytorch:22.07-py3
$ docker exec -it gpt /bin/bash

From here, the operation is performed in a container. The version of torch is 1.13.0 as follows.

# pip list | grep torch
pytorch-quantization          2.1.2
torch                         1.13.0a0+08820cb
torch-tensorrt                1.2.0a0
torchtext                     0.13.0a0
torchvision                   0.14.0a0

# export CUDA_VISIBLE_DEVICES="0"
# python -m pip install transformers sentencepiece accelerate
# python rinna.py

CUDA out of memory

When the above was executed, the GPU ran out of VRAM capacity as shown below.

RuntimeError: CUDA out of memory. Tried to allocate 122.00 MiB (GPU 0; 11.78 GiB total capacity; 11.03 GiB already allocated; 4.00 MiB free; 11.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Since the 12GB of VRAM in TITAN V is not enough, I specify torch_dtype to reduce the model capacity.

Specifically, line 5 of model in rinna.py was commented out and line 6 was enabled.

# 標準
#model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft")
# 自動
#model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft", device_map='auto')
# 自動(VRAM16GB以下でも8GBはNG)
#model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft", torch_dtype=torch.float16, device_map='auto')
# CPU指定
#model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft").to("cpu")
# GPU指定
#model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft").to("cuda")
# GPU指定(VRAM16GB以下でも8GBはNG)
model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft", torch_dtype=torch.float16).to("cuda")

Execution Result

# python rinna.py
You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
質問をどうぞ： 日本の終戦記念日はいつですか
1.9047455787658691
1945年8月15日です。</s>
質問をどうぞ： 江戸幕府を開いたのは誰ですか
0.19092082977294922
徳川家康</s>
質問をどうぞ： 大化の改新について教えてください
4.939814329147339
大化の改新とは、中大兄皇子、後の天智天皇が政治改革を行った一連の改革です。これらの改革は、公正な競争\
のルールや、身分制度にとらわれない人々の平等な権利の確立など、その後の日本の政治システムの基盤を提供\
しました。</s>
質問をどうぞ： 南北朝時代になった原因を教えてください
1.415543794631958
歴史の授業で、南北朝時代がどのように起こったかを説明しているのですか?</s>
質問をどうぞ： 南北朝時代がどうして起こったのか教えてください
9.151806592941284
それは複雑な過程を経ています。一般的には、中国が北朝と南朝に分裂したと言えますが、それだけではありま\
せん。北朝は、13世紀後半に始まった第1次モンゴル軍の侵攻や、14世紀後半の明朝の拡大に対処しなければな\\
らなかったため、大きく揺れ動いています。一方、南朝は、14世紀半ばの明朝による中国再統一の結果、安定し\
た状態にありました。しかしながら、清盛が平氏政権を樹立した後、北朝は弱体化し、最終的には明朝による統\
一によって終了することになりました。</s>
質問をどうぞ：

Summary

This time, since I was running in a docker container, I thought it would be easy to run without any installation-related problems, but as mentioned above, I got into a bit of a bind. However, as mentioned above, I got into a bit of trouble. Thanks to the trouble, I was able to absorb a little bit of docker operation and related knowledge.

The part where I got stuck is as follows. The cause is based on my assumption and may be different from the actual situation.

DeferredCudaCallError This can occur when multiple GPUs are implemented, and it was solved by specifying which GPU is used in CUDA_VISIBLE_DEVICES.
The NVIDIA driver on your system is too old The pytorch container you are using is too new and does not support your GPU architecture. Use the old pytorch container. If you know the pytorch version and the supported sm (NVIDIA architecuter name), please let me know.
CUDA out of memory Common GPU VRAM shortage. Reduce model size. In this case, I specified torch_dtype=torch.float16.