Execution of a Japanese LLM by Quantization

Motivation

In this article, I ended on a negative note about quantization, but after a little research I reconsidered it as an interesting area and experimented with quantization, which I I summarize it here.

Reference

I used bitsandbytes for quantization this time, but there seem to be other methods. The related information is listed below.

Hugging Face Start here first. It is a little difficult to find the information in English.
Japanese LLM Inference Speed Verification A report by an intern on Japanese LLM inference speed. The report is well organized and also discusses quantization.
[][ELYZA-japanese-Llama-2-13b] (ELYZA-japanese-Llama-2-13b) the University Tokyo Startup Develops LLM that Exceeds GPT3.5! From usage to practice An article introducing ELYZA-japanese-Llama-2-13b. The code was also based on this.

Introduction

“[ELYZA-japanese-Llama-2-13b], the University of Tokyo startup developed LLM over GPT3.5! From usage to practice” was executed on RTX A4000 (GPU memory: 16GB). In that case, I got an answer while the message “WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu. Therefore, I gave up running on TITAN X (GPU memory: 12 GB) and decided to investigate model parallelization with DeepSpeed.

After a little research, I learned that there is a method/library to compress models by quantization while maintaining accuracy, so I decided to give it a try right away.

This time, I tried using the bitsandbytes library.

Note that the JupyterLab docker container that was run is the same as the one in this article.

Modified code

Original code

“ELYZA-japanese-Llama-2-13b] the University of Tokyo startup developed LLM over GPT3.5! " in Reference above, the original code is as follows.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = "あなたは誠実で優秀な日本人のアシスタントです。"
text = "仕事の熱意を取り戻すためのアイデアを5つ挙げてください。"

model_name = "elyza/ELYZA-japanese-Llama-2-13b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    use_cache=True,
    device_map="auto",
    offload_folder = "/content/ELYZA-japanese-Llama-2-13b-instruct",
    low_cpu_mem_usage=True,
)
model.eval()

The above section was changed as follows. Actually, the meaning of the details of each argument is not yet well understood, but after trial and error, this is the level at which it worked for the time being.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = "あなたは誠実で優秀な日本人のアシスタントです。"
text = "仕事の熱意を取り戻すためのアイデアを5つ挙げてください。"

model_name = "elyza/ELYZA-japanese-Llama-2-13b-instruct"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config,
).eval()

Execution Results

The fact that I am writing this article means that as a result of the above changes, I was able to run TITAN X with 12 GB of memory.

The effect of quantization is great. Unfortunately, however, I do not understand the details of the code. This is an issue for the future.

In order to be able to ask a number of questions, I changed the code as follows, separating the model building and the question part.

import torch
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "elyza/ELYZA-japanese-Llama-2-13b-instruct"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config,
).eval()

For the original code, “max_new_tokens” was increased to 1024 as follows

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = "あなたは誠実で優秀な日本人のアシスタントです。"
text = "仕事の熱意を取り戻すためのアイデアを5つ挙げてください。"

while text != "":
    text = input("質問を入力してください：")
    prompt = "{bos_token}{b_inst} {system}{prompt} {e_inst} ".format(
        bos_token=tokenizer.bos_token,
        b_inst=B_INST,
        system=f"{B_SYS}{DEFAULT_SYSTEM_PROMPT}{E_SYS}",
        prompt=text,
        e_inst=E_INST,
    )
    token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

    with torch.no_grad():
        output_ids = model.generate(
            token_ids.to(model.device),
            max_new_tokens=1024,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    output = tokenizer.decode(output_ids.tolist()[0][token_ids.size(1) :], skip_special_tokens=True)
    print(output)
    print("\n\n\n")

Summary

As for quantization, as shown in the Hugging Face posted for reference, there are other methods, which I would like to try in the future.