Trying NVIDIA Modulus - Introduction to PINNs

Introduction

Over a month ago, I became interested in NVIDIA Modulus at the How to Speed Up Simulation with AI Surrogate Models? seminar I attended, I became interested in NVIDIA Modulus, so I bought the book and started studying it. As a prerequisite for future study of Modulus, I installed Modulus in my environment, so I summarized the installation process as “Introduction to PINNs”.

Sources

NVIDIA Modulus for Beginners - as a reference book for further learning PINNs using Modulus I ordered this book at the end of July, but it arrived at the end of August. Hereafter, I call this book “Modulus for Beginners”.
NVIDIA Moduls - Modulus page in NVIDIA DEVELOPPER.
Modulus Download Page - The Modulus page in the NGC catalog linked in the “Download Now” section of the above page. Modulus page in the NGC catalog linked in the “Download Now” section of the above page.

Run

Docker Installation

Follow “Modulus for Beginners” to pull a Docker container that already has Modulus built in and use the container.

Docker Installation - It is a bit old (and no English version), but the flow may be helpful. For the latest, please check this page.
Building Rootless Docker - Setup to use docker in user mode (rootless).

I also decided to install 23.05, which is one version before the latest version, from the page opened by clicking “Tags” on the above installation page (which lists containers that can be downloaded).

Download the container (docker pull)

Page 50 of “Modulus for Beginners”

$ docker pull nvcr.io/nvidia/modulus/modulus:23.05

Download a sample to see how it works after installation

Download the sample (modulus-sym-1.1.0.zip) from Github following the instructions on pages 51-52 of “Modulus for Beginners”.

Unzip the zip file and copy it under $HOME/NVIDIA-Modulus-2305 directory.

Start the #### container (docker run)

$ cd ~/NVIDIA-Modulus-2305/modulus-sym-1.1.0
$ docker run --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}/examples:/examples -it nvcr.io/nvidia/modulus/ modulus:23.05 bash

When I started the container with the above command line, I got the following error.

ocker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting rlimits for ready process: error setting rlimit type 8: operation not permitted: unknown. unknown.
ERRO[0000] error waiting for container: error

Error handling during container init

This document states that it works on CUDA 11.7 and NVIDIA Driver 515 or later, although the Modulus version is different. However, it then says that if you want to run on a datacenter GPU such as T4, you can use NVIDIA Driver 450.51, 470.57, 510.45 or later, respectively.

I have the driver version 470.199.02 and CUDA version 11.4 from NVIDIA-SMI results, and from the above description, I think it is OK.

I tried Modulus 22.09, 22.07, and 22.03.1 with the same results.

I thought about it a lot, and tried various docker run parameters to see if there might be a setting that does not fit my environment.

As a conclusion, I don’t know why, but I was able to start the Modulus container when I removed ``–ulimit memlock=-1’’.

$ docker run --gpus all --shm-size=1g --ulimit stack=67108864 -v ${PWD}/example\
s:/examples -it nvcr.io/nvidia/modulus/modulus:23.05 bash

Running the sample program

Run the sample program from the container command line.

# cd /examples
# cd helmholtz
# python helmholtz.py
　・・（略）・・
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization wit\
h error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/pytor\
ch/pytorch/aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTor\
ch.

The above error was also encountered in using rinna 3.6b from docker!

Specify the GPU to be used from CUDA as follows

# export CUDA_VISIBLE_DEVICES="0"
# python helmholtz.py

Execution Results

# python helmholtz.py
/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
[03:34:53] - JIT using the NVFuser TorchScript backend
[03:34:53] - JitManager: {'_enabled': True, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[03:34:53] - GraphManager: {'_func_arch': False, '_debug': False, '_func_arch_allow_partial_hessian': True}
helmholtz.py:106: UserWarning: Directory validation/helmholtz.csv does not exist. Will skip adding validators. Please download the additional files from NGC https://catalog.ngc.nvidia.com/orgs/nvidia/teams/modulus/resources/modulus_sym_examples_supplemental_materials
  warnings.warn(
[03:34:55] - attempting to restore from: outputs/helmholtz
[03:34:55] - optimizer checkpoint not found
[03:34:55] - model wave_network.0.pth not found
[03:34:57] - [step:          0] record constraint batch time:  1.409e+00s
[03:34:57] - [step:          0] saved checkpoint to outputs/helmholtz
[03:34:57] - [step:          0] loss:  1.026e+04
[03:35:06] - Attempting cuda graph building, this may take a bit...
[03:35:10] - [step:        100] loss:  1.007e+04, time/iteration:  1.248e+02 ms
[03:35:14] - [step:        200] loss:  9.981e+01, time/iteration:  4.279e+01 ms
・・（omitted in the middle）・・
[03:49:45] - [step:      19800] loss:  5.507e-03, time/iteration:  4.338e+01 ms
[03:49:49] - [step:      19900] loss:  5.341e-03, time/iteration:  4.340e+01 ms
[03:49:55] - [step:      20000] record constraint batch time:  4.922e-02s
[03:49:55] - [step:      20000] saved checkpoint to outputs/helmholtz
[03:49:55] - [step:      20000] loss:  5.092e-03, time/iteration:  5.431e+01 ms
[03:49:55] - [step:      20000] reached maximum training steps, finished training!

An output directory was created in the helmholtz directory, where the model (pth) and 3D graphics data (vtp) were stored.

Summary

Although we were able to use Modulus with a Docker container, we have not been able to verify whether the execution results (above) are correct.

I am trying to read “Modulus for Beginners” chapter 2 and onwards, but I don’t understand the physics of the examples very well, so I am not making much progress.