GPU Usage

The Container Platform provides access to a worker node with 4 Nvidia Tesla V100 GPUs.

This is a non-exhaustive list of the supported libraries:

  • CUDA 9.0, 9.1, 9.2
  • CUDNN 7.1.4
  • OpenBlas
  • Tensorflow 1.9
  • TensorBoard
  • Keras
  • Theano
  • Caffe
  • Lasagne
  • Jupyter
  • Torch7
  • PyTorch
  • virtualenv
  • docker
  • numpy 1.15
  • scipy 1.1
  • scikit-learn
  • matplotlib
  • pandas
  • Cython 0.28
  • nolearn

Policy

Since the number of GPUs that the Container Platform can provide at the moment is limited (4 GPUs) and in order to allow more users to use GPUs, the requested resources can be allocated for a maximum of 8 days.

Specifically, it is possible to request and allocate 1 GPU for a maximum of 8 days, 2 GPU for a maximum of 4 days and 3-4 GPU for a maximum of 2 days.

When the 8 days (4 or 2 days in case the number of GPUs requested is 3-4) expire, the user pod(s)that are using the GPU(s) will be deleted and the GPU(s) resources will be reallocated.

Therefore, we recommend that you make a backup of your most important data before the expiration date so that data won’t be lost as a result of the pod(s) deletion.

Getting a GPU

In order to obtain access to one or more GPUs, please send us a request via web portal (Common requests -> Reserve GPU)

Each user request will be queued and satisfied in cronological order as long as the GPUs requested are free and can be allocated.

Users will then receive an email that confirms that they can access the GPU(s) along with informations regarding the time period in which the GPU(s) will be exclusively reserved to the user.

Once the confirmation email has been received, it is sufficient to require the resource nvidia.com/gpu in the Pod deployment and add the tolerations section. The key in the tolerations section can be either ‘vgpu’ or ‘gpu’. For example, to deploy the digits container, put this into file digits.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  tolerations:
  - key: "vgpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  containers:
    - name: digits-container
      image: nvcr.io/nvidia/digits:19.12-tensorflow-py3
      resources:
        limits:
          nvidia.com/gpu:

Now you can deploy it with:

$ kubectl create -f digits.yaml

GPU state

To get the current status of the GPUs, issue:

$ kubectl exec gpu-pod nvidia-smi
Mon Jul 30 06:13:39 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37                 Driver Version: 396.37                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:82:00.0 Off |                    0 |
| N/A   24C    P0    35W / 250W |    427MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
...
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Controlling GPU usage

Since GPUs are limited and expensive, we invite to use them sparingly. In particular each usere should only use one GPU at a time.

If you are using Tensorflow, ensure to avoid to allocate all GPU memory, by using this option, when creating a session:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

If you use Keras, you must pass it a session for TensorFlow, using function:

keras.backend.tensorflow_backend.set_session(session)