Welcome to Perlmutter Docs!

This page documents instructions to get started with Perlmutter and issues encountered while using it. The official documentation for Perlmutter can be found here.

Check live status: https://www.nersc.gov/live-status/motd/

Contents

SSH into Perlmutter

To SSH into Perlmutter:

ssh <username>@perlmutter.nersc.gov

Creating and Activating a Virtual Environment

python -m venv /path/to/new/virtual/environment
source /path/to/virtual/environment/bin/activate

Using NERSC PyTorch Modules

To load the PyTorch module, use the following command:

module load pytorch/2.0.1

Note: The default location of any additional packages installed using the python version accompanying this module is controlled by the environment variable #PYTHONUSERBASE.

export PYTHONUSERBASE="/pscratch/sd/s/<your scratch directory controlled by environment variable $SCRATCH"

Huggingface Cache and Credentials

Set Huggingface Cache:

export HF_DATASETS_CACHE="<path to directory where cache should be stored>"

Load Credentials:

huggingface-cli login
huggingface-cli whoami

Starting an Interactive Session on Perlmutter

Before starting an interactive session, it’s essential to ensure you’re using the right account for allocation. To find the account, you can use the env | grep ACCOUNT command, which will provide output similar to:

SALLOC_ACCOUNT=m2956_g
SBATCH_ACCOUNT=m2956_g
SLURM_JOB_ACCOUNT=m2956_g

To initiate an interactive session, use the following salloc command:

salloc --nodes 1 --qos interactive --time 02:00:00 --constraint gpu --gpus 4 --account=m2956_g

This command requests an interactive session with:

  • 1 node

  • Quality of Service set to “interactive”

  • A time limit of 2 hours

  • On a GPU node

  • Allocating 4 GPUs

  • Using the account “m2959_g”

Example Job Script

Here is an example of a job script:

#!/bin/bash
#SBATCH -A m2956
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 15:00:00
#SBATCH -N 1
#SBATCH -c 32

export HF_HOME=/pscratch/sd/s/sharma21/hf/
module load pytorch/2.0.1
export PYTHONUSERBASE="/pscratch/sd/s/sharma21" #to prevent diskerror due to installation of packages
cd $SCRATCH                                  #to avoid file lock issue
export OPENAI_API_KEY='YOUR KEY HERE'
export WANDB_API_KEY='YOUR KEY HERE'
source ~/.bashrc
source hpcenv/bin/activate

wandb login
huggingface-cli whoami

Note: Jobs may explicitly request to run on up to 256 GPU nodes which have 80 GB of GPU-attached memory instead of 40 GB. To request this, use -C gpu&hbm80g in your job script.

Troubleshooting

DiskError: Allocated 40GB space in homes directory used up

Loading pytorch using module load pytorch/2.0.1 sets the default location of that in the homes directory. Any additional packages installed using pip install take up space in the homes directory. The location can be found at $PYTHONUSERBASE. It is recommended to set the location to a file system with more space. I used $SCRATCH for now but $SCRATCH is temporary storage so it is recommended to explore other options as well.

Another option is to create a virtual environment using venv. Remember to load pytorch/2.0.1 first, then use python 3.9 that comes with it to create a virtual environment. Activate the virtual environment and pip install additional packages there. Make sure to do it in the correct order to avoid conflicts.

File lock issue while loading huggingface datasets/models (Eg. SentenceTransformer)

An issue arises when trying to load the SentenceTransformer model ‘paraphrase-MiniLM-L6-v2’. I have file lock issues on Perlmutter when my python code tries to download huggingface models/datasets. The symptom is hanging execution. To debug the issue, you have to run your job in an interative session, and use ctrl+c to stop the hangs. You will then see the execution runs some infinite looping to get file locks. .. code-block:: bash

Error Traceback: .. code-block:: python Add this in error File “/global/u2/s/sharma21/LM4HPC/Evaluation/open_ended_eval.py”, line 118, in <module>

accuracy, results = semantic_similarity_eval(open_ended_dataset, model_name, num_rows)

File “/global/u2/s/sharma21/LM4HPC/Evaluation/open_ended_eval.py”, line 36, in semantic_similarity_eval

embedder = SentenceTransformer(‘paraphrase-MiniLM-L6-v2’)

File “/global/homes/s/sharma21/.local/perlmutter/pytorch2.0.1/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py”, line 87, in __init__

snapshot_download(model_name_or_path,

File “/global/homes/s/sharma21/.local/perlmutter/pytorch2.0.1/lib/python3.9/site-packages/sentence_transformers/util.py”, line 491, in snapshot_download

path = cached_download(**cached_download_args)

File “/global/homes/s/sharma21/.local/perlmutter/pytorch2.0.1/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py”, line 118, in _inner_fn

return fn(*args, **kwargs)

File “/global/homes/s/sharma21/.local/perlmutter/pytorch2.0.1/lib/python3.9/site-packages/huggingface_hub/file_download.py”, line 770, in cached_download

with FileLock(lock_path):

File “/global/common/software/nersc/pm-2022q4/sw/pytorch/2.0.1/lib/python3.9/site-packages/filelock/_api.py”, line 260, in __enter__

self.acquire()

File “/global/common/software/nersc/pm-2022q4/sw/pytorch/2.0.1/lib/python3.9/site-packages/filelock/_api.py”, line 230, in acquire

time.sleep(poll_interval)

KeyboardInterrupt

Solution

https://docs.nersc.gov/performance/io/dvs/#do-not-use-file-locking DVS doesn’t support file locking. It’s turned off by default for most codes at NERSC (including HDF5). If you do need to use any kind of file locking, please use Perlmutter Scratch. Keep your entire code and environment in $SCRATCH directory and run code from there. However, keep in mind that the file system is purged, which may result in portions of the software stack being removed unexpectedly. You can back up your code at HPSS https://docs.nersc.gov/filesystems/archive/

Accessing wrong/old OpenAI API key from .bashrc

Despite updating the OPENAI_API_KEY environment variable in the .bashrc file, an older API key was being accessed when running jobs.

Solution Checked if any duplicate keys are present.

I set the environment variable in the script in both these ways and refresh the .bashrc everytime while running the jobs. Not exactly sure where the issue arises.

export OPENAI_API_KEY='YOUR KEY HERE'
echo "OPENAI_API_KEY='YOUR KEY HERE'" >> ~/.bashrc
source ~/.bashrc

Interactive mode times out while loading starchat-alpha model

Unable to test code on starchat-alpha in interactive mode as it takes too long to load. The model should be stored in huggingface cache. Looking into solutions <to be updated>

Resource allocation for interactive mode timed out

<username>@perlmutter:login34:/pscratch/sd/<folder>/<username>/LM4HPC/Evaluation> salloc --nodes 1 --qos
interactive --time 02:00:00 --constraint gpu --gpus 4 --account=m2956_g
salloc: Pending job allocation 17293015
salloc: job 17293015 queued and waiting for resources
salloc: error: Unable to allocate resources: Connection timed out