Virtual Environments with Pytorch

Virtual environments are essential for isolating project dependencies and ensuring compatibility across different projects. This guide explains how to create a virtual environment using Python’s built-in venv module.

Prerequisites

  • Python 3.3 or newer

  • A terminal or command prompt to execute commands.

Note

Some parts of this tutorial require the use of Git, which is pre-installed onto TACC systems.

Steps to Create and Activate a Virtual Environment

Step 1. Connect to our systems. If you are unfamiliar with how to do this, please review the section called Connecting to TACC.

Once we have connected to TACC systems, we can now create the virtual environment.

Note

It is best practice to use the $WORK directory to host our environment, since the $SCRATCH directory is regularly purged, and $HOME does not have the storage space for ML tasks.

Step 3. Create the Virtual Environment

Run this command to create a virtual environment. You can replace ‘myenv’ with whatever you want to name your virtual environment.

python3 -m venv myenv
Step 4. Verify the Creation

After running the command, a new directory (e.g., myenv) will be created in your current location. This directory contains the files needed for the virtual environment.

(base) UserName@System myenv % ls
bin             include         lib             pyvenv.cfg

Understanding the Structure

The virtual environment directory contains:
  • `bin` or `Scripts`: Contains the executables, including the Python interpreter.

  • `lib`: Includes the standard library and site packages for your virtual environment.

  • `pyvenv.cfg`: Configuration file for the virtual environment.

Step 5: Activate the environment

source myenv/bin/activate

Upon activation, you should see parentheses around the name of your environment appear in front of your working directory:

(myenv) login3.frontera(470)$

In the next section, we will test this virtual environment by installing pytorch into it and then running an example script.

Testing our Virtual Environment with multigpu_torchrun.py

To demonstrate how to use our virtual environment, we will download the multigpu_torchrun.py script from a github repository, install pytorch, and then run an example benchmarking function from the script, all within our virtual environment.

multigpu_torchrun.py is a script from the official pytorch repository that leverages distributed data parallel (DDP) to split ML training tasks across GPUs, allowing for a more efficient runtime.

The multigpu_torchrun.py script can be found in the github repository below:

https://github.com/pytorch/examples

Step 6. Download the repository containing code to run

You can download a github repository through the command line with the command git clone.

git clone https://github.com/pytorch/examples.git
Step 7. Request a Node through idev

To run our example script, we’ll need to allocate a single node for the purposes of our task. One node on Frontera has 4 GPUs, which is adequate to run multigpu_torchrun.py’s benchmarking function.

Begin your idev session by running the following in your virtual environment:

idev -N 1 -n 1 -p rtx-dev -t 02:00:00

This will request a single compute node (-N 1 -n 1) in the rtx-dev partition/queue (-p) for a time length of two hours (-t 02:00:00). The rtx-dev queue is specifically for the NVIDIA RTX-5000 GPU compute nodes on Frontera systems, which are compatible with CUDA and pytorch by extension. To determine the queues and hardware specifications of TACC’s HPC systems, see our website for more information.

When you request a node through idev, you will be taken to a loading screen. After your idev session starts, your current working directory will look like:

(myenv) c196-011[rtx](452)$

This is how you will know your idev session has begun. Ensure you see the (myenv) tag before your working directory. If you do not, activate your virtual environment again.

Step 8. Download Pytorch into our Virtual Environment

To run multigpu_torchrun, we will need to install pytorch. Run the following pip command inside of your virtual environment:

pip3 install torch torchvision torchaudio
Step 9. CD into the ddp tutorial series folder

We should now see a new directory called examples present in our virtual environment.

cd into the following directory:

cd examples/distributed/ddp-tutorial-series

This will be a hidden directory.

Step 10. Run multigpu_torchrun.py

And within our virtual environment, we will use the torchrun command to launch the training script across all of the available nodes (1).

torchrun --standalone --nproc_per_node=gpu multigpu_torchrun.py 5 10

This will distribute the training workload across all GPUs on your machine using torch.distributed and DistributedDataParallel (DDP), and train the model for 5 epochs and run checkpoints every 10 seconds.

When run successfully, you should get a result like this:

multigpu_result

Note

The task may take a few minutes to run.

Congratulations! You have now run a successful multi-GPU training task in a virtual python environment.

Deactivating a Virtual Environment

When you’re done working in your virtual environment, you can deactivate it to return to the global Python environment:

  1. Simply run the following command in your terminal (works on all operating systems):

    deactivate
    
  2. You’ll notice the environment name disappears from your command line, confirming the environment has been deactivated.

Troubleshooting

  • If the activate command is not recognized, ensure you’re in the correct directory where the virtual environment was created.

Congratulations! You now know how to activate, deactivate, and run code in a virtual environment to keep your Python projects organized and conflict-free.