Torchrun Errno: 98 - Address Already In Use

7 min read Oct 01, 2024
Torchrun Errno: 98 - Address Already In Use

Encountering "torchrun errno: 98 - address already in use"? Here's what you need to know!

The error "torchrun errno: 98 - address already in use" is a common issue faced by users of PyTorch when attempting to run distributed training. This error arises when a process tries to bind to a specific port that is already in use by another process.

This scenario often occurs in multi-GPU setups where you might inadvertently launch multiple training processes competing for the same port.

Let's delve into the causes of this error and explore effective solutions to get your PyTorch distributed training back on track.

What Causes "torchrun errno: 98 - address already in use"?

Here's a breakdown of the primary reasons behind this error:

  • Port Conflicts: Your training process is trying to bind to a port that is already occupied by another process, such as a previously running training instance, another application, or even a background service.
  • Multiple torchrun Processes: Running multiple torchrun commands simultaneously without proper port management can lead to this error.
  • Previous Processes: Even if you've closed a previous training session, some processes might still hold onto the port, preventing new processes from using it.

How to Fix "torchrun errno: 98 - address already in use"

Here's a comprehensive guide to resolving this error:

  1. Check for Running Processes:

    • Using netstat:

      netstat -a -p -n | grep ':port_number'
      

      Replace port_number with the port number reported in the error message. This command will list processes that are using the port.

    • Using lsof:

      lsof -i :port_number
      

      This command provides detailed information about processes listening on the specified port.

  2. Kill Conflicting Processes:

    Once you've identified the process using the port, you can terminate it using the kill command:

    kill -9 
    

    Replace <process_id> with the process ID obtained from netstat or lsof.

  3. Modify the Port Number:

    The easiest way to avoid conflicts is to explicitly specify a different port when launching your training:

    torchrun --nproc_per_node= --nnodes= --rdzv_backend= --rdzv_endpoint=:  
    

    Replace <number_of_gpus>, <number_of_nodes>, <backend>, <host_name>, <port_number> with your respective settings. Ensure the chosen port is not in use by other processes.

  4. Clean up Previous Processes:

    • Check for Background Processes: Ensure that no previous training instances are running in the background or as hidden processes. Use a task manager or process explorer to identify and terminate such processes.
    • Restart the Machine: If all else fails, restarting your machine often resolves the problem, as it releases all ports and resets the system.
  5. Adjust torchrun Settings:

    • --master_port: You can explicitly define the port used for communication by the master process:

      torchrun --master_port= 
      
    • --rdzv_endpoint: For distributed training, you can specify the host and port for rendezvous communication:

      torchrun --rdzv_endpoint=: 
      
    • --rdzv_backend: If you're using a specific rendezvous backend (e.g., gloo), you can customize the port settings within its configuration.

  6. Check System Resources:

    • GPU Availability: Confirm that your GPUs are available for use and not already occupied by another process.
    • Resource Limits: If you're running on a cluster or shared system, check resource limits that might be preventing your process from accessing the desired port.
  7. Firewall Configuration:

    • Firewall Rules: Verify your firewall rules are not blocking communication on the necessary ports. If necessary, temporarily disable your firewall to check if it is causing the issue.
  8. Environment Variables:

    • TORCH_DISTRIBUTED_DEBUG: Setting this environment variable to INFO can provide detailed logging that might reveal clues about the port conflict.

Examples:

Here are a few examples to illustrate the fixes:

  • Killing a Process:

    netstat -a -p -n | grep ':29500'
    # Output: tcp        0      0 0.0.0.0:29500           0.0.0.0:*               LISTEN      1234/python3
    kill -9 1234 
    
  • Specifying a New Port:

    torchrun --nproc_per_node=2 --nnodes=1 --rdzv_backend=gloo --rdzv_endpoint=localhost:29501 
    

Conclusion:

The "torchrun errno: 98 - address already in use" error is often related to port conflicts. By following the steps outlined above, you can effectively diagnose and resolve the issue, enabling you to smoothly run your PyTorch distributed training. Remember to check for running processes, kill conflicts, modify ports, and inspect environment variables. By diligently troubleshooting, you can overcome this error and unleash the power of distributed training for your deep learning endeavors.