Running inversions

DUBFI can be started from the command line:

MPI_WORKERS=2 python3 -m dubfi -c /path/to/config -t /path/to/target/directory

To see more options, use the command line argument --help. This command will use the default linear algebra implementation (“MPI”). This will create two MPI worker processes (configured using MPI_WORKERS=2) and run the inversion.

When working on an HPC system, the MPI jobs should usually be invoked explicitly using mpirun. DUBFI works with one main process and multiple worker processes. Each worker process runs dubfi.linalg.mpi_worker_main and awaits instructions from the main process. To explicitly start an MPI job with 8 worker processes, use something like:

mpirun \
  -n 1 python3 -m dubfi -c /path/to/config.yml -t /path/to/target/directory : \
  -n 8 python3 -m dubfi.linalg.mpi_worker_main

When using a job scheduler, you can use the following example to write a job script:

#!/bin/sh

# This example shows how a job script for DUBFI can look like.
# Use the documentation of your HPC system (specifically for MPI) to
# adjust this to your needs.

# This example assumes that you use 8 worker processes plus one main
# process, each working on 2 CPU cores. Thus, you need 18 CPU cores.

# Memory requirements strongly depend on your setup. For typical
# inversion tasks, you can start testing with 16 GB memory
# (in this case: 2 GB per process or 1 GB per CPU core).


# 1. Scheduler settings.
# Add specific settings required by your scheduling system.
# In this example, you should ask for an MPI job with 18 CPU cores.

# 2. Load software modules.
# On many HPC systems you need to load software modules
# such as MPI, python3, netCDF4, ...

# 3. Configure environment variables
# For parallelization:
export OMP_NUM_THREADS=2  # threads for OMP parallelization
export NUMBA_NUM_THREADS=1  # threads used for numba parallelization
# Log output:
# export LOG_LEVEL=DEBUG  # log level: DEBUG, INFO, WARNING or ERROR
export NOCOLOR=1  # disable color log: recommended for output in log files

# 4. Define configuration file and output path.
CONFIG="/path/to/config.yml"
TARGET="/path/to/output_directory"

# 5. Run MPI job.
# This probably requires some adjustments depending on the HPC system and your setup.
# Run the main program on one process and create 8 worker processes that await
# instructions from the main process.

# for Open MPI:
mpirun \
  --map-by NUMA:PE=2 \
  -n 1 python3 -m dubfi -c "$CONFIG" -t "$TARGET" : \
  -n 8 python3 -m dubfi.linalg.mpi_worker_main

# for Intel MPI:
# export I_MPI_PIN_DOMAIN=2
# mpirun \
#   -n 1 python3 -m dubfi -c "$CONFIG" -t "$TARGET" : \
#   -n 8 python3 -m dubfi.linalg.mpi_worker_main

Performance considerations

Performance is mainly determined by the maximum number of observations per MPI worker process and by the parallelization within each worker process. Each worker process runs linear algebra operations with parallelization depending on the BLAS implementation. When using OpenBLAS, the environment variable OPENBLAS_NUM_THREADS can be used to adjust the number of parallel threads.

Additionally, DUBFI uses numba’s parallelization to compute the Hessian of the cost function. This speeds up what is typically the slowest part of the calculation. But in some configurations − e.g., when using a solver that does not use the exact Hessian of the cost function − this type of parallelization is not used at all. You can usually adjust the numba parallelization using the environment variables NUMBA_NUM_THREADS.