Inversion configuration¶

This section introduces the parameters in the configuration required for a flux inversion. An example file is provided in examples/config.yml.

Processing the configuration¶

Each inversion run has an input configuration file and a target directory ($TARGET in the following). The inversion reads the configuration file and the input data, for which the configuration file defines a source directory (entry mec_dir). The input configuration is extended to include information about the utilized data (entries ssh and dimensions). The extended configuration is then written to $TARGET/config.yml. When using MPI, the worker processes will read this extended configuration.

The inversion output will be written to $TARGET/inversion_result.nc. This file is created quickly after the initialization. It is then filled with data as the inversion run progresses.

General settings¶

species defines the type of gas considered. In the currently used data format, the species is part of some netCDF variable names (see Data Interfaces).
validation_sites defines a selection of observation sites that shall be excluded from the inversion, allowing them to be used for validation. This allows reading data with a far-field correction constructed without these stations.
input.bc_name select the boundary conditions.
input.use_bc_correction selects whether a correction of the lateral boundary contribution is used. Note that this correction is not computed by DUBFI but must be provided in the input files. The implementation of the boundary correction used at DWD is not yet published.
input.vertical_coordinate defines the attribute of input netCDF files that is used as vertical coordinate of the observation site.
log sets the default log level and can be used to define paths to log files.

Input data filtering¶

The input is filtered by coodinate_filter and data_filter. coodinate_filter defines which stations, sampling heights, and times shall be used. data_filter defines rules for excluding data points based on wind speed or model data mismatch. This filtering step will lead to different results when using different meteorologies (ensemble members), boundary conditions, or input data in general. For consistent filtering, make sure that data_filter is always applied to the same data.

Uncertainty matrix R¶

Entries in uncertainty define the construction of the error covariance matrix R of the model data mismatch. This includes the correlation (cutoff) scales hscale_m, vscale_m, and tscale (all in uncertainty).

Based on these scales, the cutoff scales in segment_buffer must be adjusted. These cutoff scales are used to select buffer data when distributing the data points to MPI worker processes. If the cutoff scales are too small, the inversion will likely converge to wrong results! To be save, one should choose segment_buffer.t_cutoff = 5 * uncertainty.tscale / segment_buffer.buffer_prefactor and segment_bffer.h_cutoff_m = 5 * uncertainty.hscale_m. The cutoff scales in segment_buffer are crucial for the computational effort of the inversion.

Another parameter affecting the matrix R is data_filter.outlier_threshold, which defines an uncertainty inflation for outlier data points.

Inversion¶

The inversion depends on the prior error covariance matrix defined in prior and cycle, and on parameters in inversion.

inversion.norm_prefactor plays a special role since its entries lead to repeated inversion runs. See Bias problem for details. The cost function in the Bayesian inversion contains a prefactor to a normalization term as a tuning parameter. inversion.norm_prefactor defines a list of these prefactors and the inversion will be repeated for list entry. Note that an inversion for prefactor zero is significantly faster than for non-zero values.

The parameters inversion.solver_tol and inversion.solver_options define the targeted solver precision. By determining the number of required iterations of the solver, these parameters strongly influence the runtime.

Example¶

# Configuration of DUBFI.
#
# This configuration file is specific for one inversion setup.
# Adjust it to your needs. This configuration is the basis for the
# inversion and will be included in the output (without comments).

# Name of species, used to determine variable names in netCDF files
species: CH4

# MEC directory: path to directory containing combined model and observation data.
# This directory must contain files named "<SSH>_det_letkf.nc" where <SSH> is the
# 3-letter station code and the sampling height in meters, e.g. HPB_131.0.
# These files contain the far-field corrected data.
mec_dir: "/path/to/corrected/data/source"

# original MEC directory: contains input data without far-field correction.
# This directory is optional and defaults to be equal to mec_dir. It must contain
# files named "<SSH>_ens.nc" with model data from the meteorological ensemble.
orig_mec_dir: "/path/to/origina/data/source"

# List of observation sites (3-letter codes) that shall be used for validation.
# When selecting far-field corrected model data, these stations shall be excluded.
# Currently, this is only implemented for at most one validation site.
validation_sites: []


log:
  # Level of logging output, can usually be overridden using the command line
  # argument --log or the environment variable LOG_LEVEL.
  # Valid values are DEBUG, INFO, WARNING, ERROR, CRITICAL
  level: INFO

  ## log files: Leave empty to use target_dir/parent.log and
  ## target_dir/child-{worker:02d}.log where target_dir is adjusted.
  # parent_file: "/path/to/parent.log"
  # child_file: "/path/to/child-{worker:02d}.log"


# General input selection.
input:

  # Select this entry along the bc_prior coordinate of the input data.
  bc_name: CamsInvOpt_v24r1_sfc

  # Overwrite bc_name when using ensemble data (optional). Use this if
  # the ensemble does not contain the boundary conditions used for the
  # deterministic run.
  bc_name_ens: CamsInvOpt_v24r1_sfc

  # Use boundary (or far-field) correction.
  # Note: When switching on or off the boundary correction, you should usually
  #       also adjust mec_dir.
  use_bc_correction: true

  # Restrict the meteorological ensemble to the folloging members (0-based indices).
  #use_meteo_ensemble_members: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

  # attribute name in input files that defines the height relevant for vertical localization
  vertical_coordinate: mec_height_agl


# Parameters for defining the uncertainty matrix (R).
uncertainty:

  # Additive i.i.d. Gaussian noise on all data points.
  # Note: This number if increased when multiple sampling heights are used.
  uncorrelated: 5.0e-9  # mol/mol

  # Override uncorrelated per site (optional)
  uncorrelated_per_site:
    IPR: 1.0e-8
    JUE_50.0: 1.0e-8

  # uncertainty inflation only for meteorological (transport) uncertainty
  meteo_inflation: 1.0

  # uncertainty inflation for total model-observation uncertainty
  global_inflation: 2.0

  # localization method (Gauss, Gaspari-Cohn-scaled, Gaspari-Cohn)
  localization: Gauss

  # horizontal localization scale, in meters
  hscale_m: 5.0e+5

  # vertical localization scale
  vscale_m: 400.0

  # temporal localization scale, allowed units are "s", "m", "h", "D"
  tscale: 6 h

  # prefactor for observation standard deviation in uncorrelated uncertainty
  obs_stdev_prefactor: 1.0

  # prefactor for model equivalent calculator (MEC) standard deviation in uncorrelated uncertainty
  mec_stdev_prefactor: 1.0

  # prefactor of deviation from original time series in uncorrelated uncertainty
  time_diff_prefactor: 0.0

  # only when using diagonal R matrix: ensemble uncertainty prefactor
  diag_R_ens_prefactor: 0.0

  # only when using diagonal R matrix: prefactor of model-predicted flux signal
  diag_R_signal_prefactor: 0.5

  # only when using diagonal R matrix: prefactor of boundary condition ensemble
  diag_R_bc_ens_prefactor: 1.0

  # extra uncertainty relative to the modeled signal of natural fluxes
  natural_signal_relative_uncertainty: 1.0

  # coordinate values in flux_total contributing to natural fluxes
  natural_signal_names: [tot_nat_lulucf, GFAS]


# Segmentation for MPI distributed computing.
# This configuration strongly affects the performance and the parallelization must be
# adjusted to the segments and segment buffer size.
segment_buffer:

  # temporal cutoff scale. Observations separated by longer time are assumed to
  # be completely independent.
  t_cutoff: 30 h

  # horizontal cutoff scale in meters
  h_cutoff_m: 2.5e+6

  # buffer time in segmentation will be buffer_prefactor * t_cutoff.
  buffer_prefactor: 1.25


# Coordinate-based filtering of observations.
coordinate_filter:

  # Inversion time window.
  # valid formats: "YYYY-MM-DD" or "YYYY-MM-DD hh:mm:ss" or "YYYY-MM-DDThh:mm:ss"
  start: 2020-01-01
  end: 2020-01-22

  # stations that shall be excluded entirely
  exclude_stations:
    - IPR
    - FKL
    - LMP
    - LIN_10.0
    - BIR_10.0
    - CRA_5.0
    - OHP_10.0
    - OPE_10.0
    - JUE_50.0

  # Alternatively, a positive list of stations may be provided:
  #only_use_stations: [STE, GAT, KIT, OXK, HPB, LIN]

  # The options exclude_stations and only_use_stations should not be combined.
  # exclude_stations will also apply for stations listed in only_use_stations.

  # Stations that shall only be included in the specified time intervals
  # and excluded otherwise. This must be in the format "YYYY-MM-DD - YYYY-MM-DD".
  station_seasons:
    BIR:
      - 2021-01-01 - 2021-04-01
      - 2021-09-04 - 2022-01-01
    LUT:
      - 2021-01-01 - 2021-11-01
    HUN:
      - 2021-03-01 - 2021-11-01

  # Maximum number of sampling heights per station.
  # The highest sampling heights will be selected.
  # Note: This is a pre-selection of the existing files. Sampling heights are
  # selected without checking if valid data are available for these heights.
  num_sampling_heights: 2

  # Use local mean time instead of UTC for daily time window. This affects only
  # to the configuration options start_window, end_window, start_window_dict,
  # and end_window_dict.
  use_local_time: true

  # beginning of daily time window per station (local mean time or UTC)
  start_window: 11 h

  # end of daily time window per station (local mean time or UTC)
  end_window: 17 h

  # override start_window per station, used to select different daily time windows for mountain stations
  start_window_dict:
    JFJ: 23 h
    CMN: 23 h
    PDM: 23 h
    ZSF: 23 h
    ZUG: 23 h
    PRS: 23 h
    KAS: 23 h

  # override end_window per station, used to select different daily time windows for mountain stations
  end_window_dict:
    JFJ: 5 h
    CMN: 5 h
    PDM: 5 h
    ZSF: 5 h
    ZUG: 5 h
    PRS: 5 h
    KAS: 5 h


# Data-based filtering of observations
data_filter:

  # allowed deviation, in number of standard deviations.
  # If the prior deviation is larger, the uncertainty will be increased.
  outlier_threshold: 3.0

  # minimal model wind speed (m/s). Observations with lower wind speed are discarded.
  min_wind: 1.0

  # maximal concentration from Nord Stream pipelines in any ensemble member (mol/mol)
  max_nordstream: 1.0e-7

  # half time window over which Nord Stream data shall be averaged before filtering
  nordstream_average: 4 h

  # maximal concentration from wildfires (GFAS contribution to CH4_flux_total, in mol/mol)
  max_wildfires: 1.0e-8

  # half time window over which wildfire data shall be averaged before filtering
  wildfires_average: 4 h

  # maximum amount by which observation may lie below model-predicted boundary
  # condition contribution (mol/mol)
  max_below_bc: 5.0e-9

  # half time window over which model and observation are averaged before
  # applying max_below_bc.
  # 9h means that for an observation at 12:00 we average from 3:00 to 21:00.
  below_bc_average: 9 h

  # ignore all outliers with a prior model-observation mismatch above this threshold:
  max_prior_deviation: 2.5e-7

  # ignore all days (00:00 to 24:00 local time) where the model-observation correlation
  # coefficient is below the following threshold at a minimal observed standard deviation:
  min_diurnal_correlation_coef: -0.05

  # only apply min_diurnal_correlation_coef if standard deviation of observations
  # over 24h exceeds this value:
  min_diurnal_correlation_std: 1.0e-8


# Prior scaling factor uncertainties.
# This defines the prior B matrix. Flux category names must elements of the
# flux_cat coordinate in the input files.
#
# Simplified example:
# prior:
#   default_uncertainty: 0.4
#   uncertainty:
#     A: 0.8
#   correlations:
#     B:
#       C: 0.5
#
# Resulting prior B matrix:
#      A     B     C
# A   0.64   0     0
# B    0    0.16  0.08
# C    0    0.08  0.16
prior:

  # standard deviation of scaling factors (default value)
  default_uncertainty: 0.25

  # override standard deviation of scaling factors for specific flux categories
  uncertainty:
    A: 0.5
    B: 0.5
    C: 0.5

  # correlation coefficients of scaling factors.
  correlations:
    A:
      B: 0.5
      C: 0.5
    B:
      C: 0.5


# Define construction of prior B matrix from previous output when cycling.
cycle:

  # Inflate posterior uncertainty of previous cycle by this factor while
  # keeping correlation coefficients constant.
  inflation: 1.0

  # Next prior is a superposition of the initial prior and the previous
  # posterior. In this superposition, the initial prior is weighted by
  # this prefactor. (0: only use previous posterior; 1: no cycling)
  initial_prior_weight: 0.5



# Parameters of the Bayesian inversion.
inversion:

  # Prefactors of the norm in the cost function.
  # Larger values will generally lead to an underestimation of emissions.
  # Note: the numerical complexity is significantly reduced for 0.0.
  norm_prefactor: [0.0, 0.25, 0.5]

  # solver tolerance parameter
  solver_tol: 1.0e-6

  # solver tolerance parameter when norm_prefactor == 0
  solver_tol_simple: 1.0e-6  # optional

  # options passed directly to the solver
  solver_options:

    # When the norm of the gradient of the cost function is smaller than gtol,
    # the solver will stop and assume that the result is converged.
    gtol: 2.0e-4

    initial_trust_radius: 3.0
    max_trust_radius: 5.0
    eta: 0.1
    maxiter: 9

  # options passed directly to the solver when norm_prefactor == 0
  solver_options_simple:  # optional
    gtol: 1.0e-4
    initial_trust_radius: 3.0
    max_trust_radius: 5.0
    eta: 0.1
    maxiter: 15


###########################################################
###     EXAMPLE DATA AS CREATED BY INVERSION TOOLS      ###
###########################################################

# The following parts should be omitted in the configuration file
# since they are created automatically when copying the configuration
# to the target directory and preparing for the distributed calculation.

# Data dimensions, use these to check that parent and child use the same data.
dimensions:
  ens_size: 12  # number of meteorological ensemble members
  state_size: 55  # number of scalable flux categories
  obs_size: 1000  # total number of observation data points

# List all stations that shall be used in the inversion.
# This list is usually created automatically.
ssh:
  - JFJ_13.9
  - SSL_35.0
  - SSL_12.0
  - GAT_341.0
  - GAT_216.0
  - GAT_132.0
  # This list is incomplete!