Data Interfaces
===============

This page describes the input and output netCDF files.

General
-------

DUBFI takes model equivalents and observation data as input.
Currently, it is only designed to work with in-situ concentration measurements.
The output consists of scaling factors for flux categories and uncertainties on these flux categories.
DUBFI will not output fluxes or emission estimates.

All data input and output is in netCDF files.
Each numerical input and output field should have a :code:`units` attribute.
All concentrations are provided in units mole tracer per mole of dry air (mol/mol).
For other quantities, SI units are preferred.


Input
-----

Data input consists of multiple netCDF files containing observed concentrations and model predictions (simulation results).
Additionally, when cycling the inversion, an output file of DUBFI will also serve as input.
For the following description, we will assume that the species (configuration entry :code:`species`) is CH4.

Directory tree
^^^^^^^^^^^^^^

Observation time series are distinguished by station and sampling height (ssh).
For each ssh, DUBFI requires two input files containing information about the deterministic and ensemble run.

Input files are provided in directory :code:`mec_dir` or :code:`orig_mec_dir`, both defined in the configuration.
For an ssh identifier "JFJ\_13.9" (station Jungfraujoch, sampling 13.9 m above ground level), the required input files are
:code:`${mec_dir}/JFJ_13.9_det.nc` (deterministic run) and :code:`${orig_mec_dir}/JFJ_13.9_ens.nc` (ensemble run).

If :code:`orig_mec_dir` is not provided, it defaults to :code:`mec_dir`.
The separate directory :code:`mec_dir` allows the user to provide far-field corrected data for :code:`JFJ_13.9_det.nc` in a separate directory.


Coordinates
^^^^^^^^^^^

The netCDF input files shall use the following coordinates:

- :code:`time`: observation time, must be a sorted array of times. All data (model and observation) must be provided on the same time coordinate.
- :code:`bc_prior`: multiple boundary conditions may be provided. The configuration entry :code:`input.bc_name` defines the label selected from this dimension.
- :code:`flux_cat`: labels flux categories, these should be short but ideally human-readable labels. This must include the label "NordStream".
- :code:`flux_total`: labels components that should be summed up to obtain the total contribution of the fluxes.
- :code:`ensmem`: meteorological ensemble member, only present in the ensemble data. Coordinates along this dimension are not used.

- if configuration entry :code:`input.use_bc_correction` is true (which is the default):

  - :code:`bc_ens_letkf`: ensemble dimension of the posterior boundary condition ensemble


Common variables
^^^^^^^^^^^^^^^^

- :code:`obs_CH4 (time)`: time series of observed CH4 concentration in mol/mol
- :code:`obs_stdev_CH4 (time) (optional)`: standard deviation of :code:`obs_CH4` (alternative name: :code:`obs_stddev_CH4`)
- :code:`windspeed ([ensmem,] time)`: model wind speed in m/s, used for filtering
- :code:`CH4_flux_total (flux_total, [ensmem,] time)`: additive components of the total contribution of all fluxes within the domain to the concentration.
- :code:`CH4_bc_prior (bc_prior, [ensmem,] time)`: contribution of boundary conditions to the observed CH4 concentration. Each :code:`bc_prior` must represent one alternative total boundary contribution.


Deterministic run
^^^^^^^^^^^^^^^^^

- :code:`CH4_flux_cat (flux_cat, time)`: contribution of flux categories to total CH4 concentration
- :code:`CH4_bc_correction (time)`: correction that shall be added to :code:`CH4_bc_prior` before the inversion.
- :code:`CH4_bc_ens_letkf (bc_ens_letkf, time)`: ensemble of boundary conditions, posterior of the far-field correction. Currently, the ensemble mean is ignored.
- :code:`CH4_nordstream_ensmax (time)`: ensemble maximum of CH4 due to Nord Stream explosion. This is (ideally) based on ensemble data but included in file for deterministic run to simplify data handling.

Standard deviations estimated by the model equivalent calculator (optional). These estimate uncertainties in the interpolation to the observation coordinates:

- :code:`CH4_flux_cat_stdev (flux_cat, time)`
- :code:`CH4_flux_total_stdev (flux_total, time)`
- :code:`CH4_bc_prior_stdev (bc_prior, time)`


.. _input_format_ensemble:

Ensemble run
^^^^^^^^^^^^

The concentration of each flux category is estimated using the following fields:

- :code:`group2cat (flux_cat, flux_group) [units=1]`
- :code:`CH4_flux_cat_weights (flux_cat, time) [units=1]`
- :code:`CH4_flux_group (flux_group, ensmem, time) [units=mol mol-1]`

We define::

>>> CH4_flux_cat[i] = CH4_flux_cat_weights[i] * (group2cat[i] @ CH4_flux_group)

This reflects the approximations in the ensemble simulation, see :ref:`approx_ensemble_members`.


Attributes
^^^^^^^^^^

- :code:`ssh`: station and sampling height identifier of the form "ABC\_123.4" where "ABC" is the 3-letter station code and "123.4" is the sampling height in meters above ground level.
- :code:`lon`, :code:`lat`: station coordinates in degrees east and degrees north
- height in meters must be provided as attribute defined in configuration entry :code:`uncertainty.vertical_coordinate`. This coordinate is used for the localization of correlations.
- :code:`ens_size`: number of meteorological ensemble members
- :code:`station_code` (optional): 3-letter station code, should be first three letters of :code:`ssh`
- :code:`is_auxiliary_to_data_in`:
  This attribute defines a path to a file from which missing fields shall be read.
  The typical use case is that multiple files with different far-field correction exist, but not all data need to be stored redundantly.


Examples
^^^^^^^^

Deterministic:

.. literalinclude:: ../../examples/input_det.cdl
   :language: cdl

Ensemble:

.. literalinclude:: ../../examples/input_ens.cdl
   :language: cdl


Output
------

The output of DUBFI mainly consists of scaling factors for flux categories, and the error covariance matrices for these scaling factors. The vectors of scaling factors form the state space of the inversion.

Additionally, the output contains the sensitivity of scaling factors to observations as a connection between state space and observation space.
The output is based on the internal structure of the observation space, which is a one-dimensional vector without definite ordering.
For post-processing, it may be of interest to reproduce the observation filtering done in the inversion using :py:func:`dubfi.fluxes.readobs.data_from_config`.

DUBFI does not know about the fluxes and does not provide flux estimates.
The post-processing package used at :term:`DWD` is not yet published.


Directory tree
^^^^^^^^^^^^^^

The output directory specified for :py:mod:`dubfi.fluxes` will contain a file "inversion_result.nc", a configuration file "config.yml", and logfiles (when using the default configuration).

Coordinates
^^^^^^^^^^^

- :code:`flux_cat`: flux category (as in input)
- :code:`flux_cat_dual`: flux category, equivalent to :code:`flux_cat`, used for square matrices
- :code:`norm_prefactor`: prefactor of normalization term in cost function in Bayesian inversion (as in configuration entry :code:`inversion.norm_prefactor`)
- :code:`obs`: observation dimension combining time and ssh. All observation time series are combined in a long vector. Sorting along time time or station is not guaranteed.
- :code:`ssh`: station and sampling height identifiers (strings)
- :code:`obs_time (obs)`: observation time
- :code:`ssh_idx (obs)`: 0-based index of ssh identifier along observation dimension. Observation at index :code:`i` has observation time :code:`obs_time[i]` and ssh identifier :code:`ssh[ssh_idx[i]]`.
- :code:`segment (optional)`: segment in :term:`MPI` parallelization


Data variables
^^^^^^^^^^^^^^

Output data variables are described by :py:data:`dubfi.fluxes.core.OUTPUT_METADATA`, which might be more up to date than this documentation.

- :code:`raw_config_utf8`:
  UTF-8 encoded :term:`YAML` configuration defining inversion parameters
- :code:`cost_function_post`:
  posterior inversion cost function
- :code:`s_prior_kalman`:
  Prior scaling factors in cycling with constant R. This is only present when using cycling.
  Values equal zero indicate that s_prior describes the deviation from the prior emissions.
- :code:`s_prior`:
  Prior scaling factors. When cycling, this depends on the norm prefactor.
  Values equal zero indicate that s_post describes the deviation from the prior emissions.
- :code:`s_post`:
  posterior scaling factors
- :code:`s_post_kalman`:
  Posterior scaling factors, assuming constant R.
  To be understood relative to s_prior. In these posterior scaling factors, the dependence of the uncertainty on the scaling factors was neglected.
- :code:`b_prior`:
  uncertainty (error covariance) matrix of s_prior
- :code:`b_post`:
  uncertainty (error covariance) matrix of s_post
- :code:`b_post_kalman`:
  uncertainty (error covariance) matrix of s_post_kalman
- :code:`sensitivity`:
  Linearized sensitivity of posterior scaling factors to observations.
  This is the derivative of posterior scaling factors w.r.t. observations.
- :code:`sensitivity_kalman`:
  Linearized sensitivity of posterior scaling factors to observations, assuming constant R.
  This is the derivative of s_post_kalman w.r.t. observations.
- :code:`averaging_kernel`:
  Averaging kernel estimate.
  This is the derivative of posterior scaling factors (dimension flux_cat_dual) w.r.t. true scaling factors (dimension flux_cat), estimated under assumption of a perfect transport model.
- :code:`mdm_prior`:
  observation minus prior model prediction (for scaling factors s_prior)
- :code:`mdm_post`:
  observation minus posterior model prediction (for scaling factors s_post)
- :code:`mdm_post_kalman`:
  observation minus model prediction for scaling factors s_post_kalman
- :code:`mdm_stdev_prior`:
  standard deviation assumed in inversion for mdm_prior (same for mdm_post_kalman), including uncertainty weighting and inflation
- :code:`mdm_stdev_post`:
  standard deviation assumed in inversion for mdm_post, including uncertainty weighting and inflation
- :code:`ssh`:
  station code and sampling height
- :code:`ssh_idx`:
  0-based index of ssh coordinate for observation data points
- :code:`obs_time`:
  observation time (UTC)
- :code:`lon`:
  station longitude (degrees east)
- :code:`lat`:
  station latitude (degrees north)
- :code:`height`:
  observation height coordinate as used for localization
- :code:`flux_cat`: flux category name
- :code:`segment_size`:
  unbuffered size of segments in :term:`MPI` parallelization
- :code:`buffered_segment_size`:
  buffered size of segments in :term:`MPI` parallelization
- :code:`obs_count`:
  number of observations per station and sampling height
- :code:`norm_prefactor`:
  prefactor of normalization term in cost function
- :code:`solver_nit`: number of solver iterations
- :code:`solver_nfev`:
  number of cost function calls in solver
- :code:`solver_njev`:
  number of calls to gradient of cost function in solver
- :code:`solver_nhev`:
  number of calls to Hessian of cost function in solver
- :code:`solver_status`: solver status, 0 means success


Attributes
^^^^^^^^^^

- :code:`start_window`: start of inversion time window, ISO-formatted date and time string
- :code:`end_window`: end of inversion time window, ISO-formatted date and time string
- :code:`chi2`: :math:`\chi^2` value of the fit, can be used combined with attribute :code:`ddof` to estimate agreement of the assumed uncertainties with the true deviation
- :code:`next_norm_prefactor_idx`: integer reporting the progress in filling an existing file with data. In a complete file, this must be equal to the size of the dimension :code:`norm_prefactor`.