1. Timeseries analysis#

These scripts faciliate the validation of simulated timeseries. This can be done in multiple ways. Either for GRDC-stations and the associated default GRDC file standard. In this case, most of the meta-data can be extracted directly from file. The GRDC-script is limited to evaluating discharge simulations. Or via an Excel-file together with a geojson-file providing the locations of all stations. with the Excel-script, any simulated variable can be evaluated.

1.1. Validation with GRDC-data#

Code documentation

Usage: pcru_eval_tims grdc [OPTIONS] NCF DATA_LOC OUT

Uses pcrglobwb_utils to validate simulated time series (currently only
discharge is supported)  with observations (currently only GRDC) for one or
more stations. The station name and file with GRDC data need to be provided
in a separate yml-file. Per station, it is also possible to provide lat/lon
coordinates which will supersede those provided by GRDC. The script
faciliates resampling to other temporal resolutions.

Returns a csv-file with the evaluated time series (OBS and SIM), a csv-file
with the resulting scores (KGE, R2, RMSE, RRMSE, NSE),  and if specified a
simple plot of the time series. If specified, it also returns a geojson-file
containing KGE values per station evaluated.

NCF: Path to the netCDF-file with simulations.

DATA_LOC: either yaml-file or folder with GRDC files.

OUT: Main output directory. Per station, a sub-directory will be created.

Options:
-v, --var-name TEXT             variable name in netCDF-file
-gc, --grdc-column TEXT         name of column in GRDC file to be read (only
                                used with -f option)
-e, --encoding TEXT             encoding of GRDC-files.
-sf, --selection-file TEXT      path to file produced by pcru_sel_grdc
                                function (only used with -f option)
-t, --time-scale TEXT           time scale at which analysis is performed if
                                resampling is desired. String needs to
                                follow pandas conventions.
-N, --number-processes INTEGER  number of processes to be used in
                                multiprocessing.Pool()- defaults to number
                                of CPUs in the system.
--verbose / --no-verbose        more or less print output.
--help                          Show this message and exit.

Settings

There are two options how to use this function. What they have in common is that they read a variable --var-name from a netCDF-file NCF containing simulated data. The variable name default to ‘discharge’.

Also, the command line script will create individual sub-folders per evaluated station in the main output folder OUT. Per sub-folder, a csv-file with the compuated metrics will be stored along with the underlying timeseries.

1.1.1. Option 1: Detailed analysis#

By providing one yml-file as DATA_LOC which has the subsequent structure for each location to be analysed:

<location_name>
    file: <path/to/GRDC_file>
    lat: <latitude value>
    lon: <longitude value>
    column: <latitude value>
<location_name>
    file: <path/to/GRDC_file>
    lat: <latitude value>
    lon: <longitude value>
    column: <latitude value>

file needs to point to the GRDC file corresponding to this station. It can be a relative or absolute path.

lat, lon, and column are optional settings.

By default, pcrglobwb_utils retrieves latitude and longitude information from the meta-data stored in each GRDC-file, and performs a window search around this location to reduce the risk of a mis-match between GRDC coords and location in the model output. In some cases, this may still not be sufficient and hence coordinates can be provided manually via the yaml-file.

GRDC-files have often multiple columns with data. pcrglobwb_utils uses ' Calculated' as default. If another column is supposed to be read, this can be specified here.

Example

In this example, we make use of a yml-file to validate discharge at locations Obidos and Jaturana (both located in the Amazon).

Obidos:
    file: 'path/to/files/3629000_Obidos.day'
    column: ' Original'

Jaturana:
    file: 'path/to/files/3627000_Jatuarana.day'
    lon: -59.65
    lat: -3.05
    column: ' Calculated'

While we use the GRDC coordinates for Obidos, we specify them for Jaturana. Also, the column to be read in the GRDC-file differs per station.

The daily values are resampled to monthly values in this example.

$ yaml_file='path/to/yaml_file.yml'
$ sim='path/to/model_discharge_output.nc'
$ out='./OUT/'
$ pcru_eval_tims grdc $sim $yaml_file $out_dir -t M

1.1.2. Option 2: batch analysis#

If a batch of stations is to be analysed, it is possible to provide a folder path where GRDC-files are stored as DATA_LOC. pcrglobwb_utils will then read all files, retrieve meta-data, and perform the analysis. It is possible to only select stations fulfilling certain requirements by providing a file containing selected stations with option --selection-file. This has the advantage that not all files need to be specified in a yaml-file, but on the downside gives less possibilites to finetune the analysis. The only thing that can be provided is the column name in the GRDC file batch via --grdc-column.

Note

To reduce the risk of stations not being located in the ‘right’ cell, a window search is automatically performed to find the best matching cell.

In both cases, it is possible to resample simulated and observed data to larger time steps with --time-scale.

To speed up computations, it is possible to parallelise the evaluation by specifying a number of cores as -number-processes. Note that the number of cores used may be scaled down to either the number of stations available or the number of cores available.

Example

In the example above, both GRDC files are stored in the folder path/to/files. Instead of specifying these files manually, we can just analyse the entire folder content.

When analysing many files, it may make sense to parallelise this process, here across 8 cores. And again, we want to perform the analysis at the monthly scale.

$ folder='path/to/files/'
$ sim='path/to/model_discharge_output.nc'
$ out='./OUT/'
$ pcru_eval_tims grdc $sim $folder $out_dir -N 8 -t M

1.2. Validation with Excel-file#

If observations are not sources from GRDC, they can be stored in an Excel-file as an alternative.

Attention

This settings is by far less well tested than the use of GRDC data.

Settings

Key inputs are a netCDF-file containing simulated values (NCF). With the option --var-name, the variable name can be specified. By default, variable ‘discharge’ will be read.

Observed values are provided with an Excel-file (XLS). The file needs to have two or more columns. The first column contains the dates of observed values. All other columns contain then the observed values themselves. The first row must contain the names of the stations to be analysed (except for the first column which does not have to have a header).

The list of stations to be analysed is retrieved from a geojson-file (LOC). It contains the locations (lat/lon) of the stations and also a unique identifier per station which must be provided with --location-id.

The command line script will create individual sub-folders per evaluated station in the main output folder OUT. Per sub-folder, a csv-file with the compuated metrics will be stored along with the underlying timeseries.

With the --geojson / --no-geojson switch, a geojson-file will be stored to OUT containing KGE values per evaluated station (or not). Defaults to True.

The --plot switch activates printing of simple plots of the timeseries per evaluated station.

Note

While the GRDC script works only with simulated discharge, the Excel script provided more freedom and can be used to evaluate any timeseries and variable simulated with PCR-GLOBWB!

Code documentation

Usage: pcr_utils_evaluate excel [OPTIONS] NCF XLS LOC OUT

Uses pcrglobwb_utils to validate simulated time series with observations
for one or more stations. The station names and their locations need to be
provided via geojson-file. Observations are read from Excel-file and
analysis will be performed for all stations with matching names in Excel-
file columns and geojson-file. The Excel-file must have only one sheet
with first column being time stamps of observed values, and all other
columns are observed values per station. These columns must have a header
with the station name. The script faciliates resampling to other temporal
resolutions.

Returns a csv-file with the evaluated time series (OBS and SIM),  a csv-
file with the resulting scores (KGE, r, RMSE, NSE),  and if specified a
simple plot of the time series. If specified, it also returns a geojson-
file containing KGE values per station evaluated.

NCF: Path to the netCDF-file with simulations.

XLS: Path to Excel-file containing dates and values per station.

LOC: Path to geojson-file containing location and names of stations.

OUT: Main output directory. Per station, a sub-directory will be created.

Options:
    -v, --var-name TEXT             variable name in netCDF-file
    -id, --location-id TEXT         unique identifier in locations file.
    -t, --time-scale TEXT           time scale at which analysis is performed if upscaling is desired: month, year, quarter.
    --plot / --no-plot              simple output plots.
    --geojson / --no-geojson        create GeoJSON file with KGE per GRDC station.
    --verbose / --no-verbose        more or less print output.
    --help                          Show this message and exit.

Example

In this example, each station in the geojson-file with a unqiue identifier ‘station’ will be matched with the columns in the Excel-file to validate simulated sediment transport.

$ sim='path/to/model_output.nc'
$ excel='path/to/data.xlsx'
$ loc='path/to/stations.geojson'
$ out='./OUT/'
$ pcr_utils_evaluate excel -v sedimentTransport -id station $sim $excel_file $loc $out