.. _usage_timeseries: Timeseries analysis ====================================== These scripts faciliate the validation of simulated timeseries. This can be done in multiple ways. Either for GRDC-stations and the associated default GRDC file standard. In this case, most of the meta-data can be extracted directly from file. The GRDC-script is limited to evaluating discharge simulations. Or via an Excel-file together with a geojson-file providing the locations of all stations. with the Excel-script, any simulated variable can be evaluated. Validation with GRDC-data ------------------------- **Code documentation** .. code-block:: console Usage: pcru_eval_tims grdc [OPTIONS] NCF DATA_LOC OUT Uses pcrglobwb_utils to validate simulated time series (currently only discharge is supported) with observations (currently only GRDC) for one or more stations. The station name and file with GRDC data need to be provided in a separate yml-file. Per station, it is also possible to provide lat/lon coordinates which will supersede those provided by GRDC. The script faciliates resampling to other temporal resolutions. Returns a csv-file with the evaluated time series (OBS and SIM), a csv-file with the resulting scores (KGE, R2, RMSE, RRMSE, NSE), and if specified a simple plot of the time series. If specified, it also returns a geojson-file containing KGE values per station evaluated. NCF: Path to the netCDF-file with simulations. DATA_LOC: either yaml-file or folder with GRDC files. OUT: Main output directory. Per station, a sub-directory will be created. Options: -v, --var-name TEXT variable name in netCDF-file -gc, --grdc-column TEXT name of column in GRDC file to be read (only used with -f option) -e, --encoding TEXT encoding of GRDC-files. -sf, --selection-file TEXT path to file produced by pcru_sel_grdc function (only used with -f option) -t, --time-scale TEXT time scale at which analysis is performed if resampling is desired. String needs to follow pandas conventions. -N, --number-processes INTEGER number of processes to be used in multiprocessing.Pool()- defaults to number of CPUs in the system. --verbose / --no-verbose more or less print output. --help Show this message and exit. **Settings** There are two options how to use this function. What they have in common is that they read a variable ``--var-name`` from a netCDF-file ``NCF`` containing simulated data. The variable name default to 'discharge'. Also, the command line script will create individual sub-folders per evaluated station in the main output folder ``OUT``. Per sub-folder, a csv-file with the compuated metrics will be stored along with the underlying timeseries. Option 1: Detailed analysis ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ By providing one yml-file as ``DATA_LOC`` which has the subsequent structure for each location to be analysed: .. code-block:: yaml file: lat: lon: column: file: lat: lon: column: ``file`` needs to point to the GRDC file corresponding to this station. It can be a relative or absolute path. ``lat``, ``lon``, and ``column`` are optional settings. By default, ``pcrglobwb_utils`` retrieves latitude and longitude information from the meta-data stored in each GRDC-file, and performs a window search around this location to reduce the risk of a mis-match between GRDC coords and location in the model output. In some cases, this may still not be sufficient and hence coordinates can be provided manually via the yaml-file. GRDC-files have often multiple columns with data. ``pcrglobwb_utils`` uses ``' Calculated'`` as default. If another column is supposed to be read, this can be specified here. **Example** In this example, we make use of a yml-file to validate discharge at locations Obidos and Jaturana (both located in the Amazon). .. code-block:: yaml Obidos: file: 'path/to/files/3629000_Obidos.day' column: ' Original' Jaturana: file: 'path/to/files/3627000_Jatuarana.day' lon: -59.65 lat: -3.05 column: ' Calculated' While we use the GRDC coordinates for Obidos, we specify them for Jaturana. Also, the column to be read in the GRDC-file differs per station. The daily values are resampled to monthly values in this example. .. code-block:: console $ yaml_file='path/to/yaml_file.yml' $ sim='path/to/model_discharge_output.nc' $ out='./OUT/' $ pcru_eval_tims grdc $sim $yaml_file $out_dir -t M Option 2: batch analysis ^^^^^^^^^^^^^^^^^^^^^^^^^ If a batch of stations is to be analysed, it is possible to provide a folder path where GRDC-files are stored as ``DATA_LOC``. ``pcrglobwb_utils`` will then read all files, retrieve meta-data, and perform the analysis. It is possible to only select stations fulfilling certain requirements by providing a file containing selected stations with option ``--selection-file``. This has the advantage that not all files need to be specified in a yaml-file, but on the downside gives less possibilites to finetune the analysis. The only thing that can be provided is the column name in the GRDC file batch via ``--grdc-column``. .. note:: To reduce the risk of stations not being located in the 'right' cell, a window search is automatically performed to find the best matching cell. In both cases, it is possible to resample simulated and observed data to larger time steps with ``--time-scale``. To speed up computations, it is possible to parallelise the evaluation by specifying a number of cores as ``-number-processes``. Note that the number of cores used may be scaled down to either the number of stations available or the number of cores available. **Example** In the example above, both GRDC files are stored in the folder ``path/to/files``. Instead of specifying these files manually, we can just analyse the entire folder content. When analysing many files, it may make sense to parallelise this process, here across 8 cores. And again, we want to perform the analysis at the monthly scale. .. code-block:: console $ folder='path/to/files/' $ sim='path/to/model_discharge_output.nc' $ out='./OUT/' $ pcru_eval_tims grdc $sim $folder $out_dir -N 8 -t M Validation with Excel-file --------------------------- If observations are not sources from GRDC, they can be stored in an Excel-file as an alternative. .. attention:: This settings is by far less well tested than the use of GRDC data. **Settings** Key inputs are a netCDF-file containing simulated values (``NCF``). With the option ``--var-name``, the variable name can be specified. By default, variable 'discharge' will be read. Observed values are provided with an Excel-file (``XLS``). The file needs to have two or more columns. The first column contains the dates of observed values. All other columns contain then the observed values themselves. The first row must contain the names of the stations to be analysed (except for the first column which does not have to have a header). The list of stations to be analysed is retrieved from a geojson-file (``LOC``). It contains the locations (lat/lon) of the stations and also a unique identifier per station which must be provided with ``--location-id``. The command line script will create individual sub-folders per evaluated station in the main output folder ``OUT``. Per sub-folder, a csv-file with the compuated metrics will be stored along with the underlying timeseries. With the ``--geojson / --no-geojson`` switch, a geojson-file will be stored to ``OUT`` containing KGE values per evaluated station (or not). Defaults to True. The ``--plot`` switch activates printing of simple plots of the timeseries per evaluated station. .. note:: While the GRDC script works only with simulated discharge, the Excel script provided more freedom and can be used to evaluate any timeseries and variable simulated with PCR-GLOBWB! **Code documentation** .. code-block:: console Usage: pcr_utils_evaluate excel [OPTIONS] NCF XLS LOC OUT Uses pcrglobwb_utils to validate simulated time series with observations for one or more stations. The station names and their locations need to be provided via geojson-file. Observations are read from Excel-file and analysis will be performed for all stations with matching names in Excel- file columns and geojson-file. The Excel-file must have only one sheet with first column being time stamps of observed values, and all other columns are observed values per station. These columns must have a header with the station name. The script faciliates resampling to other temporal resolutions. Returns a csv-file with the evaluated time series (OBS and SIM), a csv- file with the resulting scores (KGE, r, RMSE, NSE), and if specified a simple plot of the time series. If specified, it also returns a geojson- file containing KGE values per station evaluated. NCF: Path to the netCDF-file with simulations. XLS: Path to Excel-file containing dates and values per station. LOC: Path to geojson-file containing location and names of stations. OUT: Main output directory. Per station, a sub-directory will be created. Options: -v, --var-name TEXT variable name in netCDF-file -id, --location-id TEXT unique identifier in locations file. -t, --time-scale TEXT time scale at which analysis is performed if upscaling is desired: month, year, quarter. --plot / --no-plot simple output plots. --geojson / --no-geojson create GeoJSON file with KGE per GRDC station. --verbose / --no-verbose more or less print output. --help Show this message and exit. **Example** In this example, each station in the geojson-file with a unqiue identifier 'station' will be matched with the columns in the Excel-file to validate simulated sediment transport. .. code-block:: console $ sim='path/to/model_output.nc' $ excel='path/to/data.xlsx' $ loc='path/to/stations.geojson' $ out='./OUT/' $ pcr_utils_evaluate excel -v sedimentTransport -id station $sim $excel_file $loc $out