--- title: "Quality Control of SHARK Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quality Control of SHARK Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Overview The SHARK4R package provides a set of functions to perform quality control (QC) on SHARK data. These functions help identify missing or invalid values, spatial errors, and statistical outliers and are mainly intended for internal data validation. The tutorial covers: - **Field validation** (required and datatype-specific fields) - **Code validation** (project and platform codes) - **Geospatial checks** (points on land, basins, proximity to shore) - **Depth checks** (missing, negative, or implausible values) - **Outlier detection** (parameter-specific thresholds) - **Logical Parameter Checks** (parameter-specific rules) - **Interactive QC** with a Shiny app This workflow ensures SHARK data are consistent, valid, and ready for analysis. Several quality control components, originally developed by Provoost and Bosch (2018), have been adapted for compatibility with the SHARK format. --- ## Installation You can install the latest version of `SHARK4R` from CRAN using: ```{r, eval=FALSE} install.packages("SHARK4R") ``` Load the package along with `dplyr`: ```{r, include=FALSE} suppressPackageStartupMessages({ library(SHARK4R) library(dplyr) }) ``` ```{r, eval=FALSE} library(SHARK4R) library(dplyr) ``` --- ## Retrieve SHARK Data You can fetch SHARK data using the same filtering options as the SHARK web interface. Explore available options with: ```{r} shark_options <- get_shark_options() ``` Filter datasets containing "Chlorophyll": ```{r} # Filter names using grepl chlorophyll_datasets <- shark_options$datasets[grepl("Chlorophyll", shark_options$datasets)] # Select the first dataset for demonstration selected_dataset <- chlorophyll_datasets[1] # Print the name of the selected dataset print(selected_dataset) ``` Download the selected dataset as a data frame: ```{r} chlorophyll_data <- get_shark_datasets(selected_dataset, save_dir = tempdir(), return_df = TRUE, verbose = FALSE) tibble(chlorophyll_data) ``` SHARK data can be downloaded and saved locally using the `save_dir` argument, then imported into R using the function `read_shark()` for both ZIP archives or text files. --- ## Step 1: Check Required Fields Validate mandatory fields: - **Global fields**: `check_datatype()` - **Datatype-specific fields**: `check_fields()` with optional `field_definitions` ```{r} check_fields(data = chlorophyll_data, datatype = "Chlorophyll") ``` --- ## Step 2: Validate Project and Platform Codes Ensure metadata codes follow SHARK conventions: ```{r} # Validate project codes check_codes(chlorophyll_data) # Validate ship/platform codes check_codes(data = chlorophyll_data, field = "platform_code", code_type = "SHIPC", match_column = "Code") ``` --- ## Step 3: Geospatial Checks ### Visualize Data Points ```{r} plot_map_leaflet(chlorophyll_data) ``` ### Identify Points on Land ```{r} n_rows_on_land <- check_onland(chlorophyll_data) nrow(n_rows_on_land) ``` Optional geospatial QC functions: - `positions_are_near_land()` - `which_basin()` from the [iRfcb package](https://CRAN.R-project.org/package=iRfcb) --- ## Step 4: Depth Checks Verify plausibility and consistency of depth values: ```{r} check_depth(data = chlorophyll_data) # default columns: min/max depth check_depth(data = chlorophyll_data, "water_depth_m") ``` Checks performed: - Missing depth column or empty values (warnings) - Non-numeric or negative values (errors) - Depth exceeding bathymetry or minimum > maximum (errors) --- ## Step 5: Outlier Detection Retrieve reference statistics for your datatype: ```{r} shark_statistics <- get_shark_statistics(datatype = "Chlorophyll", fromYear = 2020, toYear = 2024, verbose = FALSE) tibble(shark_statistics) ``` Detect extreme values using thresholds (e.g., 99th percentile): ```{r} check_outliers(data = chlorophyll_data, parameter = "Chlorophyll-a", datatype = "Chlorophyll", threshold_col = "P99", thresholds = shark_statistics) ``` Visualize anomalies: ```{r} # Scatterplot with horizontal line at 99th percentile scatterplot(chlorophyll_data, hline = shark_statistics$P99) ``` --- ## Step 6: Logical Parameter Checks Use `check_parameter_rules()` to flag measurements that violate parameter-specific or row-wise logical rules. ```{r} check_parameter_rules(data = chlorophyll_data) ``` * `return_df = TRUE` gives a data frame of violations. * `return_logical = TRUE` gives logical vectors for each parameter. * Only parameters present in the dataset are checked; available parameters are listed if none match. * You can define custom rules by providing your own `param_conditions` or `rowwise_conditions` lists. --- ## Step 7: Station Matching Verify station names against the official SHARK registry: ```{r} station_match <- match_station(chlorophyll_data$station_name) head(station_match) ``` To plot stations and their distances from the station register in an interactive map: ```{r} check_station_distance(data = chlorophyll_data, plot_leaflet = TRUE) ``` To check if stations are nominal (comparing unique coordinates per station): ```{r} check_nominal_station(data = chlorophyll_data) ``` --- ## Interactive QC with Shiny For a more user-friendly interface, use the Shiny QC app: ```{r, eval=FALSE} # Run the app run_qc_app() # Alternative, download support files and knit documents locally check_setup(path = tempdir()) # using a temp folder in this example ``` The app provides point-and-click access to the same QC checks described above. --- ## Recommended Workflow Summary 1. **Check Required Fields** (`check_datatype()`, `check_fields()`) 2. **Validate Codes** (`check_codes()`) 3. **Geospatial Checks** (`plot_map_leaflet()`, `check_onland()`, optional `positions_are_near_land()`, `which_basin()`) 4. **Depth Checks** (`check_depth()`) 5. **Outlier Detection** (`check_outliers()`, `scatterplot()`) 6. **Logical Parameter Checks** (`check_parameter_rules()`) 7. **Station Matching** (`match_station()`) 8. **Final Review & Visualization** (interactive maps and scatterplots) Following this order ensures comprehensive QC and prepares your SHARK data for analysis. --- ## Citation ```{r, echo=FALSE} citation("SHARK4R") ``` ## References - Provoost P, Bosch S (2024). `obistools`: Tools for data enhancement and quality control. Ocean Biodiversity Information System. Intergovernmental Oceanographic Commission of UNESCO. R package version 0.1.0, .