Ulrike Herzschuh, Luidmila A. Pestryakova, Birgit Heim, Elena Troeva, Stefan Kruse, Bringfried Pflug, Evgenij S. Zakharov, Femke van Geffen, Iuliia Shevtsova, Frederic Brieger, Simone Maria Stuenzi, Nadine Bernhardt, Rongwei Geng, and Luise Schulte
This data collection is an attempt to remedy the scarcity of tree level forest structure data in the circum-boreal region, whilst providing, as part of the data collection, adjusted and labelled tree level and vegetation plot level data for machine learning and upscaling practices. Publicly available comprehensive datasets on tree level forest structure are rare, due to the involvement of governmental agencies, public sectors, and private actors that all influence the availability of these datasets. We present datasets of vegetation composition and tree and plot level forest structure for two important vegetation transition zones in Siberia, Russia; the summergreen–evergreen transition zone in central Yakutia and the tundra–taiga transition zone in Chukotka (NE Siberia). The SiDroForest collection contains a variety of data mainly based on unmanned aerial vehicle (UAV) and field data collected from 64 vegetation plots during fieldwork jointly performed by the Alfred Wegener Institute for Polar and Marine Research (AWI) and the North-Eastern Federal University of Yakutsk (NEFU) during the Chukotka 2018 expedition to Siberia. The data collection consists of four separate datasets. The fieldwork locations are the anchors that bind the data types together based on the location of the vegetation plot. i) The first dataset (Kruse et al., 2021, https://doi.pangaea.de/10.1594/PANGAEA.933263) provides UAV-borne data products covering the 64 vegetation plots surveyed during fieldwork: including structure from motion (SfM) point clouds, point-cloud products such as Digital Elevation Model (DEM), Canopy Height Model (CHM), Digital Surface Model (DSM) and Digital Terrain Model (DTM) constructed from Red Green Blue (RGB) and Red Green Near Infrared (RGN) orthomosaics. Forest structure and vegetation composition data are crucial in the assessment of whether a forest is to act as a carbon sink under changing climate conditions. Fieldwork and UAV-products can provide such data in depth. ii) The second dataset contains spatial data in the form of points and polygon shape files of 872 labelled individual trees and shrubs that were recorded during fieldwork at the same vegetation plots with information on tree height, crown diameter, and species (van Geffen et al., 2021c, https://doi.pangaea.de/10.1594/PANGAEA.932821). These tree- and shrub-individual labelled point and polygon shape files were generated and are located on the UAV RGB orthoimages. The individual number links to the information collected during the expedition such as tree height, crown diameter and vitality provided in table format. This dataset can be used to link individual trees in the SfM point clouds, providing unique insights into the vegetation composition and also allows future monitoring of the individual trees and the contents of the recorded vegetation plots at large. iii) The third dataset contains a synthesis of 10 000 generated images and masks that have the tree crowns of two species of larch (Larix gmelinii and Larix cajanderi) automatically extracted from the RGB UAV images in the common objects in context (COCO) format (van Geffen et al., 2021a, https://doi.pangaea.de/10.1594/PANGAEA.932795). The synthetic dataset was created specifically to detect Siberian larch species. iv) If publicly available forest-structure datasets at tree level are rarely available for Siberia, even fewer ready-to-use tree and plot level data are available for machine learning approaches, for example optimised data formats containing annotated vegetation categories. The fourth set contains Sentinel-2 Level-2 bottom of atmosphere labelled image patches with seasonal information and annotated vegetation categories covering the vegetation plots (van Geffen et al., 2021b, https://doi.pangaea.de/10.1594/PANGAEA.933268). The dataset is created with the aim of providing a small ready-to use validation and training data set to be used in various vegetation-related machine-learning tasks. The SidroForest data collection serves a variety of user communities. First of all, the UAV-derived top of canopy structure information, orthomosaics and the detailed vegetation information in the labelled data set provide detailed information on forest type, structure and composition for scientific communities with ecological and biological applications. The detailed Land Cover and Vegetation structure information in the first two data sets are of use for the generation and validation of Land Cover remote sensing products in radar and optical remote sensing. In addition to providing information on forest structure and vegetation composition of the vegetation plots, parts of the SiDroForest dataset are prepared to be used as training and validation data for machine learning purposes. For example, the Synthetic tree crown dataset is generated from the raw UAV images and optimized to be used in neural networks. Furthermore, the fourth SiDroForest data set contains standardized Sentinel-2 labelled image patches that provide training data on vegetation class categories for machine learning classification with JSON labels provided. The SiDroForst data collective serves as a basis to add future data collected during expeditions performed by the Alfred Wegener Institute, creating a larger dataset in the upcoming years that can provide unique insights into remote hard to reach boreal regions of Siberia.