Back to Search Start Over

Data de- and re-dimensioning for optimized brokering access

Authors :
Fuka, Daniel R.
MacAlister, Charlotte R.
Siri-Jodha S. Khalsa
M. Todd Walter
Fuka, Mary Z.
Pierce, Suzanne A.
Easton, Zachary M.
Publication Year :
2014
Publisher :
figshare, 2014.

Abstract

Data brokering systems aim to facilitate the exchange of data and models between disciplines in an increasingly transparent manner thereby accelerating scientific discovery. Researchers from many different, yet complimentary geoscience disciplines need to access cross field datasets from different fields with significantly different data formats, and in most cases differing time and space dimensionality than their own field commonly uses. This causes problems with large datasets with different time and space dimensions, as the difference in dimension often means that the entire dataset has to be read in order to provide the limited information the researcher is interested in. In this poster we present methods for removing the dimensionality from datasets, both physically on the data serving side as well demonstrate de- and redimensioning datasets from a broker based virtual perspective so the data brokering system can quickly access the smaller subset of data in the correct dimensionality for any given scientific field. What we did We de- and re-dimensioned the large reanalysis dataset, CFSR, to test alternative data paradigms to enhance the performance of single location extraction while maintain needed performace for spatial time step extraction. Results We were able to increase single location access performace by 10,000X, though spatial timestep access decreased by a factor of 10. Spatial requirements increased by a factor of 7 . Discussion Cross Science Data Issues Data brokering systems facilitate the exchange of data between disciplines, though the broker can not be responsible for optimizing the data structures for all sciences. An optimum data paradigm for one science is likely not efficient for other sciences, due to the native data dimensionalities of each field. Benchmark Quality While computational systems hosting each dataset are very dissimilar for this benchmark, it is fair to say that they are biased towards the CISL infrastructure. Hosting Space Requirements Other considerations that must be considered are the space requirements. Since the compression of shorter strings is less efficient than dense grids, and locational data has to be added to each point, the new paradigm for this study requires 4 times the storage space. Re-dimensioned Usage Over 22,000 single location and multi grid point requests have been made to TAMU and VTech Servers, estimated at ~ 3000 day equivalents from RDA subset service, which would not be possible with given computational resources.

Details

Database :
OpenAIRE
Accession number :
edsair.doi.dedup.....60b8f2402154089f554c3cd2a84ed09a
Full Text :
https://doi.org/10.6084/m9.figshare.1157478