Back to Search
Start Over
ParquetDB: A Lightweight Python Parquet-Based Database
- Publication Year :
- 2025
-
Abstract
- Traditional data storage formats and databases often introduce complexities and inefficiencies that hinder rapid iteration and adaptability. To address these challenges, we introduce ParquetDB, a Python-based database framework that leverages the Parquet file format's optimized columnar storage. ParquetDB offers efficient serialization and deserialization, native support for complex and nested data types, reduced dependency on indexing through predicate pushdown filtering, and enhanced portability due to its file-based storage system. Benchmarks show that ParquetDB outperforms traditional databases like SQLite and MongoDB in managing large volumes of data, especially when using data formats compatible with PyArrow. We validate ParquetDB's practical utility by applying it to the Alexandria 3D Materials Database, efficiently handling approximately 4.8 million complex and nested records. By addressing the inherent limitations of existing data storage systems and continuously evolving to meet future demands, ParquetDB has the potential to significantly streamline data management processes and accelerate research development in data-driven fields.<br />Comment: 41 pages, 12 figures
- Subjects :
- Computer Science - Databases
Physics - Data Analysis, Statistics and Probability
Subjects
Details
- Database :
- arXiv
- Publication Type :
- Report
- Accession number :
- edsarx.2502.05311
- Document Type :
- Working Paper