github.com/Qbeast-io/qbeast-spark ↗
Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
Open this visualization on its own page →
Contributors
7
Lines of Code
1,815
From
2021-09-23
To
2022-06-03
About Qbeast-io/qbeast-spark
Qbeast Spark is an Apache Spark extension designed to enhance data processing in data lakehouses by providing multi-dimensional filtering and efficient data sampling capabilities. Built on Delta Lake, it maintains ACID properties for data integrity while enabling faster queries through intelligent indexing and sampling strategies. The project targets big data environments where query performance and statistical accuracy are critical.
The extension implements multi-column indexing using the Qbeast format, allowing users to filter and sample data across multiple dimensions simultaneously. A key feature is the improved sampling operator that can read statistically significant subsets of files, with a table tolerance model that lets users trade off sampling fraction against query accuracy. In demonstrated benchmarks, the tool achieved query execution approximately 22 times faster than Delta format alone with only 0.034% error using 1% sampling.
Users can work with Qbeast through both Scala code and SQL syntax, with operations including indexing datasets, dynamic insertion of records, and table optimization to improve query performance. The project maintains compatibility with Apache Spark 3.5.x, Hadoop 3.3.x, and Delta Lake 3.1.x, and includes comprehensive documentation covering quickstart guides, algorithm details, and cloud storage recommendations. A Python index visualizer is also provided for examining the index structure and sampling metrics.