The histogram, which is a simple representation for distribution of a large data set, is widely used for estimation of the query result sizes that has many applications for various types of query processing. Estimates for the histogram buckets that partially overlap with the query region are computed based on the assumption that all the data objects in a bucket are uniformly distributed. However, it has been shown to be intractable to organize histogram buckets such that objects in every bucket are uniformly distributed. In most heuristic histogram methods, there often exist skews or clusters of objects in data distributions within the histogram buckets, which degrades the accuracy of the estimates. In this dissertation, we explore the issue of how to construct effective histograms for estimation of the result sizes of multi-dimensional range queries. We propose three new histogram methods: the skew-tolerant histogram, the quad tree-based histogram, and the minimal-skew cover histogram.
In the first part of this dissertation, we propose a new histogram method, called the skew-tolerant histogram for two or three dimensional geographic data objects that are used in many real-world applications in practice. The proposed method provides a significantly enhanced accuracy in a robust manner even for the data set that has a highly skewed distribution. When constructing a histogram, our method detects and utilizes the clusters of objects present in various parts of a data set. By directly utilizing clusters in organizing buckets, our proposed method can provide an enhanced accuracy in a robust manner over skewed distributions. Through extensive performance experiments, we show a considerable accuracy improvement of the proposed method.
In the second part of this dissertation, we propose a new histogram method, called the quad tree-based histogram that is based on the use of the existing quad tree for multi-dimensional data sets. The compact representation of the t...