[CI] Paper theme
Alexander Rasin
alexr at cs.brown.edu
Sat Jun 14 20:48:50 EDT 2008
Hideaki and myself have met to go over this list again and here's some
comments/corrections to agree on.
> Thought it would be good down our new pitch and agree on it.
>
> We are claiming:
>
> - A method for detecting "soft dependencies" (what we have been calling
> correlations) between attributes
>
The detection of soft FDs is not novel. Both CORDS and BHUNT do this to
some degree, only BHUNT uses fixed buckets and is less general (algebraic
relations). CORDS applies no bucketing, but conceptually does the same
kind of detection.
> - Given a collection of such dependencies, a physical organization of the
> database that includes:
>
> a) A choice of a primary (clustered) attribute
>
We do not actually propose a choice of a primary attribute - we do some
experiments to show what a good/bad attribute clustering looks like and
how to estimate the benefit (cost model).
> b) A collection of secondary indices
>
> - A technique for bucketing secondary indices to decrease their size without
> giving up performance dramatically
>
yes
> - A cost model that demonstrates we can predict which indices will benefit
> and what the benefit will be
>
We select the indices that will benefit based on correlation with the
primary clustered attribute and selectivity. i.e. an uncorrelated column
or a column that has a high (> 0.5) selectivity is not useful as a CI.
Once good candidate predicates are selected, all CI combinations are tried
(i.e. 3 "good" columns result in 7 feasible column combinations)
We do have a cost model that estimates the benefit of each one of these.
> - A collection of performance results that shows that this works well in many
> real cases
>
Yes.
> Other things?
>
It is worth noting that neither CORDS nor BHUNT do multi-column
correlations (only pairs of columns) which is required to build
multi-column indexes. We do have some results (such as Table7)
demonstrating that multi-column index is better than single-column index
> Here are some things I am not sure about:
>
> - Besides index compression, do we think what we are proposing is novel
> (e.g., are soft dependencies plus our cost model and search techniques new,
> even if just running on unclustered indices)?
>
Multi-column part would be relatively new or at least different.
Our search techniques are not new, except for the search for a good bucket
size.
Neither bucketing search nor our cost model would be of any use in
the abcense of correlation between unclustered attributes that we are
trying to index and the clustered attribute.
> - Are our index compression techniques new?
>
Our index compression techniques can arguably be equated to "coarse
bitmaps" or "reduced-bitmap bloom filters" from the slides you sent.
At least we think it is possible to express bucketing (which, of
course, is not new) via these techniques.
Hideaki & Alex
More information about the CI
mailing list