[CI] BHUNT comparison experiments
Hideaki Kimura
hkimura at cs.brown.edu
Fri Jun 20 11:58:55 EDT 2008
The CI query at the knee point is:
SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND
1100 AND CATID IN (18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,35,37)
and it costs 350ms on average. So, excluding the cost of UNION, BHUNT
and CI costs about same.
Yeah, BHUNT is not suited for correlations with some fraction of errors
and the ebay-category example doesn't favor it. But, then, what's BHUNT
suited for? Any correlation except artificial TPCH has some errors.
BHUNT paper describes fedex/ship/ground as examples, which will have
some (1%? I experienced many!) proportion of exceptionally-delayed or
lost delivery. If BHUNT can't work with it, they shouldn't have titled
it as "fuzzy algebraic constraints" :-)
I think the point is that the query performance of exception table and
algebraic correlation is very similar to ours, but they don't have any
technique to compress it. CI suggests Clustered Attribute bucketing,
so CI keeps the query performance even there isn't built-in CATID.
CI has unclustered attribute bucketing to compress CI and studied
the tradeoff between performance and size. Those things are essential to
make it small and never touched in BHUNT/CORDS paper.
Alexander Rasin wrote:
>
> As I have said - I do not necessarily advocate this approach, I just
> think this is what BHUNT people meant.
>
> Yes, error table came out to be large, but that's because you picked an
> arbitrary 5% error rate and only 1 bump. For this distribution BHUNT
> would probably contain multiple bump segments, as in Figure6 in BHUNT
> paper.
>
> Also, the reason why your previous "IN" clause approach worked is
> because you used CATID that has built-in artificial bucketing. Kinda
> cheating. For unique primary this approach simply wouldn't work, it'd be
> too slow.
>
> Why don't you time a CI-augmented query for same data set? Why do I
> keep thinking it'll be slower....?
>
> I don't think the BHUNT algebraic relation is suited for something
> containing 5% errors. Just like they aren't suited for String
> correlations. I am pretty sure the idea in BHUNT paper was that bumps
> can be designed so that errors are rare. 5% is not rare. And I quote:
>
> "BHUNT tries to choose the sample size so as to control the number of
> exceptions, and hence the size of the exception tables."
--
Hideaki Kimura <hkimura at cs.brown.edu>
More information about the CI
mailing list