[CI] BHUNT comparison experiments
Alexander Rasin
alexr at cs.brown.edu
Fri Jun 20 13:16:39 EDT 2008
> The CI query at the knee point is:
> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100
> AND CATID IN (18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,35,37)
> and it costs 350ms on average. So, excluding the cost of UNION, BHUNT and CI
> costs about same.
>
Are you sure that CI does about the same? That must has to do with
experiment setup. How many CATID are there in total?
Looking at your previous email I see that the BHUNT bump falls into 20-33,
which is the majority of the data. Although 5% of the rows are errors,
they fall into 4 different primary key buckets.
That would be stacking the odds heavily into CI favor - that's the tight
error distribution. It is entirely possible for 5% of errors to be spread
throughout the whole table and not just 4 buckets, and that's when CI will
lose badly.
> Yeah, BHUNT is not suited for correlations with some fraction of errors and
> the ebay-category example doesn't favor it. But, then, what's BHUNT suited
> for? Any correlation except artificial TPCH has some errors. BHUNT paper
> describes fedex/ship/ground as examples, which will have some (1%? I
> experienced many!) proportion of exceptionally-delayed or lost delivery. If
> BHUNT can't work with it, they shouldn't have titled
> it as "fuzzy algebraic constraints" :-)
>
BHUNT is suited for correlations that are pretty much accurate. With a
very small number of errors. Gaussian distribution is not it.
Keep in mind that they can find multiple bumps.
This is not completely different from our approach - we need a correlation
where un-clustered bucket points to few clustered buckets. They need a
correlation where unclustered column algebraically converts into few
clustered ranges.
Of course we are more general, which is the point we are trying to make.
> I think the point is that the query performance of exception table and
> algebraic correlation is very similar to ours, but they don't have any
> technique to compress it. CI suggests Clustered Attribute bucketing,
> so CI keeps the query performance even there isn't built-in CATID.
> CI has unclustered attribute bucketing to compress CI and studied
> the tradeoff between performance and size. Those things are essential to make
> it small and never touched in BHUNT/CORDS paper.
>
All of that is true, but if we want to be really honest, we should also
point out the (rare) case where BHUNT would be much preferable to CI.
> Alexander Rasin wrote:
>>
>> As I have said - I do not necessarily advocate this approach, I just think
>> this is what BHUNT people meant.
>>
>> Yes, error table came out to be large, but that's because you picked an
>> arbitrary 5% error rate and only 1 bump. For this distribution BHUNT would
>> probably contain multiple bump segments, as in Figure6 in BHUNT paper.
>>
>> Also, the reason why your previous "IN" clause approach worked is because
>> you used CATID that has built-in artificial bucketing. Kinda cheating. For
>> unique primary this approach simply wouldn't work, it'd be too slow.
>>
>> Why don't you time a CI-augmented query for same data set? Why do I keep
>> thinking it'll be slower....?
>>
>> I don't think the BHUNT algebraic relation is suited for something
>> containing 5% errors. Just like they aren't suited for String
>> correlations. I am pretty sure the idea in BHUNT paper was that bumps can
>> be designed so that errors are rare. 5% is not rare. And I quote:
>>
>> "BHUNT tries to choose the sample size so as to control the number of
>> exceptions, and hence the size of the exception tables."
> --
> Hideaki Kimura <hkimura at cs.brown.edu>
>
More information about the CI
mailing list