[CI] BHUNT comparison experiments

Alexander Rasin alexr at cs.brown.edu
Fri Jun 20 13:16:39 EDT 2008


> The CI query at the knee point is:
>  SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100 
> AND CATID IN (18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,35,37)
> and it costs 350ms on average. So, excluding the cost of UNION, BHUNT and CI 
> costs about same.
>
Are you sure that CI does about the same?  That must has to do with 
experiment setup. How many CATID are there in total?

Looking at your previous email I see that the BHUNT bump falls into 20-33, 
which is the majority of the data.  Although 5% of the rows are errors, 
they fall into 4 different primary key buckets.
That would be stacking the odds heavily into CI favor - that's the tight 
error distribution. It is entirely possible for 5% of errors to be spread 
throughout the whole table and not just 4 buckets, and that's when CI will 
lose badly.

> Yeah, BHUNT is not suited for correlations with some fraction of errors and 
> the ebay-category example doesn't favor it. But, then, what's BHUNT suited 
> for? Any correlation except artificial TPCH has some errors. BHUNT paper 
> describes fedex/ship/ground as examples, which will have some (1%? I 
> experienced many!) proportion of exceptionally-delayed or lost delivery. If 
> BHUNT can't work with it, they shouldn't have titled
> it as "fuzzy algebraic constraints" :-)
>
BHUNT is suited for correlations that are pretty much accurate. With a 
very small number of errors. Gaussian distribution is not it.
Keep in mind that they can find multiple bumps.
This is not completely different from our approach - we need a correlation 
where un-clustered bucket points to few clustered buckets.  They need a 
correlation where unclustered column algebraically converts into few 
clustered ranges.
Of course we are more general, which is the point we are trying to make.

> I think the point is that the query performance of exception table and 
> algebraic correlation is very similar to ours, but they don't have any 
> technique to compress it. CI suggests Clustered Attribute bucketing,
> so CI keeps the query performance even there isn't built-in CATID.
> CI has unclustered attribute bucketing to compress CI and studied
> the tradeoff between performance and size. Those things are essential to make 
> it small and never touched in BHUNT/CORDS paper.
>
All of that is true, but if we want to be really honest, we should also 
point out the (rare) case where BHUNT would be much preferable to CI.


> Alexander Rasin wrote:
>> 
>> As I have said - I do not necessarily advocate this approach, I just think 
>> this is what BHUNT people meant.
>> 
>> Yes, error table came out to be large, but that's because you picked an 
>> arbitrary 5% error rate and only 1 bump.  For this distribution BHUNT would 
>> probably contain multiple bump segments, as in Figure6 in BHUNT paper.
>> 
>> Also, the reason why your previous "IN" clause approach worked is because 
>> you used CATID that has built-in artificial bucketing. Kinda cheating. For 
>> unique primary this approach simply wouldn't work, it'd be too slow.
>> 
>> Why don't you time a CI-augmented query for same data set?  Why do I keep 
>> thinking it'll be slower....?
>> 
>> I don't think the BHUNT algebraic relation is suited for something 
>> containing 5% errors.  Just like they aren't suited for String 
>> correlations. I am pretty sure the idea in BHUNT paper was that bumps can 
>> be designed so that errors are rare.  5% is not rare.  And I quote:
>> 
>> "BHUNT tries to choose the sample size so as to control the number of 
>> exceptions, and hence the size of the exception tables."
> -- 
> Hideaki Kimura <hkimura at cs.brown.edu>
>


More information about the CI mailing list