[CI] BHUNT comparison experiments
Samuel Madden
madden at csail.mit.edu
Fri Jun 20 08:49:31 EDT 2008
Still not sure I get what you mean by "in the absence of joins" --
seems like Hideaki's 2nd rewrite basically shows that BHUNT boils down
to doing something identical to what we do using a much larger index.
I think your c) might be good to experiment with.
I would also like us to think of a case where there is no algebraic
correlation but where we win still. I guess the state/city examples
are an obvious case of this, but we should list some others. This is
a definite advantage we can claim, in addition to index size.
-Sam
On Jun 20, 2008, at 7:52 AM, Alexander Rasin wrote:
>
> I would say that in the absence of joins, it is not surprising that
> BHUNT implementation is competitive with us, except for the size of
> exceptions table. In fact, I wouldn't be surprised if we lost - what
> exactly was the CI runtime as compared to the numbers you presented
> at the end?
>
> Note that this Guassian chosen distribution is relatively favorable
> to us. I can think of a distribution that is really bad for CI. In
> particular, imagine a case where the central bump remains the same,
> but number of errors is very small. For example, 10,000 error tuples
> uniformly distributed outside of the bump can cause CI to read
> additional 10000 primary key buckets, but BHUNT will only need to
> read 10K exception tuples, which is very cheap (both size and I/O-
> wise)
>
> If we want to have an experiment discussing the difference, perhaps
> we should measure performance for different error distribution.
> To summarize:
>
> a) Reading the BHUNT-bump will be the same between CI and BHUNT index
>
> b) Size of the exception table is known to be determined by # of
> error tuples
>
> c) The difference in performance will depend on outside-of-the-bump
> tuple distribution (good for us when errors are clustered together,
> bad for us when errors are uniformly distributed)
>
> Hence we can simply discuss points (a) and (b) as they are self-
> explanatory and if we want a figure we can do it with (c).
> Do we want a figure?
>
> Alex
>
>> Here's what I'm experimenting now.
>> As BHUNT paper has no experiments about fuzzy relations,
>> I kinda made up BHUNT implementation.
>>
>>
>> I changed the ebay price-category experiments a little bit.
>> To make the price-category correlation algebraic, mean value
>> of price is now determined by CATID.
>> Price = Gauss(mean=CATID*40, stdev=100)
>> this implies,
>> CATID = Gauss(mean=Price/40, stdev=2.5)
>>
>> We want to run the same query as CT:
>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND
>> 1100
>>
>> As the table itself is clustered on CATID, BHUNT will append a
>> predicate on CATID by the algebraic relation.
>>
>> Suppose BHUNT captures the bump as
>> CATID BETWEEN Price/40 -5 AND Price/40 +5
>> , which means 95% (+-stdev*2=95%) of tuples are in this bump and
>> the other 5% will be registered to exception table.
>>
>> BHUNT will rewrite the query as
>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND
>> 1100 AND (
>> CATID BETWEEN 20 AND 33
>> OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE
>> BETWEEN 1000 AND 1100)
>> )
>>
>> The exception table is constructed by following queries:
>> CREATE TABLE EXCEPTIONS (CATID INT NOT NULL, Price REAL NOT NULL);
>> INSERT INTO EXCEPTIONS (CATID, Price)
>> SELECT CATID, Price FROM ITEMS
>> WHERE CATID>PRICE/40 + 5 OR CATID<PRICE/40 - 5;
>> CREATE INDEX I_EXCEPTION ON EXCEPTIONS(PRICE);
>>
>>
>> The result so far is...
>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND
>> 1100 AND (
>> CATID BETWEEN 20 AND 33
>> OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE
>> BETWEEN 1000 AND 1100)
>> )
>> took 2 minutes on average because PostgreSQL chose sequential scan!
>> As far as I know, PostgreSQL doesn't have an index hint option to
>> force query planner to use a particular index.
>>
>> So, I split the query to
>> SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN 1000 AND
>> 1100
>> and
>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND
>> 1100 AND CATID IN (...union-ed CATIDs)
>>
>> The first query took 100ms on average. The second query, which is
>> equivalent to our rewritten query, took 500ms on average. So, the
>> query runtime is almost same as ours.
>>
>> The notable difference is that the exception table is huge. It has
>> 1.9 million tuples (yeah, about 5% of ITEMS table) and amounts to
>> more than 70M bytes for the table and 40M bytes for the index, in
>> total 110M bytes. This would be 22M bytes even the error rate is 1%
>> instead of 5%. CT is less than 0.5M bytes at the knee point.
>>
>>
>> Do you think this experiments conform to our claim? What experiments
>> should I do next?
>> --
>> Hideaki Kimura <hkimura at cs.brown.edu>
>> _______________________________________________
>> CI mailing list
>> CI at list.cs.brown.edu
>> http://list.cs.brown.edu/mailman/listinfo/ci
>>
> _______________________________________________
> CI mailing list
> CI at list.cs.brown.edu
> http://list.cs.brown.edu/mailman/listinfo/ci
More information about the CI
mailing list