[CI] BHUNT comparison experiments

Alexander Rasin alexr at cs.brown.edu
Fri Jun 20 09:06:58 EDT 2008


> Still not sure I get what you mean by "in the absence of joins" -- seems like 
> Hideaki's 2nd rewrite basically shows that BHUNT boils down to doing 
> something identical to what we do using a much larger index.
>
Sorry, I think I have confused myself here.  I was thinking of joins in a 
query.  However, all our indexes are per-table only, therefore ignore this 
part of my email :)

> I think your c) might be good to experiment with.
>
Note that if the way I interpreted the paper, the exception table contains 
the whole exception tuple (rather than just price/catid as Hideaki defined 
it)
This avoids doing the very join that is currently happening (CATID in 
(list of CATIDs)), and instead one can simply execute the query in 
parallel on whatever is read off the bump and the exception table.  In 
that case, I would expect BHUNT to outperform CIs, at least when 
errors tuples are not tightly clustered.

> I would also like us to think of a case where there is no algebraic 
> correlation but where we win still.  I guess the state/city examples are an 
> obvious case of this, but we should list some others.  This is a definite 
> advantage we can claim, in addition to index size.
>
Even in our eBay example, we can work with the correlation CATX <> Price
rather than CATID which is numeric.  That should be enough if we want to 
stick to the same data set.


>> I would say that in the absence of joins, it is not surprising that BHUNT 
>> implementation is competitive with us, except for the size of exceptions 
>> table. In fact, I wouldn't be surprised if we lost - what exactly was the 
>> CI runtime as compared to the numbers you presented at the end?
>> 
>> Note that this Guassian chosen distribution is relatively favorable to us. 
>> I can think of a distribution that is really bad for CI.   In particular, 
>> imagine a case where the central bump remains the same, but number of 
>> errors is very small. For example, 10,000 error tuples uniformly 
>> distributed outside of the bump can cause CI to read additional 10000 
>> primary key buckets, but BHUNT will only need to read 10K exception tuples, 
>> which is very cheap (both size and I/O-wise)
>> 
>> If we want to have an experiment discussing the difference, perhaps we 
>> should measure performance for different error distribution.
>> To summarize:
>> 
>> a) Reading the BHUNT-bump will be the same between CI and BHUNT index
>> 
>> b) Size of the exception table is known to be determined by # of error 
>> tuples
>> 
>> c) The difference in performance will depend on outside-of-the-bump tuple 
>> distribution (good for us when errors are clustered together, bad for us 
>> when errors are uniformly distributed)
>> 
>> Hence we can simply discuss points (a) and (b) as they are self-explanatory 
>> and if we want a figure we can do it with (c).
>> Do we want a figure?
>> 
>> Alex
>> 
>>> Here's what I'm experimenting now.
>>> As BHUNT paper has no experiments about fuzzy relations,
>>> I kinda made up BHUNT implementation.
>>> 
>>> 
>>> I changed the ebay price-category experiments a little bit.
>>> To make the price-category correlation algebraic, mean value
>>> of price is now determined by CATID.
>>> Price = Gauss(mean=CATID*40, stdev=100)
>>> this implies,
>>> CATID = Gauss(mean=Price/40, stdev=2.5)
>>> 
>>> We want to run the same query as CT:
>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100
>>> 
>>> As the table itself is clustered on CATID, BHUNT will append a predicate 
>>> on CATID by the algebraic relation.
>>> 
>>> Suppose BHUNT captures the bump as
>>> CATID BETWEEN Price/40 -5 AND Price/40 +5
>>> , which means 95% (+-stdev*2=95%) of tuples are in this bump and the other 
>>> 5% will be registered to exception table.
>>> 
>>> BHUNT will rewrite the query as
>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100 
>>> AND (
>>>  CATID BETWEEN 20 AND 33
>>>  OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN 
>>> 1000 AND 1100)
>>> )
>>> 
>>> The exception table is constructed by following queries:
>>> CREATE TABLE EXCEPTIONS (CATID INT NOT NULL, Price REAL NOT NULL);
>>> INSERT INTO EXCEPTIONS (CATID, Price)
>>>  SELECT CATID, Price FROM ITEMS
>>>  WHERE CATID>PRICE/40 + 5 OR CATID<PRICE/40 - 5;
>>> CREATE INDEX I_EXCEPTION ON EXCEPTIONS(PRICE);
>>> 
>>> 
>>> The result so far is...
>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100 
>>> AND (
>>>  CATID BETWEEN 20 AND 33
>>>  OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN 
>>> 1000 AND 1100)
>>> )
>>> took 2 minutes on average because PostgreSQL chose sequential scan!
>>> As far as I know, PostgreSQL doesn't have an index hint option to force 
>>> query planner to use a particular index.
>>> 
>>> So, I split the query to
>>> SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN 1000 AND 1100
>>> and
>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100 
>>> AND CATID IN (...union-ed CATIDs)
>>> 
>>> The first query took 100ms on average. The second query, which is 
>>> equivalent to our rewritten query, took 500ms on average. So, the query 
>>> runtime is almost same as ours.
>>> 
>>> The notable difference is that the exception table is huge. It has 1.9 
>>> million tuples (yeah, about 5% of ITEMS table) and amounts to more than 
>>> 70M bytes for the table and 40M bytes for the index, in total 110M bytes. 
>>> This would be 22M bytes even the error rate is 1% instead of 5%. CT is 
>>> less than 0.5M bytes at the knee point.
>>> 
>>> 
>>> Do you think this experiments conform to our claim? What experiments
>>> should I do next?
>>> -- 
>>> Hideaki Kimura <hkimura at cs.brown.edu>
>>> _______________________________________________
>>> CI mailing list
>>> CI at list.cs.brown.edu
>>> http://list.cs.brown.edu/mailman/listinfo/ci
>>> 
>> _______________________________________________
>> CI mailing list
>> CI at list.cs.brown.edu
>> http://list.cs.brown.edu/mailman/listinfo/ci


More information about the CI mailing list