[CI] BHUNT comparison experiments
Samuel Madden
madden at csail.mit.edu
Fri Jun 20 09:20:05 EDT 2008
On Jun 20, 2008, at 9:06 AM, Alexander Rasin wrote:
>
>> Still not sure I get what you mean by "in the absence of joins" --
>> seems like Hideaki's 2nd rewrite basically shows that BHUNT boils
>> down to doing something identical to what we do using a much larger
>> index.
>>
> Sorry, I think I have confused myself here. I was thinking of joins
> in a query. However, all our indexes are per-table only, therefore
> ignore this part of my email :)
>
>> I think your c) might be good to experiment with.
>>
> Note that if the way I interpreted the paper, the exception table
> contains the whole exception tuple (rather than just price/catid as
> Hideaki defined it)
> This avoids doing the very join that is currently happening (CATID
> in (list of CATIDs)), and instead one can simply execute the query
> in parallel on whatever is read off the bump and the exception
> table. In that case, I would expect BHUNT to outperform CIs, at
> least when errors tuples are not tightly clustered.
But that would make the exceptions table even bigger!
>
>
>> I would also like us to think of a case where there is no algebraic
>> correlation but where we win still. I guess the state/city
>> examples are an obvious case of this, but we should list some
>> others. This is a definite advantage we can claim, in addition to
>> index size.
>>
> Even in our eBay example, we can work with the correlation CATX <>
> Price
> rather than CATID which is numeric. That should be enough if we
> want to stick to the same data set.
>
>
>>> I would say that in the absence of joins, it is not surprising
>>> that BHUNT implementation is competitive with us, except for the
>>> size of exceptions table. In fact, I wouldn't be surprised if we
>>> lost - what exactly was the CI runtime as compared to the numbers
>>> you presented at the end?
>>> Note that this Guassian chosen distribution is relatively
>>> favorable to us. I can think of a distribution that is really bad
>>> for CI. In particular, imagine a case where the central bump
>>> remains the same, but number of errors is very small. For example,
>>> 10,000 error tuples uniformly distributed outside of the bump can
>>> cause CI to read additional 10000 primary key buckets, but BHUNT
>>> will only need to read 10K exception tuples, which is very cheap
>>> (both size and I/O-wise)
>>> If we want to have an experiment discussing the difference,
>>> perhaps we should measure performance for different error
>>> distribution.
>>> To summarize:
>>> a) Reading the BHUNT-bump will be the same between CI and BHUNT
>>> index
>>> b) Size of the exception table is known to be determined by # of
>>> error tuples
>>> c) The difference in performance will depend on outside-of-the-
>>> bump tuple distribution (good for us when errors are clustered
>>> together, bad for us when errors are uniformly distributed)
>>> Hence we can simply discuss points (a) and (b) as they are self-
>>> explanatory and if we want a figure we can do it with (c).
>>> Do we want a figure?
>>> Alex
>>>> Here's what I'm experimenting now.
>>>> As BHUNT paper has no experiments about fuzzy relations,
>>>> I kinda made up BHUNT implementation.
>>>> I changed the ebay price-category experiments a little bit.
>>>> To make the price-category correlation algebraic, mean value
>>>> of price is now determined by CATID.
>>>> Price = Gauss(mean=CATID*40, stdev=100)
>>>> this implies,
>>>> CATID = Gauss(mean=Price/40, stdev=2.5)
>>>> We want to run the same query as CT:
>>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000
>>>> AND 1100
>>>> As the table itself is clustered on CATID, BHUNT will append a
>>>> predicate on CATID by the algebraic relation.
>>>> Suppose BHUNT captures the bump as
>>>> CATID BETWEEN Price/40 -5 AND Price/40 +5
>>>> , which means 95% (+-stdev*2=95%) of tuples are in this bump and
>>>> the other 5% will be registered to exception table.
>>>> BHUNT will rewrite the query as
>>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000
>>>> AND 1100 AND (
>>>> CATID BETWEEN 20 AND 33
>>>> OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE
>>>> BETWEEN 1000 AND 1100)
>>>> )
>>>> The exception table is constructed by following queries:
>>>> CREATE TABLE EXCEPTIONS (CATID INT NOT NULL, Price REAL NOT NULL);
>>>> INSERT INTO EXCEPTIONS (CATID, Price)
>>>> SELECT CATID, Price FROM ITEMS
>>>> WHERE CATID>PRICE/40 + 5 OR CATID<PRICE/40 - 5;
>>>> CREATE INDEX I_EXCEPTION ON EXCEPTIONS(PRICE);
>>>> The result so far is...
>>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000
>>>> AND 1100 AND (
>>>> CATID BETWEEN 20 AND 33
>>>> OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE
>>>> BETWEEN 1000 AND 1100)
>>>> )
>>>> took 2 minutes on average because PostgreSQL chose sequential scan!
>>>> As far as I know, PostgreSQL doesn't have an index hint option to
>>>> force query planner to use a particular index.
>>>> So, I split the query to
>>>> SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN 1000
>>>> AND 1100
>>>> and
>>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000
>>>> AND 1100 AND CATID IN (...union-ed CATIDs)
>>>> The first query took 100ms on average. The second query, which is
>>>> equivalent to our rewritten query, took 500ms on average. So, the
>>>> query runtime is almost same as ours.
>>>> The notable difference is that the exception table is huge. It
>>>> has 1.9 million tuples (yeah, about 5% of ITEMS table) and
>>>> amounts to more than 70M bytes for the table and 40M bytes for
>>>> the index, in total 110M bytes. This would be 22M bytes even the
>>>> error rate is 1% instead of 5%. CT is less than 0.5M bytes at the
>>>> knee point.
>>>> Do you think this experiments conform to our claim? What
>>>> experiments
>>>> should I do next?
>>>> --
>>>> Hideaki Kimura <hkimura at cs.brown.edu>
>>>> _______________________________________________
>>>> CI mailing list
>>>> CI at list.cs.brown.edu
>>>> http://list.cs.brown.edu/mailman/listinfo/ci
>>> _______________________________________________
>>> CI mailing list
>>> CI at list.cs.brown.edu
>>> http://list.cs.brown.edu/mailman/listinfo/ci
More information about the CI
mailing list