[CI] BHUNT comparison experiments

Alexander Rasin alexr at cs.brown.edu
Fri Jun 20 10:25:53 EDT 2008


>> implementation is competitive with us, except for the size of exceptions 
>> table. In fact, I wouldn't be surprised if we lost - what exactly was the 
>> CI runtime as compared to the numbers you presented at the end?
> We can't simply compare with our CI Runtime because distribution was changed 
> from the figure on the paper, but it's sure that performance is almost same.
>
I am concerned we might actually be slower.  My impression is that 
gaussian distribution would give scattered errors (depending on the 
parameter) causing c_per_u to be high.

>> Note that this Guassian chosen distribution is relatively favorable to us. 
>> I can think of a distribution that is really bad for CI.   In particular, 
>> imagine a case where the central bump remains the same, but number of 
>> errors is very small. For example, 10,000 error tuples uniformly 
>> distributed outside of the bump can cause CI to read additional 10000 
>> primary key buckets, but BHUNT will only need to read 10K exception tuples, 
>> which is very cheap (both size and I/O-wise)
> I think distribution doesn't matter because in this experiment BHUNT uses 
> CATID to join. CATID can be thought as a bucketing of primary key. So, BHUNT 
> and CI is exactly same in terms of the second query.
>
> I think if we use ITEMID (identifier for each tuple) without bucketing for 
> BHUNT to join, it will screw up because 10,000 error tuples simply result in 
> 10,000 IN clause or 10,000 nested loop join (I think postgresql will choose 
> sequential scan in such a case).
> Should I do that experiment as (c)?
>

As I said earlier, my interpretation of the paper is that the error table 
actually contains the complete "error" rows.
In that case you can execute the same query on the error table and merge 
the results, which is likely to perform very well.

Using CATID which isn't unique and forces similar bucketing is an odd 
choice - regardless of whether it is beneficial or harmful to BHUNT, I 
don't think that's what they do....



>
>
>> 
>> If we want to have an experiment discussing the difference, perhaps we 
>> should measure performance for different error distribution.
>> To summarize:
>> 
>> a) Reading the BHUNT-bump will be the same between CI and BHUNT index
>> 
>> b) Size of the exception table is known to be determined by # of error 
>> tuples
>> 
>> c) The difference in performance will depend on outside-of-the-bump tuple 
>> distribution (good for us when errors are clustered together, bad for us 
>> when errors are uniformly distributed)
>> 
>> Hence we can simply discuss points (a) and (b) as they are self-explanatory 
>> and if we want a figure we can do it with (c).
>> Do we want a figure?
>> 
>> Alex
>> 
>>> Here's what I'm experimenting now.
>>> As BHUNT paper has no experiments about fuzzy relations,
>>> I kinda made up BHUNT implementation.
>>> 
>>> 
>>> I changed the ebay price-category experiments a little bit.
>>> To make the price-category correlation algebraic, mean value
>>> of price is now determined by CATID.
>>>  Price = Gauss(mean=CATID*40, stdev=100)
>>> this implies,
>>>  CATID = Gauss(mean=Price/40, stdev=2.5)
>>> 
>>> We want to run the same query as CT:
>>>  SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100
>>> 
>>> As the table itself is clustered on CATID, BHUNT will append a predicate 
>>> on CATID by the algebraic relation.
>>> 
>>> Suppose BHUNT captures the bump as
>>>  CATID BETWEEN Price/40 -5 AND Price/40 +5
>>> , which means 95% (+-stdev*2=95%) of tuples are in this bump and the other 
>>> 5% will be registered to exception table.
>>> 
>>> BHUNT will rewrite the query as
>>>  SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100 
>>> AND (
>>>    CATID BETWEEN 20 AND 33
>>>    OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN 
>>> 1000 AND 1100)
>>>  )
>>> 
>>> The exception table is constructed by following queries:
>>>  CREATE TABLE EXCEPTIONS (CATID INT NOT NULL, Price REAL NOT NULL);
>>>  INSERT INTO EXCEPTIONS (CATID, Price)
>>>    SELECT CATID, Price FROM ITEMS
>>>    WHERE CATID>PRICE/40 + 5 OR CATID<PRICE/40 - 5;
>>>  CREATE INDEX I_EXCEPTION ON EXCEPTIONS(PRICE);
>>> 
>>> 
>>> The result so far is...
>>>  SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100 
>>> AND (
>>>    CATID BETWEEN 20 AND 33
>>>    OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN 
>>> 1000 AND 1100)
>>>  )
>>> took 2 minutes on average because PostgreSQL chose sequential scan!
>>> As far as I know, PostgreSQL doesn't have an index hint option to force 
>>> query planner to use a particular index.
>>> 
>>> So, I split the query to
>>> SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN 1000 AND 1100
>>>  and
>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100 
>>> AND CATID IN (...union-ed CATIDs)
>>> 
>>> The first query took 100ms on average. The second query, which is 
>>> equivalent to our rewritten query, took 500ms on average. So, the query 
>>> runtime is almost same as ours.
>>> 
>>> The notable difference is that the exception table is huge. It has 1.9 
>>> million tuples (yeah, about 5% of ITEMS table) and amounts to more than 
>>> 70M bytes for the table and 40M bytes for the index, in total 110M bytes. 
>>> This would be 22M bytes even the error rate is 1% instead of 5%. CT is 
>>> less than 0.5M bytes at the knee point.
>>> 
>>> 
>>> Do you think this experiments conform to our claim? What experiments
>>> should I do next?
>>> -- 
>>> Hideaki Kimura <hkimura at cs.brown.edu>
>>> _______________________________________________
>>> CI mailing list
>>> CI at list.cs.brown.edu
>>> http://list.cs.brown.edu/mailman/listinfo/ci
>>> 
>> 
>
> -- 
> Hideaki Kimura <hkimura at cs.brown.edu>
>


More information about the CI mailing list