[CI] BHUNT comparison experiments

Hideaki Kimura hkimura at cs.brown.edu
Thu Jun 19 20:31:06 EDT 2008


Here's what I'm experimenting now.
As BHUNT paper has no experiments about fuzzy relations,
I kinda made up BHUNT implementation.


I changed the ebay price-category experiments a little bit.
To make the price-category correlation algebraic, mean value
of price is now determined by CATID.
   Price = Gauss(mean=CATID*40, stdev=100)
this implies,
   CATID = Gauss(mean=Price/40, stdev=2.5)

We want to run the same query as CT:
   SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100

As the table itself is clustered on CATID, BHUNT will append a predicate 
on CATID by the algebraic relation.

Suppose BHUNT captures the bump as
   CATID BETWEEN Price/40 -5 AND Price/40 +5
, which means 95% (+-stdev*2=95%) of tuples are in this bump and the 
other 5% will be registered to exception table.

BHUNT will rewrite the query as
   SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 
1100 AND (
     CATID BETWEEN 20 AND 33
     OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE 
BETWEEN 1000 AND 1100)
   )

The exception table is constructed by following queries:
   CREATE TABLE EXCEPTIONS (CATID INT NOT NULL, Price REAL NOT NULL);
   INSERT INTO EXCEPTIONS (CATID, Price)
     SELECT CATID, Price FROM ITEMS
     WHERE CATID>PRICE/40 + 5 OR CATID<PRICE/40 - 5;
   CREATE INDEX I_EXCEPTION ON EXCEPTIONS(PRICE);


The result so far is...
   SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 
1100 AND (
     CATID BETWEEN 20 AND 33
     OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE 
BETWEEN 1000 AND 1100)
   )
took 2 minutes on average because PostgreSQL chose sequential scan!
As far as I know, PostgreSQL doesn't have an index hint option to force 
query planner to use a particular index.

So, I split the query to
SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN 1000 AND 1100
   and
SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100 
AND CATID IN (...union-ed CATIDs)

The first query took 100ms on average. The second query, which is 
equivalent to our rewritten query, took 500ms on average. So, the query 
runtime is almost same as ours.

The notable difference is that the exception table is huge. It has 1.9 
million tuples (yeah, about 5% of ITEMS table) and amounts to more than 
70M bytes for the table and 40M bytes for the index, in total 110M 
bytes. This would be 22M bytes even the error rate is 1% instead of 5%. 
CT is less than 0.5M bytes at the knee point.


Do you think this experiments conform to our claim? What experiments
should I do next?
-- 
Hideaki Kimura <hkimura at cs.brown.edu>


More information about the CI mailing list