[CI] BHUNT comparison experiments

George Huo george.huo at gmail.com
Fri Jun 20 07:53:25 EDT 2008


On Thu, Jun 19, 2008 at 5:31 PM, Hideaki Kimura <hkimura at cs.brown.edu> wrote:
> Here's what I'm experimenting now.
> As BHUNT paper has no experiments about fuzzy relations,
> I kinda made up BHUNT implementation.
>
>
> I changed the ebay price-category experiments a little bit.
> To make the price-category correlation algebraic, mean value
> of price is now determined by CATID.
>  Price = Gauss(mean=CATID*40, stdev=100)
> this implies,
>  CATID = Gauss(mean=Price/40, stdev=2.5)
>
> We want to run the same query as CT:
>  SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100
>
> As the table itself is clustered on CATID, BHUNT will append a predicate on
> CATID by the algebraic relation.
>
> Suppose BHUNT captures the bump as
>  CATID BETWEEN Price/40 -5 AND Price/40 +5
> , which means 95% (+-stdev*2=95%) of tuples are in this bump and the other
> 5% will be registered to exception table.
>
> BHUNT will rewrite the query as
>  SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100
> AND (
>    CATID BETWEEN 20 AND 33
>    OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN
> 1000 AND 1100)
>  )
>
> The exception table is constructed by following queries:
>  CREATE TABLE EXCEPTIONS (CATID INT NOT NULL, Price REAL NOT NULL);
>  INSERT INTO EXCEPTIONS (CATID, Price)
>    SELECT CATID, Price FROM ITEMS
>    WHERE CATID>PRICE/40 + 5 OR CATID<PRICE/40 - 5;
>  CREATE INDEX I_EXCEPTION ON EXCEPTIONS(PRICE);
>
>
> The result so far is...
>  SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100
> AND (
>    CATID BETWEEN 20 AND 33
>    OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN
> 1000 AND 1100)
>  )
> took 2 minutes on average because PostgreSQL chose sequential scan!
> As far as I know, PostgreSQL doesn't have an index hint option to force
> query planner to use a particular index.

You can play with enable_seqscan to force index lookups globally.

>
> So, I split the query to
> SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN 1000 AND 1100

How's the performance without the DISTINCT in the inner select?

>  and
> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100 AND
> CATID IN (...union-ed CATIDs)

Hmm, are the union-ed CATIDs just the exception table IDs, or all
relevant IDs? How would BHUNT get from 'CATID BETWEEN 20 AND 33' to
this step?

>
> The first query took 100ms on average. The second query, which is equivalent
> to our rewritten query, took 500ms on average. So, the query runtime is
> almost same as ours.
>
> The notable difference is that the exception table is huge. It has 1.9
> million tuples (yeah, about 5% of ITEMS table) and amounts to more than 70M
> bytes for the table and 40M bytes for the index, in total 110M bytes. This
> would be 22M bytes even the error rate is 1% instead of 5%. CT is less than
> 0.5M bytes at the knee point.
>
>
> Do you think this experiments conform to our claim? What experiments
> should I do next?
> --
> Hideaki Kimura <hkimura at cs.brown.edu>
> _______________________________________________
> CI mailing list
> CI at list.cs.brown.edu
> http://list.cs.brown.edu/mailman/listinfo/ci
>


More information about the CI mailing list