[CI] BHUNT comparison experiments

Alexander Rasin alexr at cs.brown.edu
Fri Jun 20 11:13:26 EDT 2008


As I have said - I do not necessarily advocate this approach, I just think 
this is what BHUNT people meant.

Yes, error table came out to be large, but that's because you picked an 
arbitrary 5% error rate and only 1 bump.  For this distribution BHUNT 
would probably contain multiple bump segments, as in Figure6 in BHUNT 
paper.

Also, the reason why your previous "IN" clause approach worked is because 
you used CATID that has built-in artificial bucketing. Kinda cheating. For 
unique primary this approach simply wouldn't work, it'd be too slow.

Why don't you time a CI-augmented query for same data set?  Why do I keep 
thinking it'll be slower....?

I don't think the BHUNT algebraic relation is suited for something 
containing 5% errors.  Just like they aren't suited for String 
correlations. I am pretty sure the idea in BHUNT paper was that bumps can 
be designed so that errors are rare.  5% is not rare.  And I quote:

"BHUNT tries to choose the sample size so as to control the number of 
exceptions, and hence the size of the exception tables."



> Okay, here's what I got with an error table with all attributes.
>
> CREATE TABLE EXCEPTIONS (CATID INT NOT NULL, ItemID INT NOT NULL,
> CAT1 VARCHAR(40) NOT NULL, CAT2 VARCHAR(40) NOT NULL,
> CAT3 VARCHAR(40) NOT NULL, CAT4 VARCHAR(40) NOT NULL, CAT5 VARCHAR(40) NOT 
> NULL,
> CAT6 VARCHAR(40) NOT NULL, Price REAL NOT NULL);
>
> INSERT INTO EXCEPTIONS (CATID, ItemID ,CAT1, CAT2, CAT3, CAT4, CAT5, CAT6, 
> Price)
> SELECT CATID, ItemID ,CAT1, CAT2, CAT3, CAT4, CAT5, CAT6, Price FROM ITEMS
> WHERE CATID>PRICE/40.467 + 5 OR CATID<PRICE/40.467 - 5;
>
> The exception table is 266MB! If the database has many correlations to 
> exploit, such a large exception table doesn't work.
>
>
> The rewritten query is now:
> SELECT COUNT(DISTINCT CAT3) FROM (SELECT CAT3 FROM ITEMS WHERE PRICE BETWEEN 
> 1000 AND 1100 AND (CATID BETWEEN 20 AND 33) UNION SELECT CAT3 FROM EXCEPTIONS 
> WHERE PRICE BETWEEN 1000 AND 1100) AS V
>
> which on average took 800ms (detail below).
>
> Aggregate  (cost=21.54..21.55 rows=1 width=98) (actual time=782.046..782.046 
> rows=1 loops=1)
>   ->  Unique  (cost=21.50..21.51 rows=2 width=15) (actual 
> time=779.880..781.975 rows=18 loops=1)
>         ->  Sort  (cost=21.50..21.51 rows=2 width=15) (actual 
> time=779.877..780.750 rows=3811 loops=1)
>               Sort Key: items.cat3
>               Sort Method:  quicksort  Memory: 285kB
>               ->  Append  (cost=0.00..21.49 rows=2 width=15) (actual 
> time=99.936..762.980 rows=3811 loops=1)
>                     ->  Index Scan using pk_items on items (cost=0.00..12.99 
> rows=1 width=15) (actual time=99.934..564.299 rows=3604 loops=1)
>                           Index Cond: ((catid >= 20) AND (catid <= 33))
>                           Filter: ((price >= 1000::double precision) AND 
> (price <= 1100::double precision))
>                     ->  Index Scan using i_exception on exceptions 
> (cost=0.00..8.48 rows=1 width=15) (actual time=132.524..197.014 rows=207 
> loops=1)
>                           Index Cond: ((price >= 1000::double precision) AND 
> (price <= 1100::double precision))
> Total runtime: 782.723 ms
> (12 rows)
>
> The runtime is almost same as previous, a little bit (200ms) slower for 
> Append(union).
>
>
> Alexander Rasin wrote:
>> 
>>>> implementation is competitive with us, except for the size of exceptions 
>>>> table. In fact, I wouldn't be surprised if we lost - what exactly was the 
>>>> CI runtime as compared to the numbers you presented at the end?
>>> We can't simply compare with our CI Runtime because distribution was 
>>> changed from the figure on the paper, but it's sure that performance is 
>>> almost same.
>>> 
>> I am concerned we might actually be slower.  My impression is that gaussian 
>> distribution would give scattered errors (depending on the parameter) 
>> causing c_per_u to be high.
>> 
>>>> Note that this Guassian chosen distribution is relatively favorable to 
>>>> us. I can think of a distribution that is really bad for CI.   In 
>>>> particular, imagine a case where the central bump remains the same, but 
>>>> number of errors is very small. For example, 10,000 error tuples 
>>>> uniformly distributed outside of the bump can cause CI to read additional 
>>>> 10000 primary key buckets, but BHUNT will only need to read 10K exception 
>>>> tuples, which is very cheap (both size and I/O-wise)
>>> I think distribution doesn't matter because in this experiment BHUNT uses 
>>> CATID to join. CATID can be thought as a bucketing of primary key. So, 
>>> BHUNT and CI is exactly same in terms of the second query.
>>> 
>>> I think if we use ITEMID (identifier for each tuple) without bucketing for 
>>> BHUNT to join, it will screw up because 10,000 error tuples simply result 
>>> in 10,000 IN clause or 10,000 nested loop join (I think postgresql will 
>>> choose sequential scan in such a case).
>>> Should I do that experiment as (c)?
>>> 
>> 
>> As I said earlier, my interpretation of the paper is that the error table 
>> actually contains the complete "error" rows.
>> In that case you can execute the same query on the error table and merge 
>> the results, which is likely to perform very well.
>> 
>> Using CATID which isn't unique and forces similar bucketing is an odd 
>> choice - regardless of whether it is beneficial or harmful to BHUNT, I 
>> don't think that's what they do....
>> 
>> 
>> 
>>> 
>>> 
>>>> 
>>>> If we want to have an experiment discussing the difference, perhaps we 
>>>> should measure performance for different error distribution.
>>>> To summarize:
>>>> 
>>>> a) Reading the BHUNT-bump will be the same between CI and BHUNT index
>>>> 
>>>> b) Size of the exception table is known to be determined by # of error 
>>>> tuples
>>>> 
>>>> c) The difference in performance will depend on outside-of-the-bump tuple 
>>>> distribution (good for us when errors are clustered together, bad for us 
>>>> when errors are uniformly distributed)
>>>> 
>>>> Hence we can simply discuss points (a) and (b) as they are 
>>>> self-explanatory and if we want a figure we can do it with (c).
>>>> Do we want a figure?
>>>> 
>>>> Alex
>>>> 
>>>>> Here's what I'm experimenting now.
>>>>> As BHUNT paper has no experiments about fuzzy relations,
>>>>> I kinda made up BHUNT implementation.
>>>>> 
>>>>> 
>>>>> I changed the ebay price-category experiments a little bit.
>>>>> To make the price-category correlation algebraic, mean value
>>>>> of price is now determined by CATID.
>>>>>  Price = Gauss(mean=CATID*40, stdev=100)
>>>>> this implies,
>>>>>  CATID = Gauss(mean=Price/40, stdev=2.5)
>>>>> 
>>>>> We want to run the same query as CT:
>>>>>  SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 
>>>>> 1100
>>>>> 
>>>>> As the table itself is clustered on CATID, BHUNT will append a predicate 
>>>>> on CATID by the algebraic relation.
>>>>> 
>>>>> Suppose BHUNT captures the bump as
>>>>>  CATID BETWEEN Price/40 -5 AND Price/40 +5
>>>>> , which means 95% (+-stdev*2=95%) of tuples are in this bump and the 
>>>>> other 5% will be registered to exception table.
>>>>> 
>>>>> BHUNT will rewrite the query as
>>>>>  SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 
>>>>> 1100 AND (
>>>>>    CATID BETWEEN 20 AND 33
>>>>>    OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE 
>>>>> BETWEEN 1000 AND 1100)
>>>>>  )
>>>>> 
>>>>> The exception table is constructed by following queries:
>>>>>  CREATE TABLE EXCEPTIONS (CATID INT NOT NULL, Price REAL NOT NULL);
>>>>>  INSERT INTO EXCEPTIONS (CATID, Price)
>>>>>    SELECT CATID, Price FROM ITEMS
>>>>>    WHERE CATID>PRICE/40 + 5 OR CATID<PRICE/40 - 5;
>>>>>  CREATE INDEX I_EXCEPTION ON EXCEPTIONS(PRICE);
>>>>> 
>>>>> 
>>>>> The result so far is...
>>>>>  SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 
>>>>> 1100 AND (
>>>>>    CATID BETWEEN 20 AND 33
>>>>>    OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE 
>>>>> BETWEEN 1000 AND 1100)
>>>>>  )
>>>>> took 2 minutes on average because PostgreSQL chose sequential scan!
>>>>> As far as I know, PostgreSQL doesn't have an index hint option to force 
>>>>> query planner to use a particular index.
>>>>> 
>>>>> So, I split the query to
>>>>> SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN 1000 AND 1100
>>>>>  and
>>>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND 1100 
>>>>> AND CATID IN (...union-ed CATIDs)
>>>>> 
>>>>> The first query took 100ms on average. The second query, which is 
>>>>> equivalent to our rewritten query, took 500ms on average. So, the query 
>>>>> runtime is almost same as ours.
>>>>> 
>>>>> The notable difference is that the exception table is huge. It has 1.9 
>>>>> million tuples (yeah, about 5% of ITEMS table) and amounts to more than 
>>>>> 70M bytes for the table and 40M bytes for the index, in total 110M 
>>>>> bytes. This would be 22M bytes even the error rate is 1% instead of 5%. 
>>>>> CT is less than 0.5M bytes at the knee point.
>>>>> 
>>>>> 
>>>>> Do you think this experiments conform to our claim? What experiments
>>>>> should I do next?
>>>>> -- 
>>>>> Hideaki Kimura <hkimura at cs.brown.edu>
>>>>> _______________________________________________
>>>>> CI mailing list
>>>>> CI at list.cs.brown.edu
>>>>> http://list.cs.brown.edu/mailman/listinfo/ci
>>>>> 
>>>> 
>>> 
>>> -- 
>>> Hideaki Kimura <hkimura at cs.brown.edu>
>>> 
>> 
>
>
> -- 
> Hideaki Kimura <hkimura at cs.brown.edu>
>


More information about the CI mailing list