[CI] BHUNT comparison experiments
Hideaki Kimura
hkimura at cs.brown.edu
Fri Jun 20 10:55:21 EDT 2008
Okay, here's what I got with an error table with all attributes.
CREATE TABLE EXCEPTIONS (CATID INT NOT NULL, ItemID INT NOT NULL,
CAT1 VARCHAR(40) NOT NULL, CAT2 VARCHAR(40) NOT NULL,
CAT3 VARCHAR(40) NOT NULL, CAT4 VARCHAR(40) NOT NULL, CAT5 VARCHAR(40)
NOT NULL,
CAT6 VARCHAR(40) NOT NULL, Price REAL NOT NULL);
INSERT INTO EXCEPTIONS (CATID, ItemID ,CAT1, CAT2, CAT3, CAT4, CAT5,
CAT6, Price)
SELECT CATID, ItemID ,CAT1, CAT2, CAT3, CAT4, CAT5, CAT6, Price FROM ITEMS
WHERE CATID>PRICE/40.467 + 5 OR CATID<PRICE/40.467 - 5;
The exception table is 266MB! If the database has many correlations to
exploit, such a large exception table doesn't work.
The rewritten query is now:
SELECT COUNT(DISTINCT CAT3) FROM (SELECT CAT3 FROM ITEMS WHERE PRICE
BETWEEN 1000 AND 1100 AND (CATID BETWEEN 20 AND 33) UNION SELECT CAT3
FROM EXCEPTIONS WHERE PRICE BETWEEN 1000 AND 1100) AS V
which on average took 800ms (detail below).
Aggregate (cost=21.54..21.55 rows=1 width=98) (actual
time=782.046..782.046 rows=1 loops=1)
-> Unique (cost=21.50..21.51 rows=2 width=15) (actual
time=779.880..781.975 rows=18 loops=1)
-> Sort (cost=21.50..21.51 rows=2 width=15) (actual
time=779.877..780.750 rows=3811 loops=1)
Sort Key: items.cat3
Sort Method: quicksort Memory: 285kB
-> Append (cost=0.00..21.49 rows=2 width=15) (actual
time=99.936..762.980 rows=3811 loops=1)
-> Index Scan using pk_items on items
(cost=0.00..12.99 rows=1 width=15) (actual time=99.934..564.299
rows=3604 loops=1)
Index Cond: ((catid >= 20) AND (catid <= 33))
Filter: ((price >= 1000::double precision)
AND (price <= 1100::double precision))
-> Index Scan using i_exception on exceptions
(cost=0.00..8.48 rows=1 width=15) (actual time=132.524..197.014 rows=207
loops=1)
Index Cond: ((price >= 1000::double
precision) AND (price <= 1100::double precision))
Total runtime: 782.723 ms
(12 rows)
The runtime is almost same as previous, a little bit (200ms) slower for
Append(union).
Alexander Rasin wrote:
>
>>> implementation is competitive with us, except for the size of
>>> exceptions table. In fact, I wouldn't be surprised if we lost - what
>>> exactly was the CI runtime as compared to the numbers you presented
>>> at the end?
>> We can't simply compare with our CI Runtime because distribution was
>> changed from the figure on the paper, but it's sure that performance
>> is almost same.
>>
> I am concerned we might actually be slower. My impression is that
> gaussian distribution would give scattered errors (depending on the
> parameter) causing c_per_u to be high.
>
>>> Note that this Guassian chosen distribution is relatively favorable
>>> to us. I can think of a distribution that is really bad for CI. In
>>> particular, imagine a case where the central bump remains the same,
>>> but number of errors is very small. For example, 10,000 error tuples
>>> uniformly distributed outside of the bump can cause CI to read
>>> additional 10000 primary key buckets, but BHUNT will only need to
>>> read 10K exception tuples, which is very cheap (both size and I/O-wise)
>> I think distribution doesn't matter because in this experiment BHUNT
>> uses CATID to join. CATID can be thought as a bucketing of primary
>> key. So, BHUNT and CI is exactly same in terms of the second query.
>>
>> I think if we use ITEMID (identifier for each tuple) without bucketing
>> for BHUNT to join, it will screw up because 10,000 error tuples simply
>> result in 10,000 IN clause or 10,000 nested loop join (I think
>> postgresql will choose sequential scan in such a case).
>> Should I do that experiment as (c)?
>>
>
> As I said earlier, my interpretation of the paper is that the error
> table actually contains the complete "error" rows.
> In that case you can execute the same query on the error table and merge
> the results, which is likely to perform very well.
>
> Using CATID which isn't unique and forces similar bucketing is an odd
> choice - regardless of whether it is beneficial or harmful to BHUNT, I
> don't think that's what they do....
>
>
>
>>
>>
>>>
>>> If we want to have an experiment discussing the difference, perhaps
>>> we should measure performance for different error distribution.
>>> To summarize:
>>>
>>> a) Reading the BHUNT-bump will be the same between CI and BHUNT index
>>>
>>> b) Size of the exception table is known to be determined by # of
>>> error tuples
>>>
>>> c) The difference in performance will depend on outside-of-the-bump
>>> tuple distribution (good for us when errors are clustered together,
>>> bad for us when errors are uniformly distributed)
>>>
>>> Hence we can simply discuss points (a) and (b) as they are
>>> self-explanatory and if we want a figure we can do it with (c).
>>> Do we want a figure?
>>>
>>> Alex
>>>
>>>> Here's what I'm experimenting now.
>>>> As BHUNT paper has no experiments about fuzzy relations,
>>>> I kinda made up BHUNT implementation.
>>>>
>>>>
>>>> I changed the ebay price-category experiments a little bit.
>>>> To make the price-category correlation algebraic, mean value
>>>> of price is now determined by CATID.
>>>> Price = Gauss(mean=CATID*40, stdev=100)
>>>> this implies,
>>>> CATID = Gauss(mean=Price/40, stdev=2.5)
>>>>
>>>> We want to run the same query as CT:
>>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND
>>>> 1100
>>>>
>>>> As the table itself is clustered on CATID, BHUNT will append a
>>>> predicate on CATID by the algebraic relation.
>>>>
>>>> Suppose BHUNT captures the bump as
>>>> CATID BETWEEN Price/40 -5 AND Price/40 +5
>>>> , which means 95% (+-stdev*2=95%) of tuples are in this bump and the
>>>> other 5% will be registered to exception table.
>>>>
>>>> BHUNT will rewrite the query as
>>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND
>>>> 1100 AND (
>>>> CATID BETWEEN 20 AND 33
>>>> OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE
>>>> BETWEEN 1000 AND 1100)
>>>> )
>>>>
>>>> The exception table is constructed by following queries:
>>>> CREATE TABLE EXCEPTIONS (CATID INT NOT NULL, Price REAL NOT NULL);
>>>> INSERT INTO EXCEPTIONS (CATID, Price)
>>>> SELECT CATID, Price FROM ITEMS
>>>> WHERE CATID>PRICE/40 + 5 OR CATID<PRICE/40 - 5;
>>>> CREATE INDEX I_EXCEPTION ON EXCEPTIONS(PRICE);
>>>>
>>>>
>>>> The result so far is...
>>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND
>>>> 1100 AND (
>>>> CATID BETWEEN 20 AND 33
>>>> OR CATID IN (SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE
>>>> BETWEEN 1000 AND 1100)
>>>> )
>>>> took 2 minutes on average because PostgreSQL chose sequential scan!
>>>> As far as I know, PostgreSQL doesn't have an index hint option to
>>>> force query planner to use a particular index.
>>>>
>>>> So, I split the query to
>>>> SELECT DISTINCT CATID FROM EXCEPTIONS WHERE PRICE BETWEEN 1000 AND 1100
>>>> and
>>>> SELECT COUNT(DISTINCT CAT3) FROM ITEMS WHERE PRICE BETWEEN 1000 AND
>>>> 1100 AND CATID IN (...union-ed CATIDs)
>>>>
>>>> The first query took 100ms on average. The second query, which is
>>>> equivalent to our rewritten query, took 500ms on average. So, the
>>>> query runtime is almost same as ours.
>>>>
>>>> The notable difference is that the exception table is huge. It has
>>>> 1.9 million tuples (yeah, about 5% of ITEMS table) and amounts to
>>>> more than 70M bytes for the table and 40M bytes for the index, in
>>>> total 110M bytes. This would be 22M bytes even the error rate is 1%
>>>> instead of 5%. CT is less than 0.5M bytes at the knee point.
>>>>
>>>>
>>>> Do you think this experiments conform to our claim? What experiments
>>>> should I do next?
>>>> --
>>>> Hideaki Kimura <hkimura at cs.brown.edu>
>>>> _______________________________________________
>>>> CI mailing list
>>>> CI at list.cs.brown.edu
>>>> http://list.cs.brown.edu/mailman/listinfo/ci
>>>>
>>>
>>
>> --
>> Hideaki Kimura <hkimura at cs.brown.edu>
>>
>
--
Hideaki Kimura <hkimura at cs.brown.edu>
More information about the CI
mailing list