[CI] Refactoring Section 3 and 4

Hideaki Kimura hkimura at cs.brown.edu
Tue Jun 24 17:41:15 EDT 2008


Alex and I had reviewed Section 3 (MODEL) and 4 (ESTIMATING CT COST). 
Following is the suggestion from us to be discussed at the upcoming 
meeting at 6p.

3. MODEL
3.1-3.2 (Preliminaries, Seq Scan)
No change.

3.3 Cost of CT
This subsection will remain without much change except that we'll say 
that c_per_u depends on bucketing and therefore too wide bucketing 
affects query performance.

About Sam's comment on this section:
We think the cost model is not similar to CORDS (actually, CORDS didn't 
provide cost model), so we don't have to refer CORDS/BHUNT here. Is it 
right?

3.4-3.5 Cost of Unclustered B+Tree, Discussion
We want to do a substantial respinning here.

First, the previous 3.4 and 3.5 assumes that the B+Tree implementation 
doesn't have Bitmap Index Scan access, which isn't true and might make 
reviewers think that we overlooked the access method. We should just 
assume B+Tree with Bitmap Index Scan and clearly say so in this section.

Second, however, building a cost model for B+Tree with Bitmap Index Scan 
is not easy; we need c_per_u-like parameter to estimate the cost. 
Selectivity information is not enough to estimate the cost as Bitmap 
Index Scan is affected by how much the selected tuples are scattered 
over the disk. We suggest to have a small section to say that:

a) Existing DBMS like PostgreSQL didn't have a decent cost model for 
Bitmap Index Scan because it needs c_per_u to estimate the cost.
b) Our CT cost model is applicable to B+Tree with Bitmap Index Scan and 
more accurate because it's aware of page clustering.
c) The only difference b/w Bitmap Index Scan and CT is that CT has 
bucketing and c_per_u is variable while B+Tree c_per_u is fixed.

It's a bit challenging to claim, but it will clarify the difference b/w 
CT and Bitmap Index Scan so that reviewers will know we are comparing 
apple to apple.


4. Estimating CT COST
We think this section should be a subsection of Section 3 right after 
the cost model because CT cost model should be followed by how to 
estimate c_per_u in the cost model. Instead, Section 5 should be renamed 
to "CT Advisor" or something.

We should say in this subsection that c_per_u is same as FD strength 
defined in CORDS paper but we used the value for many other purposes; CT 
size estimation and query cost estimation, which requires more accuracy 
on c_per_u estimation and that's why we used Distinct Sampling unlike CORDS.
-- 
Hideaki Kimura <hkimura at cs.brown.edu>


More information about the CI mailing list