[CI] Re: Correlations in SDSS

Samuel Madden madden at csail.mit.edu
Thu Jun 26 07:36:29 EDT 2008


I think this is great!  We should just say what the query is that we  
made up and why we did that (b/c we wanted to simulate a wide variety  
of possible ad-hoc queries.)

On Jun 25, 2008, at 11:12 PM, Hideaki Kimura wrote:

> Here's the figure to be put in the paper.
> It shows that more than 1/4 of queries in SDSS can be 2x- sped up
> by these clustering.
>
> Any thoughts on this?
>
> I had a discussion with Alex about whether putting the column name
> on the figure or not. It might be more convincing if we put the
> real column name in SDSS, but might be risky because our workload
> is what we made up; not the true SDSS workload.
>
> Hideaki Kimura wrote:
>> The excel file shows simulated query performance for each clustering.
>> The first column is the name of clustering attribute.
>> The 2nd and 3rd columns are sum of query runtime on each clustering.
>> The 4th and 5th columns are the number of queries that were sped up  
>> more than 20% from sequential scan, which means the number of well  
>> correlated columns. 20% is a heuristic threshold without solid  
>> reasoning.
>> 1. Data
>>  A(narrow). 200K tuples, 39 columns(181 bytes/tuple), 4450 pages  
>> (36MB)
>>  B(wide). 200K tuples, 400 columns(1980 bytes/tuple), 50070 pages  
>> (410MB)
>> The only difference is the width of a tuple used in the simulation.
>> 2. Workload
>> The simulated workload consists of 39*2 queries.
>>  A. 1%-selectivity query
>>  SELECT * FROM table WHERE u BETWEEN "u's 30th percentile value"  
>> AND "u's 31st percentile value"
>>  B. 5%-selectivity query
>>  SELECT * FROM table WHERE u BETWEEN "u's 30th percentile value"  
>> AND "u's 35th percentile value"
>> for every (39) column.
>> 3. Cost model
>> I used simulation-based cost estimation same as Table 3 (Clustered  
>> column bucketing granularity and IO cost) to correctly emulate the  
>> behavior of Bitmap Index Scan. I didn't use the cost model  
>> developed in Section 3 because it's affected by bucketing. Note  
>> that the cost model can't correctly estimate the cost of ranging  
>> query unless we know the knee point.
>> Note that this simulates Bitmap Index Scan; CT might be slower than  
>> this because of bucketing.
>> 4. Conclusion
>>  A. For many clustering in SDSS, B+Tree and CT can exploit  
>> correlation for 1/3 to 2/3 columns.
>>  B. More selective query is more benefited by correlation *when  
>> compared with seqscan*.
>>  C. Larger table exhibits more benefit of data correlation.
>> Is this result enough convincing to say that even a real dataset  
>> has plentiful beneficial data correlations?
>> What kind of figure should I make (what is x-axis, y-axis, how many  
>> figures)?
>> ------------------------------------------------------------------------
>> _______________________________________________
>> CI mailing list
>> CI at list.cs.brown.edu
>> http://list.cs.brown.edu/mailman/listinfo/ci
>
>
> -- 
> Hideaki Kimura <hkimura at cs.brown.edu>
> <clustering_figure.xls>_______________________________________________
> CI mailing list
> CI at list.cs.brown.edu
> http://list.cs.brown.edu/mailman/listinfo/ci



More information about the CI mailing list