[CI] Re: Correlations in SDSS
Samuel Madden
madden at csail.mit.edu
Thu Jun 26 07:36:29 EDT 2008
I think this is great! We should just say what the query is that we
made up and why we did that (b/c we wanted to simulate a wide variety
of possible ad-hoc queries.)
On Jun 25, 2008, at 11:12 PM, Hideaki Kimura wrote:
> Here's the figure to be put in the paper.
> It shows that more than 1/4 of queries in SDSS can be 2x- sped up
> by these clustering.
>
> Any thoughts on this?
>
> I had a discussion with Alex about whether putting the column name
> on the figure or not. It might be more convincing if we put the
> real column name in SDSS, but might be risky because our workload
> is what we made up; not the true SDSS workload.
>
> Hideaki Kimura wrote:
>> The excel file shows simulated query performance for each clustering.
>> The first column is the name of clustering attribute.
>> The 2nd and 3rd columns are sum of query runtime on each clustering.
>> The 4th and 5th columns are the number of queries that were sped up
>> more than 20% from sequential scan, which means the number of well
>> correlated columns. 20% is a heuristic threshold without solid
>> reasoning.
>> 1. Data
>> A(narrow). 200K tuples, 39 columns(181 bytes/tuple), 4450 pages
>> (36MB)
>> B(wide). 200K tuples, 400 columns(1980 bytes/tuple), 50070 pages
>> (410MB)
>> The only difference is the width of a tuple used in the simulation.
>> 2. Workload
>> The simulated workload consists of 39*2 queries.
>> A. 1%-selectivity query
>> SELECT * FROM table WHERE u BETWEEN "u's 30th percentile value"
>> AND "u's 31st percentile value"
>> B. 5%-selectivity query
>> SELECT * FROM table WHERE u BETWEEN "u's 30th percentile value"
>> AND "u's 35th percentile value"
>> for every (39) column.
>> 3. Cost model
>> I used simulation-based cost estimation same as Table 3 (Clustered
>> column bucketing granularity and IO cost) to correctly emulate the
>> behavior of Bitmap Index Scan. I didn't use the cost model
>> developed in Section 3 because it's affected by bucketing. Note
>> that the cost model can't correctly estimate the cost of ranging
>> query unless we know the knee point.
>> Note that this simulates Bitmap Index Scan; CT might be slower than
>> this because of bucketing.
>> 4. Conclusion
>> A. For many clustering in SDSS, B+Tree and CT can exploit
>> correlation for 1/3 to 2/3 columns.
>> B. More selective query is more benefited by correlation *when
>> compared with seqscan*.
>> C. Larger table exhibits more benefit of data correlation.
>> Is this result enough convincing to say that even a real dataset
>> has plentiful beneficial data correlations?
>> What kind of figure should I make (what is x-axis, y-axis, how many
>> figures)?
>> ------------------------------------------------------------------------
>> _______________________________________________
>> CI mailing list
>> CI at list.cs.brown.edu
>> http://list.cs.brown.edu/mailman/listinfo/ci
>
>
> --
> Hideaki Kimura <hkimura at cs.brown.edu>
> <clustering_figure.xls>_______________________________________________
> CI mailing list
> CI at list.cs.brown.edu
> http://list.cs.brown.edu/mailman/listinfo/ci
More information about the CI
mailing list