[CI] Re: Correlations in SDSS
Hideaki Kimura
hkimura at cs.brown.edu
Wed Jun 25 23:12:20 EDT 2008
Here's the figure to be put in the paper.
It shows that more than 1/4 of queries in SDSS can be 2x- sped up
by these clustering.
Any thoughts on this?
I had a discussion with Alex about whether putting the column name
on the figure or not. It might be more convincing if we put the
real column name in SDSS, but might be risky because our workload
is what we made up; not the true SDSS workload.
Hideaki Kimura wrote:
> The excel file shows simulated query performance for each clustering.
>
> The first column is the name of clustering attribute.
> The 2nd and 3rd columns are sum of query runtime on each clustering.
> The 4th and 5th columns are the number of queries that were sped up more
> than 20% from sequential scan, which means the number of well correlated
> columns. 20% is a heuristic threshold without solid reasoning.
>
>
> 1. Data
> A(narrow). 200K tuples, 39 columns(181 bytes/tuple), 4450 pages (36MB)
> B(wide). 200K tuples, 400 columns(1980 bytes/tuple), 50070 pages (410MB)
>
> The only difference is the width of a tuple used in the simulation.
>
> 2. Workload
> The simulated workload consists of 39*2 queries.
> A. 1%-selectivity query
> SELECT * FROM table WHERE u BETWEEN "u's 30th percentile value" AND
> "u's 31st percentile value"
>
> B. 5%-selectivity query
> SELECT * FROM table WHERE u BETWEEN "u's 30th percentile value" AND
> "u's 35th percentile value"
>
> for every (39) column.
>
> 3. Cost model
> I used simulation-based cost estimation same as Table 3 (Clustered
> column bucketing granularity and IO cost) to correctly emulate the
> behavior of Bitmap Index Scan. I didn't use the cost model developed in
> Section 3 because it's affected by bucketing. Note that the cost model
> can't correctly estimate the cost of ranging query unless we know the
> knee point.
>
> Note that this simulates Bitmap Index Scan; CT might be slower than this
> because of bucketing.
>
> 4. Conclusion
> A. For many clustering in SDSS, B+Tree and CT can exploit correlation
> for 1/3 to 2/3 columns.
> B. More selective query is more benefited by correlation *when
> compared with seqscan*.
> C. Larger table exhibits more benefit of data correlation.
>
>
> Is this result enough convincing to say that even a real dataset has
> plentiful beneficial data correlations?
> What kind of figure should I make (what is x-axis, y-axis, how many
> figures)?
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> CI mailing list
> CI at list.cs.brown.edu
> http://list.cs.brown.edu/mailman/listinfo/ci
--
Hideaki Kimura <hkimura at cs.brown.edu>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clustering_figure.xls
Type: application/vnd.ms-excel
Size: 22528 bytes
Desc: not available
Url : http://list.cs.brown.edu/pipermail/ci/attachments/20080625/b102d9d6/clustering_figure-0001.xls
More information about the CI
mailing list