[CI] Correlations in SDSS
Hideaki Kimura
hkimura at cs.brown.edu
Tue Jun 24 13:06:39 EDT 2008
Here's the number of FDs in Desktop SDSS fact table (PhotoObjAll, 200K
tuples, 400 columns).
1. Columns picked
I picked up 39 columns in the fact table that are predicated in the SDSS
benchmark workload queries.
2. Bucketing
I applied bucketing for all columns so that cardinality of every column
is at most 128. Because this is a wide bucketing abd trivial FDs by
many-valued columns are wiped out, it will pick up beneficial FDs only.
However, even wider bucketing might be better for Pairwise-FDs as the
total tuple count is just 200K though 128^3=2M. This is why more
Pairwise-FDs were detected than Single-FDs.
3. FD Strength threshold
I calculated the strength of all 1482 Single-FDs (A->B) and 27417
Pairwise-FDs (AB->C). The figure shows cumulative percentage of FDs over
certain c_per_u (FD strength) threshold.
Note that the c_per_u is the value *after bucketing*
clustered/unclustered attributed value. For example, the
orderdate->shipdate has c_per_u=30 before bucketing, c_per_u=2 after
bucketing with 128 buckets.
I think enough beneficial c_per_u after wide bucketing is very low, like
2, 3 or at most 10. c_per_u=10 with 128 buckets means about 8%
selectivity!
--
Hideaki Kimura <hkimura at cs.brown.edu>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SDSS_FD20080625.xls
Type: application/vnd.ms-excel
Size: 19968 bytes
Desc: not available
Url : http://list.cs.brown.edu/pipermail/ci/attachments/20080624/da112e48/SDSS_FD20080625-0001.xls
More information about the CI
mailing list