[CI] Correlations in SDSS

Hideaki Kimura hkimura at cs.brown.edu
Tue Jun 24 13:06:39 EDT 2008


Here's the number of FDs in Desktop SDSS fact table (PhotoObjAll, 200K 
tuples, 400 columns).

1. Columns picked
I picked up 39 columns in the fact table that are predicated in the SDSS 
benchmark workload queries.

2. Bucketing
I applied bucketing for all columns so that cardinality of every column 
is at most 128. Because this is a wide bucketing abd trivial FDs by 
many-valued columns are wiped out, it will pick up beneficial FDs only.

However, even wider bucketing might be better for Pairwise-FDs as the 
total tuple count is just 200K though 128^3=2M. This is why more 
Pairwise-FDs were detected than Single-FDs.

3. FD Strength threshold
I calculated the strength of all 1482 Single-FDs (A->B) and 27417 
Pairwise-FDs (AB->C). The figure shows cumulative percentage of FDs over 
certain c_per_u (FD strength) threshold.

Note that the c_per_u is the value *after bucketing* 
clustered/unclustered attributed value. For example, the 
orderdate->shipdate has c_per_u=30 before bucketing, c_per_u=2 after 
bucketing with 128 buckets.
I think enough beneficial c_per_u after wide bucketing is very low, like 
  2, 3 or at most 10. c_per_u=10 with 128 buckets means about 8% 
selectivity!
-- 
Hideaki Kimura <hkimura at cs.brown.edu>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SDSS_FD20080625.xls
Type: application/vnd.ms-excel
Size: 19968 bytes
Desc: not available
Url : http://list.cs.brown.edu/pipermail/ci/attachments/20080624/da112e48/SDSS_FD20080625-0001.xls


More information about the CI mailing list