Data skew in a column can make it difficult for the optimizer to accurately estimate the cardinality of an operation. Without a histogram it will assume an even distribution amongst the distinct values for the column. Take the case of a Yes/No flag for active records in a table. There may be 1 million rows in the table with only 100 being marked as active. The optimizer would assume half are marked as “Yes” and half as “No”, which would be a really bad assumption in this case. Histograms describe the distribution of data in a column, which helps identify data skew and allows the optimizer to make a better decision.
Two types of histograms were supported, “frequency” and “height-balanced” histograms. Oracle database 12c introduces two new types of histogram, a “top frequency” and “hybrid” histogram. This article presents an example of each type of histogram.
Frequency Histograms
A frequency histogram is created when the number of distinct values (NDV) for the column is less than or equal to the total number of histogram buckets, which defaults to 254. This means each distinct value can be represented by its own bucket. The following code creates and populates a table. Half of the rows have a RECORD_TYPE value of 0, with the other half being randomly assigned a value from 1-9, giving a total of 10 distinct values. Once populated, the data is queried using a filter on the skewed column, then statistics are gathered.
DROP TABLE tab1 PURGE;
CREATE TABLE tab1 AS
SELECT level AS id,
CASE
WHEN MOD(level, 2) = 0 THEN 0
ELSE TRUNC(DBMS_RANDOM.value(1,10))
END AS record_type,
'Description for ' || level AS description
FROM dual
CONNECT BY level <= 10000;
COMMIT;
ALTER TABLE tab1 ADD CONSTRAINT tab1_pk PRIMARY KEY (id);
CREATE INDEX tab1_record_type_idx ON tab1(record_type);
-- Query table referencing skewed column to trigger automatic
-- histogram creation the next time statistics are gathered.
SELECT COUNT(*) FROM tab1 WHERE record_type = 0;
COUNT(*)
----------
5000
SQL>
-- Default statistics gathered.
EXEC DBMS_STATS.gather_table_stats(USER, 'TAB1');
We can see the presence of the frequency histogram using the USER_TAB_COLUMNS view.
COLUMN column_name FORMAT A20
SELECT column_id,
column_name,
histogram
FROM user_tab_columns
WHERE table_name = 'TAB1'
ORDER BY column_id;
COLUMN_ID COLUMN_NAME HISTOGRAM
---------- -------------------- ---------------
1 ID NONE
2 RECORD_TYPE FREQUENCY
3 DESCRIPTION NONE
SQL>
The USER_TAB_HISTOGRAMS view is used to display information about the contents of histograms. The ENDPOINT_VALUE column represents the highest distinct column value in the bucket and the ENDPOINT_NUMBER is a cumulative frequency of rows in the bucket. If we want to get the actual frequency of the specific column value, we need to subtract the previous cumulative value. The following query does that using the LAG analytic function.
SELECT endpoint_value,
endpoint_number,
endpoint_number - LAG(endpoint_number, 1, 0) OVER (ORDER BY endpoint_value, endpoint_number) AS frequency
FROM user_tab_histograms
WHERE table_name = 'TAB1'
AND column_name = 'RECORD_TYPE'
ORDER BY endpoint_value;
ENDPOINT_VALUE ENDPOINT_NUMBER FREQUENCY
-------------- --------------- ----------
0 5000 5000
1 5535 535
2 6104 569
3 6679 575
4 7243 564
5 7773 530
6 8329 556
7 8889 560
8 9469 580
9 10000 531
SQL>
Height-Balanced Histograms
In previous releases, when there were more distinct column values than the number of histogram buckets, a height-balanced histogram was created. In this case the histogram buckets are compressed so each bucket represents a range of values, which should contain approximately the same frequency of rows.
In Oracle database 12c, height-balanced histograms are only created if sampling is explicitly used during statistics collection. If there is no explicit sampling, Oracle will perform a full table scan and build a hybrid histogram, or possibly a top frequency histogram depending on the circumstances.
The following table has half the rows with a RECORD_TYPE of 0 and the other half each having a unique value. The total number of distinct values is 5001, which is far in excess of the default number of histogram buckets of 254. Once populated, the data is queried using a filter on the skewed column, then statistics are gathered using an explicit sample size.
DROP TABLE tab1 PURGE;
CREATE TABLE tab1 AS
SELECT level AS id,
CASE
WHEN MOD(level,2) = 0 THEN 0
ELSE level
END AS record_type,
'Description for ' || level AS description
FROM dual
CONNECT BY level <= 10000;
COMMIT;
ALTER TABLE tab1 ADD CONSTRAINT tab1_pk PRIMARY KEY (id);
CREATE INDEX tab1_record_type_idx ON tab1(record_type);
-- Query table referencing skewed column to trigger automatic
-- histogram creation the next time statistics are gathered.
SELECT COUNT(*) FROM tab1 WHERE record_type = 0;
COUNT(*)
----------
5000
SQL>
-- Statistics gathered. Use explicit sampling to force height-balanced histogram.
EXEC DBMS_STATS.gather_table_stats(USER, 'TAB1', estimate_percent=>10);
We can see the presence of the height-balanced histogram using the USER_TAB_COLUMNS view.
COLUMN column_name FORMAT A20
SELECT column_id,
column_name,
histogram
FROM user_tab_columns
WHERE table_name = 'TAB1'
ORDER BY column_id;
COLUMN_ID COLUMN_NAME HISTOGRAM
---------- -------------------- ---------------
1 ID NONE
2 RECORD_TYPE HEIGHT BALANCED
3 DESCRIPTION NONE
SQL>
Since there are not enough buckets to represent all the distinct values, the buckets are compressed to represent ranges of values. The ENDPOINT_VALUE represents the highest value in the bucket and the ENDPOINT_NUMBER represents the cumulative frequency. The following query displays the range, number of values represented and frequency associated with each bucket. The range is displayed as “<=” because popular values can span buckets in high-balanced histograms. Remember, this is sample data.
COLUMN range FORMAT A20
SELECT '<=' || endpoint_value AS range,
endpoint_value - (LAG(endpoint_value, 1, -1) OVER (ORDER BY endpoint_value, endpoint_number)+1) + 1 AS vals_in_range,
endpoint_number - LAG(endpoint_number, 1, 0) OVER (ORDER BY endpoint_value, endpoint_number) AS frequency
FROM user_tab_histograms
WHERE table_name = 'TAB1'
AND column_name = 'RECORD_TYPE'
ORDER BY endpoint_value;
RANGE VALS_IN_RANGE FREQUENCY
-------------------- ------------- ----------
<=0 1 125
<=57 57 1
<=143 86 1
<=209 66 1
<=281 72 1
<=359 78 1
<=443 84 1
<=515 72 1
<=579 64 1
<=679 100 1
<=749 70 1
RANGE VALS_IN_RANGE FREQUENCY
-------------------- ------------- ----------
<=831 82 1
<=887 56 1
<=963 76 1
<=1051 88 1
<=1119 68 1
<=1201 82 1
<=1265 64 1
<=1343 78 1
<=1413 70 1
<=1481 68 1
<=1571 90 1
RANGE VALS_IN_RANGE FREQUENCY
-------------------- ------------- ----------
<=1647 76 1
<=1735 88 1
<=1813 78 1
<=1875 62 1
<=1943 68 1
<=2029 86 1
<=2113 84 1
<=2181 68 1
<=2263 82 1
<=2335 72 1
<=2415 80 1
RANGE VALS_IN_RANGE FREQUENCY
-------------------- ------------- ----------
<=2523 108 1
<=2615 92 1
<=2675 60 1
<=2753 78 1
<=2821 68 1
<=2899 78 1
<=2971 72 1
<=3051 80 1
<=3119 68 1
<=3189 70 1
<=3273 84 1
RANGE VALS_IN_RANGE FREQUENCY
-------------------- ------------- ----------
<=3359 86 1
<=3445 86 1
<=3503 58 1
<=3587 84 1
<=3679 92 1
<=3755 76 1
<=3835 80 1
<=3911 76 1
<=3977 66 1
<=4051 74 1
<=4159 108 1
RANGE VALS_IN_RANGE FREQUENCY
-------------------- ------------- ----------
<=4253 94 1
<=4355 102 1
<=4467 112 1
<=4567 100 1
<=4633 66 1
<=4709 76 1
<=4779 70 1
<=4869 90 1
<=4947 78 1
<=5025 78 1
<=5101 76 1
RANGE VALS_IN_RANGE FREQUENCY
-------------------- ------------- ----------
<=5193 92 1
<=5281 88 1
<=5363 82 1
<=5443 80 1
<=5529 86 1
<=5621 92 1
<=5711 90 1
<=5767 56 1
<=5841 74 1
<=5909 68 1
<=5991 82 1
RANGE VALS_IN_RANGE FREQUENCY
-------------------- ------------- ----------
<=6057 66 1
<=6115 58 1
<=6201 86 1
<=6273 72 1
<=6363 90 1
<=6439 76 1
<=6521 82 1
<=6605 84 1
<=6667 62 1
<=6733 66 1
<=6811 78 1
RANGE VALS_IN_RANGE FREQUENCY
-------------------- ------------- ----------
<=6895 84 1
<=6961 66 1
<=7041 80 1
<=7113 72 1
<=7191 78 1
<=7263 72 1
<=7333 70 1
<=7429 96 1
<=7511 82 1
<=7567 56 1
<=7651 84 1
RANGE VALS_IN_RANGE FREQUENCY
-------------------- ------------- ----------
<=7719 68 1
<=7805 86 1
<=7877 72 1
<=7977 100 1
<=8057 80 1
<=8115 58 1
<=8177 62 1
<=8259 82 1
<=8345 86 1
<=8433 88 1
<=8499 66 1
RANGE VALS_IN_RANGE FREQUENCY
-------------------- ------------- ----------
<=8571 72 1
<=8645 74 1
<=8707 62 1
<=8793 86 1
<=8875 82 1
<=8949 74 1
<=9023 74 1
<=9085 62 1
<=9147 62 1
<=9227 80 1
<=9309 82 1
RANGE VALS_IN_RANGE FREQUENCY
-------------------- ------------- ----------
<=9385 76 1
<=9479 94 1
<=9553 74 1
<=9641 88 1
<=9709 68 1
<=9781 72 1
<=9849 68 1
<=9933 84 1
<=9999 66 1
SQL>