Column Histograms in Oracle 11g

Data skew in a column can make it difficult for the optimizer to accurately estimate the cardinality of an operation. Without a histogram it will assume an even distribution amongst the distinct values for the column. Take the case of a Yes/No flag for active records in a table. There may be 1 million rows in the table with only 100 being marked as active. The optimizer would assume half are marked as “Yes” and half as “No”, which would be a really bad assumption in this case. Histograms describe the distribution of data in a column, which helps identify data skew and allows the optimizer to make a better decision.

Two types of histograms were supported, “frequency” and “height-balanced” histograms. Oracle database 12c introduces two new types of histogram, a “top frequency” and “hybrid” histogram. This article presents an example of each type of histogram.

Frequency Histograms

A frequency histogram is created when the number of distinct values (NDV) for the column is less than or equal to the total number of histogram buckets, which defaults to 254. This means each distinct value can be represented by its own bucket. The following code creates and populates a table. Half of the rows have a RECORD_TYPE value of 0, with the other half being randomly assigned a value from 1-9, giving a total of 10 distinct values. Once populated, the data is queried using a filter on the skewed column, then statistics are gathered.

DROP TABLE tab1 PURGE;

CREATE TABLE tab1 AS
SELECT level AS id,
       CASE
         WHEN MOD(level, 2) = 0 THEN 0
         ELSE TRUNC(DBMS_RANDOM.value(1,10))
       END AS record_type,
       'Description for ' || level AS description
FROM   dual
CONNECT BY level <= 10000;
COMMIT;

ALTER TABLE tab1 ADD CONSTRAINT tab1_pk PRIMARY KEY (id);
CREATE INDEX tab1_record_type_idx ON tab1(record_type);

-- Query table referencing skewed column to trigger automatic
-- histogram creation the next time statistics are gathered.
SELECT COUNT(*) FROM tab1 WHERE record_type = 0;

  COUNT(*)
----------
      5000

SQL>

-- Default statistics gathered.
EXEC DBMS_STATS.gather_table_stats(USER, 'TAB1');

We can see the presence of the frequency histogram using the USER_TAB_COLUMNS view.

COLUMN column_name FORMAT A20

SELECT column_id,
       column_name,
       histogram
FROM   user_tab_columns
WHERE  table_name = 'TAB1'
ORDER BY column_id;

 COLUMN_ID COLUMN_NAME          HISTOGRAM
---------- -------------------- ---------------
         1 ID                   NONE
         2 RECORD_TYPE          FREQUENCY
         3 DESCRIPTION          NONE

SQL>

The USER_TAB_HISTOGRAMS view is used to display information about the contents of histograms. The ENDPOINT_VALUE column represents the highest distinct column value in the bucket and the ENDPOINT_NUMBER is a cumulative frequency of rows in the bucket. If we want to get the actual frequency of the specific column value, we need to subtract the previous cumulative value. The following query does that using the LAG analytic function.

SELECT endpoint_value,
       endpoint_number,
       endpoint_number - LAG(endpoint_number, 1, 0) OVER (ORDER BY endpoint_value, endpoint_number) AS frequency
FROM   user_tab_histograms
WHERE  table_name  = 'TAB1'
AND    column_name = 'RECORD_TYPE'
ORDER BY endpoint_value;

ENDPOINT_VALUE ENDPOINT_NUMBER  FREQUENCY
-------------- --------------- ----------
             0            5000       5000
             1            5535        535
             2            6104        569
             3            6679        575
             4            7243        564
             5            7773        530
             6            8329        556
             7            8889        560
             8            9469        580
             9           10000        531

SQL>

Height-Balanced Histograms

In previous releases, when there were more distinct column values than the number of histogram buckets, a height-balanced histogram was created. In this case the histogram buckets are compressed so each bucket represents a range of values, which should contain approximately the same frequency of rows.

In Oracle database 12c, height-balanced histograms are only created if sampling is explicitly used during statistics collection. If there is no explicit sampling, Oracle will perform a full table scan and build a hybrid histogram, or possibly a top frequency histogram depending on the circumstances.

The following table has half the rows with a RECORD_TYPE of 0 and the other half each having a unique value. The total number of distinct values is 5001, which is far in excess of the default number of histogram buckets of 254. Once populated, the data is queried using a filter on the skewed column, then statistics are gathered using an explicit sample size.

DROP TABLE tab1 PURGE;

CREATE TABLE tab1 AS
SELECT level AS id,
       CASE
         WHEN MOD(level,2) = 0 THEN 0
         ELSE level
       END AS record_type,
       'Description for ' || level AS description
FROM   dual
CONNECT BY level <= 10000;
COMMIT;

ALTER TABLE tab1 ADD CONSTRAINT tab1_pk PRIMARY KEY (id);
CREATE INDEX tab1_record_type_idx ON tab1(record_type);

-- Query table referencing skewed column to trigger automatic
-- histogram creation the next time statistics are gathered.
SELECT COUNT(*) FROM tab1 WHERE record_type = 0;

  COUNT(*)
----------
      5000

SQL>

-- Statistics gathered. Use explicit sampling to force height-balanced histogram.
EXEC DBMS_STATS.gather_table_stats(USER, 'TAB1', estimate_percent=>10);

We can see the presence of the height-balanced histogram using the USER_TAB_COLUMNS view.

COLUMN column_name FORMAT A20

SELECT column_id,
       column_name,
       histogram
FROM   user_tab_columns
WHERE  table_name = 'TAB1'
ORDER BY column_id;

 COLUMN_ID COLUMN_NAME          HISTOGRAM
---------- -------------------- ---------------
         1 ID                   NONE
         2 RECORD_TYPE          HEIGHT BALANCED
         3 DESCRIPTION          NONE

SQL>

Since there are not enough buckets to represent all the distinct values, the buckets are compressed to represent ranges of values. The ENDPOINT_VALUE represents the highest value in the bucket and the ENDPOINT_NUMBER represents the cumulative frequency. The following query displays the range, number of values represented and frequency associated with each bucket. The range is displayed as “<=” because popular values can span buckets in high-balanced histograms. Remember, this is sample data.

COLUMN range FORMAT A20

SELECT '<=' || endpoint_value AS range,
       endpoint_value - (LAG(endpoint_value, 1, -1) OVER (ORDER BY endpoint_value, endpoint_number)+1) + 1 AS vals_in_range,
       endpoint_number - LAG(endpoint_number, 1, 0) OVER (ORDER BY endpoint_value, endpoint_number) AS frequency
FROM   user_tab_histograms
WHERE  table_name  = 'TAB1'
AND    column_name = 'RECORD_TYPE'
ORDER BY endpoint_value;

RANGE                VALS_IN_RANGE  FREQUENCY
-------------------- ------------- ----------
<=0                              1        125
<=57                            57          1
<=143                           86          1
<=209                           66          1
<=281                           72          1
<=359                           78          1
<=443                           84          1
<=515                           72          1
<=579                           64          1
<=679                          100          1
<=749                           70          1

RANGE                VALS_IN_RANGE  FREQUENCY
-------------------- ------------- ----------
<=831                           82          1
<=887                           56          1
<=963                           76          1
<=1051                          88          1
<=1119                          68          1
<=1201                          82          1
<=1265                          64          1
<=1343                          78          1
<=1413                          70          1
<=1481                          68          1
<=1571                          90          1

RANGE                VALS_IN_RANGE  FREQUENCY
-------------------- ------------- ----------
<=1647                          76          1
<=1735                          88          1
<=1813                          78          1
<=1875                          62          1
<=1943                          68          1
<=2029                          86          1
<=2113                          84          1
<=2181                          68          1
<=2263                          82          1
<=2335                          72          1
<=2415                          80          1

RANGE                VALS_IN_RANGE  FREQUENCY
-------------------- ------------- ----------
<=2523                         108          1
<=2615                          92          1
<=2675                          60          1
<=2753                          78          1
<=2821                          68          1
<=2899                          78          1
<=2971                          72          1
<=3051                          80          1
<=3119                          68          1
<=3189                          70          1
<=3273                          84          1

RANGE                VALS_IN_RANGE  FREQUENCY
-------------------- ------------- ----------
<=3359                          86          1
<=3445                          86          1
<=3503                          58          1
<=3587                          84          1
<=3679                          92          1
<=3755                          76          1
<=3835                          80          1
<=3911                          76          1
<=3977                          66          1
<=4051                          74          1
<=4159                         108          1

RANGE                VALS_IN_RANGE  FREQUENCY
-------------------- ------------- ----------
<=4253                          94          1
<=4355                         102          1
<=4467                         112          1
<=4567                         100          1
<=4633                          66          1
<=4709                          76          1
<=4779                          70          1
<=4869                          90          1
<=4947                          78          1
<=5025                          78          1
<=5101                          76          1

RANGE                VALS_IN_RANGE  FREQUENCY
-------------------- ------------- ----------
<=5193                          92          1
<=5281                          88          1
<=5363                          82          1
<=5443                          80          1
<=5529                          86          1
<=5621                          92          1
<=5711                          90          1
<=5767                          56          1
<=5841                          74          1
<=5909                          68          1
<=5991                          82          1

RANGE                VALS_IN_RANGE  FREQUENCY
-------------------- ------------- ----------
<=6057                          66          1
<=6115                          58          1
<=6201                          86          1
<=6273                          72          1
<=6363                          90          1
<=6439                          76          1
<=6521                          82          1
<=6605                          84          1
<=6667                          62          1
<=6733                          66          1
<=6811                          78          1

RANGE                VALS_IN_RANGE  FREQUENCY
-------------------- ------------- ----------
<=6895                          84          1
<=6961                          66          1
<=7041                          80          1
<=7113                          72          1
<=7191                          78          1
<=7263                          72          1
<=7333                          70          1
<=7429                          96          1
<=7511                          82          1
<=7567                          56          1
<=7651                          84          1

RANGE                VALS_IN_RANGE  FREQUENCY
-------------------- ------------- ----------
<=7719                          68          1
<=7805                          86          1
<=7877                          72          1
<=7977                         100          1
<=8057                          80          1
<=8115                          58          1
<=8177                          62          1
<=8259                          82          1
<=8345                          86          1
<=8433                          88          1
<=8499                          66          1

RANGE                VALS_IN_RANGE  FREQUENCY
-------------------- ------------- ----------
<=8571                          72          1
<=8645                          74          1
<=8707                          62          1
<=8793                          86          1
<=8875                          82          1
<=8949                          74          1
<=9023                          74          1
<=9085                          62          1
<=9147                          62          1
<=9227                          80          1
<=9309                          82          1

RANGE                VALS_IN_RANGE  FREQUENCY
-------------------- ------------- ----------
<=9385                          76          1
<=9479                          94          1
<=9553                          74          1
<=9641                          88          1
<=9709                          68          1
<=9781                          72          1
<=9849                          68          1
<=9933                          84          1
<=9999                          66          1

SQL>

Ref: https://oracle-base.com/