Statistics used by the optimizer when generating the execution plan are stored in the data dictionary. Statistics about physical storage characteristics and data distribution in the various schema objects are generated by using the DBMS_STATS package.
To get the right execution plans, statistics should be maintained not just when the database is created and after the initial load of data but on a regular basis based on the frequency in which the data volume changes. If no statistics are available or the statistics are stale, the execution plans generated by the optimizer may not be efficient enough. Under these conditions, the optimizer performs dynamic sampling depending on the setting of the OPTIMIZER_DYNAMIC_SAMPLE parameter. This sampling may cause slower parse times; so for best performance, the optimizer should have representative optimizer statistics.
Statistics at the table level include number of rows, number of blocks, and row length. Column statistics such as number of distinct values, number of nulls in columns, data distribution, or histograms. Index statistics include number of leaf blocks, levels, and clustering factors. System statistics include I/O performance and CPU performance. All of these statistics are important and change with the volume of data changes or if the structure of the objects changes. The statistics collected are then maintained in the data dictionary tables of the database.
Starting with Oracle database 10g, Oracle automatically starts gathering statistics once every day. Although Oracle recommends using the automated process, in certain cases, it may be required to use homegrown procedures to collect statistics, either because such a frequent collection interval may not be required or the automated collection process maybe scheduled with conflicts with other processes.
The following procedure could be used to verify if the automatic gathering of statistics has been enabled:
SELECT program_action,
number_of_arguments NOA,
enabled
FROM dba_scheduler_programs
WHERE owner = 'SYS'
AND program_name LIKE 'GATHER%';
PROGRAM_ACTION NOA ENABL
--------------------------------------------- ---------- -----
dbms_stats.gather_database_stats_job_proc 0 TRUE
DBA_AUTOTASK_TASK displays all jobs that have been scheduled for automatic execution. As seen from the query output following, the GATHER_STATS_PROG is enabled for automatic statistics collection:
SQL> SELECT TASK_NAME, STATUS FROM DBA_AUTOTASK_TASK;
TASK_NAME STATUS
---------------------------------------------- --------
AUTO_SQL_TUNING_PROG ENABLED
auto_space_advisor_prog ENABLED
gather_stats_prog ENABLED
The GATHER_DATABASE_STATS_JOB_PROC prioritizes database objects that require statistics so that objects that most need updated statistics are processed first, before the maintenance window closes.
Statistics can also be gathered manually using the DBMS_STATS package. It is used to gather statistics on tables and indexes and individual columns and partitions of tables. However, it does not gather cluster statistics. This is accomplished by collecting statistics on the individual tables instead of the whole cluster. DBMS_STATS collects statistics at the following levels:
GATHER_INDEX_STATS
GATHER_TABLE_STATS
GATHER_SCHEMA_STATS
GATHER_DICTIONARY_STATS
GATHER_DATABASE_STATS
When gathering statistics on a table, DBMS_STATS gathers information about the data distribution of the columns within the table. The most basic information about the data distribution, such as the maximum and minimum values for the column, may not be sufficient for the optimizer’s needs if the data within the column is skewed. For skewed data distributions, histograms will have to be created as part of the column statistics to describe the data distribution of a given column. This is particularly helpful for data warehouse implementations where skewed data is normally present, and histograms help the optimizer to generate efficient execution plans. Histograms provide improved selectivity estimates in the presence of data skew, resulting in optimal execution plans with nonuniform data distributions.