Oracle Text Index

With Oracle text indexes (or Domain index), we can index text documents and search it based on contents using text patterns with specialized text query operators.

Oracle Text index is different from the traditional B-Tree or Bitmap indexes. They have several components communicates internally.

In a query application, the table must contain the text or pointers to where the text is stored. Text is usually a collection of documents but can also be small text.

Oracle Text index differs from the traditional B-Tree or Bitmap. In an Oracle Text index, the text data is not directly indexed rather, the text data is split into a set of tokens (these splits stored in database internal tables) and tokens are indexed.

Oracle Text Index objects

Oracle Text index has four tables: $I, $K, $N and $R tables.

The $I table contains the data which is being indexed, all the tokens (words) generated from the text document is stored in this table. The tokens in this table are indexed by a B-Tree index with name format DR${index_name}$X.

The $K table maps the internal DOCID values to external ROWID values (fetching a DOCID when we know the ROWID value) .

The $R table maps the ROWID values to DOCID values, (fetching a ROWID when we know the DOCID value). The entries from this table are indexed by a B-Tree index with name format DRC${index_name}$R.

The $N table contains a list of deleted DOCID values, which are cleaned up by the index optimization process.

Oracle Text Health Check

 Oracle Text Status and Version:

  1. A: Status of all CTXSYS objects status :
SELECT * FROM dba_objects
WHERE status !='VALID' AND OWNER = 'CTXSYS' 
ORDER BY object_type,     object_name;

B: The query for health check of the index.
The idx_docid_count is Number of documents indexed. The number of idx_docid_count should be the same or close to the number of rows of base table. The domidx_status  is domain index status.

SELECT c.idx_owner,c.idx_name,c.idx_text_name,c.idx_type,
c.idx_docid_count, i.status,i.domidx_status
FROM ctxsys.ctx_indexes c, dba_indexes i
WHERE c.idx_owner = β€˜OWNER’
AND c.idx_name = β€˜INDEX_NAME' and c.idx_name=i.index_name
ORDER BY 2,3;

C: Compilation errors of invalid Text-related objects:

SELECT owner, name, type, line, position, text
  FROM dba_errors
 WHERE owner = 'CTXSYS'
      OR (owner = 'SYS' AND (name like 'CTX_%' or name like 'DRI%'))
  ORDER BY owner, name, sequence;

  SELECT * FROM ctxsys.ctx_index_errors
   ORDER BY err_timestamp DESC, err_index_owner, err_index_name;

D: Extract the DDL of the existing index :

SELECT CTX_REPORT.CREATE_INDEX_SCRIPT('SCHEMA.INDEX_NAME') FROM DUAL;
  1. Validating the Index integrity:

A: Validate $K against the base table(there should be no rows selected for a valid INDEX) :

select  *
from dr$INDEX_NAME$k k
where not exists (select 1
from TABLE_NAME t
where k.textkey = t.rowid);

The keys on $K  table should be match with the base table rowids.

B: Validate $R against $K(there should be no rows selected for a valid INDEX) :

select  *
from table(ctx_diag.decode_r('dr$INDEX_NAME$R')) r
where not exists (select 1
from dr$INDEX_NAME$k k
where r.textkey = k.textkey);

C: validate $R (find duplicates) (there should be no rows selected for a valid INDEX) :

column docids for a40

select  textkey, listagg(docid, ', ') within group (order by docid) docids
from table(ctx_diag.decode_r('dr$INDEX_NAME$R'))
group by textkey
having count(*) > 1;

The above queries for validating the TEXT index should have no return values therefore the index would be consistence.

Types of Oracle Text Indexes

CONTEXT

Use this index to build a text retrieval application when your text consists of large coherent documents.

You can index documents of different formats such as MS Word, HTML or plain text.

You can customize the index in a variety of ways.

This index type requires CTX_DDL.SYNC_INDEX after DML on base table.

Note! Transactional CONTEXT Indexes: The new TRANSACTIONAL parameter to CREATE INDEX and ALTER INDEX enables changes to a base table to be immediately queryable.

CTXCAT

Use this index type for better mixed query performance.

Typically, with this index type, you index small documents or text fragments.
Other columns in the base table, such as item names, prices, and descriptions can be included in the index to improve mixed query performance.

This index is larger and takes longer to build than a CONTEXT index.

The size of a CTXCAT index is related to the total amount of text to be indexed, the number of indexes in the index set, and the number of columns indexed.
Consider your queries and your resources before adding indexes to the index set.

This index type is transactional, automatically updating itself after DML to base table.

No CTX_DDL.SYNC_INDEX is necessary.

CTXRULE

Use CTXRULE index to build a document classification or routing application.

This index is created on a table of queries, where the queries define the classification or routing criteria. 

ALTER INDEX Sync Methods

 MANUAL:

No automatic synchronization. This is the default. You must manually synchronize the index with CTX_DDL.SYNC_INDEX.

EVERY:

Automatically synchronize the index at a regular interval specified by the value of interval-string.

ON COMMIT:

Synchronize the index immediately after a commit.   

TRANSACTIONAL:

Specify that documents can be searched immediately after they are inserted or updated.

If a text index is created with TRANSACTIONAL enabled, then, in addition to processing the synchronized rowids already in the index, the CONTAINS operator will process unsynchronized rowids as well.

To turn on TRANSACTIONAL index property:

ALTER INDEX myidx REBUILD PARAMETERS('replace metadata transactional');
                To turn off TRANSACTIONAL index property:
ALTER INDEX myidx REBUILD PARAMETERS('replace metadata nontransactional');

Oracle Text Index preferences

DATASTORE:

  1.  DIRECT_DATASTORE: Indicates that the data is stored internally in text columns of a database table.
  2. MULTI_COLUMN_DATASTORE: Indicates that the data is stored in text table in more than one column. Columns are concatenated (joined) to create a virtual document and each concatenated row is indexed as a single document
  3. DETAIL_DATASTORE: Indicates that the data is stored internally in a text column.
  4. NESTED_DATASTORE: Indicates that the data is stored in nested tables
  5. FILE_DATASTORE: Indicates that the data is stored in Operating System files. This type of data source is supported only for CONTEXT index.
  6. URL_DATASTORE: Indicates that the data is stored over Internet.
  7. USER_DATASTORE: Indicates that the documents would be synthesized at index time by a user defined stored procedure

FILTER:

In this phase the text stream can be converted to format that is recognized by the Oracle text processing engine. 

SECTIONER:

 The task of the sectioner is to divide the incoming text stream into multiple sections based on the internal document structures (HTML or XML).

 LEXER:

This property determines the language associated with the incoming document.

 How to SYNC_INDEX after DML on base table

The following script is useful to synchronize the index with the base table when using context type of text index:

export ORACLE_SID=SID_NAME
export LD_LIBRARY_PATH=$ORACLE_HOME/lib:$ORACLE_HOME/ctx/lib:$ORACLE_HOME/lib32:/usr/lib
sqlplus "/as sysdba" << EOF
exec ctx_ddl.sync_index(idx_name =>'SCHEMA.INXEX’);
exit;
EOF


ref: https://www.dbconcepts.com/oracle-text-index/