docs-en

GridDB SQL Tuning Guide

table of contents


Introduction

Purpose and organization of this document

This document describes GridDB SQL tuning.

This document is intended for developers who develop systems using SQL with GridDB.

The contents of this manual are as follows.

For optimization rules and tuning, the distinctive features of GridDB are mainly descried.

Tuning steps

SQL tuning is performed by following the following steps, as in the case of general database systems.

STEPs 1 to 3 will be explained in more detail in the following sections.

Check a slow query

Identify which of the queries running on the system are slow queries that are taking a long time.

A slow query that takes a long execution time can be set to output information about the query and the execution time to the event log. This helps identify which of the queries executed from an application are causing a bottleneck.

The procedure to check the slow query is as follows.

  1. Set the parameters to output slow query settings
  1. Execute a query.
  1. Check slow queries

Obtain the plan

Execute the slow query identified in STEP 1 and get the query plan (global plan).

Use the operation tool gs_sh to get a plan. Execute the query with EXPLAIN ANALYZE syntax and get the plan with the sub-command getplantxt.

(1) Execute the query in the “EXPLAIN ANALYZE” syntax.

gs[public]> sql EXPLAIN ANALYZE select * from table1, table2 where table1.value=1 and table1.id=table2.id ;
A search was executed. (19 ms)

(2) Get the plan.

gs[public]> getplantxt
Id Type       Input Rows Lead time Actual time Node                 And more..
--------------------------------------------------------------------------------------------------------------------
 0 SCAN       -     -           30          30 192.168.15.161:10001 table: {table1}
 1   SCAN     0     3000        21          21 192.168.15.161:10001 table: {table1, table2} INDEX SCAN JOIN_EQ_HASH
 2     RESULT 1     14           0           0 192.168.15.161:20001

GridDB processes SQL as follows: A node parses queries and generates a plan. The plan consists of plans for each task, which is a unit of execution, and each node executes assigned tasks.

SQL plan and the flow of processing
SQL plan and the flow of processing

The sub-command getplantxt for showing a plan displays a per-task plan, line by line. An output of a task becomes the input of the next task.

See the execution example of “(2) Get the plan” above and check the following.

Tune the query

In the scan processing under a search condition in a WHERE clause and in the scan processing for joining a table, the performance may greatly vary depending on whether the index is used or not. In the latter case, the performance also depends on the join order. Note these points when tuning a query.

(1) Analyze a plan

Analyze a plan to check whether the index is used and the queries are executed as intended. See the SQL plan for the details of a plan.

(2) Tuning

Tune using the following methods depending on the problem identified from the analysis of a plan.

Example: The following example illustrates how to tune a query for joining tables.

Get a plan for a query which joins table1 and table2 with the value of the column value.

gs[public]> EXPLAIN ANALYZE select * from table1, table2 where table1.value=0 and table1.value=table2.value;
A search was executed. (13 ms)
gs[public]> getplantxt
Id Type       Input Rows  Lead time Actual time Node                 And more..
-------------------------------------------------------------------------------------
 0 SCAN       -     -            20          20 192.168.15.161:10001 table: {table1}
 1 SCAN       -     -             9           9 192.168.15.161:10001 table: {table2}
 2   JOIN     0,1   10000,3000  891         891 192.168.15.161:20001 JOIN_EQ_HASH
 3     RESULT 2     32            2           2 192.168.15.161:20001

In this plan, both table1 and table2 are scanned, and join is processed without using an index.

The index information of the table shows the column value is not indexed; therefore, create an index.

Execute the same query to get a plan.

gs[public]> EXPLAIN ANALYZE select * from table1, table2 where table1.value=0 and table1.value=table2.value;
A search was executed. (7 ms)
gs[public]> getplantxt
Id Type       Input Rows Lead time Actual time Node                 And more..
--------------------------------------------------------------------------------------------------------------------
 0 SCAN       -     -           20          20 192.168.15.161:10001 table: {table1}
 1   SCAN     0     10000       80          80 192.168.15.161:10001 table: {table1, table2} INDEX SCAN JOIN_EQ_HASH
 2     RESULT 1     32           3           3 192.168.15.161:20001

“INDEX SCAN” is displayed for plan ID2, which shows the plan has changed to use an index in join processing.

Optimize SQL

Scan using an index

In a scan processing to find part of data from the entire table, the “index scan” using the table index is often faster than the “full scan” which accesses all the rows of the table. The index scan can be used in the following processing.

The index scan is more effective when the search condition of a WHERE clause can narrow down the data by reducing the hit ratio to the number of rows of a table. This is especially true of the first case above.

Index selection rules (processing the search condition)

This section explains rules to select an index for processing the search condition in the WHERE clause.

Rules vary depending on the operator or expression specified.

AND

Basically, when indexes are set on the columns used as search conditions, the first index is used.

Example:

a>1 AND b=2 (Both "a" and "b" are indexed)

There may be exceptional cases where the first index is not used, as in the following cases:

If conditions for the same column overlap, merge the conditions before using the index.

Example:

a>1 AND a<=4 AND a<=3 ("a" is indexed)

OR

For OR, the index is used only when all the columns specified as search conditions are indexed.

Example:

a>1 OR b=2 (Both "a" and "b" are indexed)

Example:

a>1 OR b=2 ("b" is indexed)

Expressions for a comparison operator

The index is used only when the expressions for a comparison operator consist of a single column value and a constant expression.

Example:

a> 10*1000-1 ("a" is indexed)

Example:

a+1>10*1000-1 ("a" is indexed)

Example:

a>b (Both "a" and "b" are indexed)

IN, BETWEEN

For IN and BETWEEN interpreted as expressions combining AND/OR and comparison operators, the rules mentioned above are applied.

Example:

a IN (1,2)        → a=1 OR a=2     ("a" is indexed)
a BETWEEN 1 AND 2 → a>=1 AND a<=2   ("a" is indexed)

[Memo]

Index selection rules (processing to find the maximum and minimum values)

The following explains index selection rules used in finding the maximum and minimum values using the aggregate function.

To calculate the MIN and MAX functions that aggregate indexed columns in the column list for the SELECT clause, use indexes as a general rule. Indexes are also used when the column list for the SELECT clause only contain an expression that combines these two aggregate functions and constant expressions.

[Memo]

Example:

SELECT MIN(a)         FROM table1            ("a" is indexed)
SELECT MAX(a)         FROM table1 WHERE a<10 ("a" is indexed)
SELECT MAX(a)-MIN(a)  FROM table1            ("a" is indexed)

Example:

SELECT MIN(a), MAX(b) FROM table1            (Both "a" and "b" are indexed)

Example:

SELECT MAX(a)         FROM table1 WHERE a=10 ("a" is indexed)
SELECT MAX(a)         FROM table1 WHERE b<10 ("a" is indexed; no partitioning setting)
SELECT MIN(a), SUM(a) FROM table1            ("a" is indexed)
SELECT MIN(a), MAX(b) FROM table1 WHERE a<10 (Both "a" and "b" are indexed)

Example:

SELECT MAX(a)         FROM table1 WHERE b>=TIMESTAMP('2024-01-01T00:00:00Z')
    ("a" is indexed; b denotes a partitioning key for an one-day interval)

Composite index selection rules

This section explains rules to select a composite index for scan processing when optimizing GridDB SQL.

The range that uses an index differs depending on the column or the operator specified in the search condition. The part from the beginning of a column that comprises a composite index to the end of consecutive column conditions joined by the AND condition is used as the composite index.

[Memo]

Example:

where col1 = 1 and col2 = 1 and col3 = 2 (with a composite index consisting of col1, col2, and col3)

Join

This section describes how to determine the method of performing the operation of joining multiple tables for SQL optimization in GridDB.

For join processing without using indexes, the join order and the join operation method greatly influence the performance.

How to determine the join order

In executing an SQL statement that contains three or more table joins, it is required to determine which table should be joined in what order. Two approaches are available for determining the order: cost-based and rule-based. Whichever approach can be selected according to the settings described below.

Setting the join order method

The join order method can be switched between cost-based and rule-based; the method can be set as follows:

[Memo]

Cost-based method

The cost-based method for determining the join order is a method for determining the order with the lowest estimated cost for executing join operations. The cost here refers to the rough estimate of the number of rows in the tables to be joined and the join results (intermediate result). In joining multiple tables, there can be many choices for join operations, as to which of the tables are joined and in what order they are joined. For each one of the choices for join operations, a rough estimate of the number of rows that are input and output for join operations is calculated, and the one which results in the smallest estimated number of rows that are joined upon the execution of SQL statements. Because the choice which results in the smaller number of rows to be operated performs operations faster, the one that is ultimately chosen will be the optimal join order.

The estimated number of rows that are join output is found by an evaluation made by GridDB by following the two criteria from the two different perspectives given below:

To determine the degree of filtering, the following evaluation criteria are used:

[Memo]

Because the cost-based and rule-based methods employ different criteria in determining the join order, the two methods may result in different join order. For details about the rule-based method, see Rule-based method.

The following simple example illustrates how the order of joining three tables (A, B, and C) is determined.

Example

  SELECT * FROM A,B,C
  WHERE A.x = B.x AND B.y = C.y

In the cost-based method, the given SQL statement and the estimated number of rows in each table are referenced to analyze the cost. In this example, the estimated number of rows in Table A is much greater than the number in the rest of the tables. In this case, the cost of executing the join operation for Table A is judged as high. As a result, a table join plan shown in the figure below is generated where the join operations for Tables B and C are executed first and the join operation for Table A is executed at the end.

Join order (as determined by the cost-based method)
Join order (as determined by the cost-based method)

In the rule-based method, a plan is generated by analyzing only the given SQL statement. In the example SQL statement, a join order plan as shown in the figure below is generated according to the order of the join descriptions. While the join operation for Table A with many rows would likely result in a greater number of outputs, a plan is generated in such a way that the operation result of Table A is joined with Table C, according to the description of the SQL statement.

Join order (as determined by the rule-based method)
Join order (as determined by the rule-based method)

[Memo]

Rule-based method

The rule-based method for determining the join order is a method for determining the order according to the degree of joins of the SQL statement and the degree of filtering.

[Memo]

Selecting the driving table

In join processing, the table accessed first is called the driving table, and the table accessed next and joined is called the internal table.

The order of joins (driving table and inner table) is determined by the following rules.

When joining two tables, a table which has a conditional expression equal to a constant becomes a drive table.

Example

t1.a=t2.x AND t2.y=1

Method for determining join application

This section explains the rules for using indexes in join processing.

An index is used if all of the following five rules are true:

[Memo]

Method for determining join selection

When the join syntax applies to the method for determining join application, an index is used to join. This section explains the rules to choose the index used to join.

Basically, the rules for using an index are the same as the rules for choosing an index to scan. All indexes set for columns are used in the order described.

Note that in some cases, not all the indexes are used, as below.

When indexes are set on the columns used as search conditions, the first index is used.

Example:

  t1.a=t2.x AND t1.b>t2.y AND t2.z=1 (a and b are indexed, the driving table is t2, and the internal table is t1)

In the OR condition (A OR B), when B is a false constant, the index for A is not used.

Example:

  t1.a=t2.x AND (t1.b=t2.y OR false) AND t2.z=1 (a and b are indexed, the driving table is t2, and the internal table is t1)

Method for determining how join operations are executed

The following three methods for join operation are available: hash join, sort merge, and nested loop join.

Join operation method Description
Hash join Create a temporary table in the memory by hashing the join key of the driving table and then compare the table to the hash value of the internal table.
Sort merge join Sort the two tables to be joined by the join key, compare them in order and join them.
Nested loop join For the values of the join key of the driving table, find and join the data of the inner table that matched the join conditions.

Hash join is the quickest, followed by sort merge join and nested loop join.

Choose one of these operation methods based on the type of the first join condition.

First condition Method to choose
Equivalent condition Hash join or sort merge join
(Even if hash join is chosen, some of the operations may switch to sort merge join due to limits on memory resources.)
Comparison condition Sort merge join
- Nested loop join

The first join condition of the join is selected in the following priority.

Hint phrase

An execution plan can be controlled by specifying a hint in a query to modify the execution plan, without changing the SQL statement.

[Points to note]

Terminology

The following table provides the definition of terms for the hint function.

Term Description
Hint phrase Information for controlling the execution plan
Hints A list of hint phrases. Specified in the query to control an execution plan.

Specifying hints

Write a hint in the block comment of the query to control the execution plan. The block comment for the hint can only be written immediately before or after the first SELECT (INSERT/UPDATE/DELETE) statement in SQL. To distinguish a hint comment from regular comments, the block comment for the hint begins with “/*+”.

The target to give a hint is specified by the object name or alias in parentheses. The targets are separated by either space, tab, or newline.

In the following example, the Leading hint clause specifies the table join order.

/*+
Leading(t3 t2 t1)
 */
SELECT *
  FROM t1, t2, t3
    ON t1.x = t2.y and t2.y = t3.z
  ORDER BY t1.x
  LIMIT 10;

[Memo]

List of hint phrases

The following table gives available hint phrases.

Category Operation Description
Parallelism MaxDegreeOfTaskInput(upper_limit) Maximum number of inputs for one task.
  MaxDegreeOfExpansion(upper_limit) Maximum number of expansions for plan nodes.
Scanning method IndexScan(table) Index scan is used if possible.
  NoIndexScan(table) No index scan is used.
Join order CostBasedJoin() Determine the join order based on cost.
  NoCostBasedJoin() Do not determine the join order based on cost; determine the join order based on rules.
  TableRowCount(number of table rows) Give the estimate number of rows to determine the join order based on cost. (applicable only when the join order is determined based on cost)
Joining method IndexJoin(table table) Using index join if possible.
  NoIndexJoin(table table) No index join is used.
Table joining order Leading(table table [table …]) Join specified tables in the specified order.
  Leading(( table set table set )) Join the first specified table set as the outer table and
the second table set as the inner table indicates single table or table set)
Number of intermediate rows generated MaxGeneratedRows (upper limit)           Maximum number of rows generated in a task for interpolation operations that use the GROUP BY RANGE clause.

Table set = { table or ( table set table set )

Details of hint phrases

This section details each category of hint phrases.

Parallelism

Control parallelization processing.

Scanning method

Specify the scanning method.

Join order

Specify how to determine the join order when given three or more tables.

Joining method

Specify which joining method to select for a table combination.

Table joining order

Specify in what order the tables are joined.

(1) Specify only the join order: Leading(table table [table …])

Specify the table names or aliases in order from the first table to be joined. In this method, only Left-deep join orders are used.

[Example 1]

/*+ Leading(S R Q P) */
SELECT * FROM P,Q,R,S WHERE P.x = Q.x AND ...
Table joining order (Example 1)
Table joining order (Example 1)

(2) Specify the join order and direction: Leading(( table set table set ))

table set = { table or ( table set table set ) }

If only the join order is specified as in (1), the join direction (different for outer table or inner table) may differ from what is expected. To fix the join direction, use the following expression.

/*+ Leading((t1 (t2 t3))) */
SELECT ...

In this expression, parentheses can be nested. It joins the first specified table set as the outer table and the second set as the inner table.

[Example 2-1]

/*+ Leading(((P Q) R)) */
SELECT * FROM P,Q,R WHERE P.x = Q.x AND ...
Table joining order (Example 2-1)
Table joining order (Example 2-1)

[Example 2-2]

/*+ Leading((R (Q P))) */
SELECT * FROM P,Q,R WHERE P.x = Q.x AND ...
Table joining order (Example 2-2)
Table joining order (Example 2-2)

[Example 2-3]

/*+ Leading(((P Q) (R S))) */
SELECT * FROM P,Q,R,S WHERE P.x = Q.x AND ...
Table joining order (Example 2-3)
Table joining order (Example 2-3)

[Memo]

Number of intermediate rows generated

Specify the maximum number of intermediate rows generated by a single task. This number is applied to a task for interpolation operations that use the GROUP BY RANGE clause.

Error handling

In the following cases, a syntax error occurs.

In the following case, a table specification error occurs:

[Memo]

SQL plan (execution plan)

The EXPLAIN ANALYZE statement of SQL indicates what operation and index were chosen by SQL optimization.

In SQL processing, SQL syntax is analyzed, optimized, and divided into processing units called “tasks”, such as join and sort scan, and then a plan (execution plan) is generated.

Tasks are executed on any one of the nodes which constitute a cluster, in parallel execution, exchanging data between these tasks.

[Memo]

Execution of the EXPLAIN ANALYZE statement will output information, including the task plans, execution time, in JSON format for each task in a line.

The main items to be output are as follows.

Item Description
id Plan ID
type Type of processing
inputList List of plan IDs of the plans to be input
profile/plan Cost information estimated during plan generation (when the join order is determined based on cost)
profile/leadtime Processing time
profile/rows Number of inputs
profile/address The address and port number of the node which executed the processing

[Memo]

The types of processing are as follows.

Value of type Description
GROUP Grouping operation
JOIN Join operation
LIMIT Filter operation of the number of lines
SCAN Table scan operation
SELECT Selection operation (condition filter projection)
SORT Sort operation
UNION Combine and set operation
INSERT, UPDATE, DELETE Various operations to change the container
DDL DDL/DCL statement
RESULT Retain the result

A plan can be obtained using the operation tool gs_sh.

Example: To get the plan of a query “select * from table1, table2 where table1.value = 0 and table1.id = table2.id”

gs[public]> EXPLAIN ANALYZE select * from table1, table2 where table1.value=0 and table1.id=table2.id;
A search was executed. (11 ms)
gs[public]> getplantxt
Id Type       Input Rows Lead time Actual time Node                 And more..
--------------------------------------------------------------------------------------------------------------------
 0 SCAN       -     -            0           0 192.168.15.161:10001 table: {table1} INDEX SCAN
 1   SCAN     0     0            2           2 192.168.15.161:10001 table: {table1, table2} INDEX SCAN JOIN_EQ_HASH
 2     RESULT 1     0            0           0 192.168.15.161:20001

Execute the getplanjson subcommand with gs_sh to output the plan in JSON format.

[Notes]

Example of JSON format of a plan

{
  "nodeList" : [ {
    "cmdOptionFlag" : 65,
    "id" : 0,
    "indexInfoList" : [ 3, 3 ],
    "inputList" : [ ],
    "outputList" : [ {
      "columnId" : 0,
      "columnType" : "INTEGER",
          :
          :
}

Copyright (c) 2017 TOSHIBA Digital Solutions Corporation