First and foremost, it must be stated that handling data in the tens of millions is generally not an optimal use case for MySQL.
There are several strategies for optimizing MySQL with tens of millions of records:
- Sharding (splitting tables and databases)
- Creating summary tables
- Modifying queries to use multiple subqueries
In this discussion, we are considering a scenario where a single MySQL table contains tens of millions of records.
The table design is poor, and business rules do not allow the SQL query to be split into multiple subqueries.
Under these circumstances, developers can attempt to achieve their query objectives by optimizing the SQL.
When a single MySQL table contains tens of millions of records, special considerations arise.
This discussion primarily focuses on SQL optimization strategies in extreme cases.
Generating Tens of Millions of Records
We will use stored procedures to generate 10 million records.
The table structures are as follows:
CREATE TABLE `orders` (
`order_id` int NOT NULL AUTO_INCREMENT,
`user_id` int DEFAULT NULL,
`order_date` date NOT NULL,
`total_amount` decimal(10,2) NOT NULL,
PRIMARY KEY (`order_id`),
KEY `idx_user_id` (`user_id`) USING BTREE,
KEY `idx_user_amount` (`user_id`,`total_amount`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;
CREATE TABLE `users` (
`user_id` int NOT NULL AUTO_INCREMENT,
`username` varchar(50) COLLATE utf8mb4_general_ci NOT NULL,
`email` varchar(100) COLLATE utf8mb4_general_ci NOT NULL,
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`user_id`),
KEY `idx_user_id` (`user_id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;
The stored procedure for generating data is as follows.
User data:
-- Stored procedure to generate 1000 users
CREATE DEFINER=`root`@`localhost` PROCEDURE `create_users`()
BEGIN
DECLARE i INT DEFAULT 0;
DECLARE total_users INT DEFAULT 1000; -- Adjust the number of users
DECLARE rnd_username VARCHAR(50);
DECLARE rnd_email VARCHAR(100);
WHILE i < total_users DO
-- Generate random username and email
SET rnd_username = CONCAT('User', FLOOR(1 + RAND() * 10000000)); -- Assume unique username
SET rnd_email = CONCAT(rnd_username, '@example.com'); -- Assume unique email
-- Insert data into users table
INSERT INTO users (username, email) VALUES (rnd_username, rnd_email);
SET i = i + 1;
END WHILE;
END
Order data generation stored procedure:
CREATE DEFINER=`root`@`localhost` PROCEDURE `generate_orders`()
BEGIN
DECLARE i INT DEFAULT 0;
DECLARE total_users INT DEFAULT 1000; -- Number of users
DECLARE total_orders_per_user INT DEFAULT 1000; -- Number of orders per user
DECLARE rnd_user_id INT;
DECLARE rnd_order_date DATE;
DECLARE rnd_total_amount DECIMAL(10, 2);
DECLARE j INT DEFAULT 0;
WHILE i < total_users DO
-- Get user ID
SELECT user_id INTO rnd_user_id FROM users LIMIT i, 1;
WHILE j < total_orders_per_user DO
-- Generate order date and total amount
SET rnd_order_date = DATE_ADD('2020-01-01', INTERVAL FLOOR(RAND() * 1096) DAY); -- Random date between 2020-01-01 and 2022-12-31
SET rnd_total_amount = ROUND(RAND() * 1000, 2); -- Random total amount between 0 and 1000
-- Insert data into orders table
INSERT INTO orders (user_id, order_date, total_amount) VALUES (rnd_user_id, rnd_order_date, rnd_total_amount);
SET j = j + 1;
END WHILE;
SET j = 0;
SET i = i + 1;
END WHILE;
END
Separating the generation of users
and orders
data allows for multiple calls to the generate_orders
stored procedure to utilize multithreading.
First, call call create_users()
, then open 15 windows to call the orders stored procedure call generate_orders()
.
This entire process will generate 1000 users and 15*1000*1000
, which equals 15 million order records.
Original SQL
This is a very simple SQL query to calculate the total order amount for each user.
Without any indexes created by default, this query takes more than 190 seconds to execute.
-- First version
SELECT a.*, sum(b.total_amount) as total
FROM users a
LEFT JOIN orders b ON a.user_id = b.user_id
GROUP BY a.user_id;
The explain analysis is as follows:

As we can see, no indexes are used, the type is “ALL”, meaning a full table scan.
Execution time: 191 seconds.
First Optimization: Regular Indexes
Create indexes on all columns used in the SQL conditions, including WHERE
, JOIN
, and SUM
.
CREATE INDEX idx_orders_user_id ON orders (user_id);
CREATE INDEX idx_orders_total_amount ON orders (total_amount);
CREATE INDEX idx_users_user_id ON users (user_id);
The SQL query remains the same:
-- First version
SELECT a.*, sum(b.total_amount) as total
FROM users a
LEFT JOIN orders b ON a.user_id = b.user_id
GROUP BY a.user_id;
Let’s check the explain result:

The type is “index” or “ref”, and all queries use indexes.
However, the result is disappointing. The execution time increased to more than 460 seconds.
This suggests that the query became slower due to MySQL’s mechanism of using the primary key to locate the associated fields.
Second Optimization: Covering Indexes
A covering index is an index that contains all the columns needed to satisfy the query, allowing the query to be executed directly from the index without accessing the actual data rows.
-- Do not delete regular indexes yet
-- drop INDEX idx_orders_user_id ON orders;
-- drop INDEX idx_orders_total_amount ON orders;
CREATE INDEX idx_orders_total_amount_user_id ON orders (total_amount, user_id);
CREATE INDEX idx_orders_user_id_total_amount ON orders (user_id, total_amount);
Creating indexes on 15 million records takes over 300 seconds, so indexing should be done judiciously.
The SQL query remains the same:
-- First version
SELECT a.*, sum(b.total_amount) as total
FROM users a
LEFT JOIN orders b ON a.user_id = b.user_id
GROUP BY a.user_id;
Let’s check the explain result:

We can see that the orders
table's type has changed from "index" to "ref".
The query time has now decreased from 460+ seconds to 10 seconds.
This result proves that covering indexes can improve query performance.
The key observation here is that only the idx_orders_user_id_total_amount
index reduced the query time, while idx_orders_total_amount_user_id
did not.
This is partly due to the execution order of MySQL’s keywords (hypothetical, as no definitive source was found).
MySQL execution order:
shell from
on
join
where
group by
having
select
distinct
union (all)
order by
limit
The covering index uses the WHERE
clause before the SELECT
's SUM
function, which aligns with the index creation order of idx_orders_user_id_total_amount
.
drop INDEX idx_orders_user_id ON orders;
drop INDEX idx_orders_total_amount ON orders;
drop INDEX idx_orders_total_amount_user_id ON orders;
Dropping the redundant indexes shows that the execution time remains unchanged at 10 seconds.
Thus, the optimization strategy here is to use covering indexes to hit the indexes effectively.
Third Optimization: Reducing Data Volume
Reducing the data volume involves removing unnecessary data or making architectural adjustments. One way to achieve this is by splitting tables.
By using this approach, the data volume can be reduced from tens of millions to millions or even hundreds of thousands, significantly improving query speed.
-- Third Optimization: Reducing Data Volume
SELECT a.*, sum(b.total_amount) as total
FROM users a
LEFT JOIN orders b ON a.user_id = b.user_id
WHERE a.user_id > 1033
GROUP BY a.user_id;
The explain result is as follows:

As we can see, the users
table's type is "range", filtering a portion of the data.
The query time decreased from 10 seconds to 7 seconds, proving that reducing data volume is effective.
Fourth Optimization: Small Table Driving Large Table
In MySQL, the optimizer typically selects the most suitable driving table based on query conditions and table size.
The “small table driving large table” strategy involves selecting the smaller table as the driving table in a join query to reduce memory usage and processing time.
Based on the results of the third optimization, let’s try this strategy.
-- Third version, small table driving large table, no significant effect
SELECT a.*, sum(b.total_amount) as total
FROM users a
LEFT JOIN (SELECT user_id, total_amount FROM orders c WHERE c.user_id > 1033) b ON a.user_id = b.user_id
WHERE a.user_id > 1033
GROUP BY a.user_id;
This change modifies the LEFT JOIN
table to a subquery, filtering some data in advance.
The explain result is as follows:

As we can see, the explain result doesn’t change much, and the actual execution performance remains the same.
While the “small table driving large table” strategy isn’t effective here, it can still be a valid optimization strategy depending on the specific business logic.
Fifth Optimization: Forcing Index Use
When the IN
clause is used in queries involving tens of millions of records, poorly designed indexes might lead to index inefficiency, affecting query performance.
Normally, the MySQL optimizer selects the best execution plan, including the appropriate indexes. However, for large data volume IN
clause queries, MySQL might not effectively use indexes, leading to full table scans or index inefficiency.
The following SQL demonstrates this, where forcing the index and not forcing it yields similar results due to non-sparse IN
data:
-- Fifth version, forcing index
SELECT a.*, sum(b.total_amount) as total
FROM users a
LEFT JOIN orders b FORCE INDEX (idx_orders_user_id_total_amount) ON a.user_id = b.user_id
WHERE b.user_id IN (1033, 1034, 1035, 1036, 1037, 1038)
GROUP BY a.user_id;
-- Fifth version, not forcing index
SELECT a.*, sum(b.total_amount) as total
FROM users a
LEFT JOIN orders b ON a.user_id = b.user_id
WHERE b.user_id IN (1033, 1034, 1035, 1036, 1037, 1038)
GROUP BY a.user_id;
In both cases, the query time is less than a second.
In real business scenarios, such situations can occur where the query, despite hitting the index as per the explain result, remains slow. In such cases, forcing the index can be tried.
Optimization Strategies
- Hit indexes early, use small table driving large table.
- Handle
IN
clause inefficiency with forced indexes. - Use covering indexes to avoid table lookups.
How to Optimize SQL Next Time
- For data volumes approaching tens of millions, use table partitioning, e.g., based on user ID modulo.
- Replace subqueries with summary tables to hit indexes, such as generating daily or monthly summary tables.
- Redundant join fields directly placed in a single table for single table queries.
- Hit indexes, using space for time, which is the core scenario analyzed in this article.
The core point of hitting indexes is covering indexes. For data volumes in the tens of millions, forced indexes may be necessary.
Tips
Meaning of type
in Explain Results
In MySQL’s EXPLAIN
query results, the type
field indicates the access type used by the query, representing the method of access during query execution.
Depending on the access type, MySQL’s query optimizer will choose different execution plans. Below are possible values for the type
field and their meanings:
- system: The best scenario, indicating the query returns a single row result, usually through direct access to the
PRIMARY KEY
or a unique index. - const: Indicates MySQL found a constant value in the query, which is done in the first table of the join. Since this is a constant condition, MySQL only reads one row of data from the table, e.g., accessing one row of data through the primary key.
- eq_ref: Similar to
const
, but in the case of using an index. This type of query accesses the table using a unique index, where for each index key value, only one row matches. Common in joins using primary or unique indexes. - ref: Indicates the query used a non-unique index to find values. Returns all rows matching a single value, generally appearing in joins using non-unique indexes or index prefixes.
- range: Indicates the query used an index for range scans, common in queries with range conditions such as
BETWEEN
,IN()
,>
,<
, etc. - index: Indicates MySQL will scan the entire index to find the needed rows, usually when no suitable index is available.
- all: Indicates MySQL will perform a full table scan to find the needed rows, the worst scenario.
Typically, the type
field values are ranked from best to worst as system
, const
, eq_ref
, ref
, range
, index
, all
. However, the actual situation depends on the specific query, table structure, and index usage. Better query performance usually corresponds to better type
values.
MySQL’s Table Lookup Mechanism
In MySQL, table lookup (also known as “ref” or “Bookmark Lookup”) refers to the process where MySQL first uses the index to find the position of rows meeting the conditions and then looks up the actual data rows in the main table.
This process typically occurs in queries where a covering index does not satisfy all query requirements.
When a query cannot be fully satisfied by an index, MySQL needs to look up the main table for additional information. This usually happens in the following situations:
- Non-covering index queries: When the query needs to return additional columns not included in the index, MySQL needs to look up the main table for these extra columns.
- Using index range conditions: When range conditions (
BETWEEN
,>
,<
, etc.) are used, and the index can only locate the start of the range, MySQL needs to check the main table for complete rows meeting the range conditions. - Using clustered index but required columns are not in the index: In tables with clustered indexes, if the required columns are not in the clustered index, MySQL needs to look up the main table for these columns.
When MySQL performs a table lookup, additional disk access occurs as it needs to read data from the main table, which may degrade performance, especially in large tables or high concurrency environments.
To minimize table lookups, consider the following:
- Create covering indexes: Ensure all columns needed for the query are included in the index to avoid table lookups.
- Optimize query statements: Avoid using range conditions where possible, or ensure all filtering conditions can be fully matched by the index.
- Consider table design: Include frequently queried columns in the index to reduce the need for table lookups.
Post a Comment