Sunday, April 17, 2011

MySQL performance optimization: order by datetime field

I have a table with roughly 100.000 blog postings, linked to a table with 50 feeds via an 1:n relationship. When I query both tables with a select statement, ordered by a datetime field of the postings table, MySQL always uses filesort, resulting in very slow query times (>1 second). Here's the schema of the postings table (simplified):

+---------------------+--------------+------+-----+---------+----------------+
| Field               | Type         | Null | Key | Default | Extra          |
+---------------------+--------------+------+-----+---------+----------------+
| id                  | int(11)      | NO   | PRI | NULL    | auto_increment |
| feed_id             | int(11)      | NO   | MUL | NULL    |                |
| crawl_date          | datetime     | NO   |     | NULL    |                |
| is_active           | tinyint(1)   | NO   | MUL | 0       |                |
| link                | varchar(255) | NO   | MUL | NULL    |                |
| author              | varchar(255) | NO   |     | NULL    |                |
| title               | varchar(255) | NO   |     | NULL    |                |
| excerpt             | text         | NO   |     | NULL    |                |
| long_excerpt        | text         | NO   |     | NULL    |                |
| user_offtopic_count | int(11)      | NO   | MUL | 0       |                |
+---------------------+--------------+------+-----+---------+----------------+

And here's the feed table:

+-------------+--------------+------+-----+---------+----------------+
| Field       | Type         | Null | Key | Default | Extra          |
+-------------+--------------+------+-----+---------+----------------+
| id          | int(11)      | NO   | PRI | NULL    | auto_increment |
| type        | int(11)      | NO   | MUL | 0       |                |
| title       | varchar(255) | NO   |     | NULL    |                |
| website     | varchar(255) | NO   |     | NULL    |                |
| url         | varchar(255) | NO   |     | NULL    |                |
+-------------+--------------+------+-----+---------+----------------+

And here's the query that takes >1 second to execute. Please note that the post_date field has an index, but MySQL isn't using it to sort the postings table:

SELECT 
    `postings`.`id`, 
    UNIX_TIMESTAMP(postings.post_date) as post_date, 
    `postings`.`link`, 
    `postings`.`title`, 
    `postings`.`author`, 
    `postings`.`excerpt`, 
    `postings`.`long_excerpt`, 
    `feeds`.`title` AS feed_title, 
    `feeds`.`website` AS feed_website
FROM 
    (`postings`)
JOIN 
    `feeds` 
ON 
    `feeds`.`id` = `postings`.`feed_id`
WHERE 
    `feeds`.`type` = 1 AND 
    `postings`.`user_offtopic_count` < 10 AND 
    `postings`.`is_active` = 1
ORDER BY 
    `postings`.`post_date` desc
LIMIT 
    15

The result of the explain extended command on this query shows that MySQL is using filesort:

+----+-------------+----------+--------+---------------------------------------+-----------+---------+--------------------------+-------+-----------------------------+
| id | select_type | table    | type   | possible_keys                         | key       | key_len | ref                      | rows  | Extra                       |
+----+-------------+----------+--------+---------------------------------------+-----------+---------+--------------------------+-------+-----------------------------+
|  1 | SIMPLE      | postings | ref    | feed_id,is_active,user_offtopic_count | is_active | 1       | const                    | 30996 | Using where; Using filesort |
|  1 | SIMPLE      | feeds    | eq_ref | PRIMARY,type                          | PRIMARY   | 4       | feedian.postings.feed_id |     1 | Using where                 |
+----+-------------+----------+--------+---------------------------------------+-----------+---------+--------------------------+-------+-----------------------------+

When I remove the order by part, MySQL stops using filesort. Please let me know if you have any ideas on how to optimize this query to get MySQL to sort and select the data by using indexes. I have already tried a few things such as creating a combined index on all where/order by fields, as suggested by a few blog postings, but this didn't work either.

From stackoverflow
  • MySQL has two filesort algorithms: an older filesort that sorts records on disk, and a new version that works in memory.

    If it cannot use an index on the first table in the join to sort the query, it will have to do a filesort. If resultset before sorting converted to fixed-width format is greater than the sort buffer OR if it contains any text fields, it will have to use the slower on-disk filesort algorithm (the second condition is satisfied since your query has a text field).

    MySQL is choosing to use the is_active column, ostensibly because it thinks that column is most selective in eliminating rows before it continues with the other joins and where conditions. The first thing I would suggest would be to try creating composite indexes with post_date, feed_id, and the columns in the where condition, e.g. (is_active, user_offtopic_count, post_date, feed_id).

    Dennis G. : Thanks for the explanation!
  • Create a composite index either on postings (is_active, post_date) (in that order).

    It will be used both for filtering on is_active and ordering by post_date.

    MySQL should show REF access method over this index in EXPLAIN EXTENDED.

    Note that you have a RANGE filtering condition over user_offtopic_count, that's why you cannot use an index over this field both in filtering and in sorting by other field.

    Depending on how selective is your user_offtopic_count (i. e. how many rows satisfy user_offtopic_count < 10), it may be more useful to create an index on user_offtopic_count and let the post_dates be sorted.

    To do this, create a composite index on postings (is_active, user_offtopic_count) and make sure the RANGE access method over this index is used.

    Which index will be faster depends on your data distribuion. Create both indexes, FORCE them and see which is faster:

    CREATE INDEX ix_active_offtopic ON postings (is_active, user_offtopic_count);
    CREATE INDEX ix_active_date ON postings (is_active, post_date);
    
    SELECT 
        `postings`.`id`, 
        UNIX_TIMESTAMP(postings.post_date) as post_date, 
        `postings`.`link`, 
        `postings`.`title`, 
        `postings`.`author`, 
        `postings`.`excerpt`, 
        `postings`.`long_excerpt`, 
        `feeds`.`title` AS feed_title, 
        `feeds`.`website` AS feed_website
    FROM 
        `postings` FORCE INDEX (ix_active_offtopic)
    JOIN 
        `feeds` 
    ON 
        `feeds`.`id` = `postings`.`feed_id`
    WHERE 
        `feeds`.`type` = 1 AND 
        `postings`.`user_offtopic_count` < 10 AND 
        `postings`.`is_active` = 1
    ORDER BY 
        `postings`.`post_date` desc
    LIMIT 
        15
    
    /* This should show RANGE access with few rows and keep the FILESORT */
    
    SELECT 
        `postings`.`id`, 
        UNIX_TIMESTAMP(postings.post_date) as post_date, 
        `postings`.`link`, 
        `postings`.`title`, 
        `postings`.`author`, 
        `postings`.`excerpt`, 
        `postings`.`long_excerpt`, 
        `feeds`.`title` AS feed_title, 
        `feeds`.`website` AS feed_website
    FROM 
        `postings` FORCE INDEX (ix_active_date)
    JOIN 
        `feeds` 
    ON 
        `feeds`.`id` = `postings`.`feed_id`
    WHERE 
        `feeds`.`type` = 1 AND 
        `postings`.`user_offtopic_count` < 10 AND 
        `postings`.`is_active` = 1
    ORDER BY 
        `postings`.`post_date` desc
    LIMIT 
        15
    
    /* This should show REF access with lots of rows and no FILESORT */
    
    Dennis G. : That did the trick for me, thanks a lot! I had to use force index to get the best index used. We are now using multiple combined indexes for the different queries.
  • Also, it's important to remember that MySQL won't use an index if the column you're ordering by has a function applied to it.

    You should also try aliasing postings.post_date as something else. This will tell MySQL to order by the unaltered column, and you'll still select the unix timestamp.

0 comments:

Post a Comment