Sunday, March 20, 2011

Indexing strategy on table

I have an SQL Server 2005 table named 'EventTable' defined as such:

EventID, EventTypeCode, EventStatusCode, EventDate

Currently the table has a clustered index on the primary key 'EventID', there are no other indexes currently

EventTypeCode and EventStatusCode columns are CHAR(3) (examples are 'NEW', 'SEN', 'SAL') and are foreign keys

Common Selects will be...

select * from EventTable Where EventDate = @dateparam;
select * from EventTable Where EventTypeCode = @eventtype;
select * from EventTable Where EventStatusCode = @statustype;

What index strategy would you use to handle Select statements above?

Is it better to have a covering (compound) index on the 3 columns? If so, what order should the compound index be in?

Or a separate index on each of the 3 columns?

The table will grow at the rate of about 300 events per day..

From stackoverflow
  • I would put an index on each of the foreign keys (I generally index most foreign keys), and then probably one on the date field, depending on the frequency it's used in searches.

  • Strategy 1, provide indexes that can be used for filtering. Table lookups will fetch the remaining data. This almost doubles the use of space and quadruples write IO cost.

    on EventTable(EventDate)
    on EventTable(EventTypeCode)
    on EventTable(EventStatusCode)
    

    Strategy 2, provide covering indexes that can be used for filtering. There will be no lookups. This quadruples the use of space and write IO cost.

    on EventTable(EventDate, EventId,
                  EventTypeCode, EventStatusCode)
    on EventTable(EventTypeCode, EventId,
                  EventDate, EventStatusCode)
    on EventTable(EventStatusCode, EventId,
                  EventDate, EventTypeCode)
    


    The reason that the column order matters in a covering index (in general), is that data is ordered by each column in turn. That is to say: column 2 tie-breaks column 1. Column 3 tie-breaks column 1 and 2.

    Since you don't have any queries that filter on multiple columns, there is no significance (in your case) to the column order after the first column.

    If you had a query such as

    where EventDate = @EventDate
      and EventTypeCode = @EventTypeCode
    

    Then this covering index would be useful. EventDate is likely more selective than EventTypeCode, so it goes first.

    on EventTable(EventDate, EventTypeCode,
                  EventId, EventStatusCode)
    


    Edit further: If you have a query such as

    where EventDate between '2008-12-01' and '2008-12-31'
      and EventTypeCode = 'todo'
    

    Then this index will work best:

    on EventTable(EventTypeCode, EventDate,
                  EventId, EventStatusCode)
    

    This will put all the 'todo' events together, ordered by their EventDate as a tie-breaker. SQL Server just has to find the first element and read until it finds an element that doesn't meet the criteria and stop.

    If the EventDate was first in the index, then the data would be ordered by date, and then each date would have the 'todo' events clumped together. SQL Server would find the first todo on 12-01, read til it finds an element that doesn't meet the criteria... then find the first todo on 12-02, read until it's out of todo's... then find... on out for 31 days.

    You want to choose an index that places the items you want contiguous to each other.


    At 300 records per day, your table will get to 5 million records in 50 years. This isn't that big. Either strategy will work. Strategy 1 will probably be fast enough (err on the side of space).

  • How frequently do you run selects against the table? Are the selects generally part of normal processing or more towards reporting and/or maintenance and debugging?

    Is there an initial load of data? If not, the table size is pretty tiny, and likely to remain that way for years to come.

    Although you give some sample selects, do you know how frequent each type of select will be run?

    I'd probably just leave the table as it is and run profiler to see how the table is being accessed in production. If it's going to be a table that is accessed constantly and may become a bottleneck for different functions then I'd make a best guess as to which columns will be part of the WHERE clause most frequently and put a single index on that. For example, if there is a process that looks at all events for the past 24 hours that runs every 10 seconds, then an index on the date column might be in order and I would even cluster on that one rather than the primary key.

  • Please take a look at this good article on SQL Server Indexing:

    http://www.mssqltips.com/tip.asp?tip=1206

  • It will also be common to execute queries such as

    where EventDate between '2008-12-01' and '2008-12-31'
      and EventTypeCode = 'todo'
    

    • the table is more likely to grow at 500-800/records per day rather than 300
    • the queries mentioned in the initial question will be run many times throughout the day, during normal use of the ASP.NET application
    • NHibernate 'HQL' is used to perform such queries
    • there is no initial load of data, the table only sits at about 10K records now because this is a new app
    • ...I'm more or less just trying to avoid the customer having to call us in a couple years to complain about the app becoming 'slow' since this table will be hit so much
    Eric Z Beard : In a couple of years? There is no such thing as getting indexes right once and walking away forever. They take constant analysis and maintenance to stay right as the data changes.

0 comments:

Post a Comment