Friday, December 11, 2009

Best Practices - SQL

General:


· Decide upon a database naming convention, standardize it across your organization and be consistent in following it. It helps make your code more readable and understandable.
· Normalize your tables: There are two common excuses for not normalizing databases: performance and pure laziness. You'll pay for the second one sooner or later; and, about performance, don't optimize what's not slow. And, more frequent than the inverse, the resulting design is slower. DBMS’s were designed to be used with normalized databases and SQL Server is no exception, so design with normalization in mind.
· Make sure you normalize your data at least till 3rd normal form. At the same time, do not compromise on query performance. A little bit of de-normalization helps queries perform faster.
· Try not to use system tables directly. System table structures may change in a future release. Wherever possible, use the sp_help* stored procedures or INFORMATION_SCHEMA views. There will be situations where you cannot avoid accessing system table though!
· Write comments in your stored procedures, triggers and SQL batches generously, whenever something is not very obvious. This helps other programmers understand your code clearly. Don't worry about the length of the comments, as it won't impact the performance, unlike interpreted languages like ASP 2.0.
· Do not use SELECT * in your queries. Always write the required column names after the SELECT statement, like SELECT CustomerID, CustomerFirstName, City. This technique results in fewer disks IO and less network traffic and hence better performance.
· Try to avoid server side cursors as much as possible. Always stick to 'set based approach' instead of a 'procedural approach' for accessing/manipulating data. Cursors can be easily avoided by SELECT statements in many cases. If a cursor is unavoidable, use a simple WHILE loop instead, to loop through the table. WHILE loop is faster than a cursor most of the times. But for a WHILE loop to replace a cursor you need a column (primary key or unique key) to identify each row uniquely.
· Use FQN when calling USP from within another USP.
· Avoid improper usage of tables in JOIN or any other Query. i.e. If table with less records can be used why to use huge tables.
· Limit the use of correlated sub-queries; often they can be replaced with a JOIN.
· Include the SET NOCOUNT ON statement into your stored procedures to stop the message indicating the number of rows affected by a Transact-SQL statement.
· Try to avoid using SQL Server cursors, whenever possible.
· If you need to return the total table's row count, you can use alternative way instead of SELECT COUNT(*) statement. For e.g.
o SELECT rows FROM sysindexes WHERE id = OBJECT_ID('table_name') AND indid <> and NOT) as they result in table and index scans. If you must do heavy text-based searches, consider using the Full-Text search feature of SQL Server for better performance.
· Use 'Derived tables' wherever possible, as they perform better. Consider the following query to find the second highest salary from Employees table: SELECT MIN(Salary) FROM Employees WHERE EmpID IN (SELECT TOP 2 EmpID FROM Employees ORDER BY Salary Desc). The same query can be re-written using a derived table as shown below, and it performs twice as fast as the above query: SELECT MIN(Salary) FROM (SELECT TOP 2 Salary FROM Employees ORDER BY Salary Desc) AS A. This is just an example, the results might differ in different scenarios depending upon the database design, indexes, volume of data etc. So, test all the possible ways a query could be written and go with the efficient one. With some practice and understanding of 'how SQL Server optimizer works', you will be able to come up with the best possible queries without this trial and error method.
· While designing your database, design it keeping 'performance' in mind. You can't really tune performance later, when your database is in production, as it involves rebuilding tables/indexes, re-writing queries. Use the graphical execution plan in Query Analyzer or SHOWPLAN_TEXT or SHOWPLAN_ALL commands to analyze your queries. Make sure your queries do 'Index seeks' instead of 'Index scans' or 'Table scans'. A table scan or an index scan is a very bad thing and should be avoided where possible (sometimes when the table is too small or when the whole table needs to be processed, the optimizer will choose a table or index scan).
· Use 'User Defined Data-type', if a particular column repeats in a lot of your tables, so that the data-type of that column is consistent across all your tables.
· Try not to use text and ntext data-types for storing large textual data. 'text' datatype has some inherent problems associated with it. You can not directly write, update text data using INSERT, UPDATE statements (You have to use special statements like READTEXT, WRITETEXT and UPDATETEXT). There are a lot of bugs associated with replicating tables containing text columns. So, if you don't have to store more than 8 KB of text, use char(8000) or varchar(8000) data-types.
· If you have a choice, do not store binary files, image files (Binary large objects or BLOBs) etc. inside the database. Instead store the path to the binary/image file in the database and use that as a pointer to the actual binary file. Retrieving, manipulating these large binary files is better performed outside the database and after all, database is not meant for storing files.
· Use char data type for a column, only when the column is non-nullable. If a char column is nullable, it is treated as a fixed length column in SQL Server 7.0+. So, a char(100), when NULL, will eat up 100 bytes, resulting in space wastage. So, use varchar(100) in this situation. Of course, variable length columns do have a very little processing overhead over fixed length columns. Carefully choose between char and varchar depending up on the length of the data you are going to store.
· Always access tables in the same order in all your stored procedures/triggers consistently. This helps in avoiding deadlocks. Other things to keep in mind to avoid deadlocks are: Keep your transactions as short as possible. Touch as less data as possible during a transaction. Never, ever wait for user input in the middle of a transaction. Do not use higher level locking hints or restrictive isolation levels unless they are absolutely needed. Make your front-end applications deadlock-intelligent, that is, these applications should be able to resubmit the transaction incase the previous transaction fails with error 1205. In your applications, process all the results returned by SQL Server immediately, so that the locks on the processed rows are released, hence no blocking.
· Offload tasks like string manipulations, concatenations, row numbering, case conversions, type conversions etc. to the front-end applications, if these operations are going to consume more CPU cycles on the database server (It's okay to do simple string manipulations on the database end though). Also try to do basic validations in the front-end itself during data entry. This saves unnecessary network roundtrips.
· Always check the global variable @@ERROR immediately after executing a data manipulation statement (like INSERT/UPDATE/DELETE), so that you can rollback the transaction in case of an error (@@ERROR will be greater than 0 in case of an error). This is important, because, by default, SQL Server will not rollback all the previous changes within a transaction if a particular statement fails. This behavior can be changed by executing SET XACT_ABORT ON. The @@ROWCOUNT variable also plays an important role in determining how many rows were affected by a previous data manipulation (also, retrieval) statement, and based on that you could choose to commit or rollback a particular transaction.
· Though we survived the Y2K, always store 4 digit years in dates (especially, when using char or int datatype columns), instead of 2 digit years to avoid any confusion and problems. This is not a problem with datetime columns, as the century is stored even if you specify a 2 digit year. But it's always a good practice to specify 4 digit years even with datetime datatype columns.
· In your queries and other SQL statements, always represent date in yyyy/mm/dd format. This format will always be interpreted correctly, no matter what the default date format on the SQL Server is. This also prevents the following error, while working with dates: Server: Msg 242, Level 16, State 3, Line 2. The conversion of a char data type to a datetime data type resulted in an out-of-range datetime value.



Where:


· Try to use AND conditions rather than OR in WHERE clauses, as AND is a friend of INDEXES.
· Try to use parenthesis wherever possible/required in WHERE conditions to isolate/simplify complex/multiple conditions.
· Minimize the use of not equal operations, <> or !=. SQL Server has to scan a table or index to find all values to see if they are not equal to the value given in the expression. Try rephrasing the expression using ranges:
o WHERE Key-Column < 'Test-Value' AND Key-Column > 'Test-Value'
· Try to avoid using functions in Where Clause conditions as this may reduce the chances to strike the best index for the query. Instead try to use local variables to store the manipulated values or create derived column in the table to store manipulated values if this is a frequent hit. For e.g. Convert/Cast functions, Date functions, etc.

Indexes:

· Choose Index Strategy wisely. Limit number of indexes as per the necessity only.
· Consider creating Clustered Index on appropriate key rather than default primary key. For e.g. If [CreatedDate] column undergoes frequent range queries than Clustered Index on [CreatedDate] Column helps a lot and it is more meaningful.
· Analyze “Selectivity” of the columns for a table before creating indexes. Choose the index key wisely and in descending order of its selectivity. For e.g. If Col1 has more distinct values than Col2, then index key order should be Col2,Col1.
· Use the same order in Where Clause conditions as in Index i.e. based on descending order of Selectivity of the Column.
· Create Index on columns that are going to be highly selective. Indexes are vital to efficient data access; however, there is a cost associated with creating and maintaining an index structure. For every insert, update and delete, each index must be updated. In a data warehouse, this is acceptable, but in a transactional database, you should weigh the cost of maintaining an index on tables that incur heavy changes. The bottom line is to use effective indexes judiciously. On analytical databases, use as many indexes as necessary to read the data quickly and efficiently. Now a classic example is DONOT index a column like "Gender". This would have a selectivity of 50% and if your table is having 10 Million records, you can be least assured that using this index you may have to travel half the number of rows ... Hence maintaining such indexes can slow your performance.
· The 'fill factor' option specifies how full SQL Server will make each index page. When there is no free space to insert new row on the index page, SQL Server will create new index page and transfer some rows from the previous page to the new one. This operation is called page splits. You can reduce the number of page splits by setting the appropriate fill factor option to reserve free space on each index page. The fill factor is a value from 1 through 100 that specifies the percentage of the index page to be left empty. The default value for fill factor is 0. It is treated similarly to a fill factor value of 100, the difference in that SQL Server leaves some space within the upper level of the index tree for FILLFACTOR = 0. The fill factor percentage is used only at the time the index is created. If the table contains read-only data (or data that very rarely changed), you can set the 'fill factor' option to 100. When the table's data modified very often, you can decrease the 'fill factor' option to 70 percent, for example. Having explained page splits in detail I would warn you in over looking at this point because more free space means that SQL Server has to traverse through more pages to get the same amount of data. Hence try to strike a balance and arrive at an appropriate value.


Temp Tables:


· Create indexes on Temp Tables if they undergo JOIN and SELECT more frequently rather than UPDATE/INSERT
· Avoid Index creation on empty Temp Tables rather than create indexes after Inserting Data.
· Use Temp Tables in place of complex sub-queries for INSERT/UPDATE/DELETE statements.


USP/Dynamic SQL:


· Consider returning the integer value as an RETURN statement instead of an integer value as part of a record-set. The RETURN statement exits unconditionally from a stored procedure, so the statements following RETURN are not executed. Though the RETURN statement is generally used for error checking, you can use this statement to return an integer value for any other reason. Using RETURN statement can boost performance because SQL Server will not create a record-set.
Make sure your stored procedures always return a value indicating the status. Standardize on the return values of stored procedures for success and failures. The RETURN statement is meant for returning the execution status only, but not data. If you need to return data, use OUTPUT parameters.
If your stored procedure always returns a single row resultset, consider returning the resultset using OUTPUT parameters instead of a SELECT statement, as ADO handles output parameters faster than resultsets returned by SELECT statements.
· Use the sp_executesql stored procedure instead of the EXECUTE statement. The sp_executesql stored procedure supports parameters. So, using the sp_executesql stored procedure instead of the EXECUTE statement improve readability of your code when there many parameters are used. When you use the sp_executesql stored procedure to execute a Transact-SQL statement that will be reused many times, the SQL Server query optimizer will reuse the execution plan it generates for the first execution when the change in parameter values to the statement is the only variation.
· If you have a very large stored procedure, try to break down this stored procedure into several sub-procedures, and call them from a controlling stored procedure.
· Add the WITH RECOMPILE option to the CREATE PROCEDURE statement if you know that your query will vary each time it is run from the stored procedure (Dynamic SQL). The WITH RECOMPILE option prevents reusing the stored procedure execution plan, so SQL Server does not cache a plan for this procedure and the procedure is recompiled at run time. Using the WITH RECOMPILE option can boost performance if your query will vary each time it is run from the stored procedure because in this case the wrong execution plan will not be used.
· Avoid dynamic SQL statements as much as possible. Dynamic SQL tends to be slower than static SQL, as SQL Server must generate an execution plan every time at runtime. IF and CASE statements come in handy to avoid dynamic SQL. Another major disadvantage of using dynamic SQL is that, it requires the users to have direct access permissions on all accessed objects like tables and views. Generally, users are given access to the stored procedures which reference the tables, but not directly on the tables. In this case, dynamic SQL will not work.
· Separate out Dynamic SQL part into different USP and use WITH RECOMPILE. This can avoid RECOMPILATION of whole SP and also it won’t cache the EXECUTION PLAN for Dynamic part and this can boost the performance.


XML Input Parameters:


· XQuery/XPath combined with filters can definitely improve performance.
· Using Cross Apply with XQuery/XPath that can avoid referencing Root Node can improve performance drastically.
· Try using OPENXML variation as well to read XML files. In some cases, this can outperform XQuery/XPath.