Tuesday, March 10, 2009

SSIS Performance Framework Design Considerations : SSIS Best Practices , SSIS Performance / Package Review Checklist

I'm reading: SSIS Performance Framework Design Considerations : SSIS Best Practices , SSIS Performance / Package Review ChecklistTweet this !
Having implemented projects with end-to-end MS BI Implementation and after reading tons of best practices document from technical forums and blogs, I feel like sharing my experience about how many of these I have seen getting useful in real practice. Below is the list of some of the SSIS 2005 best practices and SSIS 2005 performance related hints which I have practically seen getting useful in projects that I have worked upon. I would rate it's usefulness on a scale of 1 to 10, and this rating is solely my viewpoint and experience.

Best Practices:

  • To avoid most of the package deployment error from one system to another system, set package protection level to ‘DontSaveSenstive’. [While working in a multi-user environment in a version-controlled system, this is very important. Rating: 6/10]
  • Once you copy-paste a script component and execute the package, it may fail. Just open the script editor of the pasted script component, save the script and execute the package – it will work. [This may or may not happen. I feel it's experience driven, but I have experienced this. Rating: 3/10]
  • In Master packages, expecially extract packages where a single package is extracting data from 20-30 odd tables, Group tasks together with annotations. This would seem very trivial during the start of development, but it would work like an air-conditioner when the development tempreature is at it's peak. [This keeps package manageable, easy to debug and more organized after annotations are added. Rating: 7/10]
  • Avoid unnecessary type casts. For example, the flat file connection manager, by default, uses string [DT_STR] data type for all the columns. Watch precision issues when using the money, float, and decimal types. Also, be aware the money is faster than decimal, and money has fewer precision considerations than float. Check my post for SQL Server to SSIS data type mapping for more info on data types. [Understanding of SSIS data type is very important from a design and performance perspective. Rating: 10/10]
  • For the SQL job that call the SSIS packages, make multiple steps each doing small tasks rather than a single step doing all the tasks. In the first case, the transaction log grows too big and if a rollback to happen, it may take the full processing space of the server. Do check out my post about Improving SSIS Performance of load packages using Chunking method. [Generally in a data warehouse loading scenario, only the very first load that populated data warehouse for the very first time is heavy. For the rest delta is loaded, and that delta is often divided into different dimension and fact loads. I do not rule out possibility of this scenario, but my belief is that the probability of this scenario is 50/50 in a data warehouse load. Rating: 5/10]
  • The control flow of an SSIS package threads together various control tasks. It is possible to set a transaction that can span into multiple tasks using the same connection. To enable this, the “retainsameconnection” property of the connection manager should be set to “True”. [This can save many connection hits to the database and gain at least 15-20% connection making load reduction of production database. Rating: 7/10]

Coming to SSIS performance related points, three of the top points which I have seen affecting performance directly and dramatically are Sorting, batchsize and commitsize properties related to transactions, and selection of appropriate components / tasks.

SSIS 2005 Performance Hints:

  • Sorting in SSIS is a time consuming operation. At the places where we know that data is coming from database tables, it’s better to perform the sorting operation at the database query itself. Sort the data using the ORDER BY clause in the query, and change the IsSorted property of the source adpaters to make the component and SSIS aware that the data is sorted.
  • When you execute SQL statements within SSIS whether to read a source, to perform a look transformation, or to change tables, some standard optimizations significantly help performance:
  1. If Integration Services and SQL Server run on the same server, use the SQL Server destination instead of the OLE DB destination to improve performance.
  2. Commit size 0 is fastest on heap bulk targets, because only one transaction is committed. If you cannot use 0, use the highest possible value of commit size to reduce the overhead of multiple-batch writing. Commit size = 0 is a bad idea if inserting into a Btree – because all incoming rows must be sorted at once into the target Btree—and if your memory is limited, you are likely to spill. Batchsize = 0 is ideal for inserting into a heap. For an indexed destination, I recommend testing between 100,000 and 1,000,000 as batch size.
  3. Use a commit size of less than 5000 to avoid lock escalation when inserting; note that in SQL Server 2008 you can now enable/disable lock escalation at the object level, but use this wisely.
  4. Heap inserts are typically faster than using a clustered index. This means that you may want to drop indexes and rebuild if you are changing a large part of the destination table; you will want to test your inserts both by keeping indexes in place and by dropping/disabling all indexes and rebuilding to validate.
  5. Use partitions and partition SWITCH command; i.e., load a work table that contains a single partition and SWITCH it in to the main table after you build the indexes and put the constraints on.
  6. A key network property is the packet size of your connection. By default this value is set to 4,096 bytes. This means a new network package must be assemble for every 4 KB of data. As noted in SqlConnection.PacketSize Property in the .NET Framework Class Library, increasing the packet size will improve performance because fewer network read and write operations are required to transfer a large data set.
  7. Another network tuning technique is to use network affinity at the operating system level. At high throughputs, you can sometimes improve performance this way. For the network itself, you may want to work with your network specialists to enable jumbo frames to increase the default payload of 1,500 bytes to 9,000 bytes. By enabling jumbo frames, you will further decrease the amount of network operation required to move large data sets.
  • Right Component / Task selection for the right job: As per many discussions and forums, SCD Task performs poor and this starts getting realized as soon as the data flow increase beyond the limit of 50k records. As per the latest conference at SQLBits, I just saw the presentation slides of one of the speaker where it was suggested to use MERGE statement of SQL Server 2008 instead of SCD Task. Also one other poorly performing task is the OLE DB Command which does a row-by-row operation. Instead opt for a BULK Insert Command. These tasks should either not be used or should be used minimally with caution.


Reference: MSDN SSIS Forum - SSIS 15 Best Practices, SSIS Best Practices By SQLCAT Team

No comments:

Related Posts with Thumbnails