Siddharth Mehta's Blog: New features in SSIS 2008 that can Improve SSIS 2008 Performance over SSIS 2005 comparatively

An integral part of any BI system is the data warehouse — a central repository of data that is regularly refreshed from the source systems. The new data is transferred at regular intervals (often nightly) by extract, transform, and load (ETL) processes.

Typically the data in the data warehouse is structured as a star schema or snowflake schema or a hybrid between the two. No matter which structure is chosen, after the new data has been loaded into the data warehouse, many BI systems copy subsets of the data to function-specific data marts where the data is typically structured as a multi-dimensional OLAP cube.

Below are some of the points and new features of SQL Server 2008, using which I feel that, efforts made on SSIS performance tuning and optimization (in SSIS 2005 post development) would be reduced to a major level. Also this rich feature set of SQL Server 2008 should help reduce development time and improve performance compared to SSIS / SQL Server 2005.

1) Feasiblility Study: Develop a Proof Of Concept

Before venturing with any new feature that has been introduced in a new product or product version, it's always a good design practice to develop a POC of the new feature in alignment with the technical requirements.

Deciding upon a Proof of Concept project is an excellent way to gain support and influence people. This might involve the creation of an OLAP cube and the use of visualization software for one (or several) business units. The project can be used to show business people what they can expect from the new system and also to train the BI team. Choose a project with a small and well-defined scope that is relatively easy to achieve. Performing a Proof of Concept requires an investment of time, effort, and money but, by limiting the scope, you limit the expenditure and ensure a rapid return on the investment.

2) Analyze OLTP Source System : Use data profiling to examine the distribution of the data in the source systems

The new Data Profiling task in Integration Services can be used to initially understand the nature of the source data for design purposes. However, the profiles it produces can also be used to apply business rules to data as part of the transformation process. Suppose, for example, the business rule says that the data from a particular source is acceptable only if the number of nulls does not exceed 1%. The profiles produced by a Data Profiling task can be used to apply this rule.

You can use the information you gather by using the Data Profiler to define appropriate data transformation rules to ensure that your data warehouse contains “clean” data after ETL, which leads to more accurate and trusted analytical results. The Data Profiler is a data flow task in which you can define the profile information you need.

Eight data profiles are available; five of these analyze individual columns:

· Column Null Ratio
· Column Value Distribution
· Column Length Distribution
· Column Statistics
· Column Pattern

Three analyze either multiple columns or relationships between tables and columns:

· Candidate Key
· Functional Dependency
· Value Inclusion

Multiple data profiles for several columns or column combinations can be computed with one Data Profiling task and the output can be directed to an XML file or package variable. The former is the best option for ETL design work.

Note that Data Profiling tasks profile SQL Server tables; data in other locations must be loaded into staging tables before it can be profiled.

3) Use MERGE statement for dealing with Slowly Changing Dimensions : During the extract phase you often need multiple Data Manipulation Language (DML) queries in order to perform one logical movement of the deltas into the relevant table. This is particularly true when you have to deal with slowly changing dimensions. SQL Server 2008 allows you to combine these multiple queries into one MERGE statement.

The MERGE statement performs insert, update, or delete operations on a target table based on the results of a join with a source table.

The MERGE statement provides three types of WHEN clauses:

· WHEN MATCHED enables you to UPDATE or DELETE the given row in the target table when the source and target rows match some criteria or criterion.
· WHEN NOT MATCHED [BY TARGET] enables you to INSERT a row into the target when it exists in the source but not in the target.
· WHEN NOT MATCHED BY SOURCE enables you to UPDATE or DELETE the given row in the target table when it exists in the target but not in the source.

You can specify a search condition with each of the WHEN clauses to choose which type of DML operation should be performed on the row.

The OUTPUT clause for the MERGE statement includes a new virtual column called $action that you can use to identify the DML action that was performed on each row.

4) Use Change Data Capture to simplify extract process.

SQL Server 2008 has a new data tracking feature that is of particular benefit in data warehousing. The Change Data Capture process tracks changes to user tables and collects them into a relational format. A typical use would be to track changes in an operational database for later inclusion in the warehouse.

The capture process collects change data from the database’s transaction log and inserts it into a change table. Metadata about each transaction is also inserted into a metadata table so that changes can be ordered with regard to time. This enables the identification of, for instance, the type of change made to each row, and which column or columns changed in an updated row. It is also possible to request all rows that changed between two time/dates.

5) Use new Lookup Transformation

In SQL Server 2008, the Lookup transformation supports two connection types when connecting to the reference dataset: the Cache connection manager and the OLE DB connection manager. The reference dataset can be a cache file, an existing table or view, a new table, or the result of an SQL query.

Reference data is usually cached for efficiency and now a dataflow can be used to populate the cache. Many potential sources can be used as reference data: Excel, XML, text, Web services—anything within reach of an ADO.Net provider. In SQL Server 2005, the cache could only be populated by an SQL query and a Lookup could only take data from specific OLE /DB connections. The new Cache Transform component populates a cache defined by the Cache connection manager.

The cache no longer needs to be reloaded each time it is used: this removes the speed penalty incurred by reloading from a relational source. If a reference dataset is used by two pipelines in a single package, the cache can be saved to permanent file storage as well as to virtual memory so it is available to multiple Lookups within one package. Furthermore the cache file format is optimized for speed and its size is unrestricted.

The miss-cache feature is also new. When running directly against the dataset, a Lookup component can add to the cache any key values from the source where there is no matching value in the reference dataset. So if Lookup has once determined that the reference set does not contain, for example, the value 885, it does not waste time inspecting the reference set for that value if it appears again in the source data. Under certain conditions this feature can produce a performance improvement of 40%.

Finally there is now a ‘Lookup no match output’ to which ‘miss-cache’ rows can be directed instead of going to the error output.

References: Data warehouse Best Practices - MSDN, BOL

Siddharth Mehta's Blog

Sunday, March 01, 2009

New features in SSIS 2008 that can Improve SSIS 2008 Performance over SSIS 2005 comparatively

No comments:

Latest Trends and Technologies

Elasticsearch Resources

Hadoop, BIG Data, and Cloud

Read My Articles

Microsoft Business Intelligence

SQL Server Product Team Blogs

Community

MS BI 2008 Whitepapers

Article Category

MS BI 2008 Video Tutorials

Blog Archive