SQL Server Integration Services – An ETL Guide

SQL Server integration Services has been a tool of choice for me since I first ventured into SQL Server 2005 world. Since then SQL Server has evolved dramatically and more features have been added with increased capabilities. With SSIS2012, you can now do column mapping, declare and change the parameter properties during runtime of a package, Script task and script component for running VB or C# code, Connection manager where you can create and share connections on a project level, the merge join transformations now consume reduced memory usage, the DQS cleaning transformation that helps you do cleansing transformations with relative ease etc.

One of the biggest feature of SQL Server 2012 is introduction of the In-memory OLTP execution. The in memory feature is becoming a very sought upon by the consumer due to the dramatic performance gain as well as dropping price of physical memory modules. Where SQL Server 2012 claims to have 5-20 times performance improvements, SAP Hana with its new architecture, claims to boost the performance by a whopping 300 times! As soon as I get my hands on Hana, I will do the comparison test. In-memory optimization gets this performance boost by taking advantage of hyper threading technology and enables concurrent processing that eliminates logical locks on the tables. The Stored procedures are also compiled natively and the result set is held in the memory. This is all on top of a massive algorithm change that supports faster processing.

Anyways, let us get back to the ETL discussion. Where do ETL as well as other tools/capabilities like OLAP cube, Universe semantic layer, end systems, source systems stand in the overall picture of the systems setup?

I created this to understand the role of each layer.

Business Intelliegence System Overview

 

From a pure information system’s perspective, it all starts with the source system. But what I didn’t show in the diagram above is that the source systems gets the data from an actual physical event like sales information from P.O.S system, Payroll data from Time and Attendence system, customer information using CRM system and so on. Let us say a company uses SAP to place orders, create invoices and stores shipment data. They are all tied to a physical event that happened in real world. The data entered in those system are then stored into the application’s native database or external databases. ETL comes into picture at this point and enables to work with raw data by performing cleansing task, data profiling, checking the referential integrity, NULL errors & invalid/corrupted records, transforming the data for storing in a dimensional or OLAP compliant schema. That’s just a very high level description but a lot goes on behind the scene and we’ll see some of that in coming sections. The ETL script or package will store the data into their respective tables with proper cardinaility and data integrity into a data warehouse. The end systems where the final data is presented could be accessed directly from the data warehouse(through applications developed using .NET, Java, HTML5 APIs etc) or by using intermediate tools like BO universe designer, SQL Server Analytical cubes etc. Each has it’s own strength and set of capabilities. For example, if the objective is to create a reporting system where users can do data mining or analysis, the SSAS and BO universe will provide that capability by allowing  addition of rules, calculations etc that can be used in friendly GUI for even a non technical user.

Let us dive into SSIS tool. When you log in to SQL Server Data Tools (previously called BIDS- Business Intelligence Development Studio), you will be prompted to pick a project for either SSIS, SSAS or SSRS. Once you create an Integration Services project, you’ll see the a package with dtsx extension being created. The dtsx package is in the XML form and you can see the code by right clicking on the package name from Solution Explorer and selecting “View Code”.

SQL Server Data Tools - SSIS

 Click on the picture for larger view.

The view that you see in the picture above is the default layout for SSIS.  On the right side is the Solution explorer where you can create a new connection, packages etc and in the center you can create Control Flow and Data Flow designs, on the bottom section you can manage project level connections and declare Variables in the Variables section.

Control Flow is where you define various tasks that are arranged in a logical order and sequential pattern to get the desired result.  Control flow contains various task components like Data Flow task, FTP, Script task,  execute process task and containers like For Loop, For Each etc. Each task can be executed in order by using “precedence” operator. It is nothing but a path that the process will take in case of success, complete or failure.

Tasks are individual work functions that will have that will execute a particular code based on the type of task with some user input. The great thing about SSIS is that users don’t need to insert a function script in a code form, they can just be dragged and dropped. Try dragging a task into Control Flow and double click on the task. A task editor window will open up  which will let you set the parameters, expressions, script, properties etc.

Tasks are generally executed sequentially and each task has either a  success, failure and completion value that gets evaluated based on constraints, expression or a combination of both in logical OR and logical AND. (We’ll see those shortly in an example).

Here are a few of the most popular tasks.

  • Analysis Services Processing Task – This task will process a SQL Server Analysis Services Cube, dimension or a mining model. Frequently used to update the analytical cube with latest data.
  • Bulk Insert Task – This task executes BULK INSERT SQL command to load data into  a table.
  • Data Flow Task  – Data flow is the heart of the ETL package where various transformations happen in the data pipeline. This is the most frequent and obvious task used in any package. The data flow task will let you connect to any data source and will output the data based on the transformations performed. When ever this task is added in the control flow, a data flow task model is automatically created that can be accessed through a tab next to “Control Flow” tab.
  • Execute Package Task – This task lets you execute a package from within a package and making it modular.
  • Execute Process Task – This task will execute a program external to the package. The Task editor will let you select a program on your operating system that will be executed with input, output parameters and also using an expression.
  •  Execute SQL Task – This is another popular task frequently used by the database architect for executing a SQL statement. For example the SQL might insert data from the source (ex-excel sheet) as soon as the excel sheet is loaded with data from another task.
  • FTP Task – Lets you connect to a server using the FTP protocol for sending and receiving files.
  • Send Mail Task – Will let you send an email using SMTP protocol. Commonly used to send out a custom message on success, failure or completion of a task.

Precedence Constraints are the links that connects various tasks and directs the flow of task execution in the designed order.  Business rules or logic can be applied as a part of an expression. For example, execute the task only when a defined variable has a value of greater than 100. The constraint values determines whether and when the next task will be executed in response to the preceding task’s execution.

There are three constraint values –

  • Success – A task link to another task will only execute when the first task successfully completes.
  • Failure – In linked tasks, the next task is executed only when the preceding task fails to complete.
  • Completion – The linked task executes regardless of success and failure on completion/execution of preceding task.

Precedence Constraints

 

 

Containers are used for grouping various tasks together logically into units of work. This is not helps with visual aspect but also enables defining variables and event handlers on the containers itself. There are 4 types of containers-

  • Task Host Container- This is not a visible item but just an abstract interface concept.
  • Sequence Container- This container allows grouping tasks logically with separate boundaries. This container can be collapsed and expanded as needed.
  • For Loop Container- This sets up a loop for all the tasks within a container that will be executed until a condition is satisfied.
  • Foreach loop Container-This sets a loop on files or records and executes the task for each of the records.

Data Flow is the core of any SSIS package where most of the action takes place. It enables you to authenticate and connect to the database, load the relevant data into database engine’s memory, perform transformations like row sampling, term lookup, importing columns, fuzzy logic etc and export the data to an ODBC, OLE DB, Raw file, excel, sql server etc destinations. This is the place where in-memory processing of data takes place and really shows the performance edge of SQL Server database engine. The connections that can be used for sources and destinations are actually defined in the connection manager thats globally available for any data flow tasks.

 

Note: More Content and Examples to follow. Stay Tuned.