# Earned Value Management – Formulas, Calculations & Examples

Any project needs a constant monitoring of time, material and cost usage. If you see your team falling behind on schedule, you can add more resources to stay on track. But that could result in increase of project cost (Unless you are not paid overtime!). This concept is commonly known as triple constraint in the Project Management world and it means that if either one or more constraints (time, cost or schedule) change in positive or negative direction, the other constraints will be affected as well.

To stay on track we universally use the tried and tested Earned Value management calculations. Not only are they used for monitoring and control of a project; they are a valuable set of tools for reporting performance to the stakeholder or sponsors or your program manager.

Here is an example of a Project Performance Report.

Changes in project baselines can occur due to either internal reasons or external factors; for example internal reason could be resource turnover and external reason could be change in requirement by business)

Lets assume there is a change in requirement and we invoke change management to have those changes approved and added to the project baselines. That will cause an increase in budget & time and they would need to be recalculated. Now we might also need to forecast “If” situations such as what if the schedule is immovable then how many more resources would I need to add and what would be the additional cost incurred?

These questions and analysis can be easily done via EV calculations.

Here are the terms and formulas that are most commonly used. I created this diagram depicting the terms in pictorial form for easier understanding.

• Planned Value (PV): This is the estimated value of work that is planned to be done.
• Earned Value (EV): EV is where you are currently in the project. i.e, earned value of work completed.
• Actual Cost (AC): This is the actual cost of the work completed.
• Budget at Completion (BAC): This is the project’s total budget that is determined by the P.O.
• Estimate at Completion (EAC): This is a forecasting term that will help in determining expected total cost in the end. EAC = AC + (BAC – EV)
• Estimate to Complete (ETC): This is expected remaining cost. (from now until the end) ETC = EAC-AC
• Cost Variance (CV): EV-AC. You want your earned value higher than the actual cost.
• Schedule Variance (SV): You again want to keep your SV positive.
• SV=EV-PV
• Variance at Completion (VAC): This is the variance calculation to determine whether the project is running over or under budget.

Here is an example , let’s say we hired a coder with billing rate of \$50/hour to code 500 lines of code in 5 days (total of 40 hours). The coder is expected to code 100 lines/day and work 8 hours/day.

Scenario on Day 1:  Coder worked 6 hours and finished 200 lines of code

1. PV = \$400    (per 100 lines of code or 8 hours of work at \$50/hour)
2. EV = 200 lines of code x \$4/line of code = \$800
3. AC = 6 hours x \$50/hour = \$300
4. CV = EV – AC = \$800 – \$300 = \$500
5. SV = EV – PV = \$800 – \$400 = \$400
6. CPI = EV/AC = \$800/\$300 = 2.67
7. SPI = EV/PV = \$800/\$400 = 2
8. BAC = Total Project Budget = \$400/day x 5 days = \$2000
9. EAC = AC + (BAC – EV) = \$300 + (\$2000 – \$800) = \$1500     — Atypical, that is, variable work each day by ignoring the past. To be used when all future work will be accomplished at the budgeted rate. –
10. EAC = BAC / CPI = \$2000 / 2.67 = \$750  — Typical, that is, assuming same amount of work per day and same spending per day will continue. –
11. EAC = AC + [(BAC – EV) / (CPI x SPI)] = \$300 + [(\$2000-\$800)/(2.67 x 2)] =  \$525   -Used when dealing with poor cost performance and the completion date is firm.
12. ETC = EAC – AC = \$1200
13. VAC = BAC – EAC = \$500
14. TCPI = (BAC – EV) / (BAC – AC) = 70.59%

Perform same calculations for the following daily scenarios:

Scenario on Day 2:  Coder worked 8 hours and finished 50 lines of code

Scenario on Day 3:  Coder worked 6 hours and finished 100 lines of code

Scenario on Day 4:  Coder worked 10 hours and finished 50 lines of code

Scenario on Day 5:  Coder worked 10 hours and finished 100 lines of code

I created this spreadsheet that will perform all the calculations for you and also give you budget status on daily basis(See “Over or Under Budget?” column). Just input the cells highlighted in Yellow.

And here is the attachment for the Excel sheet –> Excel Spreadsheet for EVM Formulas and Calculations

# Determining the Critical Path and Float

In spite of Critical Path being such an important method of managing a successful and timely delivery of project deliverables and as the name implies it being “critical”, it is surprisingly overlooked. How many times have you experienced a setback or delay in completing some of the project tasks whether due to any unexpected issues or dependencies or identified risks and things of that nature? To stay on track, critical path helps you mitigate the risk of falling behind the set & agreed schedule and also assists in managing those delays in a way that will cause no real negative effect on the firm project completion date.

Critical path method is used to calculate early & late start dates and finish dates without taking into consideration any resource constraints. Critical path method highlights that one path, out of various different paths in your schedule network diagram, that results in latest completion date of the project. Some of the tasks with a calculated float greater than “0” can be delayed but the tasks that fall under the critical path cannot be delayed. Any delay will cause increase in resources, schedule and related cost. Microsoft Project has an option for automatically displaying the critical path that should be monitored closely by any project manager. The rationale being that there could be delays on certain tasks but the maximum delay (without affection timeline) could only be what is permissible by the calculated slack or float. So when you notice your team is going slow on certain tasks and missing the completion date, you will need to figure out a way to restructure the dates for those delayed tasks in such a way that the project completion date is not affected. This is where critical path and slacks comes into picture.

We can easily find the critical path using Microsoft Project and also visually highlight the entire path in Gantt chart. Here are the sample tasks I created in MS Project; You can see the tasks highlighted in yellow(in task list) and red (in Gantt chart) are the tasks on critical path.

Now let’s assume Task 2 has been delayed to start on 5/29 instead of starting on 5/21 as originally planned. Since Task 2 has a slack of 9 days, we can delay the start date upto maximum 9 days before affecting the project completion date. See below on Gantt Chart how moving task 2 has no negative effect on the total time frame.

Calculating the slack manually will clear this concept from the ground level. Let’s say you have 9 WBS tasks in your project plan and the durations have been determined for each task. Next step is to determine the predecessors for each subsequent task. This forms a network schedule diagram which will be used for calculating the Critical Path.  Start by designating the acronyms ES, EF, LS and LF to the 4 corners of the task box as shown in the diagram below.

Next step is to perform Forward Pass calculation to determine earliest start and earliest finish of each task. You start by assigning the value 0 to your first ES on Task 1.  You will then add the duration for task 1 to the ES value (which is Zero for first task) and determine the EF. So EF for task 1 is 0 +2 = 2.

You will carry the values of EF =2 to ES of task 2 and ES of Task 3 since both the the successor tasks. Again you will add ES + Duration on Tasks 2 and 3 to get EF values.

Now before you proceed, notice that Task 4 has two predecessor tasks. So you will need to calculate the ES for both of the predecessor tasks (Task 2 and 6) and pick the highest EF value.

After calculating the Earliest Finish for Task 2 and Task 6, we will pick 16 as ES for Task 4 since that is the highest number out of 7 and 16.

Keep repeating these steps for each task until all ES and EF are calculated and determined.

Backward Pass needs to be performed for calculating the LF and LS and is followed in the opposite direction of forward pass, i.e, from Task 9 to Task 1.

Start by taking the Earliest Finish value of the last task and assign it to the Latest Finish of that new task. Now subtract duration from the Latest Finish to get Latest Start.

The LF of Task 9 is 37, so the Latest Start will be 37 – duration 3 = 34.  Designate LS value to the LF of preceding task (from right to left perspective).

Keep performing the same steps until you reach Task 3. As you can see the values for LF of Task 3 can be carried over by both Task 5 and Task 6. Unlike forward pass method where you pick the highest values you will instead pick the lowest of the number. So the correct value carried over to LF of Task 3 will 10 (from 10 and 18 of Tasks 6 and 4 respectively)

The final step before we can determine the tasks falling on the critical pass is to calculate the float or slack. Float is the number of days a particular task can be delayed until the project completion is affected negatively.

To calculate float you will need to subtract Earliest Finish from Latest Finish (LF-EF) for each tasks. After calculating the float for all the tasks, you will note that there is a path where all the tasks have Zero float. That means there is no slack or delay allowed on any of those tasks. In our example, the critical path is in the order of Task 1, 3,5, 4,7 and 9.

Here is the video depicting each steps.

# A case for using a shared information library

Over time, an information library can be a great source of information for your team. Whether your company uses Sharepoint or Microsoft TFS, FileStack or OneHUB the idea is to adopt and implement a process & a system for archiving and maintaining code changes for each client as well as any break-fixes or subsequent enhancements using SharePoint site.

The Objective

• Help track changes for each implementation.
• Check in the code changes per bank and maintain a history of all code revisions.
• Use of a break-fixes SharePoint page that will contain all the changes made by anyone archived by date.
• Have a separate section or page that includes all the comments or important information about an implementation.
• Shared internally across various gourps such as Product Development Group and Product Support Group.

The Opportunity

• There are many “break-fixes” or modifications that are added to an existing code base over a period of time. Tracking those changes as well as handling the job failures in absence of the primary developer can be challenging and a backup resource will face numerous challenges without the 360 degree knowledge of a particular implementation.
• Will serve as a search site for any keywords to quickly check if there is any information around it. For example, a keyword search of “Custom code to rollup particular loan types” will yield a result of any items that were checked in or added on the SharePoint site. The scope of the search could include all the banks and will return results related to the keyword.
• Repository of stored information may include added code snippets, additional logic or requirements implemented in the past, known bugs, discussions and log of any relevant information.
• Will minimize the amount of time needed for a new or a backup resource to jump into job failures or other issues. Such site will also serve as a knowledge base and as a reference point to check before diving deep into any task.
• Global search on SharePoint will enable your to look for a particular piece of information across all the banks. Such a wealth of knowledge will increase clarity and negate the reason to reinvent a similar work product that might already be in use at some other bank.
• Important communications as well as details and insight on all clients can be maintained and retrieved when needed.
• Such a site will help capture more accurate and detailed information along with the business rules and logic.

Solution

• Create a SharePoint site that will have following features –
• Code Versioning and comments on all subsequent check-ins of the original code.
• Archival of all the changes along with discussion board and comments section.
•  Site for maintaining other key or supplemental information, which can be beneficial to all, as a common knowledge.
• Create a site for archiving a set of common code that can be reused at a different implementation.
• Site for archival of “know-how” and “How to” knowledge or tips.

Other Benefits

• Reduced work time for any fixes or “job fail” resolutions.
• Easy transition of knowledge between responsible resource and new or a backup resource.
• Negating the need to re-invent the wheel by making use of existing code library.

CONCLUSION

With a growing list of  clients, incorporating a shared repository system will surely help meet the challenges of maintaining relevant and critical information for each implementation.

# SSIS – Forcing a sub package to succeed

Use this method when you need to force certain tasks to succeed in spite of failure scenarios. This applies to spreadsheets and flat files loads that are non-critical to the remaining flow of tasks and should not cause the entire job to fail. This method involves moving desired load tasks into a new sub package and calling the sub package from parent via Execute Package Task. A working example is set up on TEST server for reference.

The set up and configuration is as follows ->

In this example we will move the Flat File load Task to a new Sub-package.dtsx that will in turn be called from Execute Package Task.

Double Click on Execute Package Task à Package à ReferenceType. Change it to “External Reference”, the reason being that when the parent package is executed from SQL Agent job the child package will be accessible and be able to run. Skipping this step will result in an error shown below.

After switching to External Reference, you’ll need to specify a connection from where the SQL Agent job will run.

Go back to the parent package, highlight “Execute Package Task” and open up Properties for that Task. Change the property value of “ForceExecutionResult” from “None” to “Success”.  Make sure you don’t accidently set this for entire package!

Click on “Event Handlers” tab, and select “Execute package task” from the drop down list.

Click on the blue colored link as seen on the above screenshot. This will create an “OnError” event for “Execute Package Task” and thereby exposing new system variables that are otherwise not accessible. Make sure to click on grid option icon and select “Show system variables”.

Scroll down and locate “Propagate” variable and change it to False from the default value of True. This will ensure that any error that occurs in that component (Execute Package task in this case) will not propagate to the other elements of control flow. Make sure this is done ONLY on the OnError Event of “Execute Package Task”.

Add a “Send Mail Task” with an On Failure path of the precedence constraint. This will send an email notifying the intended recipient of sub-package failure.

# SQL Server Integration Services – An ETL Guide

SQL Server integration Services has been a tool of choice for me since I first ventured into SQL Server 2005 world. Since then SQL Server has evolved dramatically and more features have been added with increased capabilities. With SSIS2012, you can now do column mapping, declare and change the parameter properties during runtime of a package, Script task and script component for running VB or C# code, Connection manager where you can create and share connections on a project level, the merge join transformations now consume reduced memory usage, the DQS cleaning transformation that helps you do cleansing transformations with relative ease etc.

One of the biggest feature of SQL Server 2012 is introduction of the In-memory OLTP execution. The in memory feature is becoming a very sought upon by the consumer due to the dramatic performance gain as well as dropping price of physical memory modules. Where SQL Server 2012 claims to have 5-20 times performance improvements, SAP Hana with its new architecture, claims to boost the performance by a whopping 300 times! As soon as I get my hands on Hana, I will do the comparison test. In-memory optimization gets this performance boost by taking advantage of hyper threading technology and enables concurrent processing that eliminates logical locks on the tables. The Stored procedures are also compiled natively and the result set is held in the memory. This is all on top of a massive algorithm change that supports faster processing.

Anyways, let us get back to the ETL discussion. Where do ETL as well as other tools/capabilities like OLAP cube, Universe semantic layer, end systems, source systems stand in the overall picture of the systems setup?

I created this to understand the role of each layer.

From a pure information system’s perspective, it all starts with the source system. But what I didn’t show in the diagram above is that the source systems gets the data from an actual physical event like sales information from P.O.S system, Payroll data from Time and Attendence system, customer information using CRM system and so on. Let us say a company uses SAP to place orders, create invoices and stores shipment data. They are all tied to a physical event that happened in real world. The data entered in those system are then stored into the application’s native database or external databases. ETL comes into picture at this point and enables to work with raw data by performing cleansing task, data profiling, checking the referential integrity, NULL errors & invalid/corrupted records, transforming the data for storing in a dimensional or OLAP compliant schema. That’s just a very high level description but a lot goes on behind the scene and we’ll see some of that in coming sections. The ETL script or package will store the data into their respective tables with proper cardinaility and data integrity into a data warehouse. The end systems where the final data is presented could be accessed directly from the data warehouse(through applications developed using .NET, Java, HTML5 APIs etc) or by using intermediate tools like BO universe designer, SQL Server Analytical cubes etc. Each has it’s own strength and set of capabilities. For example, if the objective is to create a reporting system where users can do data mining or analysis, the SSAS and BO universe will provide that capability by allowing  addition of rules, calculations etc that can be used in friendly GUI for even a non technical user.

Let us dive into SSIS tool. When you log in to SQL Server Data Tools (previously called BIDS- Business Intelligence Development Studio), you will be prompted to pick a project for either SSIS, SSAS or SSRS. Once you create an Integration Services project, you’ll see the a package with dtsx extension being created. The dtsx package is in the XML form and you can see the code by right clicking on the package name from Solution Explorer and selecting “View Code”.

Click on the picture for larger view.

The view that you see in the picture above is the default layout for SSIS.  On the right side is the Solution explorer where you can create a new connection, packages etc and in the center you can create Control Flow and Data Flow designs, on the bottom section you can manage project level connections and declare Variables in the Variables section.

Control Flow is where you define various tasks that are arranged in a logical order and sequential pattern to get the desired result.  Control flow contains various task components like Data Flow task, FTP, Script task,  execute process task and containers like For Loop, For Each etc. Each task can be executed in order by using “precedence” operator. It is nothing but a path that the process will take in case of success, complete or failure.

Tasks are individual work functions that will have that will execute a particular code based on the type of task with some user input. The great thing about SSIS is that users don’t need to insert a function script in a code form, they can just be dragged and dropped. Try dragging a task into Control Flow and double click on the task. A task editor window will open up  which will let you set the parameters, expressions, script, properties etc.

Tasks are generally executed sequentially and each task has either a  success, failure and completion value that gets evaluated based on constraints, expression or a combination of both in logical OR and logical AND. (We’ll see those shortly in an example).

Here are a few of the most popular tasks.

• Analysis Services Processing Task – This task will process a SQL Server Analysis Services Cube, dimension or a mining model. Frequently used to update the analytical cube with latest data.
• Bulk Insert Task – This task executes BULK INSERT SQL command to load data into  a table.
• Data Flow Task  – Data flow is the heart of the ETL package where various transformations happen in the data pipeline. This is the most frequent and obvious task used in any package. The data flow task will let you connect to any data source and will output the data based on the transformations performed. When ever this task is added in the control flow, a data flow task model is automatically created that can be accessed through a tab next to “Control Flow” tab.
• Execute Package Task – This task lets you execute a package from within a package and making it modular.
• Execute Process Task – This task will execute a program external to the package. The Task editor will let you select a program on your operating system that will be executed with input, output parameters and also using an expression.
•  Execute SQL Task – This is another popular task frequently used by the database architect for executing a SQL statement. For example the SQL might insert data from the source (ex-excel sheet) as soon as the excel sheet is loaded with data from another task.
• FTP Task – Lets you connect to a server using the FTP protocol for sending and receiving files.
• Send Mail Task – Will let you send an email using SMTP protocol. Commonly used to send out a custom message on success, failure or completion of a task.

Precedence Constraints are the links that connects various tasks and directs the flow of task execution in the designed order.  Business rules or logic can be applied as a part of an expression. For example, execute the task only when a defined variable has a value of greater than 100. The constraint values determines whether and when the next task will be executed in response to the preceding task’s execution.

There are three constraint values –

• Completion – The linked task executes regardless of success and failure on completion/execution of preceding task.

Containers are used for grouping various tasks together logically into units of work. This is not helps with visual aspect but also enables defining variables and event handlers on the containers itself. There are 4 types of containers-

• Task Host Container- This is not a visible item but just an abstract interface concept.
• Sequence Container- This container allows grouping tasks logically with separate boundaries. This container can be collapsed and expanded as needed.
• For Loop Container- This sets up a loop for all the tasks within a container that will be executed until a condition is satisfied.
• Foreach loop Container-This sets a loop on files or records and executes the task for each of the records.

Data Flow is the core of any SSIS package where most of the action takes place. It enables you to authenticate and connect to the database, load the relevant data into database engine’s memory, perform transformations like row sampling, term lookup, importing columns, fuzzy logic etc and export the data to an ODBC, OLE DB, Raw file, excel, sql server etc destinations. This is the place where in-memory processing of data takes place and really shows the performance edge of SQL Server database engine. The connections that can be used for sources and destinations are actually defined in the connection manager thats globally available for any data flow tasks.

Note: More Content and Examples to follow. Stay Tuned.

# Understanding the Requirements Traceability Matrix

One of the important documents that stand out during project planning is the requirements traceability matrix (RTM). It is the heart of requirements collection phase and is a very effective document for maintaining, tracing and implementing the requirements as the project progresses.

While working in an SDLC project, there tend to be numerous resources involved for different phases. We can’t expect all of them to speak a common language in terms of referring to an item or a task or even the process itself due to different background on their skill set and type of job roles.

One term I often see people getting confused is the requirements traceability matrix and Mapping matrix. The key distinction is that the former is furnished by the Analyst throughout the requirements collection phase and the latter is produced during the BRD or mapping of physical requirements to their technical components (example – an item tracked through different schemas and databases and associated business rules).

The requirements captured and finalized in the requirements traceability matrix will help shape the WBS tasks and Activities list, which in turn will be used for developing, project schedule. That was a very high level description of the process but when you put all that into action there are many steps and many documentations resulting in the end.  That’s a topic of discussion for another day!

Going back to the Requirements Traceability Matrix, the project team manager and the analyst will work closely with the SMEs and stakeholders to capture their vision and goals of the project objectives more accurately and also capture supplemental information like who requested the requirement, business justification, affected components, business rules, related WBS entry, source etc. that comes in handy during the entire project. We’ll see some of the elements used in the matrix more closely in the attached sample.

So how are these different elements captured and what are the steps involved?

The requirements collection process starts with the project manager meeting with the SMEs and Stakeholders by setting up one or more of the following activities.

• Interviews
• Focus groups
• Cross-functional team meetings
• Brainstorming sessions
• Facilitated workshops
• Prototype (my favorite and most common)
• Surveys and Questionnaire

In my experience, acquiring the prototype and following it up with additional questions around the components requested is one of the most effective ways of gathering requirements. Without examples or prototypes, the analyst will be left scrambling to understand the business requirements and translate them into technical deliverables. This is because at times the business users/stakeholders tend to be vague in what they want in terms of intrinsic needs while defining a deliverable. I shouldn’t say this but many of the users/stakeholders are of wavering mind and will change their requirements as the project timeline advances. Which in my opinion is fine because during project progression the corporate goals and direction of the organization might change with their changing consumer market and those changes in goals would need to be aligned and incorporated in an on-going project. This is where RTM comes in handy and is the master sheet for “tracing” the requirements and the details around each of them. One thing to remember is that there should be unique identifier for each requirement and they should never be deleted or re-ordered.  This is so each of the requirements that are referenced throughout the communications and also included in the BRDs or other documents like RAID would mess up the association and cause confusion. I always use a flag on all the entries that states if the requirement is current or outdated & whether it is needed or not.

Here is an example of a Requirements Traceability Matrix document.

# R.A.I.D Document

In my earlier post “8 Must-Have Documents in a Project Manager’s Briefcase” I mentioned R.A.I.D document. Today I will dive a little deeper and show you some examples. R.A.I.D stands for Risk, Assumptions, Issues and Decisions. You might see different variations of this document; some people like to use “Dependencies” instead of “Decisions” whereas others use “Action” in place of “Assumption”. Personally I feel that the dependencies are anyways captured as a part of issues and action items are also listed for mitigation of  risk and issues log. So it would make more sense to use the categories – Risk, Assumptions, Issues and Decisions.

This is what I wrote in my earlier post and I am just pasting it back instead of reinventing the wheel!

R.A.I.D Document- This document will have sections for documenting and maintaining risks, assumptions, issues and decisions logged throughout the project lifecycle.

• Risk Log – This document would contain the detail about the risk, affected components, impact on timeline, level of severity, team member(s) responsible for resolving the risk, action needed, cause of the risk, impacted group, impacted work packages, resolution target date etc. All risks should be numbered uniquely and in the order of discovery and should be referred by that number throughout the project life cycle.

• Assumptions- This section will list out all the assumptions before the commencement of project and during project execution as you follow through your activities. Every project will have assumption such as availability of required data, estimated disk space for database, availability of required software and things of that nature.

• Issue log – This section is similar to risk log and would contain list of all the issues numbered uniquely and in order they were received along with the impact of the issue, priority,  affected components, team member(s) responsible for resolving the issue, status indicator, actual reported date, projected completion date, dependencies etc. Entries in the issue log should never be deleted or moved once they are resolved or there is change in status or priority and should have unique id so they can be refereed throughout the project by that id.

• Decisions- This section documents the decisions taken by any individual authorized in their position to do so, decision details, department/group, date recorded, impacted and affected components.

Hope this is helpful to anyone trying to build a template. Or just contact me if you need my excel sheet.

# SQL Tutorial – Data Retrieval (Part 2)

In the first part of this series, we saw the structure of SELECT Statement, how to join multiple tables, operators and building a subquery.

Today we will see Nested Table expression that is very frequently used while writing Stored Procedures or Views. By using this expression we are creating a temporary view from a query that is defined in the FROM Clause. This helps when you are trying to apply some aggregations on a set of data and use the result set as a temporary table. This temporary table takes the spot of an actual table name in the FROM clause and acts like a virtual table.

We’ll use the Human Resource schema for all the examples in this post. See how department information is stored in Department table and the history for each employee’s department is stored in “EmployeeDepartmentHistory” Table. Note the usage of Start and End date that basically tells which department id an employee belonged to at any point in history. Details of employees are as you can see stored in the “Employee” table.

Before we construct a nested query, let us take a peek at the data for each of the tables we’ll be using in our final query.

**This is partial result set for Employee Table.

`SELECT`
`LoginID`
` ,VacationHours`
`,SickLeaveHours`
`,OrganizationLevel, SalariedFlag`
`FROM AdventureWorks2012.HumanResources.Employee`

Partial data for EmployeeDepartmentHistory table.

Data from Department table

In order to construct a nested query,  let’s make up a scenario where we want to find the Total of Average vacation hours by salaried and non salaried employees across the company.

We’ll also add some rules to our query.

• Results should consist of record for only those employees who took more than 20 Sick Leave hours in the selected time frame.
• Result should include employees from Sales, Purchasing, Marketing or Engineering department only.

Here is the final query which we will dissect and understand after the jump.

`SELECT SUM(AVG_VacationHours) AS TotalOfAvgVacHours, AVG(TotalTimeOff) AS AVGTotalTimeOff, SalariedFlag`
`FROM (`
`SELECT`
`AVG(VacationHours) AS AVG_VacationHours`
`,SUM(SickLeaveHours+VacationHours) AS TotalTimeOff`
`,OrganizationLevel`
`,EmpHistory.DepartmentID AS DeptID`
`,Department.Name`
`,SalariedFlag`
```FROM AdventureWorks2012.HumanResources.Employee AS Employee

`ON Employee.BusinessEntityID = EmpHistory.BusinessEntityID`
`INNER JOIN AdventureWorks2012.HumanResources.Department AS Department`
`ON Department.DepartmentID = EmpHistory.DepartmentID`
`WHERE HireDate between '01-05-2005' AND GETDATE()`
`AND SickLeaveHours >20`
`AND Department.Name IN ('Sales','Purchasing','Marketing','Engineering')`
`GROUP BY OrganizationLevel, SalariedFlag, EmpHistory.DepartmentID, Name`
`)`
`AS`
```Temp_Table
GROUP BY Temp_Table.SalariedFlag

Result:
```

We start by selecting the items we want for forming our virtual table, the result of which will be used in the From Clause of the main query. The SQL in the nested expression is used in place of an actual table and meets our rules that I mentioned earlier.

In the nested query, we’ll do an Average of  vacation hours and Sum up Sick Leave days with Vacation days for employees hired between 01-05-2005 to date. We’ll also pick the departments in this query and only those employees who took more than 20 hours sick leave.

Here is the result set for the nested query.

Now this is the data of a virtual table that will be used in the FROM clause of main query. In the main query, we will sum up all the averages for vacation hours and group it by Salaried Flag.

The end result tells us how many average vacation hours salaried employees and unsalaried employees(contractors) took. The query also tells us the average time off for both categories of salaried and non salaried employees.

Here is the result set again.

# SQL Tutorial – Data Retrieval (Part 1)

Structured Query Language is the universal language spoken by all the enterprise databases. If you ask a database developer what is his primary spoken language, he will most say “SQL”. That was a joke! But the point is when you are working on databases, understanding the SQL is of foremost importance since it is the building block for writing/reading Stored procedures, Derived Views, Macros etc which are most commonly used form of SQL usage for retrieving the data in a desired form.

Between various databases such as SQL Server, Teradata, Oracle, Netezza etc the statement functions and syntaxes can differ but the structure and fundamental concepts of SQL can be applied to any of them.

We will be using Production schema from AdventureWorks 2012 database on SQL Server 2012 for all our examples. Click on the picture for schema diagram.

SQL statements are categorized by their primary function. Below are the types of SQL Statement.

Before we start, lets understand the structure of Select Statement since this is the most commonly and frequently used Data Retrieval Statement. The Select and From clauses are required and the remaining clauses are optional.

If the data that we are trying to retrieve lies in two separate tables, they would need to be joined using the Primary and Secondary Key relationship that exists between the two tables.

There are different types of joins and is easier to understand using Venn diagram from set theory,

• Inner Join – Using this join returns all the common rows between two or more tables.

Select Table_A.*,Table_B.* from Table_A INNER JOIN Table_B on Table_A.Key = Table_B.Key

• Right Outer Join - This join as the name suggest will return matching rows from the table that is on the left side (A) of the join and return the entire dataset for the table on the right side (B).

Select Table_A.*,Table_B.* from Table_A RIGHT OUTER JOIN Table_B on Table_A.Key = Table_B.Key

• Left Outer Join – This join as the name suggest will return matching rows from the table that is on the right side of the join and return the entire dataset for the table on the left side.

Select Table_A.*,Table_B.* from Table_A LEFT OUTER JOIN Table_B on Table_A.Key = Table_B.Key

• Full Outer Join – Will give you all the rows from all the tables joined using this type of join.

1. Select Table_A.*,Table_B.* from Table_A OUTER JOIN Table_B on Table_A.Key = Table_B.Key

2. Select Table_A.*,Table_B.* from Table_A OUTER JOIN Table_B on Table_A.Key = Table_B.Key where Table_A.Key is NULL OR Table_B.Key is NULL

• Cross Join – This is commonly known as Cartesian product and should be avoided. The join will return combination of all the rows from the two tables and return N x N number of rows.

Select Table_A.*,Table_B.* from Table_A OUTER JOIN Table_B

Lets start with an example by constructing a basic Select statement that has all the clauses shown in the picture above..

`SELECT  Product.Name ,`
`        ProductID ,`
`        ProductLine,`
`        ListPrice,`
`        AVG(ListPrice) AS Avg_ListPrice,`
`        StandardCost`
`FROM  AdventureWorks2012.Production.Product`
`WHERE ListPrice > = 100`
`AND StandardCost <= 300`
`AND ProductLine IS NOT NULL`
`GROUP BY ProductID,`
`         ListPrice,`
`         Product.Name,`
`         ProductLine,`
`         ListPrice,`
`         StandardCost`
`HAVING AVG(ListPrice) < 500`
`ORDER BY ProductID ASC,`
`         Product.Name ASC,`
`         ListPrice DESC`

The above example was for selecting the data from a single table. The tables will most likely be in normalized form when working on either RMDBS database with dimensional modeling or OLAP cubes and since the data would reside on a large number of normalized tables; while retrieving data, a SQL query needs to be constructed that will join different tables related by Primary keys and Foreign keys. I covered Normalization upto 3NF in this post.

Each table is created/designed to have at least a unique primary key (PK) for every row and the tables can be related to other tables by using Foreign Key(FK) as shown in the example below.

ProductID is the Primary Key in Product table and ProductSubcategoryID is one of the Foreign key. You can see that ProductSubcategory has ProductSubcategoryID as its Primary key and these two tables can be joined through the Primary Key – Foreign key join.

Let’s join Product and ProductSubcategory tables. Note how we applied an inner join on the third row. Right Outer join, Left Outer join or Outer Joins can also be applied depending on what result set you are trying to retrieve.

`SELECT  Product.Name , ProductID , ProductLine , Product.ProductSubcategoryID`
`FROM  AdventureWorks2012.Production.Product AS Product`
`INNER JOIN AdventureWorks2012.Production.ProductSubcategory AS ProductSubCat`
`ON Product.ProductSubcategoryID = ProductSubCat.ProductSubcategoryID`

Let’s understand SubQuery concept.  Take tables “Product”, “BillofMaterials” and “UnitMeasure” for selecting products that are of subcategory code “12” and the products should be in “Each”. There are 38 unit measure codes in “BillofMaterials” table  and the Name of those unit measure codes are in the table “UnitMeasure”. So lets say you don’t want to look up the Name from the UnitMeasure table manually to find out the code(for “Each” in this case)  that you want to filter the result set for. We can write a subquery that will get the measure code for selected measure code name.

In short, we will need to add another query (subquery) in the where clause conditions that will retrieve records for “Each” and feed that into the main query’s Where clause condition.

`SELECT  Product.Name ,`
`        ProductID ,`
`        ProductLine ,`
`        Product.ProductSubcategoryID,`
`BillofMaterial.UnitMeasureCode,`
`BillofMaterial.PerAssemblyQty`
`FROM    AdventureWorks2012.Production.Product AS Product`
`        INNER JOIN AdventureWorks2012.Production.BillOfMaterials AS BillofMaterial`
`ON Product.ProductID = BillofMaterial.BillOfMaterialsID`
`WHERE ProductLine IS NOT NULL`
`AND ProductSubcategoryID = 12`
`AND UnitMeasureCode = (SELECT UnitMeasure.UnitMeasureCode`
`FROM AdventureWorks2012.Production.UnitMeasure`
`WHERE UnitMeasure.Name = 'Each' )`

Let’s learn about couple of operators -UNION, UNION ALL, EXCEPT, EXCEPT ALL, INTERSECT, INTERSECT ALL etc.

UNION operator will combine data from two or more tables and UNION ALL will retain the duplicate rows.

`SELECT ProductSubcategoryID`
`      ,Name`
`  FROM AdventureWorks2012.Production.ProductSubcategory`
`  WHERE ProductSubcategoryID IS NOT NULL`
`  union`
`  SELECT ProductSubcategoryID`
`         ,[Name]`
`   FROM AdventureWorks2012.Production.product`
`   WHERE ProductSubcategoryID is NOT NULL`

EXCEPT operator will give results from the first half of the SQL but not from the Tables that are after EXCEPT operator.

Example:

`SELECT ProductSubcategoryID`
`      ,Name`
`  FROM AdventureWorks2012.Production.ProductSubcategory`
`  WHERE ProductSubcategoryID IS NOT NULL`
`  EXCEPT`
`  SELECT ProductSubcategoryID`
`         ,[Name]`
`   FROM AdventureWorks2012.Production.product`
```   WHERE Name  IN ('Forks','Gloves','Handlebars','Headsets','Helmets')
```

INTERSECT operator will give  only the matching rows from the queries.

Example.

`SELECT  Product.Name ,`
`        ProductID ,`
`        ProductLine,`
`ListPrice,`
`StandardCost,`
`FinishedGoodsFlag`
`FROM  AdventureWorks2012.Production.Product`
`WHERE StandardCost> 1000`
`INTERSECT`
`SELECT  Product.Name ,`
`        ProductID ,`
`        ProductLine,`
`ListPrice,`
`StandardCost,`
`FinishedGoodsFlag`
`FROM  AdventureWorks2012.Production.Product`
`WHERE FinishedGoodsFlag = '1'`
`ORDER BY Name, ProductID`

```      AND ProductLine= 'M'
```

While using these operators, the select list should have equal and identical fields in terms of datatype and data length. Also note in the query above for INTERSECT, we put order by clause in the end after both the SQL statements.

In upcoming tutorials, I will give more examples of various analytical functions, nested queries and also examples of Data Definition language.

# Normalizing Data

Depending on the design of relational database, the tables might contain lots of duplicate data or might contain incorrect entries of data in single line that you can’t delete without losing the other entry or it may contain data that shouldn’t exist.

Normalization is the process of effectively organizing & rearranging the data in a database into a more standard normal form that prevents these kind of issues.  Normalizing the data is due to the need for data integrity, flexibility and efficiency.  There are several levels of normalization but you might not need to apply all of them to your database. For RMDBS compliance, you can have your database specialist normalize a flat database into three levels at minimum. Here are all the levels listed.

• First Normal Form (1NF)

• Second Normal Form (2NF)

• Third Normal Form (3NF)

• Boyce‐Codd Normal Form (BCNF)

• Fourth Normal Form (4NF)

• Fifth Normal Form (5NF)

• Domain/Key Normal Form (DKNF)

First Normal Form (1NF) – In this step we will eliminate repeating groups by creating separate table for each set of related attributes and assign a primary key (PK) to each of those tables.

These are the requirements for 1NF

1. Each column must have a unique name.

2. The order of the rows and columns doesn’t matter.

3. Each column must have a single data type.

4. No two rows can contain identical values.

5. Each column must contain a single value.

6. Columns cannot contain repeating groups.

Second Normal Form (2NF) – Take the data that is only partly dependent on the Primary Key and put it in a seperate table. This means that we need to split up the table columns that are of cardinality “many-to-many relationship”. After splitting the data, foreign keys should be defined and added so that the tables can be related. In this step all the non key attributes are fully functionally dependent on the Primary Key.

The requirements for 2NF are listed below

1. The database tables should be in 1NF form
2. All of the non key fields are dependent on all of the key fields.

Third Normal Form (3NF) – In this form, no non key attributes are transitively dependent on the Primary key. Transitive dependency means that one attribute is dependent on the second attribute, which in turn depends on a third attribute. When you are deleting data in a table, and if the tables are transitively dependent on each other, there will be unwanted loss of information. An easier way of understanding this step is that the data that is not dependent on Primary Key  should be separated out and stored in another table.

The requirements for 3NF are listed below

1. The database tables should be in 2NF form
2. Should not contain transitive dependencies

The final tables after performing the first three normalization forms are now ready for use in a relational database.