Data warehouse/BI

SSIS – Forcing a sub package to succeed

Use this method when you need to force certain tasks to succeed in spite of failure scenarios. This applies to spreadsheets and flat files loads that are non-critical to the remaining flow of tasks and should not cause the entire job to fail. This method involves moving desired load tasks into a new sub package and calling the sub package from parent via Execute Package Task. A working example is set up on TEST server for reference.

The set up and configuration is as follows ->

1

In this example we will move the Flat File load Task to a new Sub-package.dtsx that will in turn be called from Execute Package Task.

Double Click on Execute Package Task à Package à ReferenceType. Change it to “External Reference”, the reason being that when the parent package is executed from SQL Agent job the child package will be accessible and be able to run. Skipping this step will result in an error shown below.

 

2

 

3

After switching to External Reference, you’ll need to specify a connection from where the SQL Agent job will run.

4

After selecting the connection, select your sub package containing file loads.5

Go back to the parent package, highlight “Execute Package Task” and open up Properties for that Task. Change the property value of “ForceExecutionResult” from “None” to “Success”.  Make sure you don’t accidently set this for entire package!

5

 

Click on “Event Handlers” tab, and select “Execute package task” from the drop down list.

6

 

Click on the blue colored link as seen on the above screenshot. This will create an “OnError” event for “Execute Package Task” and thereby exposing new system variables that are otherwise not accessible. Make sure to click on grid option icon and select “Show system variables”.

7

 

Scroll down and locate “Propagate” variable and change it to False from the default value of True. This will ensure that any error that occurs in that component (Execute Package task in this case) will not propagate to the other elements of control flow. Make sure this is done ONLY on the OnError Event of “Execute Package Task”.

8

 

9

 

Add a “Send Mail Task” with an On Failure path of the precedence constraint. This will send an email notifying the intended recipient of sub-package failure.

10

SQL Server Integration Services – An ETL Guide

SQL Server integration Services has been a tool of choice for me since I first ventured into SQL Server 2005 world. Since then SQL Server has evolved dramatically and more features have been added with increased capabilities. With SSIS2012, you can now do column mapping, declare and change the parameter properties during runtime of a package, Script task and script component for running VB or C# code, Connection manager where you can create and share connections on a project level, the merge join transformations now consume reduced memory usage, the DQS cleaning transformation that helps you do cleansing transformations with relative ease etc.

One of the biggest feature of SQL Server 2012 is introduction of the In-memory OLTP execution. The in memory feature is becoming a very sought upon by the consumer due to the dramatic performance gain as well as dropping price of physical memory modules. Where SQL Server 2012 claims to have 5-20 times performance improvements, SAP Hana with its new architecture, claims to boost the performance by a whopping 300 times! As soon as I get my hands on Hana, I will do the comparison test. In-memory optimization gets this performance boost by taking advantage of hyper threading technology and enables concurrent processing that eliminates logical locks on the tables. The Stored procedures are also compiled natively and the result set is held in the memory. This is all on top of a massive algorithm change that supports faster processing.

Anyways, let us get back to the ETL discussion. Where do ETL as well as other tools/capabilities like OLAP cube, Universe semantic layer, end systems, source systems stand in the overall picture of the systems setup?

I created this to understand the role of each layer.

Business Intelliegence System Overview

 

From a pure information system’s perspective, it all starts with the source system. But what I didn’t show in the diagram above is that the source systems gets the data from an actual physical event like sales information from P.O.S system, Payroll data from Time and Attendence system, customer information using CRM system and so on. Let us say a company uses SAP to place orders, create invoices and stores shipment data. They are all tied to a physical event that happened in real world. The data entered in those system are then stored into the application’s native database or external databases. ETL comes into picture at this point and enables to work with raw data by performing cleansing task, data profiling, checking the referential integrity, NULL errors & invalid/corrupted records, transforming the data for storing in a dimensional or OLAP compliant schema. That’s just a very high level description but a lot goes on behind the scene and we’ll see some of that in coming sections. The ETL script or package will store the data into their respective tables with proper cardinaility and data integrity into a data warehouse. The end systems where the final data is presented could be accessed directly from the data warehouse(through applications developed using .NET, Java, HTML5 APIs etc) or by using intermediate tools like BO universe designer, SQL Server Analytical cubes etc. Each has it’s own strength and set of capabilities. For example, if the objective is to create a reporting system where users can do data mining or analysis, the SSAS and BO universe will provide that capability by allowing  addition of rules, calculations etc that can be used in friendly GUI for even a non technical user.

Let us dive into SSIS tool. When you log in to SQL Server Data Tools (previously called BIDS- Business Intelligence Development Studio), you will be prompted to pick a project for either SSIS, SSAS or SSRS. Once you create an Integration Services project, you’ll see the a package with dtsx extension being created. The dtsx package is in the XML form and you can see the code by right clicking on the package name from Solution Explorer and selecting “View Code”.

SQL Server Data Tools - SSIS

 Click on the picture for larger view.

The view that you see in the picture above is the default layout for SSIS.  On the right side is the Solution explorer where you can create a new connection, packages etc and in the center you can create Control Flow and Data Flow designs, on the bottom section you can manage project level connections and declare Variables in the Variables section.

Control Flow is where you define various tasks that are arranged in a logical order and sequential pattern to get the desired result.  Control flow contains various task components like Data Flow task, FTP, Script task,  execute process task and containers like For Loop, For Each etc. Each task can be executed in order by using “precedence” operator. It is nothing but a path that the process will take in case of success, complete or failure.

Tasks are individual work functions that will have that will execute a particular code based on the type of task with some user input. The great thing about SSIS is that users don’t need to insert a function script in a code form, they can just be dragged and dropped. Try dragging a task into Control Flow and double click on the task. A task editor window will open up  which will let you set the parameters, expressions, script, properties etc.

Tasks are generally executed sequentially and each task has either a  success, failure and completion value that gets evaluated based on constraints, expression or a combination of both in logical OR and logical AND. (We’ll see those shortly in an example).

Here are a few of the most popular tasks.

  • Analysis Services Processing Task – This task will process a SQL Server Analysis Services Cube, dimension or a mining model. Frequently used to update the analytical cube with latest data.
  • Bulk Insert Task – This task executes BULK INSERT SQL command to load data into  a table.
  • Data Flow Task  – Data flow is the heart of the ETL package where various transformations happen in the data pipeline. This is the most frequent and obvious task used in any package. The data flow task will let you connect to any data source and will output the data based on the transformations performed. When ever this task is added in the control flow, a data flow task model is automatically created that can be accessed through a tab next to “Control Flow” tab.
  • Execute Package Task – This task lets you execute a package from within a package and making it modular.
  • Execute Process Task – This task will execute a program external to the package. The Task editor will let you select a program on your operating system that will be executed with input, output parameters and also using an expression.
  •  Execute SQL Task – This is another popular task frequently used by the database architect for executing a SQL statement. For example the SQL might insert data from the source (ex-excel sheet) as soon as the excel sheet is loaded with data from another task.
  • FTP Task – Lets you connect to a server using the FTP protocol for sending and receiving files.
  • Send Mail Task – Will let you send an email using SMTP protocol. Commonly used to send out a custom message on success, failure or completion of a task.

Precedence Constraints are the links that connects various tasks and directs the flow of task execution in the designed order.  Business rules or logic can be applied as a part of an expression. For example, execute the task only when a defined variable has a value of greater than 100. The constraint values determines whether and when the next task will be executed in response to the preceding task’s execution.

There are three constraint values –

  • Success – A task link to another task will only execute when the first task successfully completes.
  • Failure – In linked tasks, the next task is executed only when the preceding task fails to complete.
  • Completion – The linked task executes regardless of success and failure on completion/execution of preceding task.

Precedence Constraints

 

 

Containers are used for grouping various tasks together logically into units of work. This is not helps with visual aspect but also enables defining variables and event handlers on the containers itself. There are 4 types of containers-

  • Task Host Container- This is not a visible item but just an abstract interface concept.
  • Sequence Container- This container allows grouping tasks logically with separate boundaries. This container can be collapsed and expanded as needed.
  • For Loop Container- This sets up a loop for all the tasks within a container that will be executed until a condition is satisfied.
  • Foreach loop Container-This sets a loop on files or records and executes the task for each of the records.

Data Flow is the core of any SSIS package where most of the action takes place. It enables you to authenticate and connect to the database, load the relevant data into database engine’s memory, perform transformations like row sampling, term lookup, importing columns, fuzzy logic etc and export the data to an ODBC, OLE DB, Raw file, excel, sql server etc destinations. This is the place where in-memory processing of data takes place and really shows the performance edge of SQL Server database engine. The connections that can be used for sources and destinations are actually defined in the connection manager thats globally available for any data flow tasks.

 

Note: More Content and Examples to follow. Stay Tuned.

 

SQL Tutorial – Data Retrieval (Part 2)

In the first part of this series, we saw the structure of SELECT Statement, how to join multiple tables, operators and building a subquery.

Today we will see Nested Table expression that is very frequently used while writing Stored Procedures or Views. By using this expression we are creating a temporary view from a query that is defined in the FROM Clause. This helps when you are trying to apply some aggregations on a set of data and use the result set as a temporary table. This temporary table takes the spot of an actual table name in the FROM clause and acts like a virtual table.

We’ll use the Human Resource schema for all the examples in this post. See how department information is stored in Department table and the history for each employee’s department is stored in “EmployeeDepartmentHistory” Table. Note the usage of Start and End date that basically tells which department id an employee belonged to at any point in history. Details of employees are as you can see stored in the “Employee” table.

Human Resource Schema

Before we construct a nested query, let us take a peek at the data for each of the tables we’ll be using in our final query.

**This is partial result set for Employee Table.

SELECT
LoginID
 ,VacationHours
,SickLeaveHours
,OrganizationLevel, SalariedFlag
FROM AdventureWorks2012.HumanResources.Employee

Employee Table

Partial data for EmployeeDepartmentHistory table.

Resultset for EmpDeptHistory

Data from Department table

Department table resultset

 

In order to construct a nested query,  let’s make up a scenario where we want to find the Total of Average vacation hours by salaried and non salaried employees across the company.

We’ll also add some rules to our query.

  • Results should consist of record for only those employees who took more than 20 Sick Leave hours in the selected time frame.
  • Result should include employees from Sales, Purchasing, Marketing or Engineering department only.

 

Here is the final query which we will dissect and understand after the jump.

SELECT SUM(AVG_VacationHours) AS TotalOfAvgVacHours, AVG(TotalTimeOff) AS AVGTotalTimeOff, SalariedFlag
FROM (
SELECT
AVG(VacationHours) AS AVG_VacationHours
,SUM(SickLeaveHours+VacationHours) AS TotalTimeOff
,OrganizationLevel
,EmpHistory.DepartmentID AS DeptID
,Department.Name
,SalariedFlag
FROM AdventureWorks2012.HumanResources.Employee AS Employee 

INNER JOIN AdventureWorks2012.HumanResources.EmployeeDepartmentHistory AS EmpHistory
ON Employee.BusinessEntityID = EmpHistory.BusinessEntityID
INNER JOIN AdventureWorks2012.HumanResources.Department AS Department
ON Department.DepartmentID = EmpHistory.DepartmentID
WHERE HireDate between '01-05-2005' AND GETDATE()
AND SickLeaveHours >20
AND Department.Name IN ('Sales','Purchasing','Marketing','Engineering')
GROUP BY OrganizationLevel, SalariedFlag, EmpHistory.DepartmentID, Name
)
AS
Temp_Table
GROUP BY Temp_Table.SalariedFlag

Result:
Final Result for Nested Query

We start by selecting the items we want for forming our virtual table, the result of which will be used in the From Clause of the main query. The SQL in the nested expression is used in place of an actual table and meets our rules that I mentioned earlier.

In the nested query, we’ll do an Average of  vacation hours and Sum up Sick Leave days with Vacation days for employees hired between 01-05-2005 to date. We’ll also pick the departments in this query and only those employees who took more than 20 hours sick leave.

Here is the result set for the nested query.

Nested Query Result set

Now this is the data of a virtual table that will be used in the FROM clause of main query. In the main query, we will sum up all the averages for vacation hours and group it by Salaried Flag.

The end result tells us how many average vacation hours salaried employees and unsalaried employees(contractors) took. The query also tells us the average time off for both categories of salaried and non salaried employees.

Here is the result set again.

Final Result for Nested Query

 

 

SQL Tutorial – Data Retrieval (Part 1)

Structured Query Language is the universal language spoken by all the enterprise databases. If you ask a database developer what is his primary spoken language, he will most say “SQL”. That was a joke! But the point is when you are working on databases, understanding the SQL is of foremost importance since it is the building block for writing/reading Stored procedures, Derived Views, Macros etc which are most commonly used form of SQL usage for retrieving the data in a desired form.

Between various databases such as SQL Server, Teradata, Oracle, Netezza etc the statement functions and syntaxes can differ but the structure and fundamental concepts of SQL can be applied to any of them.

We will be using Production schema from AdventureWorks 2012 database on SQL Server 2012 for all our examples. Click on the picture for schema diagram.

Production Schema

SQL statements are categorized by their primary function. Below are the types of SQL Statement.

Types of SQL Statements

Before we start, lets understand the structure of Select Statement since this is the most commonly and frequently used Data Retrieval Statement. The Select and From clauses are required and the remaining clauses are optional.

Select Statement Construct

If the data that we are trying to retrieve lies in two separate tables, they would need to be joined using the Primary and Secondary Key relationship that exists between the two tables.

There are different types of joins and is easier to understand using Venn diagram from set theory,

  • Inner Join – Using this join returns all the common rows between two or more tables.

Select Table_A.*,Table_B.* from Table_A INNER JOIN Table_B on Table_A.Key = Table_B.Key

Inner Join Visual Diagram

  • Right Outer Join - This join as the name suggest will return matching rows from the table that is on the left side (A) of the join and return the entire dataset for the table on the right side (B).

Select Table_A.*,Table_B.* from Table_A RIGHT OUTER JOIN Table_B on Table_A.Key = Table_B.Key

Right Join Visual Diagram

  • Left Outer Join – This join as the name suggest will return matching rows from the table that is on the right side of the join and return the entire dataset for the table on the left side.

Select Table_A.*,Table_B.* from Table_A LEFT OUTER JOIN Table_B on Table_A.Key = Table_B.Key

Left Join Visual Diagram

  • Full Outer Join – Will give you all the rows from all the tables joined using this type of join.

1. Select Table_A.*,Table_B.* from Table_A OUTER JOIN Table_B on Table_A.Key = Table_B.Key

Outer Join Visual Diagram

2. Select Table_A.*,Table_B.* from Table_A OUTER JOIN Table_B on Table_A.Key = Table_B.Key where Table_A.Key is NULL OR Table_B.Key is NULL

Outer Join with NULL Keys

  • Cross Join – This is commonly known as Cartesian product and should be avoided. The join will return combination of all the rows from the two tables and return N x N number of rows.

Select Table_A.*,Table_B.* from Table_A OUTER JOIN Table_B

 

Lets start with an example by constructing a basic Select statement that has all the clauses shown in the picture above..

SELECT  Product.Name ,
        ProductID ,
        ProductLine,
        ListPrice,
        AVG(ListPrice) AS Avg_ListPrice,
        StandardCost
FROM  AdventureWorks2012.Production.Product
WHERE ListPrice > = 100
AND StandardCost <= 300
AND ProductLine IS NOT NULL
GROUP BY ProductID,
         ListPrice,
         Product.Name,
         ProductLine,
         ListPrice,
         StandardCost
HAVING AVG(ListPrice) < 500
ORDER BY ProductID ASC,
         Product.Name ASC,
         ListPrice DESC

 

Complete Basic Select SQL Statement

The above example was for selecting the data from a single table. The tables will most likely be in normalized form when working on either RMDBS database with dimensional modeling or OLAP cubes and since the data would reside on a large number of normalized tables; while retrieving data, a SQL query needs to be constructed that will join different tables related by Primary keys and Foreign keys. I covered Normalization upto 3NF in this post.

Each table is created/designed to have at least a unique primary key (PK) for every row and the tables can be related to other tables by using Foreign Key(FK) as shown in the example below.

Primary Key and Foreign Key Relation

 

ProductID is the Primary Key in Product table and ProductSubcategoryID is one of the Foreign key. You can see that ProductSubcategory has ProductSubcategoryID as its Primary key and these two tables can be joined through the Primary Key – Foreign key join.

Let’s join Product and ProductSubcategory tables. Note how we applied an inner join on the third row. Right Outer join, Left Outer join or Outer Joins can also be applied depending on what result set you are trying to retrieve.

SELECT  Product.Name , ProductID , ProductLine , Product.ProductSubcategoryID
FROM  AdventureWorks2012.Production.Product AS Product
INNER JOIN AdventureWorks2012.Production.ProductSubcategory AS ProductSubCat
ON Product.ProductSubcategoryID = ProductSubCat.ProductSubcategoryID

 

Resultset 1 basic join

Let’s understand SubQuery concept.  Take tables “Product”, “BillofMaterials” and “UnitMeasure” for selecting products that are of subcategory code “12” and the products should be in “Each”. There are 38 unit measure codes in “BillofMaterials” table  and the Name of those unit measure codes are in the table “UnitMeasure”. So lets say you don’t want to look up the Name from the UnitMeasure table manually to find out the code(for “Each” in this case)  that you want to filter the result set for. We can write a subquery that will get the measure code for selected measure code name.

In short, we will need to add another query (subquery) in the where clause conditions that will retrieve records for “Each” and feed that into the main query’s Where clause condition.

SELECT  Product.Name ,
        ProductID ,
        ProductLine ,
        Product.ProductSubcategoryID,
BillofMaterial.UnitMeasureCode,
BillofMaterial.PerAssemblyQty
FROM    AdventureWorks2012.Production.Product AS Product
        INNER JOIN AdventureWorks2012.Production.BillOfMaterials AS BillofMaterial
ON Product.ProductID = BillofMaterial.BillOfMaterialsID
WHERE ProductLine IS NOT NULL
AND ProductSubcategoryID = 12
AND UnitMeasureCode = (SELECT UnitMeasure.UnitMeasureCode
FROM AdventureWorks2012.Production.UnitMeasure
WHERE UnitMeasure.Name = 'Each' )

Let’s learn about couple of operators -UNION, UNION ALL, EXCEPT, EXCEPT ALL, INTERSECT, INTERSECT ALL etc.

UNION operator will combine data from two or more tables and UNION ALL will retain the duplicate rows.

SELECT ProductSubcategoryID
      ,Name
  FROM AdventureWorks2012.Production.ProductSubcategory
  WHERE ProductSubcategoryID IS NOT NULL
  union
  SELECT ProductSubcategoryID
         ,[Name]
   FROM AdventureWorks2012.Production.product
   WHERE ProductSubcategoryID is NOT NULL

Union Query Example

EXCEPT operator will give results from the first half of the SQL but not from the Tables that are after EXCEPT operator.

Example:

SELECT ProductSubcategoryID
      ,Name
  FROM AdventureWorks2012.Production.ProductSubcategory
  WHERE ProductSubcategoryID IS NOT NULL
  EXCEPT
  SELECT ProductSubcategoryID
         ,[Name]
   FROM AdventureWorks2012.Production.product
   WHERE Name  IN ('Forks','Gloves','Handlebars','Headsets','Helmets')
Result for EXCEPT Operator

INTERSECT operator will give  only the matching rows from the queries.

Example.

SELECT  Product.Name ,
        ProductID ,
        ProductLine,
ListPrice,
StandardCost,
FinishedGoodsFlag
FROM  AdventureWorks2012.Production.Product
WHERE StandardCost> 1000
INTERSECT
SELECT  Product.Name ,
        ProductID ,
        ProductLine,
ListPrice,
StandardCost,
FinishedGoodsFlag
FROM  AdventureWorks2012.Production.Product
WHERE FinishedGoodsFlag = '1'
ORDER BY Name, ProductID

 

      AND ProductLine= 'M'
Intersect SQL Example

While using these operators, the select list should have equal and identical fields in terms of datatype and data length. Also note in the query above for INTERSECT, we put order by clause in the end after both the SQL statements.

In upcoming tutorials, I will give more examples of various analytical functions, nested queries and also examples of Data Definition language.

Normalizing Data

Depending on the design of relational database, the tables might contain lots of duplicate data or might contain incorrect entries of data in single line that you can’t delete without losing the other entry or it may contain data that shouldn’t exist.

Normalization is the process of effectively organizing & rearranging the data in a database into a more standard normal form that prevents these kind of issues.  Normalizing the data is due to the need for data integrity, flexibility and efficiency.  There are several levels of normalization but you might not need to apply all of them to your database. For RMDBS compliance, you can have your database specialist normalize a flat database into three levels at minimum. Here are all the levels listed.

  • First Normal Form (1NF)

  • Second Normal Form (2NF)

  • Third Normal Form (3NF)

  • Boyce‐Codd Normal Form (BCNF)

  • Fourth Normal Form (4NF)

  • Fifth Normal Form (5NF)

  • Domain/Key Normal Form (DKNF)

First Normal Form (1NF) – In this step we will eliminate repeating groups by creating separate table for each set of related attributes and assign a primary key (PK) to each of those tables.

These are the requirements for 1NF

  1. Each column must have a unique name.

  2. The order of the rows and columns doesn’t matter.

  3. Each column must have a single data type.

  4. No two rows can contain identical values.

  5. Each column must contain a single value.

  6. Columns cannot contain repeating groups.

1st Normal form - Analyticalviews.com

 

 

Second Normal Form (2NF) – Take the data that is only partly dependent on the Primary Key and put it in a seperate table. This means that we need to split up the table columns that are of cardinality “many-to-many relationship”. After splitting the data, foreign keys should be defined and added so that the tables can be related. In this step all the non key attributes are fully functionally dependent on the Primary Key.

The requirements for 2NF are listed below

  1. The database tables should be in 1NF form
  2. All of the non key fields are dependent on all of the key fields.

Second Normal form 2NF - Analyticalviews.com

 

 

Third Normal Form (3NF) – In this form, no non key attributes are transitively dependent on the Primary key. Transitive dependency means that one attribute is dependent on the second attribute, which in turn depends on a third attribute. When you are deleting data in a table, and if the tables are transitively dependent on each other, there will be unwanted loss of information. An easier way of understanding this step is that the data that is not dependent on Primary Key  should be separated out and stored in another table.

The requirements for 3NF are listed below

  1. The database tables should be in 2NF form
  2. Should not contain transitive dependencies

Third Normal form 3NF - Analyticalviews.com

 

 

The final tables after performing the first three normalization forms are now ready for use in a relational database.

Normalized tables - Analyticalviews.com

 

BusinessObjects vs Tableau Assessment

Tableau is becoming a very popular and a very rapidly adopted analytical tool. BusinessObjects has long held the crown of being the most popular Business Intelligence & Analytical Platform with its diverse and rich set of tools and capabilities.

Lets look at the assessment on their individual capabilities and how they line up against each other.

BOBJ Vs Tableau Tool comparision

 

I hope this helps organizations make an informed decision based on their needs.

 

Myths surrounding Dimensional Models

I came across an interesting article dispelling myths around dimensional models and thought I will discuss them here.

Myth 1: Dimensional models are only for Summary level data – This myth is result of ill-designed or planning a design that caters to only a certain applications requiring a narrow set of query results. Dimensional models should be querable from the lowest grain to the highest possible and therefore should contain full set of data. There is practically no limitations on how much data can be stored in dimensional model tables. Summary level data should complement the granular data but should never be replaced. The amount of data store in dimensional model should be driven by business requirements and processes.

Myth 2: Dimensional models are less enterprise and more departmentalized- The dimensional models should be designed around the business process and the data captured and collected can be further used by different departments in the company. Dimensional models should not be designed separately and solely for a single business unit but should be built to use across different groups within a company. For example, sales data can be stored centrally in a dimensional model but can be further used in different data marts designed & developed for different groups like pricing, promotions, Sales strategy group etc.

Myth 3: Dimensional Models are not scalable – This is completely untrue, dimensional models are extremely scalable.  All the database vendors have incorporated the need to accommodate billions of rows of data. As much as normalized model have large data and relationships they can be translated to dimensional model due to the similarity in logical content.

Myth 4: Dimensional models cannot be integrated – Dimensional attributes are highly reusable and are often used in the ETL process to transform and replicate them into other databases in a desired form. They integrate very well with other databases when used as a centralized location of master data and using ETL feeds to replicate and source the data for different applications.

Collection of Database Concepts

I will use this section to add bits of assorted information related to databases and with time I will revisit this section to add more mini-topics.

Schemas-

In a big organization what are the chances that groups or individuals would create views with the same name? It happens very often, for example let’s say there are separate divisions operating under a parent company selling products in completely different category and market. The business lines will obviously have their own strategies, processes and the way they do their daily business in every aspect while making sure their fundamentals are aligned with the parent company’s goal. IT resources from both divisions working on sales data from the parent company’s database will most likely create logical views or macros to support their line of business and the chances are that out of hundreds of views, some will be named the same. To avoid this scenario, RMDBS provides a hierarchical cataloging of views/tables/procedures by their schema names. The database admin can create a schema for any user or group and the schemas can have views with the same name. This helps in identifying the right entity and provides flexibility in designing and working on multiple applications by multiple groups in a company. The views would be used by the schema name followed by the view name. For ex schema1.inventory, schema2.inventory.

So in short a schema is basically a space for different users or groups with separate access privileges and gives the ability to store, run or edit views or stored procedures.

Joining Fact Tables -

Each business process translates into a dimensional table containing related contextual data based on the similar category of description. For example a sales model can have a fact table containing the measures of different quantities of unit sold with a bunch of dimensional tables surrounding this fact table. There can be a location table that would contain information of the retailers who sold the unit quantities; this information would contain retailer code, name, address, zip code, store open indicator, active flag and so on. Another table would be sales calendar containing the company sales calendar with different groupings by, say for example, week end dates, month, quarter, year etc.  Each of these dimensional tables would have a primary key through which they can be joined on a many to one relationship with the fact table.  In the simplest form a fact is surrounded by a number of dimensional tables joined through a key. This allows the query engine to pick up the data requested by the constraints and scan the fact table for related entries.

Dimension Tables for Textual Context Usage

Dimension tables are integral part of any fact table dimensional modeling schema. Dimension tables contain textual context that are associated with a business process measurements stored in fact tables.

Such a table can contain anywhere between 20 to 100 attributes or columns. These tables are smaller in terms of number of rows but are considered wide with much large text content.  Each dimensional table has to have a primary key through which they can be linked or associated to the fact table. The attributes serve as primary source of grouping, query filters, constraints or labels. Attributes can at times have cryptic or coded names but it is considered good practice to name them according to the description of the content of a column. This makes it easier for user/developer to understand and use rather than memorizing what each coded name means.  Although many time the operational codes have some logic to the naming convention which is aligned with the applications and the source data used. For example, the first four letters could be related to area and the next three could be district and so on.

It is said that the analytical power of any DW/BI is directly proportional to the quality and depth of the attributes.  A dimensional table designed to contain multiple levels ranging from the most granular to the highest possible under the business rules and capabilities will give the most flexibility in usage of the contained data in any Business Intelligence application or tools. These levels being in hierarchical grouping allows different tools to produce resultant data in the form required or as needed according to the specification.  The hierarchical data should have description added as a column in spite of the redundancy of data. This allows ease of use and better overall query performance. Dimensional tables should not be normalized to a degree where descriptive information is stored in another supplemental table in form of snow flaking. Since dimensional tables have significantly fewer rows than a fact table, further normalization would rarely make a difference in database size.