Data Analysis with Lee Hawthorn

Process Time Sheet Data

Topics: SQL

When we have to deal with time sheet data it can post particular challenges for the DW if not modelled in the right way.

This applies to any HR/Service delivery and it's because we only tend to have Start Dates and End Dates in the source system.

This makes it very complex and expensive later on to calculate Count of Employees for a specific period. This is because we have to write calculations for a specific time range.

This presents a good example of how we can transform the model to make it less expensive for measures.

In this post I'll show how we can transform data with TSQL, way before it gets to the final model/mart.

Source Data

Here's an example of some source data for an Employee table. In my experience this is very common.

EmployeeData

We can transform this table by exploding out rows for each employee for the time range defined. This will make it much easier to create measures as you can simply sum a column like with a normal transaction table.

SQL

SQL has a join type called CROSS JOIN. It's hardly ever used. CROSS JOIN produces a result set which is the number of rows in the first table multiplied by the number of rows in the second table (with no WHERE clause). This kind of result is called a Cartesian Product.

In order to explode out the Employee data we need to CROSS JOIN to a Date/Time table at the grain with which we want to model with later i.e. Day.

You should have this Date data in a Calendar Table (very important table in the DW)

DimDate

Step 1

SELECT d.[Calendar_Date]
	  ,e.[EmployeeID]
      ,e.[HireDate]
      ,e.[TerminationDate]
      ,e.[RoleID]
      ,e.[Gender]
      ,e.[TerminationReason]
      ,e.[BusinessUnitID]
      ,e.[EthnicGroupID]
      ,e.[FullorPart]
      ,e.[PayTypeID]
  FROM [HR].[dbo].[Employees] as e

  CROSS JOIN [dbo].[Dim_Date] as d

This gives Employees * Dates which is not quite what we want as it gives us all employees for all dates.

We can add a where predicate to remove rows we don't want.

Step 2

 WHERE d.Calendar_Date >=e.HireDate  and d.Calendar_Date < e.TerminationDate

We're getting there. One problem we still have is the Employee with no Termination Date i.e. the active ones will be excluded from the results.

Not a problem as we can run a UNION query too.

Step 3

SELECT d.[Calendar_Date]
	   ,e.[EmployeeID]
      ,e.[HireDate]
      ,e.[TerminationDate]
      ,e.[RoleID]
      ,e.[Gender]
      ,e.[TerminationReason]
      ,e.[BusinessUnitID]
      ,e.[EthnicGroupID]
      ,e.[FullorPart]
      ,e.[PayTypeID]
  FROM [HR].[dbo].[Employees] as e

  CROSS JOIN [dbo].[Dim_Date] as d

  WHERE d.Calendar_Date >=e.HireDate  and e.TerminationDate  = '01/01/1900' and d.Calendar_Date < getdate();
GO

The WHERE clause here needs to be customised for your data. In my my data NULL dates are presented as '01/01/1900'. Note also in this case I don't want to show rows for future dates. If you were going use this data for Planning then you would have to consider this.

The final SQL is shown below.

SELECT d.[Calendar_Date]
	  ,e.[EmployeeID]
      ,e.[HireDate]
      ,e.[TerminationDate]
      ,e.[RoleID]
      ,e.[Gender]
      ,e.[TerminationReason]
      ,e.[BusinessUnitID]
      ,e.[EthnicGroupID]
      ,e.[FullorPart]
      ,e.[PayTypeID]
  FROM [HR].[dbo].[Employees] as e

  CROSS JOIN [dbo].[Dim_Date] as d

  WHERE d.Calendar_Date >=e.HireDate  and d.Calendar_Date < e.TerminationDate

UNION ALL

SELECT d.[Calendar_Date]
	   ,e.[EmployeeID]
      ,e.[HireDate]
      ,e.[TerminationDate]
      ,e.[RoleID]
      ,e.[Gender]
      ,e.[TerminationReason]
      ,e.[BusinessUnitID]
      ,e.[EthnicGroupID]
      ,e.[FullorPart]
      ,e.[PayTypeID]
  FROM [HR].[dbo].[Employees] as e

  CROSS JOIN [dbo].[Dim_Date] as d

  WHERE d.Calendar_Date >=e.HireDate  and e.TerminationDate  = '01/01/1900' and d.Calendar_Date < getdate();
GO

All you need now is a surrogate key. This makes the beginning of an Employee fact table.

Summary

Don't assume the sources tables are ready for modelling - when the end users consume your model with Self-Service tools they will have a nicer time with a fact table that has facts rather than a fact-less fact table.

Previous PostCall SSAS OLAP cubes from R stats
Next PostGetting started with Apache Spark