Data Analysis with Lee Hawthorn

Cleaning Data with SQL

Topics: SQL

When creating BI models and looking into the guts of the ERP system you see data artifacts from user behavior or sometimes poor design which requires you to get out cleaning gear.

Cleaning

It goes without saying the best course of action is to clean data at source – if only this could happen more often.

In this post I’ll show a little example I learnt recently to infer dates with T-SQL.

In the scenario below we have HR data showing employees at various positions over time. The time frame of each position is represented with Valid From and Valid To dates.

The integrity of this data is crucial if the fields are going to be used in a join.

When the validTo date is flaky we can infer the date by looking at the validFrom of the employees next position. Usually this requires a complex query with self-joins and other manners of madness.

T-SQL in SQL Server 2012+ gives us the LEAD function. We can use LEAD() to write a simple query with fewer lines of code. It works by partitioning the data into sets split by employee. Each set is sorted by validFrom so the next row gives us the date needed. I had to use DATEADD to subtract 1 from the next ValidTo date was there was an issue using LEAD with int and dates.

You can set up the data with the code below. I’ve added real ValidTo dates to show the inference is correct.

CREATE DATABASE [HR]
GO
USE HR;
CREATE TABLE [dbo].[employees](
	[id] [smallint] NOT NULL,
	[name] [nvarchar](50) NULL,
	[employeeId] [smallint] NULL,
	[validFrom] [date] NULL,
	[validTo] [date] NULL)


INSERT INTO [dbo].[employees]
	(id, name,employeeId, validFrom, validTo)

values   (1,'Nelson', '1001', '20150101' ,'20150410')
		,(2,'Nelson', '1001', '20150411' ,'')
		,(3,'John', '1002', '20150301' ,'20151002')
		,(4,'John', '1002', '20151003' ,'')
		,(5,'Paul', '1003', '20150101' ,'20150410')
		,(6,'Paul', '1003', '20150411' ,'20160410')
		,(7,'Paul', '1003', '20160411','')
		,(8,'George', '1004', '20150310' ,'20160612')
		,(9,'George', '1004', '20160613' ,'20170410')
		,(10,'George', '1004', '20170411' ,'')
GO

SELECT id, name, employeeId,validFrom,validTo,
	    DATEADD(DAY,-1,LEAD(validFrom,1,'19000102')
			OVER (PARTITION BY employeeId ORDER BY validFrom)) as InferedValidTo
FROM [dbo].[employees]
ORDER BY employeeId;

Data

Previous PostScraping data with R
Next PostData Analytics modelling, why tune by hand?