As part of #TSQL2sday Mickey Stuewe (b|t) has proposed a question looking at best practices when someone has made some bad design decisions, generally when I am facing bad design problems I am the one who put them there (:P). This tip is not necessarily a “how to fix an issue” but more a tip on how to work with dates.
As a BI developer I am always working with dates and numbers. A user will come and ask for a query relating to what happened last month – What they typically mean is what happened per day? How much did we sell on each day? How many customer visited our website each day? How many members of my team were absent in October?
When you’re working with HR data you commonly want to see where you have absences over a period of time. This might be a chart per day or per month. So this is what we want to do. We want to see each day for October how many people were absent on each day – and here we start to see a common issue. This is also an issue that might be neglected where you commonly have sales per day, when you have no sales on an edge-case day it might get missed, and you really don’t want to miss that!
Here is some sample data to illustrate the point. The sample data is my team and their absences. We have Timmy and Bobby both have had a bad month and have been ill on multiple occasions (man-flu), they have a few consecutive periods of illness and the odd one day. What I want to see is 31 rows (one for every day in October) returned showing any zeros where there were no absences. However our HR system only logs out a row for data logged and not a row for those where there has been nothing logged. So a query like this (which is what the end-user asked for) looks like this:
DECLARE @Example TABLE (ID INT IDENTITY, StaffMember VARCHAR(50), AbsenceDate DATE, AbsenceReason VARCHAR(50)) INSERT INTO @Example ( StaffMember, AbsenceDate, AbsenceReason ) VALUES ('Bobby', '20151001', 'Sickness') ,('Bobby', '20151002', 'Sickness') ,('Bobby', '20151003', 'Sickness') ,('Bobby', '20151009', 'Sickness') ,('Bobby', '20151022', 'Sickness') ,('Bobby', '20151021', 'Sickness') ,('Bobby', '20151023', 'Sickness') ,('Bobby', '20151024', 'Sickness') ,('Sammy', '20151001', 'Sickness') ,('Sammy', '20151003', 'Sickness') ,('Sammy', '20151022', 'Sickness') ,('Sammy', '20151025', 'Sickness')
When you run the following:
SELECT AbsenceDate, COUNT(*) AS 'AbsenceCount' FROM @Example WHERE AbsenceDate >= '20151001' AND AbsenceDate < '20151101' GROUP BY AbsenceDate
But we are missing most of the days. If this data is to be used for a visualisation in Excel or SSRS then we want to see this per day, however we will only have pointers for those where the data has been logged, and that is an issue. But that isn’t at the fault of the developers, why would you write out more than you need? It doesn’t make sense. Ok we have an issue, but it doesn’t stop there! Our user now wants to know “what days were there no absences?”. How can we look for data that does not exist?
We obviously need something to help. We need a tally table. A tally table or commonly known as a numbers table is a simple table which holds numbers or dates. In a data warehouse you might have a date dimension – this is similar. A tally is a table that has a load of numbers in it. 1 to 100 million or however many you need, it is up to you. For this example we want one row in a table for every date. The script below will build you a table and insert as many rows as you want. For this example I have used 365 days from 01-01-2015 this will gives a row for every day in 2015. Alter the @j variable to extend this and alter the @startDate to move the start date.
After running this script you will have a full table and now we have something that we can use to start answering the questions we have been asked. So lets look at our questions again.
Question number one. Create a list of all absences in October (including where we have no absences)
; WITH TallyDates AS (SELECT * FROM dbo.TallyTbl WHERE TallyDate >= '20151001' AND TallyDate < '20151101') , AbsenceData AS (SELECT AbsenceDate, COUNT(*) AS 'AbsenceCount' FROM @Example GROUP BY AbsenceDate) SELECT TallyDates.TallyDate, COALESCE(AbsenceData.AbsenceCount,0) AS 'AbsenceCount' FROM TallyDates LEFT JOIN AbsenceData ON TallyDates.TallyDate = AbsenceData.AbsenceDate
Now we have a full list of all the days including those where we have had no absences. Ideal!
Question number two: In October what dates were no absences? An to make it a little harder, which days were working days – wondering why there was an TallyWeekend column? It was for this.
; WITH TallyDates AS (SELECT TallyDate, TallyDateWeekend FROM dbo.TallyTbl WHERE TallyDate >= '20151001' AND TallyDate < '20151101') SELECT TallyDates.TallyDate AS 'DateWithNoAbsence' FROM TallyDates WHERE TallyDates.TallyDate NOT IN (SELECT AbsenceDate FROM @Example) AND TallyDates.TallyDateWeekend = 0
So something that seemed really difficult before is now really simple. So not a bad design solution but a way to manage dates when you need to look for missing dates.
If you do not want to make a tally table or you cannot alter the database then you can do this all in a different way. Itzik Ben-Gan developed a little script that can be used to generate millions or rows in a zero IO. You can read about this here: http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers
WITH L0 AS(SELECT 1 AS c UNION ALL SELECT 1), L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B)