Exeter’s SQL South West UG – Scripts and slides

Untitled-1

Thanks for the massive turnout last night at SQL South West, you guys rule!

I presented a session called TSQL(L): Tricky SQL, Quandaries and Lessons Learnt.

Thanks again for attending. I had a lot of fun with this session. In the session we talked about a few websites:

Ask SQL Server Central: https://ask.sqlservercentral.com/
Gaps & Islands: https://www.simple-talk.com/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/

See you all next month.

You can download the solution file Download scripts.

Slides:

Tally/Number tables in SQL Server #TSQL2sday

b8b7b87aa1a76ab17e75a1567bed421a

SqlTuesday As part of #TSQL2sday Mickey Stuewe (b|t) has proposed a question looking at best practices when someone has made some bad design decisions, generally when I am facing bad design problems I am the one who put them there (:P). This tip is not necessarily a “how to fix an issue” but more a tip on how to work with dates.

As a BI developer I am always working with dates and numbers. A user will come and ask for a query relating to what happened last month – What they typically mean is what happened per day? How much did we sell on each day? How many customer visited our website each day? How many members of my team were absent in October?

When you’re working with HR data you commonly want to see where you have absences over a period of time. This might be a chart per day or per month. So this is what we want to do. We want to see each day for October how many people were absent on each day – and here we start to see a common issue. This is also an issue that might be neglected where you commonly have sales per day, when you have no sales on an edge-case day it might get missed, and you really don’t want to miss that!

Here is some sample data to illustrate the point. The sample data is my team and their absences. We have Timmy and Bobby both have had a bad month and have been ill on multiple occasions (man-flu), they have a few consecutive periods of illness and the odd one day. What I want to see is 31 rows (one for every day in October) returned showing any zeros where there were no absences. However our HR system only logs out a row for data logged and not a row for those where there has been nothing logged. So a query like this (which is what the end-user asked for) looks like this:

DECLARE @Example TABLE (ID INT IDENTITY, StaffMember VARCHAR(50), AbsenceDate DATE, AbsenceReason VARCHAR(50))
INSERT INTO @Example ( StaffMember, AbsenceDate, AbsenceReason )
VALUES
('Bobby', '20151001', 'Sickness')
,('Bobby', '20151002', 'Sickness')
,('Bobby', '20151003', 'Sickness')
,('Bobby', '20151009', 'Sickness')
,('Bobby', '20151022', 'Sickness')
,('Bobby', '20151021', 'Sickness')
,('Bobby', '20151023', 'Sickness')
,('Bobby', '20151024', 'Sickness')
,('Sammy', '20151001', 'Sickness')
,('Sammy', '20151003', 'Sickness')
,('Sammy', '20151022', 'Sickness')
,('Sammy', '20151025', 'Sickness')

When you run the following:

SELECT AbsenceDate, COUNT(*) AS 'AbsenceCount' FROM @Example WHERE AbsenceDate >= '20151001' AND AbsenceDate < '20151101' GROUP BY AbsenceDate

1

But we are missing most of the days. If this data is to be used for a visualisation in Excel or SSRS then we want to see this per day, however we will only have pointers for those where the data has been logged, and that is an issue. But that isn’t at the fault of the developers, why would you write out more than you need? It doesn’t make sense. Ok we have an issue, but it doesn’t stop there! Our user now wants to know “what days were there no absences?”. How can we look for data that does not exist?

We obviously need something to help. We need a tally table. A tally table or commonly known as a numbers table is a simple table which holds numbers or dates. In a data warehouse you might have a date dimension – this is similar. A tally is a table that has a load of numbers in it. 1 to 100 million or however many you need, it is up to you. For this example we want one row in a table for every date. The script below will build you a table and insert as many rows as you want. For this example I have used 365 days from 01-01-2015 this will gives a row for every day in 2015. Alter the @j variable to extend this and alter the @startDate to move the start date.

After running this script you will have a full table and now we have something that we can use to start answering the questions we have been asked. So lets look at our questions again.

Question number one. Create a list of all absences in October (including where we have no absences)

; WITH TallyDates AS
(SELECT * FROM dbo.TallyTbl WHERE TallyDate &gt;= '20151001' AND TallyDate &lt; '20151101')
, AbsenceData AS 
(SELECT AbsenceDate, COUNT(*) AS 'AbsenceCount' FROM @Example GROUP BY AbsenceDate)
SELECT TallyDates.TallyDate, COALESCE(AbsenceData.AbsenceCount,0) AS 'AbsenceCount' FROM TallyDates
LEFT JOIN AbsenceData ON TallyDates.TallyDate = AbsenceData.AbsenceDate

2

Now we have a full list of all the days including those where we have had no absences. Ideal!

Question number two: In October what dates were no absences? An to make it a little harder, which days were working days – wondering why there was an TallyWeekend column? It was for this.

; WITH TallyDates AS
(SELECT TallyDate, TallyDateWeekend FROM dbo.TallyTbl WHERE TallyDate &gt;= '20151001' AND TallyDate &lt; '20151101')
SELECT TallyDates.TallyDate AS 'DateWithNoAbsence' FROM TallyDates WHERE TallyDates.TallyDate NOT IN (SELECT AbsenceDate FROM @Example)
AND TallyDates.TallyDateWeekend = 0

3

So something that seemed really difficult before is now really simple. So not a bad design solution but a way to manage dates when you need to look for missing dates.

If you do not want to make a tally table or you cannot alter the database then you can do this all in a different way. Itzik Ben-Gan developed a little script that can be used to generate millions or rows in a zero IO. You can read about this here: http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers

  WITH
  L0   AS(SELECT 1 AS c UNION ALL SELECT 1),
  L1   AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
  L2   AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
  L3   AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
  L4   AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
  L5   AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B)

Thanks

Dynamically importing Excel files with SSIS & C#

Hi everyone.

Last week my team had a request to automate a slow task the business was running on a daily basis. Every day they received an Excel file from one of our providers, this has been happening for almost 5 years. They had amassed over 1500 files. The current process involved a lengthy VBA script inside Excel that would open each file and amalgamate the data and find the most recent transaction – Any idea how long that took? 6 hours! Every day, an entire box was dedicated to this task – This needed to be changed.

The project seemed quite simple however there were a few factors which made the simple task more complicated.

1. The volume of files – 1500+
2. The sheet names were different on each workbook, with no logic in the naming – This was previously a manual task.
3. Different types of files – Excel versions 97-03 (xls) & 2007 (xlsx)
4. The data did not start at row 1 — It started at row 7

You will find this sort of operation in almost any organisation, or at least parts of it – it is a very common SSIS task. So I wanted to detail my method for tackling something like this in the hope someone else might be able to use it. The sample files are available to download below.

So let’s plan how we can tackle each of the main issue we have.

The volume of files.

In the downloads you will find a file called master.xlsm. This workbook contains a little script which will generate as many files as you need, all with different sheet names and sample data.. This is not a very quick process, so I suggest you keep the number of files small and set it going when you’re making a coffee – It will produce ~ 40 files a minute. I have created 500 and duplicated them 3 times for the purpose of the demo.

Whenever I need to work with a load of files I starting thinking how can I loop over each of these files in turn. There are many ways to do this SSIS is one option, but so can PowerShell or even log parser. For this example I am using SSIS. This task is incredibly easy in SSIS, just drop a foreach task on the canvas and configure.

2015-05-02_13-15-35

Before we configure the foreach loop to point at the folder we need to loop over, we first need to create a variable to hold output (the file path). When creating a process like this I want the process to be self auditing. By this I mean when we are processing files, I want to know which files have been processed, when and whether or not the transfer was a success. So what I am going to do is first create the variable to hold the fully qualified path and then write that out to a log table. You can see an example of this below. I have created a variable and XL_FilePath it is scoped at the package level (available to all child tasks), a type of string (char) with a default value of one of my files (to make a later step easier).

2015-05-02_16-29-24

Now we have both the foreach loop and the var we can double click on the Foreach loop container to configure. There are a few option we want to set here.

Enumerator: Foreach File Enumerator
Expressions: Blank (we are not using any expressions)
Folder: This wants to be set to the location of your files. For my example that is – C:\HyperBI\XL_SSIS\XL_Files
Files: this relates to the naming structure of your file. I know all my files are XL so I will leave this as “*.*”
Retrieve file name: You have 3 options here. Fully qualified – C:\HyperBi\XL_SSIS\XL_Files\1.xls, Name and extention – 1.xls and Name only – 1.
Traverse subfolders: The final option is if we want to go over the sub folders. All of the sample data is in one folder, so I will leave this unchecked.

2015-05-02_16-47-54

Now we need to associate the variable with the qualified path. On the left when you select Variable Mappings you will be given the option to associate the output of the loop with a variable. Select the variable we created earlier. SSIS will add an index of 0. SSIS uses a 0 based array for it index. 0 being the first column returned. You can see below that if we pipe the variable into a script task to MessageBox the result of the variable we get the following. There is a good blog here which details how to set this up.

test2

The next step is to load this into SQL server so we can loop over the files to load our data ready for processing/auditing. This is done by creating a table called SSIS_FILE_LOAD_LOG.  This table has 4 columns.

CREATE TABLE SSIS_FILE_LOAD_LOG(
SSIS_FILE_ID INT IDENTITY PRIMARY KEY
, SSIS_FILE_PATH VARCHAR(250)
, SSIS_AUDIT_DATE DATETIME DEFAULT GETDATE()
, SSIS_AUDIT_OUTCOME VARCHAR(50)
);

With this we can now add an Execute SQL task and connect it to our server/table to load the results. In the event that something goes wrong in the load I don’t want to have to reload all the previous files, hence the outcome column. We will check that the outcome is either NULL or ‘FAIL’ when we process. We want to loop over the files when this is executed and load in new rows for our new files, in order to do this we don’t want to load in files already processed or pending process. This is done by checking if the file has been processed before inserting the new row. By adding the following TSQL in to our Execute SQL task we can ensure we only capture the file paths once.

IF NOT EXISTS (SELECT 1 FROM SSIS_FILE_LOAD_LOG WHERE SSIS_FILE_PATH = ?)
INSERT INTO SSIS_FILE_LOAD_LOG (SSIS_FILE_PATH, SSIS_AUDIT_DATE) VALUES (?, NULL)

The IF EXISTS part checks if the record is in the table, if it is nothing will happen. The 2 “?” relate to variables and are configured on the Parameter Mapping screen. You will notice the parameter names are 0 and 1. The ordinal positions of the variables.

2015-05-02_19-30-15

After running this I have a table containing 1500 file paths, ready to be processed.

The data doesn’t start on row 1

This is a pain but it is an easy fix as long as you know the sheet name. In the example data I have included, the data starts at row 4 and is in columns A:D. We want to skip the first 5 rows and process the data without headers.

2015-05-02_19-46-58

 

If we add a data transfer task, followed by an Excel source – You will also need to add an Excel connection to one of the files. By default the connection will read from A1 and not A5. You can alter this by changing the range SSIS is looking at. The easiest way to do this is to create a new variable called XL_Reference with the value of ‘ZDJbagZ5$A5:D’. This is the sheet name and the start and end of your data (yours will be different as the files are random). A5 is the start and D is the end. I have omitted a number ie DE200 as I don’t know when the data ends. You can see the configuration of this below:

test3

This works, however only if we know the name of the sheet ahead of time.  As we do not know the sheet name of our next file we need to add some C#.

The sheet names were different on each workbook, with no logic in the naming

SSIS is great at many things and one of those things is processing lots of Excel files. The problem is they have to be the same structure. When this changes processing becomes an issue. The names of the sheets were different in every workbook. In some it would be the name of the file, others the date, the word extract or a simple Sheet 1. For this example every sheet name is randomly generated. In order to process the data we will need to know what the sheet name is before we process. How can we do this? Well with either VB or C# inside a script task.

Things get a little more advanced now, don’t worry stick with it and all will be OK, I promise (grab the files and play along). If you have not worked with script tasks before they can be a little daunting. They come in 2 flavours, VB and C#. I am not a big fan of VB and I used to be a C#.net developer for a short time so I prefer C#. You could use VB however the following solution is all C#. Once you have added the script task you will need to add the XL_FilePath variable to the read-only variables and the XL_Reference to the read-write variables. Then edit the script and add the following:

To the namespaces (top of the file)

using Excel = Microsoft.Office.Interop.Excel;

To the main function (replace all existing code)

var xl_file_full = Dts.Variables["User::XL_FilePath"].Value;
 
Excel.Application xlApp = new Excel.Application();
Excel.Workbook xlWorkBook = xlApp.Workbooks.Open(xl_file_full.ToString(), 0, true, 5, "", "", true, Excel.XlPlatform.xlWindows, "\t", false, false, 0, true, 1, 0);
Excel.Worksheet xlWorkSheet = (Excel.Worksheet)xlWorkBook.Worksheets.get_Item(1);
 
xlApp.Quit();
xlApp = null;
 
GC.Collect();
GC.WaitForPendingFinalizers();
 
Dts.Variables["User::XL_Reference"].Value = xlWorkSheet.Name + "$A5:D";
 
Dts.TaskResult = (int)ScriptResults.Success;

Once done you will also need some references to be added (rick click references in the solution explorer window). You will need to add a reference to the C Sharp language assembly and also the COM Excel inter op. Your references should look like this once done.

2015-05-02_20-52-53

Lets examine what is happening block by block.

using Excel = Microsoft.Office.Interop.Excel;

Here you’re adding a reference to let the script task know when you refer to “Excel” you actually mean “Microsoft.Office.Interop.Excel”.

Break down:

Define a variable and pass in the SSIS variable XL_FilePath.

var xl_file_full = Dts.Variables[“User::XL_FilePath”].Value;

Creates a new instance of a workbook, Opens the work book passed in via the var and grabs the first sheet name.

Excel.Application xlApp = new Excel.Application();
Excel.Workbook xlWorkBook = xlApp.Workbooks.Open(xl_file_full.ToString(), 0, true, 5, "", "", true, Excel.XlPlatform.xlWindows, "\t", false, false, 0, true, 1, 0);
Excel.Worksheet xlWorkSheet = (Excel.Worksheet)xlWorkBook.Worksheets.get_Item(1);

Close the Excel workbook and clear the variables.

xlApp.Quit();
xlApp = null;

Free up memory by initiating the garbage collection.

GC.Collect();
GC.WaitForPendingFinalizers();

Set the SSIS variable XL_Reference to be the sheet name and the reference to the data (A5:D);

Dts.Variables["User::XL_Reference"].Value = xlWorkSheet.Name + "$A5:D";
Dts.TaskResult = (int)ScriptResults.Success;

straight forward once you know what it is doing. The garbage collection is really important as if you omit this you will be left with 1500 open session to Excel and your server will die!

Different types of files – Excel versions 97-03 (xls) & 2007 (xlsx)

SSIS will read an Excel 97/03 file type under the 2007 format, by default it is backwards compatible! Awesome, one less thing to worry about. So now we have all the parts working independently we need to string them all together.

First we need load the results of the initial file loop in to an ADO object so we can loop over the results. In order to do this we need to do 2 things:

1. Create a new variable called XL_ADO_Object, with a data type of Object.
2. Create another Execute SQL task.  This time we will be selecting from the log table. Use the following TSQL

SELECT SSIS_FILE_ID, SSIS_FILE_PATH FROM SSIS_FILE_LOAD_LOG WHERE COALESCE(SSIS_FILE_PATH, 'X') &lt;&gt; 'SUCCESS'

We then need to configure the result set to Full result set, this allows SSIS to keep hold of the results of the Execute SQL. With these results we want to load them in to our newly created ADO Object variable. You can assign the results by selecting the result set navigation option. Add a new result set with a name of 0 and a variable name of User::XL_ADO_Object. Once this is complete we have an object loaded with the first 2 columns from our table (the ID and the FilePath).

test4

We need another foreach loop container to iterate over the results in our ADO Object, we will want to load the output of each row to a variable, the one we created earlier for the file path will work fine and we need another to hold the primary key of our audit table (which you will need to create before the next step – I have created a variable called Audit_ID with a type of int32) – We will use this ID to update whether the file was processed or not. To configure the foreach loop we will need to change the collection enumerator to Foreach ADO Enumerator and set the source to be the Object variable you created. Configure the variable mapping to Audit_ID with an index of 0 and XL_FilePath with an Index of 1 (0 = column 1, 1 = column 2) .

Now we can move the script task and the data flow task to inside the second foreach loop and connect them – script task first. We will want to use the variable for the source of our Excel file so we will need to configure the Excel Connection Manager to use a variable before continuing. Under the properties of the Excel connection add an expression and set the ExcelFilePath to be equal to the XL_FilePath variable.

2015-05-03_16-51-23

We now need to expand the data flow task to move the data to SQL Server. If you add an OLEDB destination task and connect it to your table. The TSQL for creating that table is below.

CREATE TABLE SSIS_XL_TO_SQL (
    [Invoice_ID] INT IDENTITY PRIMARY KEY,
    [Invoive_Date] datetime,
    [Customer_ID] float,
    [Order_Qty] float,
    [Order_Amount] float,
    [Source_FilePath] nvarchar(250),
    [Processed_Date] datetime
)

You will notice there is also a column for processed date and for source – If there is an issue the the data we can quickly track it back to the file it came from. To populate these we will need to add 2 derived columns. Between the source and destination add a derived column transformation. Add a new column with the expression of getdate()  – to add the date and time of the execution, then add another which references the XL_FilePath variable. Connect the 2 new columns and the 4 from the Excel file to our new table.

2015-05-03_17-06-19

The final step we have is to add another Execute SQL task to update the log file with the outcome of the movement. Connect another SQL task and add the following TSQL:

UPDATE SSIS_FILE_LOAD_LOG SET SSIS_AUDIT_OUTCOME = ‘SUCCESS’, SSIS_AUDIT_DATE = GETDATE() WHERE SSIS_FILE_ID = ?

You will need to configure the Parameter mapping to use the Audit_ID variable. Add this on the success of the data flow task and create another to act as an update when the process fails – change where it states ‘SUCCESS’ to ‘FAIL’. Connect both the script and data flow task to the fail update with logical OR failure constraints. And with that we have a working Excel processor that will work for any number of files.

test5

Thanks for reading.

Download all files here.

 

PIVOT! – Manual Pivot and the PIVOT function

Pivot

(Whenever I hear PIVOT this is what I think of, PIVOTTTT!!!)

Right, to start this series of posts I wanted to write about a method of manipulating data I refer to as a “manual PIVOT”.

For those who have use the TSQL PIVOT you with know it enables you to transpose data like you can do in excel (Rows become columns etc). For those who don’t already use it have you ever looked at the syntax in TSQL? It is horrible and almost impossible to remember (see below).

SELECT,
    [first pivoted column] AS,
    [second pivoted column] AS,
    ...
    [last pivoted column] AS
FROM
    (<select>)
    AS 
PIVOT
(
    ()
FOR
[]
    IN ( [first pivoted column], [second pivoted column],
    ... [last pivoted column])
) AS 
;
</select>

You can however create something similar using conditional logic.

If you’re using a version of SQL Server before SQL2012 then you can use a CASE statement, for those on a version greater than or equal to SQL2012 you can use either a CASE or the new Inline If (IIF). If you are unfamiliar with the CASE statement the syntax is as follows:

CASE WHEN Expression THEN Result WHEN Expression THEN Result ELSE Result END as x

Each “Expression” is a boolean, supplement with conditional logic which evaluates to TRUE.

For all of these examples I will use the AdventureWorks2012 database. If you don’t have this you can grab it from CodePlex

use AdventureWorks2012;
GO
 
SELECT CustomerID, OrderDate, TotalDue FROM Sales.SalesOrderHeader ORDER BY CustomerID

Pivot_Q1
The query above produces a simple list of what each customer has spent ordered by the customerID. What if we want to know how much each customer had spent each year? Well we could write something like this:

SELECT  CustomerID
		, YEAR(OrderDate) as 'OrderYear'
        , SUM(TotalDue) AS 'OrderTotal'
FROM    Sales.SalesOrderHeader
GROUP BY CustomerID, YEAR(OrderDate)
ORDER BY CustomerID

Pivot_Q2

This is fine but what if the requirement was to have years as columns to see the spend trend. This is where we can use the conditional logic to create a manual pivot. Using CASE and the YEAR(OrderDate) we can check to see which year the sales relate to.

SELECT  CustomerID
        , CASE WHEN YEAR(ORDERDate) = 2005 THEN (TotalDue) END AS '2005_OrderTotal'        
		, CASE WHEN YEAR(ORDERDate) = 2006 THEN (TotalDue) END AS '2006_OrderTotal'
        , CASE WHEN YEAR(ORDERDate) = 2007 THEN (TotalDue) END AS '2007_OrderTotal'
        , CASE WHEN YEAR(ORDERDate) = 2008 THEN (TotalDue) END AS '2008_OrderTotal'
        , CASE WHEN YEAR(ORDERDate) = 2009 THEN (TotalDue) END AS '2009_OrderTotal'
FROM    Sales.SalesOrderHeader
ORDER BY CustomerID

Pivot_Q3

The problem with this is that we have multiple rows for our customers. Where a customer has had a spend in 2005, all other years will show NULL. As we are looking for the total amount spent in each year we can simply SUM these totals and GROUP BY the CustomerID.

SELECT  CustomerID
        , SUM(CASE WHEN YEAR(ORDERDate) = 2005 THEN (TotalDue) END) AS '2005_OrderTotal'        
		, SUM(CASE WHEN YEAR(ORDERDate) = 2006 THEN (TotalDue) END) AS '2006_OrderTotal'
        , SUM(CASE WHEN YEAR(ORDERDate) = 2007 THEN (TotalDue) END) AS '2007_OrderTotal'
        , SUM(CASE WHEN YEAR(ORDERDate) = 2008 THEN (TotalDue) END) AS '2008_OrderTotal'
        , SUM(CASE WHEN YEAR(ORDERDate) = 2009 THEN (TotalDue) END) AS '2009_OrderTotal'
FROM    Sales.SalesOrderHeader
GROUP BY CustomerID
ORDER BY CustomerID

Pivot_Q4

This is now displaying one row for each customer with the amounts spent between 2005 and 2009 as columns.

This method works with all sorts of use cases and isn’t restricted to mathematical calculations. You can use this method to to calculate the top 3 products sold for each customer as columns for example.

; WITH RankedProducts AS 
(
SELECT  CustomerID
		, p.Name
		, SUM(OrderQty) OrderQty
		, ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY SUM(OrderQty) DESC) OrderRank
FROM    Sales.SalesOrderDetail
		INNER JOIN Sales.SalesOrderHeader ON Sales.SalesOrderHeader.SalesOrderID = Sales.SalesOrderDetail.SalesOrderID
        INNER JOIN Production.Product p ON p.ProductID = Sales.SalesOrderDetail.ProductID
GROUP BY 
	CustomerID
		, p.Name
)
SELECT CustomerID
	, MAX(CASE WHEN OrderRank = 1 THEN Name ELSE NULL END) AS 'ProductSales1'
	, MAX(CASE WHEN OrderRank = 2 THEN Name ELSE NULL END) AS 'ProductSales2'
	, MAX(CASE WHEN OrderRank = 3 THEN Name ELSE NULL END) AS 'ProductSales3'
FROM RankedProducts
GROUP BY 
	CustomerID

Pivot_Q5

This works by apply in a MAX to the Name based on the OrderRank. OrderRank is calculated using a window function for more on window functions, see an upcoming post in this series. MAX works because it treats the Name as alphanumeric, which allows the MAX to take the word over the NULL as that is higher alphanumerically.

You can use this to work out all sorts scenarios.

Thanks for reading.