Last week my team had a request to automate a slow task the business was running on a daily basis. Every day they received an Excel file from one of our providers, this has been happening for almost 5 years. They had amassed over 1500 files. The current process involved a lengthy VBA script inside Excel that would open each file and amalgamate the data and find the most recent transaction – Any idea how long that took? 6 hours! Every day, an entire box was dedicated to this task – This needed to be changed.
The project seemed quite simple however there were a few factors which made the simple task more complicated.
1. The volume of files – 1500+
2. The sheet names were different on each workbook, with no logic in the naming – This was previously a manual task.
3. Different types of files – Excel versions 97-03 (xls) & 2007 (xlsx)
4. The data did not start at row 1 — It started at row 7
You will find this sort of operation in almost any organisation, or at least parts of it – it is a very common SSIS task. So I wanted to detail my method for tackling something like this in the hope someone else might be able to use it. The sample files are available to download below.
So let’s plan how we can tackle each of the main issue we have.
The volume of files.
In the downloads you will find a file called master.xlsm. This workbook contains a little script which will generate as many files as you need, all with different sheet names and sample data.. This is not a very quick process, so I suggest you keep the number of files small and set it going when you’re making a coffee – It will produce ~ 40 files a minute. I have created 500 and duplicated them 3 times for the purpose of the demo.
Whenever I need to work with a load of files I starting thinking how can I loop over each of these files in turn. There are many ways to do this SSIS is one option, but so can PowerShell or even log parser. For this example I am using SSIS. This task is incredibly easy in SSIS, just drop a foreach task on the canvas and configure.
Before we configure the foreach loop to point at the folder we need to loop over, we first need to create a variable to hold output (the file path). When creating a process like this I want the process to be self auditing. By this I mean when we are processing files, I want to know which files have been processed, when and whether or not the transfer was a success. So what I am going to do is first create the variable to hold the fully qualified path and then write that out to a log table. You can see an example of this below. I have created a variable and XL_FilePath it is scoped at the package level (available to all child tasks), a type of string (char) with a default value of one of my files (to make a later step easier).
Now we have both the foreach loop and the var we can double click on the Foreach loop container to configure. There are a few option we want to set here.
Enumerator: Foreach File Enumerator
Expressions: Blank (we are not using any expressions)
Folder: This wants to be set to the location of your files. For my example that is – C:\HyperBI\XL_SSIS\XL_Files
Files: this relates to the naming structure of your file. I know all my files are XL so I will leave this as “*.*”
Retrieve file name: You have 3 options here. Fully qualified – C:\HyperBi\XL_SSIS\XL_Files\1.xls, Name and extention – 1.xls and Name only – 1.
Traverse subfolders: The final option is if we want to go over the sub folders. All of the sample data is in one folder, so I will leave this unchecked.
Now we need to associate the variable with the qualified path. On the left when you select Variable Mappings you will be given the option to associate the output of the loop with a variable. Select the variable we created earlier. SSIS will add an index of 0. SSIS uses a 0 based array for it index. 0 being the first column returned. You can see below that if we pipe the variable into a script task to MessageBox the result of the variable we get the following. There is a good blog here which details how to set this up.
The next step is to load this into SQL server so we can loop over the files to load our data ready for processing/auditing. This is done by creating a table called SSIS_FILE_LOAD_LOG. This table has 4 columns.
CREATE TABLE SSIS_FILE_LOAD_LOG( SSIS_FILE_ID INT IDENTITY PRIMARY KEY , SSIS_FILE_PATH VARCHAR(250) , SSIS_AUDIT_DATE DATETIME DEFAULT GETDATE() , SSIS_AUDIT_OUTCOME VARCHAR(50) );
With this we can now add an Execute SQL task and connect it to our server/table to load the results. In the event that something goes wrong in the load I don’t want to have to reload all the previous files, hence the outcome column. We will check that the outcome is either NULL or ‘FAIL’ when we process. We want to loop over the files when this is executed and load in new rows for our new files, in order to do this we don’t want to load in files already processed or pending process. This is done by checking if the file has been processed before inserting the new row. By adding the following TSQL in to our Execute SQL task we can ensure we only capture the file paths once.
IF NOT EXISTS (SELECT 1 FROM SSIS_FILE_LOAD_LOG WHERE SSIS_FILE_PATH = ?) INSERT INTO SSIS_FILE_LOAD_LOG (SSIS_FILE_PATH, SSIS_AUDIT_DATE) VALUES (?, NULL)
The IF EXISTS part checks if the record is in the table, if it is nothing will happen. The 2 “?” relate to variables and are configured on the Parameter Mapping screen. You will notice the parameter names are 0 and 1. The ordinal positions of the variables.
After running this I have a table containing 1500 file paths, ready to be processed.
The data doesn’t start on row 1
This is a pain but it is an easy fix as long as you know the sheet name. In the example data I have included, the data starts at row 4 and is in columns A:D. We want to skip the first 5 rows and process the data without headers.
If we add a data transfer task, followed by an Excel source – You will also need to add an Excel connection to one of the files. By default the connection will read from A1 and not A5. You can alter this by changing the range SSIS is looking at. The easiest way to do this is to create a new variable called XL_Reference with the value of ‘ZDJbagZ5$A5:D’. This is the sheet name and the start and end of your data (yours will be different as the files are random). A5 is the start and D is the end. I have omitted a number ie DE200 as I don’t know when the data ends. You can see the configuration of this below:
This works, however only if we know the name of the sheet ahead of time. As we do not know the sheet name of our next file we need to add some C#.
The sheet names were different on each workbook, with no logic in the naming
SSIS is great at many things and one of those things is processing lots of Excel files. The problem is they have to be the same structure. When this changes processing becomes an issue. The names of the sheets were different in every workbook. In some it would be the name of the file, others the date, the word extract or a simple Sheet 1. For this example every sheet name is randomly generated. In order to process the data we will need to know what the sheet name is before we process. How can we do this? Well with either VB or C# inside a script task.
Things get a little more advanced now, don’t worry stick with it and all will be OK, I promise (grab the files and play along). If you have not worked with script tasks before they can be a little daunting. They come in 2 flavours, VB and C#. I am not a big fan of VB and I used to be a C#.net developer for a short time so I prefer C#. You could use VB however the following solution is all C#. Once you have added the script task you will need to add the XL_FilePath variable to the read-only variables and the XL_Reference to the read-write variables. Then edit the script and add the following:
To the namespaces (top of the file)
using Excel = Microsoft.Office.Interop.Excel;
To the main function (replace all existing code)
var xl_file_full = Dts.Variables["User::XL_FilePath"].Value; Excel.Application xlApp = new Excel.Application(); Excel.Workbook xlWorkBook = xlApp.Workbooks.Open(xl_file_full.ToString(), 0, true, 5, "", "", true, Excel.XlPlatform.xlWindows, "\t", false, false, 0, true, 1, 0); Excel.Worksheet xlWorkSheet = (Excel.Worksheet)xlWorkBook.Worksheets.get_Item(1); xlApp.Quit(); xlApp = null; GC.Collect(); GC.WaitForPendingFinalizers(); Dts.Variables["User::XL_Reference"].Value = xlWorkSheet.Name + "$A5:D"; Dts.TaskResult = (int)ScriptResults.Success;
Once done you will also need some references to be added (rick click references in the solution explorer window). You will need to add a reference to the C Sharp language assembly and also the COM Excel inter op. Your references should look like this once done.
Lets examine what is happening block by block.
using Excel = Microsoft.Office.Interop.Excel;
Here you’re adding a reference to let the script task know when you refer to “Excel” you actually mean “Microsoft.Office.Interop.Excel”.
Define a variable and pass in the SSIS variable XL_FilePath.
var xl_file_full = Dts.Variables[“User::XL_FilePath”].Value;
Creates a new instance of a workbook, Opens the work book passed in via the var and grabs the first sheet name.
Excel.Application xlApp = new Excel.Application(); Excel.Workbook xlWorkBook = xlApp.Workbooks.Open(xl_file_full.ToString(), 0, true, 5, "", "", true, Excel.XlPlatform.xlWindows, "\t", false, false, 0, true, 1, 0); Excel.Worksheet xlWorkSheet = (Excel.Worksheet)xlWorkBook.Worksheets.get_Item(1);
Close the Excel workbook and clear the variables.
xlApp.Quit(); xlApp = null;
Free up memory by initiating the garbage collection.
Set the SSIS variable XL_Reference to be the sheet name and the reference to the data (A5:D);
Dts.Variables["User::XL_Reference"].Value = xlWorkSheet.Name + "$A5:D"; Dts.TaskResult = (int)ScriptResults.Success;
straight forward once you know what it is doing. The garbage collection is really important as if you omit this you will be left with 1500 open session to Excel and your server will die!
Different types of files – Excel versions 97-03 (xls) & 2007 (xlsx)
SSIS will read an Excel 97/03 file type under the 2007 format, by default it is backwards compatible! Awesome, one less thing to worry about. So now we have all the parts working independently we need to string them all together.
First we need load the results of the initial file loop in to an ADO object so we can loop over the results. In order to do this we need to do 2 things:
1. Create a new variable called XL_ADO_Object, with a data type of Object.
2. Create another Execute SQL task. This time we will be selecting from the log table. Use the following TSQL
SELECT SSIS_FILE_ID, SSIS_FILE_PATH FROM SSIS_FILE_LOAD_LOG WHERE COALESCE(SSIS_FILE_PATH, 'X') <> 'SUCCESS'
We then need to configure the result set to Full result set, this allows SSIS to keep hold of the results of the Execute SQL. With these results we want to load them in to our newly created ADO Object variable. You can assign the results by selecting the result set navigation option. Add a new result set with a name of 0 and a variable name of User::XL_ADO_Object. Once this is complete we have an object loaded with the first 2 columns from our table (the ID and the FilePath).
We need another foreach loop container to iterate over the results in our ADO Object, we will want to load the output of each row to a variable, the one we created earlier for the file path will work fine and we need another to hold the primary key of our audit table (which you will need to create before the next step – I have created a variable called Audit_ID with a type of int32) – We will use this ID to update whether the file was processed or not. To configure the foreach loop we will need to change the collection enumerator to Foreach ADO Enumerator and set the source to be the Object variable you created. Configure the variable mapping to Audit_ID with an index of 0 and XL_FilePath with an Index of 1 (0 = column 1, 1 = column 2) .
Now we can move the script task and the data flow task to inside the second foreach loop and connect them – script task first. We will want to use the variable for the source of our Excel file so we will need to configure the Excel Connection Manager to use a variable before continuing. Under the properties of the Excel connection add an expression and set the ExcelFilePath to be equal to the XL_FilePath variable.
We now need to expand the data flow task to move the data to SQL Server. If you add an OLEDB destination task and connect it to your table. The TSQL for creating that table is below.
CREATE TABLE SSIS_XL_TO_SQL ( [Invoice_ID] INT IDENTITY PRIMARY KEY, [Invoive_Date] datetime, [Customer_ID] float, [Order_Qty] float, [Order_Amount] float, [Source_FilePath] nvarchar(250), [Processed_Date] datetime )
You will notice there is also a column for processed date and for source – If there is an issue the the data we can quickly track it back to the file it came from. To populate these we will need to add 2 derived columns. Between the source and destination add a derived column transformation. Add a new column with the expression of getdate() – to add the date and time of the execution, then add another which references the XL_FilePath variable. Connect the 2 new columns and the 4 from the Excel file to our new table.
The final step we have is to add another Execute SQL task to update the log file with the outcome of the movement. Connect another SQL task and add the following TSQL:
UPDATE SSIS_FILE_LOAD_LOG SET SSIS_AUDIT_OUTCOME = ‘SUCCESS’, SSIS_AUDIT_DATE = GETDATE() WHERE SSIS_FILE_ID = ?
You will need to configure the Parameter mapping to use the Audit_ID variable. Add this on the success of the data flow task and create another to act as an update when the process fails – change where it states ‘SUCCESS’ to ‘FAIL’. Connect both the script and data flow task to the fail update with logical OR failure constraints. And with that we have a working Excel processor that will work for any number of files.
Thanks for reading.
Download all files here.