Book Review: Machine Learning Logistics: Model Management in the real world

I do not typically review books, but the following book had a profound impact on how I saw DevOps and Data science. It is a fantastic book and one I recommend to all. A few years ago I picked up Nathan Marz’s Big Data book, in which Marz introduces the concept of the Lambda Architecture. I have since used a variation of Marz’s design in most analytical projects. Lambda changed the way I approached big data. Rendezvous has had the same effect for Machine Learning.

You can download “Machine Learning Logistics” from MapR’s website. https://mapr.com/ebook/machine-learning-logistics/ The book is about 90 pages. There is also an accompanying three part video series, delivered by Ted and Ellen, which is available on YouTube.

“Machine Learning Logistics: Model management in the real world”. There is a lot packed in that short title – I will explain more as we proceed. I do want to call attention to “Model Management”.

 

Model Management

The process of managing a machine learning model once in production is known as Model Management. Vartak et al, describes model management as “The problem of tracking, storing and indexing large numbers of machine learning models so they may subsequently be shared, queried and analysed” (Vartak et al, 2016). Vartak et al’s definition does cover some of the key points, however it falls short in a few key areas.

Scully notes that there are many reasons why machine learning in production is problematic (Scully, 2015), which would make Model Management difficult.

  1. Time constraints to build – The amount of time required to get a model from idea in to production is a lot.
  2. Time constraints once deployed – A model needs to respond in x number of milliseconds. This will vary from model to model. A good benchmark is 100ms. If a model is in front of a customer on your website and it takes anything longer than 100ms to respond, then the user will notice the delay.
  3. Data changes very quickly – A model may fit the business at the time it is trained, then the business may change significantly and the model no longer reflect reality. This is common and happens in most data-intensive applications. This is known as Model Decay. I will write more about this in the coming weeks.
  4. Production machine learning needs to be able to integrate with other applications – Once a model is published there maybe any number of ways it needs to interact with data from the business.
  5. Machine learning needs to be robust – It needs to continue to work in production and require little maintenance.
  6. Live machine learning does not always work as expected – A local model may work well then when published may not have the uplift that was expected.
  7. If one data science team develops in R and the application it is being deployed to works in Java, how do you get these two teams to work together – this is an extension of the development vs operations problem.

Valentine and Merchan note that a Model management / DataOps process should support any data science language (Valentine & Merchan, 2017), however they neglect to discuss the range of languages. Data Science covers a broad spectrum of languages, tools and environments.

A data scientist can train models in R, Python, Scala, Java, C++, Lui, H20 or many other languages. Dunning and Friedman note that when asked “what is the best tool for machine learning” they found that there was no one answer (Dunning & Friedman, 2017). In their study, the smallest number of tools used was 5 and the largest number was 12 (Dunning & Friedman, 2017). How many languages do you work with? I would love to know – leave a comment.

Based on this a model management tool needs to support a minimum of 5 languages. That is 5 languages which need to be maintained, versioned and updated. If there is a fundamental change to that language, as we see with Python 2/3, the physical process of supporting these disparate languages is very difficult.This is a problem that is relatively unique to machine learning. With software development, there are lots of languages, but rarely is an application built in Java, C#, JavaScript, Ruby and C++.

Dunning & Friedman note that this is due to the development practices in machine learning, “The tool that is optimal in one situation might not be the best in another” (Dunning & Friedman, 2017). Therefore supporting multiple languages is important and their interactions should be decoupled to the point that the user is not aware of the language used. As a Data Scientist, I do not want to be told which language I need to use. If I am skilled in Python, don’t tell me I need to use R. A model management system should be language agnostic. The reason I point this out is that there are many tools on the market which offer a model management service, however these are locked to a single language – I will publish a  meta-analysis of these later in the series.

Machine Learning development can either take place locally or in the cloud across many providers (Azure, Google, AWS as well as private cloud). Depending on the amount of data required to train the model, a data scientist may need a distributed architecture such as a Data Lake on the Hadoop file system, Spark may also be an option, where in-memory processing is required. If deep learning is part of the project then a separate set of tools and hardware is required (GPUs and the software to support this). There are many different skills and languages that can be used for Data Science, as such it is difficult to imagine one tool which enables Model Management across all languages.

What should a Model Management tool do?

Based on my research, a model management application should be able to complete the following tasks.

  • Languages – Support development in multiple different machine learning languages.
  • Publication – Allow the publication of a model
  • REST API – Exposing the model via an API. A model should be unrecognisable from its base language.
  • Load balanced – Handle load balancing of the model, this includes the ability to scale up and down as required.
  • Retraining – Allowing for retraining the model automatically. Monitor model decay.
  • Model variants/multi-model deployments – Support different variants of the same model (Decision tree, logistic regression, support vector machine)
  • Automated testing – Perform automated testing
  • Telemetrics – Expose reporting for monitoring of models in production.
  • Source control – Are models sourced and versioned? Which source control application is supported?

Now we know what a Model Management tool should do, let look at the Rendezvous architecture and how it tries to solve this problem.

Machine Learning Logistics. 

My previous research was focused on tools which aim to address machine learning model management. Each of these implements part but not the whole picture. I will publish this research in a separate blog post. When looking at the available tools (Azure Model Management, H20.ai Steam or ModelDB), it is apparent that there is not currently a single tool which will allow an end-to-end model management system, which meets the conditions set out above. As an alternative, Dunning & Friedman’s book, Machine learning logistics, looks at a conceptual approach to what a model management system should do. With this option, although tools are suggested, this is not a piece of software you can download, you will need to build this yourself – I have just completed this for a customer and I will blogging (and hopefully presenting) about this soon.

Dunning & Friedman show an understanding of production machine learning far beyond other similar books. They begin by discussing the “myth of the unitary model”, a lot of machine learning books give the impression that once a model is deployed the role of a machine learning developer is done. This is far from reality. A unitary model implies that you only need to build a model once and it will remain accurate forever. This misconception could be disastrous. The thinking also falls down because it over simplifies Machine Learning. One mode might work for a subset of predictions, however Dunning and Friedman note that many models in production tackling the same business objective (customer churn) is more realistic.

Dunning & Friedman note that a model management system should meet the following conditions:

  • Have the ability to save raw data
  • Support multi-tenancy
  • Expose models everywhere
  • Monitor, compare and evaluate models
  • Deploy models in to production
  • Stage models in production without impacting production
  • Seamless replacement of a model in production
  • Automated fall back when a model is not performing.

Dunning and Friedman add some very valid points which support and go beyond the research by Valentine and Merchan. “Ability to save raw data”, this is not something that has come up in my on-going research but is incredibly important. A model is only as good as the data it is based on, if a model is being deployed or reworked and the data is different, this will produce a different model. “Monitor, compare and evaluate models” – Dunning and Friedman note that the unitary model is a false assumption so we would need a way to compare variations of the same model and aggregate the response. I talk about this problem in applying DevOps to Machine Learning a lot. Having better data will make a better model – some times with no code changes at all – This is a unique problem to Machine Learning.

Unlike other books and journals, Dunning & Friedman are less concerned about the process of training a model and more interested in what happens once that model is in production. Dunning and Friedman have expanded my initial list of requirements based on their experience executing machine learning for MapR.

A solution which aims to meet all the conditions listed above is what Dunning and Friedman, have coined the “Rendezvous approach”. Dunning & Friedman note that this implementation is a style of DataOps (Dunning & Friedman, 2017, pp12). This is one of the few examples of the term DataOps being used in-conjunction with machine learning model management. The Rendezvous approach is designed for any level of machine learning , although most importantly it supports Enterprise machine learning.

Conceptually the design is a disparate series of individual pipelines and models each doing their own part independently. The Rendezvous approach is proposed to be implemented using microservices. Puppet described high performing DevOps teams as one who implement loose coupling in their development – essentially microservices (Puppet, 2017). Dunning & Friedman agree stating that “Independence between microservices is key” (Dunning & Friedman, 2017 pp14). Independence suggest very loose coupling between microservices, this is often achieved using message queues. In the Rendezvous approach the message queue is proposed to be implemented using Apache Kafka – as both a message system and a postbox. Streaming messages is fundamental to the Rendezvous design.

Dunning & Friedman advocate the need for a persistent streaming service which connects in to a global data fabric (Dunning & Friedman, 2017 pp17). Dunning & Friedman note a data fabric goes beyond a data lake, it is a global store of all data in an organisation. This is a location for all data which is globally distributed and capable of storing both transactional data in tables, but also persisting streaming data. The Rendezvous approach is designed to work in concert with a global data fabric (Dunning and Friedman, 2017, pp 25).

When deploying a model from one environment to another, there will most likely be difference in the environments, Dunning & Friedman note that this could lead to unexpected results (Dunning & Friedman, 2017 pp19). One scenario to this is to use containers. Containers are a light weight structure for holding the configuration of a machine, they are similar to virtual machines, however aspects such as the operating system are not required (Dunning & Friedman, 2017 pp19). Containers need to be stateless, but hold state full applications, one way to achieve this is to persist storage external to the container service. Azure Model Management has implemented this as part of their solution. The benefit that you get with containers is the repeatability and a scalability through a cloud based tool such as Kubernetes. I will post a blog about both Docker, Docker-compose, Kubernetes and Helm in the coming weeks.

(Rendezvous Architecture)

The Rendezvous approach implements a series of different model variations. As part of their standard architecture there should always be two models, decoy, canary/the machine learning model (Dunning and Friedman, 2017 pp 21). You can see this in the image above.

The decoy model is responsible for simulating a machine learning model, but rather than running a model, it captures what parameters which were passed to the model. This is to allow for monitoring. With a complete list of all requests and a copy of the data used to train a model, we can begin to monitor model accuracy drift / model decay. If a model was trained based on 100,000 customers and we have seen 50,000 new customers based on the incoming requests, this would be a good indication that this model might no longer reflect reality and our model has decayed. The canary model runs alongside the main machine learning model and provides a baseline for comparison. The Decoy and Canary enable a view or how a machine learning model is being used and why it performs as it does.

A simplistic approach to exposing a machine learning model is to wrap a REST API around the model. To evaluate the model, you send a GET/POST request to the model with a series of parameters, and the model replies with the answer. Dunning & Friedman note that this simplicity is both a blessing and a hindrance. With this approach, we are unable to pass the incoming parameter off to more than one version of the same model. We could use a load balancer to send the requests to different machines. A load balanced works to reroute incoming traffic equally over many machines, or in this case models. While a load balancer will point the input to a different model this does not allow for comparisons. A load balancer would work similar to an AB test. Based on the load on a server, a result would go to the primary server or the secondary. With this approach, there is no ability to run different models and aggregate the results.

Dunning & Friedman note that an alternative to a load-balancer is to use a data stream. Data is streamed in to an input, from there that input is distributed to multiple outputs. The distribution of the request to the models is the real strength of the rendezvous architecture. As an example, we are trying to classify if a transaction is fraud (binary classification – true or false). We could use one of many different classification algorithms, or we could use all the ones that work for our prediction and select the best option. We might use a simple decision tree which returns a response in <10ms, then we might have a logistic regression which takes a bit longer. If our architecture pushes the request to multiple models, then our code always needs to know which models are live.

This is problematic. We really do not want to deploy a code change with each new model. Dunning and Friedman propose an alternative design using post boxes. When a request for classification is sent, rather than that going directly to the model, or to a service which sends it to multiple models, it goes to a post box. Then each of the models subscribe to that postbox and each run in parallel. They then need to pass their results and telemetry data back to a service which will decide what to do with the responses.

The Rendezvous approach implements a rendezvous server, which is responsible for comparing scores and passing the acceptable value back the request. The Rendezvous server is responsible for picking which model has the desired result. The rendezvous server seeks to pull in the values obtained by the models and either pick the best response or aggregate the results. If two of the three models indicate fraud, then a response of fraud would be a good response. Weights can be added to the models to steer the selection based on the machine learning developer’s intelligence. The concept of the Rendezvous server in machine learning is a powerful design for model management.

Monitoring machine learning models in production is key to understanding if performance is degrading over time. Although important, this is seldom covered in literature, however Dunning & Friedman note that it is critical to monitor what a model is doing in production (Dunning & Friedman, 2017 pp 49). Monitoring the availability of a model is not enough to understand how it is operating. To effectively monitor, a model management system needs to capture operational metrics, what input was offered, what answer was given, what was the level of accuracy. Dunning & Friedman recommend either passing these metrics with the REST API response or persisting that data in to a side data store. A model management system should aggregate metrics over time, while this is possible by looking at the message queue, the data would sit much better in its own data store. In the implementation I have used, I used CosmosDB as the store for telemetrics.

If a model is replaced, we need to analyse the potential business impact (Dunning & Friedman, 2017 pp 50). Dunning & Friedman recommend a staged transition from one model to another. A sample from each model is passed to multiple models until such a time that the developer is confident that the model being replaced is satisfactory. This approach does need to be evaluated on a case-by-case basis. If a production model is designed to shape a user’s experience of a website and each time they refresh they hit a different model which recommends a different experience, this will be confusing to the user. Logic for handling the selection would be supported by the Rendezvous server. This is known as a policy.

Dunning raises an interesting idea which would further extend Rendezvous to not need a policy. Dunning suggests that a multi-arm bandit/reinforcement learning algorithm could be responsible for deciding which model to return to the requestor. This is quite advanced and relies on having a feedback loop. If possible then what you have is incredible

Making comparisons of models however is not as easy as in traditional software development. In traditional software development, there are standard key metrics to monitor, once decided, rarely do we need to monitor additional metrics. In a web application, this might be the time a webpage takes to respond or the time taken for the database to respond (Ligus, 2013. pp 21). However, machine learning is not this simple. A classification model will return a different set of metrics to a regression model, therefore making a comparison of each quite difficult. We can however work around this.

Dunning and Friedman’s Rendezvous architecture proposes a solution to model management beyond that currently implemented by tool such as ModelDB, Steam or Azure’s Model Management. Rendezvous takes elements of DevOps to achieve this, however does not detail the full process of how to implement it. There are many different approaches one could take. The following blogs takes elements from Dunning and Friedman’s architecture and relates it back to a DevOps approach.

Beyond the book

There appears to be a shift in the machine learning industry towards a recognition that model management is a problem which needs to be resolved. Literature does not prepare data scientists for the problems associated with models in production. The tools that are on the market offer only some of the they key areas required to manage models.

In a recent discussion which took place on Twitter, Caitlin Hudson asked “What’s your team’s approach to tracking the quality of models in production”. This started a lively debate and discussion on the various implementations that people have stated building. This conversation was supported by the authors of Machine Learning Logistics Ted Dunning and Ellen Friedman as well as many other industry experts.

The outcome was that data scientist are noticing a change towards a structured approach to model management. I imagine that over the next 2 years, there will be an increase in model management tools and books. As popularity in DataOps increases it is easy to see many data scientists moving towards a mix skilled team of developers and model managers. In a recent podcast from the O’Reilly data show, Ben Lorica was talking about the rise of the term “Machine Learning Engineer” someone who id responsible for getting models deployed. I think this can be achieved with DevOps (although someone is required to set this up). I also think there will be a rise in another new role, that of the “Model Manager”. Someone who keeps models running, monitors for decay and retrains when required. This might be the same role, it might not.

Schutt & O’Neill notes that the productionisation of machine learning is hard, however by using the right tools and the right architecture this problem can become trivial. Though source control, continuous integration, continuous deployment, infrastructure as code and monitoring, the process of deploying a model to production can be fully automated.

Thanks for reading. I hope that you do go and read the book. It is a great read!

An Introduction to DevOps

Welcome to part 2 in a series on applying DevOps to Data Science. You can read part 1 here. Applying DevOps to Data Science. In this blog I want to begin to look at defining what is DevOps and begin to understand why DevOps can help a Data Scientist deploy models faster.

What is DevOps?
This divide between those who develop and those who deploy has been a struggle in traditional software development for a long time. Software developers would typically work in isolation building their applications. Once it is built it would be handed over to the operations department for deploying/migrating to production. This process could take a long amount of time.

The delay means the development team have to wait longer to deploy changes, code becomes stagnant and the developers are reliant on operations availability. When the Agile Methodology became popular a shift in the deployment process began to emerge. Developers were working with operations as part of the same team to enable sharing of responsibilities. DevOps emerged.
 
The term DevOps is a portmanteau of dev and ops, dev relating to developers and ops relating to operational staff. In traditional software development, there has typically been a separation between the developers who are building an application and operations who are tasked with deploying and monitoring the application once it is live. The two roles are very different and have different skills, techniques and interests. Developers are interested in if a feature has been built well, operations are interested in whether the application and infrastructure is performing as required (Kim et al, 2016 pp xxii).
 
Prior to DevOps, the roles of development and operations were distinct and had little crossover. Unfortunately, this meant that the lead time to deployment was long and prone to errors. At the start of 2010, industry experts began talking about applying the principles of Lean production to software development.

Lean practices were responsible for halving the time taken to manufacture vehicles, with 95% of orders being shipped on time (Kim et al, 2016 pp xxii). The term DevOps was coined to encapsulate a series of processes and culture changes aimed at reducing the amount of time required for an application to go in to production and for changes to propagate to production.


[DevOps – http://geeksrival.com/devops-now-make/]

Jez Humble defined DevOps as “a cross-disciplinary community of practice dedicated to the study of building, evolving and operating rapidly-changing resilient systems at scale.” (Muller, E, 2010). Muller further goes on to add his definition “DevOps is the practice of operations and development engineers participating together in the entire service lifecycle, from design through the development process to production support” (Muller, E, 2010). In the largest survey of DevOps practitioners, Puppet propose that “DevOps is an understood set of practices and cultural values that has been proven to help organizations of all sizes improve their software release cycles, software quality, security, and ability to get rapid feedback on product development” (Puppet, 2017).

There are many differing definitions of the term DevOps however they all have a common theme around culture/community, tools, software lifecycle, process and support, with the goal to improve software release cycles.

Culture is one aspect of DevOps. For DevOps to be implemented fully a series of concepts and associated tools can be implemented. DevOps encapsulates the following areas:
· Culture
· Source Control
· Continuous integration
· Continuous deployment
· Infrastructure as code
· Configuration as code
· Automation
· Operational Monitoring

This can be order in importance as seen in Figure 1.


 
Why is DevOps important?
A fundamental part of DevOps is automation. Doug Seven in his article “Nightmare: A DevOps cautionary tale”, Seven documents how an American stock trading lost over $400,000,000 in 45 minutes due to a failed manual deployment on one of their eight servers (Seven, 2014). Seven notes that this could and should have been avoided through good DevOps principals and automated deployments. DevOps could have saved this business 400 million dollars. Take a moment for that to sink in.

This problem is not limited to software development this could have been about a rouge machine learning model. If a model is checking fraud and starts flagging all transactions incorrectly as fraud that could cause a huge impact to the business in question. But we will come back to the discussion at greater length later in the series.
 
One of the foundations of DevOps is to strive towards shorter release cycles. Shorter release cycles mean code changes are promoted to production faster. This directly correlates with a higher throughput and higher quality and stable code (Puppet, 2017). DevOps is on the rise, in a survey conducted in 2016 16% of respondents worked on a DevOps team, in 2017 this number has increased to 27% (Puppet, 2017). This indicates that DevOps is becoming more popular and organisations are beginning to see the value it offers. It may also indicate a shift in culture to one of collaboration, a goal DevOps attempts to achieve.
 
IT performance can be measured by the throughput of code and stability of systems (Puppet, 2017 pp20). Throughput is measured by how frequently code is deployed and how fast they can move from committing code to deploying it. The stability is defined as how quickly the “system can recover from downtime and how many changes succeed” (Puppet, 2017). High performing teams could see 46 times more frequent deployments, 440 times shorted lead time for changes, 96 times faster mean time to recovery with failure rates were 5 times lower than companies who did not implement DevOps (Puppet, 2017 pp21). That is staggering!

So why don’t all development houses apply DevOps? We because it is difficult and requires complete by-in from all levels of the organisation. The numbers listed in the Puppet survey are an ideal and are not guaranteed. However if a company sees even a fraction of these benefits, DevOps will provide a return-on-investment.

Faster deployments, fewer errors, faster changes, these are all important when looking at production machine learning. Traditional software development and machine learning engineering are not that different.

In the next blog, we will explore the key parts of DevOps in more detail. See you in 2 weeks.
 

Applying DevOps to Data Science.

Hi everyone.

Some of you might know that for the last 2 years I was studying a Master’s degree in data science from the University of Dundee. This was a 2 years part-time course delivered by Andy Cobley and Mark Whitehorn. This course was fantastic and I recommend it – If you want to know more about the course, please give me a shout. The course is comprised of multiple modules. The final module is a research project, which you need to start thinking about towards the end of the first year of study. I selected my topic very early on, however being indecisive, I changed my idea 3 times (each time having written a good chunk of the project).

Why did I do this? I simply was not passionate about the subject of those projects. They we good ideas, but I just was not researching or building anything new. The outcome of my dissertation might have been a working project, however it would have felt hollow to me. I needed a topic I was passionate about. I have a core ethos that I take to every project I work on. “Never do anything more than once”. It is because of that, that I have spent much of career working either with or developing automation tools to accelerate and simplify my development processes. Having attended a lot of conferences, I became familiar with DevOps and how it accelerated the software industry. DevOps allows software developer to ship code faster. I have been applying the core principles of DevOps to all my recent projects, with great success.

The course covered a lot of the techniques required for data science, however it did not cover how to deploy a model in to production. I started researching deployment techniques. I read a lot of books which described the development process, each stopping at deployment. Although none of the book detailed how you actually deploy a model. I quickly discovered why. It is hard! I had my dissertation topic! “Applying DevOps to production machine learning”

Having spent 6 months writing about the topic, I wanted to share that with you in blog format. If you want to read my dissertation in full let me know. As there is a lot of content to get through, we will take it one step at a time.

The culmination of this blog series is a design pattern which allows a developer to commit their machine learning models in to Git and have a deployed model in production. To do this we will highlight various tools, languages and techniques along the way. All code examples are hosted in my GitHub account. I am trying to talk more and more about this subject at customers and conferences. If you would like to discuss this in more detail, please get in touch.

Research Blog series contents.
1. Setting the scene (this page)
2. An introduction to DevOps
3. DevOps in detail
4. Data Science process. And what is wrong most the current literature?
5. How DevOps is currently being implemented in the ML industry.
6. [Book Review] Machine Learning Logistics (Ted Dunning and Ellen Friendman)
7. How to apply DevOps to Data Science.
8. A design pattern for a one-click deployment of a data science model
9. Research outcomes and what is next
10. References

Technical blog series contents
1. An Introduction to Visual studio Team Services (VSTS)
2. An Introduction to Git
3. An introduction to Docker and Azure Container Registry
4. An introduction to Kubernetes

My aim is to publish a new section every 2 weeks.

All code will be copied in to the relevant blog, however you can obtain a copy on my GitHub.

Thanks for reading.

The Road to MVP

It is with great pleasure that I can share with you all that I am now a Microsoft MVP (Most Valuable Professional). The MVP programme (https://mvp.microsoft.com/) is designed to recognise those professionals who give their time to support technical communities and Microsoft. This could be through blogging, speaking at conferences, providing feedback, helping other and contributing code to projects. There are many award categories ranging from PowerPoint to Xbox each with its contribution areas. I was awarded MVP for the Data Platform, which covers SQL Server, Data Lake, HDInsight and many of the topics I am particularly interested in. I exceptionally pleased to be amongst the 19 Data Platform MVPs in the UK and 1 of the 418 Data Platform MVPs worldwide.

In a recent blog by Kevin Kline, Kevin discussed what it takes to be an MVP, in his opinion it all boils down to being passionate about what you do and sharing that passion with others (https://blogs.sentryone.com/kevinkline/how-can-i-become-a-microsoft-mvp/). I could not agree more! I talk at events because I want to help people. I still get so much out of attending sessions and I want to make sure that I give back everything I took from the community. Kevin’s blog gives a fantastic overview of what it takes to become an MVP, well worth a read.

When I first started attending user group and conferences I began to become aware of the term MVP. At that point there was only a handful of MVPs in the UK (At that point you were a SQL Server MVP). Their sessions at conferences were always held in high regard and rightly so, these men and women were at the top of their game. I looked up to these professionals and would always learn something from reading their blogs, books and seeing them talk. When I started talking at user groups and SQL Saturday’s I always wanted to become an MVP, but it was never my reason for presenting, it is now a very happy by-product of giving back to a technical community.

MVP is awarded based on your contributions in the last 12 months, so what have I done in the last 12 months:

It has been a real blast over the last 12 months, I could not have spoken at so many events without the support of Adatis – so thanks. I want to also thank everyone who took the time to nominate me and for the help and support of Microsoft.

(Some of the MVPs who nominated me – At a recent MVP event)

What will I be doing over the next 12 months? A lot of the same. Although I would like to get a few more blogs written this year. If you want to see me talk. I will be at SQLBits in London in February delivering a pre-conference training day and a general session:

A Data Engineer’s Guide to Azure SQL Data Warehouse

Enhancing relational models with graph in SQL Server 2017

Beyond that, keep an eye out at your local user group or SQL Saturday. See you soon.

Calculating distance with SSIS and the Google distance matrix API

APIHeader
Recently I set up an interactive map in R & Shiny to show the distance between where a customer lives and one of the sites where a customer attends. What I wanted to do was to find out how long it took each person to get to the site using public transport, which proved difficult in R, so i settled for a “as the crow flies” approach using pythagoras. This worked, but since then I have had in the back of my mind a burning question “what if I could used the Google API to give me that distance”. This is what I have now done, and this is how you do it.

Google have many APIs that can be used at little to no cost. The Google distance matrix API allows you to enter a starting position and one or more end positions, and the API will return the distance based on a mode of transport of your choice (walking, driving, transit – public transport). All you need to do is register and get an API key – https://developers.google.com/maps/documentation/distance-matrix/

Once you have an API key you will need to keep hold of this. This key allows you to call the API, no key, no result. If you read the documentation here you can get a few examples of how to use the key. If you change the URL below to include your API key where YOUR_API_KEY is currently you will retrieve the JSON for that journey. By default the GDM is set to driving distance, but this can be changed. Also this is currently JSON, if you would prefer to have the result as XML, change json? for xml?.

https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins=40.6655101,-73.89188969999998&destinations=40.6905615%2C-73.9976592%7C40.6905615%2C-73.9976592%7C40.6905615%2C-73.9976592%7C40.6905615%2C-73.9976592%7C40.6905615%2C-73.9976592%7C40.6905615%2C-73.9976592%7C40.659569%2C-73.933783%7C40.729029%2C-73.851524%7C40.6860072%2C-73.6334271%7C40.598566%2C-73.7527626%7C40.659569%2C-73.933783%7C40.729029%2C-73.851524%7C40.6860072%2C-73.6334271%7C40.598566%2C-73.7527626&key=YOUR_API_KEY

The most basic use is to calculate the distance from A to B. This is what we will look at using SSIS. The API takes in an origin and a destination. These can either be a lat & lng combo or an address which we be translated into Lat & lng when executed. To demonstrate the process I have created a database and loaded it with some sample customer and store data. We have 2 customers and 4 stores. These customers are linked to different stores. You can download the example files at the bottom of the page.

--CREATE DATABASE WebService_Example;
GO
 
-- ###################################################################################
-- Create tables
-- ###################################################################################
USE WebService_Example;
GO
 
DROP TABLE dbo.CustomerAddresses;
CREATE TABLE dbo.CustomerAddresses 
(
CustomerAddressID INT IDENTITY PRIMARY KEY
, CustomerName Varchar(150)
, CustomerAddress VARCHAR(500)
);
GO
 
DROP TABLE dbo.StoreAddresses;
CREATE TABLE dbo.StoreAddresses
(
StoreAddressID INT IDENTITY PRIMARY KEY
, StoreName VARCHAR(150)
, StoreType VARCHAR(20)
, StoreAddress VARCHAR(150)
); 
GO
 
DROP TABLE dbo.CustomerStoreAddress;
CREATE TABLE CustomerStoreAddress
(
CustomerAddressID INT
, StoreAddressID INT
)
 
DROP TABLE dbo.CustomerAddressDistance;
CREATE TABLE dbo.CustomerAddressDistance
(
CustomerName Varchar(150)
, CustomerAddress VARCHAR(500)
, StoreName VARCHAR(150)
, StoreAddress VARCHAR(150)
, Distance VARCHAR(50)
, EstimatedTime VARCHAR(50)
, ModeOfTransport VARCHAR(50)
)
 
 
-- ###################################################################################
-- Load tables
-- ###################################################################################
 
INSERT INTO dbo.CustomerAddresses
        ( CustomerName
        , CustomerAddress )
VALUES
('A Customer', '46 Church Road, Alphington, Exeter, EX2 8SZ')
,('B Customer', 'Flat 9, Landsdown Terrace, Exeter, EX2 4JJ')
 
INSERT INTO dbo.StoreAddresses
        ( StoreName
        , StoreType
        , StoreAddress )
VALUES
('Tesco Extra', 'Vale', 'Exeter Vale Shopping Centre, Russell Way, Exeter EX2 7EZ')
, ('Tesco Express', 'Heavitree', '85 Fore St, Heavitree, Exeter, Devon, EX1 2RN')
, ('Sainsburys Superstore', 'Alphington', 'Alphington Rd, Exeter EX2 8NH')
, ('Sainsburys Local', 'Sidwell', '12 Sidwell St, Exeter EX4 6NN');
 
INSERT INTO dbo.CustomerStoreAddress
        ( CustomerAddressID
        , StoreAddressID )
VALUES
(1,1)
, (1,2)
, (1,3)
, (1,4)
, (2,3)
, (2,4)
 
-- ###################################################################################
-- Query
-- ###################################################################################
 
SELECT 
	CA.CustomerName
	, CA.CustomerAddress AS 'Org'
	, SA.StoreName
	, SA.StoreAddress AS 'Des'
FROM dbo.CustomerAddresses CA
INNER JOIN dbo.CustomerStoreAddress CSA ON CA.CustomerAddressID = CSA.CustomerAddressID
INNER JOIN dbo.StoreAddresses SA ON CSA.StoreAddressID = SA.StoreAddressID

This will build the tables and populate the tables that are used in the SSIS project.

The SSIS project consists of a data flow task (DFT), which contains and OLEDB source, a Derived column transformation (to add the columns which are to be loaded), a script task (C#) and the an OLEDB destination to load the results. There are 2 variables, the mode of transport (walking, driving, transit) and also your API key. Below is the C# required to call the API and shred the XML.

 
#region Namespaces
using System;
using System.Data;
using System.Web;
using System.Windows.Forms;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using Microsoft.SqlServer.Dts.Runtime;
using System.Net;
using System.Xml;
// Make sure you add any missing references
#endregion
 
 
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
 
    public override void PreExecute()
    {
        base.PreExecute();
    }
 
 
    public override void PostExecute()
    {
        base.PostExecute();
    }
 
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
 
        // Create a new web client 
        WebClient wc = new WebClient();
 
        // Select your mode of transport - Edit the var in SSIS
        String TransportMode = Variables.ModeOfTransport;
 
        // Add reference to your API
        String APIKey = Variables.DistanceAPIKey; 
 
        // Build the URL that is to be requested
        String uri = "https://maps.googleapis.com/maps/api/distancematrix/xml?units=imperial&origins=" + Row.Org + "&destinations=" + Row.Des + "&mode=" + TransportMode + "&key=" + APIKey;
 
        // Get the XML
        string geoCodeInfo = wc.DownloadString(uri);
        XmlDocument xmlDoc = new XmlDocument();
        xmlDoc.LoadXml(geoCodeInfo);
 
        // Shred the XML
        String Duration = Convert.ToString(xmlDoc.SelectSingleNode("DistanceMatrixResponse/row/element/duration/text").InnerText);
        String Distance = Convert.ToString(xmlDoc.SelectSingleNode("DistanceMatrixResponse/row/element/distance/text").InnerText);
        //String Fare = Convert.ToString(xmlDoc.SelectSingleNode("DistanceMatrixResponse/row/element/fare/text").InnerText);
 
        // Update the outgoing rows. 
        Row.Time = Duration;
        Row.Distance = Distance;
 
    }
 
}

The script component looks at the rows coming in and passes them into a URL which is executed by a web client. The results are then updated into the Time and Distance Rows.

Once the data has been loaded we can analyse the results and see what is the closest store by multiple methods of transport.

-- ###################################################################################
-- Check results
-- ###################################################################################
 
-- transit
-- driving
-- walking
SELECT
    CustomerName
  , CustomerAddress
  , StoreName
  , StoreAddress
  , Distance
  , EstimatedTime
  , ModeOfTransport
  , CASE WHEN EstimatedTime LIKE '%hour%' then CONVERT(TIME,REPLACE(REPLACE(EstimatedTime,' hour ', ':'), ' mins', '')) ELSE '00:' + REPLACE(EstimatedTime, ' Mins', '') END TimeInMinutes
  , ROW_NUMBER() OVER (PARTITION BY CustomerName ORDER BY CASE WHEN EstimatedTime LIKE '%hour%' then CONVERT(TIME,REPLACE(REPLACE(EstimatedTime,' hour ', ':'), ' mins', '')) ELSE '00:' + REPLACE(EstimatedTime, ' Mins', '') END) AS 'Rank'
FROM
    CustomerAddressDistance;

2016-05-14_17-07-30

Download scripts and solution: GoogleAPI SSIS.

Sun Modelling for Data Warehouses. Visio template

2016-05-12_14-29-31

Sun modelling is a technique for requirement gathering for a multidimensional data warehouse modelling. I have used or read about many different DW modelling techniques, however none is quite as simple and elegant as Sun Modelling. Sun Modelling was designed by Prof Mark Whitehorn (Dundee University). I won’t go into the details of how you execute a sun model today, however I will write about this in the future.

For the moment I have tried to simplify the process of building a sun model in visio. You can download a visio stencil which contains the main elements required to make a Sun Model. Also included is a sample visio document.

Download it here.

2016-05-12_14-37-50

Exeter’s SQL South West UG – Scripts and slides

Untitled-1

Thanks for the massive turnout last night at SQL South West, you guys rule!

I presented a session called TSQL(L): Tricky SQL, Quandaries and Lessons Learnt.

Thanks again for attending. I had a lot of fun with this session. In the session we talked about a few websites:

Ask SQL Server Central: https://ask.sqlservercentral.com/
Gaps & Islands: https://www.simple-talk.com/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/

See you all next month.

You can download the solution file Download scripts.

Slides:

Tally/Number tables in SQL Server #TSQL2sday

b8b7b87aa1a76ab17e75a1567bed421a

SqlTuesday As part of #TSQL2sday Mickey Stuewe (b|t) has proposed a question looking at best practices when someone has made some bad design decisions, generally when I am facing bad design problems I am the one who put them there (:P). This tip is not necessarily a “how to fix an issue” but more a tip on how to work with dates.

As a BI developer I am always working with dates and numbers. A user will come and ask for a query relating to what happened last month – What they typically mean is what happened per day? How much did we sell on each day? How many customer visited our website each day? How many members of my team were absent in October?

When you’re working with HR data you commonly want to see where you have absences over a period of time. This might be a chart per day or per month. So this is what we want to do. We want to see each day for October how many people were absent on each day – and here we start to see a common issue. This is also an issue that might be neglected where you commonly have sales per day, when you have no sales on an edge-case day it might get missed, and you really don’t want to miss that!

Here is some sample data to illustrate the point. The sample data is my team and their absences. We have Timmy and Bobby both have had a bad month and have been ill on multiple occasions (man-flu), they have a few consecutive periods of illness and the odd one day. What I want to see is 31 rows (one for every day in October) returned showing any zeros where there were no absences. However our HR system only logs out a row for data logged and not a row for those where there has been nothing logged. So a query like this (which is what the end-user asked for) looks like this:

DECLARE @Example TABLE (ID INT IDENTITY, StaffMember VARCHAR(50), AbsenceDate DATE, AbsenceReason VARCHAR(50))
INSERT INTO @Example ( StaffMember, AbsenceDate, AbsenceReason )
VALUES
('Bobby', '20151001', 'Sickness')
,('Bobby', '20151002', 'Sickness')
,('Bobby', '20151003', 'Sickness')
,('Bobby', '20151009', 'Sickness')
,('Bobby', '20151022', 'Sickness')
,('Bobby', '20151021', 'Sickness')
,('Bobby', '20151023', 'Sickness')
,('Bobby', '20151024', 'Sickness')
,('Sammy', '20151001', 'Sickness')
,('Sammy', '20151003', 'Sickness')
,('Sammy', '20151022', 'Sickness')
,('Sammy', '20151025', 'Sickness')

When you run the following:

SELECT AbsenceDate, COUNT(*) AS 'AbsenceCount' FROM @Example WHERE AbsenceDate >= '20151001' AND AbsenceDate < '20151101' GROUP BY AbsenceDate

1

But we are missing most of the days. If this data is to be used for a visualisation in Excel or SSRS then we want to see this per day, however we will only have pointers for those where the data has been logged, and that is an issue. But that isn’t at the fault of the developers, why would you write out more than you need? It doesn’t make sense. Ok we have an issue, but it doesn’t stop there! Our user now wants to know “what days were there no absences?”. How can we look for data that does not exist?

We obviously need something to help. We need a tally table. A tally table or commonly known as a numbers table is a simple table which holds numbers or dates. In a data warehouse you might have a date dimension – this is similar. A tally is a table that has a load of numbers in it. 1 to 100 million or however many you need, it is up to you. For this example we want one row in a table for every date. The script below will build you a table and insert as many rows as you want. For this example I have used 365 days from 01-01-2015 this will gives a row for every day in 2015. Alter the @j variable to extend this and alter the @startDate to move the start date.

After running this script you will have a full table and now we have something that we can use to start answering the questions we have been asked. So lets look at our questions again.

Question number one. Create a list of all absences in October (including where we have no absences)

; WITH TallyDates AS
(SELECT * FROM dbo.TallyTbl WHERE TallyDate &gt;= '20151001' AND TallyDate &lt; '20151101')
, AbsenceData AS 
(SELECT AbsenceDate, COUNT(*) AS 'AbsenceCount' FROM @Example GROUP BY AbsenceDate)
SELECT TallyDates.TallyDate, COALESCE(AbsenceData.AbsenceCount,0) AS 'AbsenceCount' FROM TallyDates
LEFT JOIN AbsenceData ON TallyDates.TallyDate = AbsenceData.AbsenceDate

2

Now we have a full list of all the days including those where we have had no absences. Ideal!

Question number two: In October what dates were no absences? An to make it a little harder, which days were working days – wondering why there was an TallyWeekend column? It was for this.

; WITH TallyDates AS
(SELECT TallyDate, TallyDateWeekend FROM dbo.TallyTbl WHERE TallyDate &gt;= '20151001' AND TallyDate &lt; '20151101')
SELECT TallyDates.TallyDate AS 'DateWithNoAbsence' FROM TallyDates WHERE TallyDates.TallyDate NOT IN (SELECT AbsenceDate FROM @Example)
AND TallyDates.TallyDateWeekend = 0

3

So something that seemed really difficult before is now really simple. So not a bad design solution but a way to manage dates when you need to look for missing dates.

If you do not want to make a tally table or you cannot alter the database then you can do this all in a different way. Itzik Ben-Gan developed a little script that can be used to generate millions or rows in a zero IO. You can read about this here: http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers

  WITH
  L0   AS(SELECT 1 AS c UNION ALL SELECT 1),
  L1   AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
  L2   AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
  L3   AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
  L4   AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
  L5   AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B)

Thanks

SQL Server string manipulation functions

playing_with_string_kitty_cute_cat_hd-wallpaper-1531962

Hi everyone,

As part of an ongoing training session I am doing for some of my colleagues on the basics of TSQL, I wanted to store this content somewhere. Where better than on my blog.

So today we will be starting with looking at some basic string manipulation functions, how they work and why you might want to use them. Then we will look at how we can use them together to perform a common dynamic name separation technique. All of these examples are using a table which is constructed using the Microsoft AdventureWorks database (this version is using 2012). To build this table you will need to download a version of AdventureWorks and attach the database (alternatively you could use the RedGate Azure AdventureWorks – You can find that here).

Before we continue please run the following TSQL to build the table. There are many intentional errors in the data – I have hoped to simulate normal user activity and errors.

USE AdventureWorks2012;
GO
 
BEGIN TRY	
DROP TABLE StringMinipulationExample;
END TRY
BEGIN CATCH
	PRINT 'Table didnt exist'
END CATCH
GO
 
CREATE TABLE StringMinipulationExample
(
FullName Varchar(250)
, Address VARCHAR(1000)
); 
 
INSERT INTO StringMinipulationExample
SELECT 
	' ' + COALESCE(Title, '') + ' ' + FirstName + ' ' + LastName + '   ' AS 'FullName'
	, COALESCE(AddressLine1, '') + ' ' + COALESCE(AddressLine2, '') + ' ' + COALESCE(City, '') + ' ' + COALESCE(PostalCode, '') AS 'Address'
FROM Person.BusinessEntityAddress 
INNER JOIN Person.Person ON	Person.BusinessEntityID = BusinessEntityAddress.BusinessEntityID
INNER JOIN Person.Address ON Address.AddressID = BusinessEntityAddress.AddressID
 
SELECT FullName, Address FROM dbo.StringMinipulationExample

CONCATENATION

When you’re working with strings you will no doubt want to join one sting with another, this is known as concatenation. The process is a simple one. Take one string and combine it with another.

SELECT 'Hello' + ' ' + 'world!'

We are using a + to concatenate. You will see in the middle the ‘ ‘ this is to add a space between the two words. Sounds simple and it is, however it can quickly become confusing when you start to concatenate with NULLS or different data types. NULLS in SQL Server can cause some grief but they have a purpose. When you concatenate two strings you get a new string – Hello World! However when you concatenate a string and a NULL you get a NULL. to avoid this we need to handle those NULLS and replace them with something else. Remember ” <> NULL. An empty space will still concatenate as it is not NULL. To handle those NULLS we can use conditional logic (CASE WHEN ELSE END) or we can use ISNULL() or COALESCE(). Concatenation needs to be to the same data type. For example when you write 1+1 do you want 2 or 11? 1 + 1 = 2, ‘1’ + ‘1’ = ’11’. INT + INT = INT, CHAR + CHAR = CHARCHAR, INT + CHAR = ERROR. To avoid this error we need to first convert the INT to a CHAR using either CAST or CONVERT.

ISNULL & COALESCE

ISNULL ( check_expression , replacement_value )

COALESCE ( check_expression , replacement_value )

Both of these functions are NULL replacement functions. There are differences in how the engine processes and also on ISNULL is TSQL only and not ANSI standard.

SELECT ISNULL(NULL, '')
SELECT COALESCE(NULL, '')

CAST& CONVERT

CAST ( expression AS data_type [ ( length ) ] )

CONVERT ( data_type [ ( length ) ] , expression [ , style ] )

Both of these functions help to change the data type in your query. A lot comes down to personal preference.

SELECT CAST( 1 AS CHAR(1))
SELECT CONVERT(CHAR(1), 1)

Both will return a char. You cannot convert all datatypes from one type to another. Refer to this image for reference.

LEFT & RIGHT

LEFT ( character_expression , integer_expression )

RIGHT ( character_expression , integer_expression )

When you want to take the left most character from a string. Character_expression is your string and integer_expression is how many characters do you want. This include blank spaces.
Example:

SELECT FullName, LEFT(FullName, 5) AS 'Left5_FullName' FROM dbo.StringMinipulationExample
SELECT FullName, RIGHT(FullName, 5) AS 'Left5_FullName' FROM dbo.StringMinipulationExample

You can see that the left or right 5 characters have been selected. It might look odd as there are less then 5 characters visible on most rows and this is because there is preceeding and trailing spaces – keep reading to learn how to handle those.

SUBSTRING

SUBSTRING ( expression ,start , length )

When you want the middle x amount of characters from a string. In the example below I have used the programmers favourite Hello World!. Hello world! Has 12 characters including the space and the exclamation mark. If we want to take just the world we need to position our start on the w and take 5 spaces. The grey squares in the table below indicate which characters we are selecting. For those of you who are familiar with Excel this function is similar to the MID function.
Example:

SELECT 'Hello World!', SUBSTRING('Hello World', 7, 5)

 

1 2 3 4 5 6 7 8 9 10 11 12
H e l l o W o r l d !

 

UPPER & LOWER

UPPER ( character_expression )

LOWER ( character_expression )

When you want to cast a character as either uppercase or lower case. This function is similar to Ucase and Lcase in Excel and VB.
Example

SELECT FullName, UPPER(FullName) AS 'Upper_FullName', LOWER(FullName) AS 'Lower_FullName' FROM dbo.StringMinipulationExample

CHARINDEX

CHARINDEX ( expressionToFind ,expressionToSearch [ , start_location ] )

When you want to know the position of a character in a string of characters. This is a really helpful function. If you think most names are FirstName SPACE LastName if you want to split the first and last name you can locate the space and take the LEFT and the SUBSTRING. The way this function is structured is a little backwards. We are looking at what we are trying to find, where we are trying to find it and how many character should be skipped before we search. The reason the latter is there is in the event that there are multiple spaces, if you know the first 2 characters are always a space you might want to add ,2 to start from the 3rd character.

SELECT FullName, CHARINDEX(' ', FullName, 3) FROM dbo.StringMinipulationExample

I have included a ,3 to the end of the function as I know I have 2 spaces at the start. I want to skip these. We will use a function to remove these in a moment. This function is similar to FIND() in Excel.

REPLACE

REPLACE ( string_expression , string_pattern , string_replacement )

When you want to replace a character or set of characters with a different character or set of characters. Useful when you need to find a character and replace it.

SELECT FullName, REPLACE(FullName, ' ', '_') FROM dbo.StringMinipulationExample

In this example I am replacing all the spaces with underscores.

REVERSE

REVERSE ( string_expression )

Does what is says on the tin! Reverses a string.
You might be asking well why would I want to do that? Because it is easy to find the first special character in a string, it is not as easy to find the last. How do you make the last the first? REVERSE it.
Example:

REVERSE('Hello World')

REVERSE(‘Hello World’) = ‘dlroW olleH’

RTIM & LTRIM

RTRIM ( character_expression )

LTRIM ( character_expression )

When you want to remove the trailing spaces from the left, right or both sides of a string.
Example:

SELECT 
	FullName
	, LTRIM(FullName) AS 'LeftTrim'
	, RTRIM(FullName) AS 'RightTrim'
	, RTRIM(LTRIM(FullName)) AS 'BothTrim'
FROM dbo.StringMinipulationExample

Combining the RTRIM and LTRIM allows you to remove chars from both sides.

Bringing it all together

So we have looked at a lot of the string manipulation function in SQL Server but to get the most out of these functions we need to using them together.
Lets look at how we can separate the Title, FirstName, LastName.

; WITH CleanedStrings AS 
(
SELECT 
	RTRIM(LTRIM(FullName)) AS 'FullName'
	, FullName AS 'UncleanFullName'
	, Address
 
FROM dbo.StringMinipulationExample
)
SELECT 
	CleanedStrings.UncleanFullName
	, CleanedStrings.FullName 
	, REPLACE(LEFT(FullName, CHARINDEX('.', FullName)), '.', '') AS 'Title'
	, LEFT(LTRIM(REPLACE(CleanedStrings.FullName, LEFT(FullName, CHARINDEX('.', FullName)), '')), CHARINDEX(' ', LTRIM(REPLACE(CleanedStrings.FullName, LEFT(FullName, CHARINDEX('.', FullName)), '')))) AS 'FirstName'
	, REVERSE(LEFT(REVERSE(CleanedStrings.FullName), CHARINDEX(' ', REVERSE(CleanedStrings.FullName)))) AS 'LastName'
FROM CleanedStrings;

First we have a CTE to remove the left and right padded strings. Then the title can be found as it always is followed by a ‘.’.
First name was a little tricky. I had to remove the title then find the space and separate. I could have found the first space and the last and use SUBSTRING.

For the LastName I have reversed the string, found where the first space is in the reversed string, taken a left of that and the reversed the whole statement back to how it should have been.

Thank you for reading