Thanks for continuing to read my series on applying DevOps to Machine Learning. If you’re new here you might want to start at the begining or check out some technical content. Both should halp make this all easier to digest.
So let’s remind ourselves of what we are trying to achieve. We are looking to apply DevOps to Machine Learning, to remove the need to have a data engineer. We want to use automation where we can, CI/CD and a load of tools all working in harmony to achieve this. We have focused on a lot of the theory for the last few blogs, and this is no exception. The next blog is the detailed how to. So keep reading.
Based on the my research and a review of existing books and blogs on this topic, I will implementing the following DevOps techniques to reduce the difficulty of promoting a model in to production.
- Source control
- Infrastructure as code / Configuration as code
- Continuous integration
- Continuous deployment
- Microservices & Containers
- Monitoring and lineage
I have limited the scope to services which are in Azure, however as Azure supports IAAS servers, I could install any Apache open-source tool. The design principles restrict this as one indicates the need for PAAS over IAAS. As Azure is vast, even when working under restrictions, there are still multiple options which could work. The following sections indicates the tools I will use and the decisions I made during the selection process.
In this blog I want to explore how DevOps is being applied in the industry today before we really dig in to applying DevOps to Machine Learning in the next blog. If this is the first blog you’re reading, then you might want to start at the beginning. http://www.hyperbi.co.uk/applying-devops-to-data-science/
In recent years there has been a subtle shift appearing in the industry. It has begun to try to take the principles of DevOps and apply them to data analytics. In 2015 Andy Palmer coined the term DataOps, he describes DataOps as the intersection of data engineering, data integration, data quality and data security (Palmer, 2015). I first came across the term form Steph Locke’s blog “DataOps, its a thing. Honest” https://itsalocke.com/blog/dataops–its-a-thing-honest/ .
The principles of DevOps have changed the way that software teams work. Data Science and software development share many characteristics. However, the DevOps style of working is seldom seen in data science teams. Taking the principles discussed in previous blogs and applying them to machine learning, we will begin to see similar results to the effect the DevOps has had on traditional software development. In this blog I want to discuss the fundamental problem with a lot of books on Machine Learning – The fear of production. Sounds a little extreme I would agree, but there is a problem and many authors dance around the subject. This who blog series is about tackling that problem.
Why is the productionisation of machine learning so hard?
I am going to make a few generalisms throughout this blog. Some of you will agree and others won’t. Leave a comment if you want to talk more – here goes…
The productionisation of Machine Learning models is the hardest problem in Data Science.
– Schutt and O’Neill
There are typically two types of data scientist, those who build models and those who deploy them. Model development is typically done by academic data scientist, whom have spent many years learning statistics and understand what model works best in any given situation. The other type of data scientist can be better described as a data engineer (or a machine learning engineer – I talk more about this in the Machine Learning Logistics book review). The role of a data engineer is typically to build and maintain the platform a data scientist works with. On some occasions, this role is performed by the same person. The former makes up a lot of data science teams, the engineering part is typically taken from IT.
In the last blog An Introduction to DevOps we looked at the basics on what DevOps is. We only really skimmed the surface. I want to dig in to a bit more detail, which will make the discussion about Data Science and DevOps a little easier. I want to start by recommending two great books. You will see references to pages and quotations through out this series. All the references are list here: DevOps for Data Science. The two books I recommend are The DevOps Handbook and the Phoenix Project. Both books are fantastic and approach the subject from different angles.
I do not typically review books, but the following book had a profound impact on how I saw DevOps and Data science. It is a fantastic book and one I recommend to all. A few years ago I picked up Nathan Marz’s Big Data book, in which Marz introduces the concept of the Lambda Architecture. I have since used a variation of Marz’s design in most analytical projects. Lambda changed the way I approached big data. Rendezvous has had the same effect for Machine Learning.
You can download “Machine Learning Logistics” from MapR’s website. https://mapr.com/ebook/machine-learning-logistics/ The book is about 90 pages. There is also an accompanying three part video series, delivered by Ted and Ellen, which is available on YouTube.
Welcome to part 2 in a series on applying DevOps to Data Science. You can read part 1 here. Applying DevOps to Data Science. In this blog I want to begin to look at defining what is DevOps and begin to understand why DevOps can help a Data Scientist deploy models faster.
What is DevOps?
This divide between those who develop and those who deploy has been a struggle in traditional software development for a long time. Software developers would typically work in isolation building their applications. Once it is built it would be handed over to the operations department for deploying/migrating to production. This process could take a long amount of time.
The delay means the development team have to wait longer to deploy changes, code becomes stagnant and the developers are reliant on operations availability. When the Agile Methodology became popular a shift in the deployment process began to emerge. Developers were working with operations as part of the same team to enable sharing of responsibilities. DevOps emerged.
The term DevOps is a portmanteau of dev and ops, dev relating to developers and ops relating to operational staff. In traditional software development, there has typically been a separation between the developers who are building an application and operations who are tasked with deploying and monitoring the application once it is live. The two roles are very different and have different skills, techniques and interests. Developers are interested in if a feature has been built well, operations are interested in whether the application and infrastructure is performing as required (Kim et al, 2016 pp xxii).
Some of you might know that for the last 2 years I was studying a Master’s degree in data science from the University of Dundee. This was a 2 years part-time course delivered by Andy Cobley and Mark Whitehorn. This course was fantastic and I recommend it – If you want to know more about the course, please give me a shout. The course is comprised of multiple modules. The final module is a research project, which you need to start thinking about towards the end of the first year of study. I selected my topic very early on, however being indecisive, I changed my idea 3 times (each time having written a good chunk of the project).
Why did I do this? I simply was not passionate about the subject of those projects. They we good ideas, but I just was not researching or building anything new. The outcome of my dissertation might have been a working project, however it would have felt hollow to me. I needed a topic I was passionate about. I have a core ethos that I take to every project I work on. “Never do anything more than once”. It is because of that, that I have spent much of career working either with or developing automation tools to accelerate and simplify my development processes. Having attended a lot of conferences, I became familiar with DevOps and how it accelerated the software industry. DevOps allows software developer to ship code faster. I have been applying the core principles of DevOps to all my recent projects, with great success.
It is with great pleasure that I can share with you all that I am now a Microsoft MVP (Most Valuable Professional). The MVP programme (https://mvp.microsoft.com/) is designed to recognise those professionals who give their time to support technical communities and Microsoft. This could be through blogging, speaking at conferences, providing feedback, helping other and contributing code to projects. There are many award categories ranging from PowerPoint to Xbox each with its contribution areas. I was awarded MVP for the Data Platform, which covers SQL Server, Data Lake, HDInsight and many of the topics I am particularly interested in. I exceptionally pleased to be amongst the 19 Data Platform MVPs in the UK and 1 of the 418 Data Platform MVPs worldwide.
In a recent blog by Kevin Kline, Kevin discussed what it takes to be an MVP, in his opinion it all boils down to being passionate about what you do and sharing that passion with others (https://blogs.sentryone.com/kevinkline/how-can-i-become-a-microsoft-mvp/). I could not agree more! I talk at events because I want to help people. I still get so much out of attending sessions and I want to make sure that I give back everything I took from the community. Kevin’s blog gives a fantastic overview of what it takes to become an MVP, well worth a read.
For those who have wondered if I have fallen off the grid, I have not.
I have been blogging at http://blogs.adatis.co.uk/terrymccann/
Please continue read my content there.
Recently I set up an interactive map in R & Shiny to show the distance between where a customer lives and one of the sites where a customer attends. What I wanted to do was to find out how long it took each person to get to the site using public transport, which proved difficult in R, so i settled for a “as the crow flies” approach using pythagoras. This worked, but since then I have had in the back of my mind a burning question “what if I could used the Google API to give me that distance”. This is what I have now done, and this is how you do it.
Google have many APIs that can be used at little to no cost. The Google distance matrix API allows you to enter a starting position and one or more end positions, and the API will return the distance based on a mode of transport of your choice (walking, driving, transit – public transport). All you need to do is register and get an API key – https://developers.google.com/maps/documentation/distance-matrix/