Thanks for continuing to read my series on applying DevOps to Machine Learning. If you’re new here you might want to start at the begining or check out some technical content. Both should halp make this all easier to digest.
So let’s remind ourselves of what we are trying to achieve. We are looking to apply DevOps to Machine Learning, to remove the need to have a data engineer. We want to use automation where we can, CI/CD and a load of tools all working in harmony to achieve this. We have focused on a lot of the theory for the last few blogs, and this is no exception. The next blog is the detailed how to. So keep reading.
Based on the my research and a review of existing books and blogs on this topic, I will implementing the following DevOps techniques to reduce the difficulty of promoting a model in to production.
- Source control
- Infrastructure as code / Configuration as code
- Continuous integration
- Continuous deployment
- Microservices & Containers
- Monitoring and lineage
I have limited the scope to services which are in Azure, however as Azure supports IAAS servers, I could install any Apache open-source tool. The design principles restrict this as one indicates the need for PAAS over IAAS. As Azure is vast, even when working under restrictions, there are still multiple options which could work. The following sections indicates the tools I will use and the decisions I made during the selection process.
In this blog I want to explore how DevOps is being applied in the industry today before we really dig in to applying DevOps to Machine Learning in the next blog. If this is the first blog you’re reading, then you might want to start at the beginning. http://www.hyperbi.co.uk/applying-devops-to-data-science/
In recent years there has been a subtle shift appearing in the industry. It has begun to try to take the principles of DevOps and apply them to data analytics. In 2015 Andy Palmer coined the term DataOps, he describes DataOps as the intersection of data engineering, data integration, data quality and data security (Palmer, 2015). I first came across the term form Steph Locke’s blog “DataOps, its a thing. Honest” https://itsalocke.com/blog/dataops–its-a-thing-honest/ .
The principles of DevOps have changed the way that software teams work. Data Science and software development share many characteristics. However, the DevOps style of working is seldom seen in data science teams. Taking the principles discussed in previous blogs and applying them to machine learning, we will begin to see similar results to the effect the DevOps has had on traditional software development. In this blog I want to discuss the fundamental problem with a lot of books on Machine Learning – The fear of production. Sounds a little extreme I would agree, but there is a problem and many authors dance around the subject. This who blog series is about tackling that problem.
Why is the productionisation of machine learning so hard?
I am going to make a few generalisms throughout this blog. Some of you will agree and others won’t. Leave a comment if you want to talk more – here goes…
The productionisation of Machine Learning models is the hardest problem in Data Science.
– Schutt and O’Neill
There are typically two types of data scientist, those who build models and those who deploy them. Model development is typically done by academic data scientist, whom have spent many years learning statistics and understand what model works best in any given situation. The other type of data scientist can be better described as a data engineer (or a machine learning engineer – I talk more about this in the Machine Learning Logistics book review). The role of a data engineer is typically to build and maintain the platform a data scientist works with. On some occasions, this role is performed by the same person. The former makes up a lot of data science teams, the engineering part is typically taken from IT.
I do not typically review books, but the following book had a profound impact on how I saw DevOps and Data science. It is a fantastic book and one I recommend to all. A few years ago I picked up Nathan Marz’s Big Data book, in which Marz introduces the concept of the Lambda Architecture. I have since used a variation of Marz’s design in most analytical projects. Lambda changed the way I approached big data. Rendezvous has had the same effect for Machine Learning.
You can download “Machine Learning Logistics” from MapR’s website. https://mapr.com/ebook/machine-learning-logistics/ The book is about 90 pages. There is also an accompanying three part video series, delivered by Ted and Ellen, which is available on YouTube.
Welcome to part 2 in a series on applying DevOps to Data Science. You can read part 1 here. Applying DevOps to Data Science. In this blog I want to begin to look at defining what is DevOps and begin to understand why DevOps can help a Data Scientist deploy models faster.
What is DevOps?
This divide between those who develop and those who deploy has been a struggle in traditional software development for a long time. Software developers would typically work in isolation building their applications. Once it is built it would be handed over to the operations department for deploying/migrating to production. This process could take a long amount of time.
The delay means the development team have to wait longer to deploy changes, code becomes stagnant and the developers are reliant on operations availability. When the Agile Methodology became popular a shift in the deployment process began to emerge. Developers were working with operations as part of the same team to enable sharing of responsibilities. DevOps emerged.
The term DevOps is a portmanteau of dev and ops, dev relating to developers and ops relating to operational staff. In traditional software development, there has typically been a separation between the developers who are building an application and operations who are tasked with deploying and monitoring the application once it is live. The two roles are very different and have different skills, techniques and interests. Developers are interested in if a feature has been built well, operations are interested in whether the application and infrastructure is performing as required (Kim et al, 2016 pp xxii).