In this blog I want to explore how DevOps is being applied in the industry today before we really dig in to applying DevOps to Machine Learning in the next blog. If this is the first blog you’re reading, then you might want to start at the beginning. http://www.hyperbi.co.uk/applying-devops-to-data-science/
In recent years there has been a subtle shift appearing in the industry. It has begun to try to take the principles of DevOps and apply them to data analytics. In 2015 Andy Palmer coined the term DataOps, he describes DataOps as the intersection of data engineering, data integration, data quality and data security (Palmer, 2015). I first came across the term form Steph Locke’s blog “DataOps, its a thing. Honest” https://itsalocke.com/blog/dataops–its-a-thing-honest/ .
The principles of DevOps have changed the way that software teams work. Data Science and software development share many characteristics. However, the DevOps style of working is seldom seen in data science teams. Taking the principles discussed in previous blogs and applying them to machine learning, we will begin to see similar results to the effect the DevOps has had on traditional software development. In this blog I want to discuss the fundamental problem with a lot of books on Machine Learning – The fear of production. Sounds a little extreme I would agree, but there is a problem and many authors dance around the subject. This who blog series is about tackling that problem.
Why is the productionisation of machine learning so hard?
I am going to make a few generalisms throughout this blog. Some of you will agree and others won’t. Leave a comment if you want to talk more – here goes…
The productionisation of Machine Learning models is the hardest problem in Data Science.
– Schutt and O’Neill
There are typically two types of data scientist, those who build models and those who deploy them. Model development is typically done by academic data scientist, whom have spent many years learning statistics and understand what model works best in any given situation. The other type of data scientist can be better described as a data engineer (or a machine learning engineer – I talk more about this in the Machine Learning Logistics book review). The role of a data engineer is typically to build and maintain the platform a data scientist works with. On some occasions, this role is performed by the same person. The former makes up a lot of data science teams, the engineering part is typically taken from IT.
In the last blog An Introduction to DevOps we looked at the basics on what DevOps is. We only really skimmed the surface. I want to dig in to a bit more detail, which will make the discussion about Data Science and DevOps a little easier. I want to start by recommending two great books. You will see references to pages and quotations through out this series. All the references are list here: DevOps for Data Science. The two books I recommend are The DevOps Handbook and the Phoenix Project. Both books are fantastic and approach the subject from different angles.
I do not typically review books, but the following book had a profound impact on how I saw DevOps and Data science. It is a fantastic book and one I recommend to all. A few years ago I picked up Nathan Marz’s Big Data book, in which Marz introduces the concept of the Lambda Architecture. I have since used a variation of Marz’s design in most analytical projects. Lambda changed the way I approached big data. Rendezvous has had the same effect for Machine Learning.
You can download “Machine Learning Logistics” from MapR’s website. https://mapr.com/ebook/machine-learning-logistics/ The book is about 90 pages. There is also an accompanying three part video series, delivered by Ted and Ellen, which is available on YouTube.
Welcome to part 2 in a series on applying DevOps to Data Science. You can read part 1 here. Applying DevOps to Data Science. In this blog I want to begin to look at defining what is DevOps and begin to understand why DevOps can help a Data Scientist deploy models faster.
What is DevOps?
This divide between those who develop and those who deploy has been a struggle in traditional software development for a long time. Software developers would typically work in isolation building their applications. Once it is built it would be handed over to the operations department for deploying/migrating to production. This process could take a long amount of time.
The delay means the development team have to wait longer to deploy changes, code becomes stagnant and the developers are reliant on operations availability. When the Agile Methodology became popular a shift in the deployment process began to emerge. Developers were working with operations as part of the same team to enable sharing of responsibilities. DevOps emerged.
The term DevOps is a portmanteau of dev and ops, dev relating to developers and ops relating to operational staff. In traditional software development, there has typically been a separation between the developers who are building an application and operations who are tasked with deploying and monitoring the application once it is live. The two roles are very different and have different skills, techniques and interests. Developers are interested in if a feature has been built well, operations are interested in whether the application and infrastructure is performing as required (Kim et al, 2016 pp xxii).
Some of you might know that for the last 2 years I was studying a Master’s degree in data science from the University of Dundee. This was a 2 years part-time course delivered by Andy Cobley and Mark Whitehorn. This course was fantastic and I recommend it – If you want to know more about the course, please give me a shout. The course is comprised of multiple modules. The final module is a research project, which you need to start thinking about towards the end of the first year of study. I selected my topic very early on, however being indecisive, I changed my idea 3 times (each time having written a good chunk of the project).
Why did I do this? I simply was not passionate about the subject of those projects. They we good ideas, but I just was not researching or building anything new. The outcome of my dissertation might have been a working project, however it would have felt hollow to me. I needed a topic I was passionate about. I have a core ethos that I take to every project I work on. “Never do anything more than once”. It is because of that, that I have spent much of career working either with or developing automation tools to accelerate and simplify my development processes. Having attended a lot of conferences, I became familiar with DevOps and how it accelerated the software industry. DevOps allows software developer to ship code faster. I have been applying the core principles of DevOps to all my recent projects, with great success.