References:
Motivation
I have been thinking about designing ML systems in a principled, resuable, and clean manner. Initially I was a data scientist pipelining data modelling elements,
I have no problem with that - in fact, my strength lies in data modelling and analysis (particularly in math evidenced scenarios).
After more than two years immersing in a data science environment, I thought I was able to do everything with data. I knew caret, sklearn, RNN/CNN, Bayesian, optimisation, CV, stats, algebra, I was able to collect data from web (scraping), process almost all common types of data (including alternative ones),
arrive at meaningful conclusions by analysis, implement algorithms in different languages over various platforms, monitor the performance and go to the next round of life circle. All the things pertaining to a full life cycle data scientist.
Engineering came to me sudenly. In my first year working as a data scientist, I was fortunate to rotate among different groups, taking some tasks such as reading a model structure in XML saved from R, and implement it in JPMML,
I/O txt/spreadsheet files in Python/Java and do some transformation and calculations, I failed a lot, and but also had a taste on Java unit test, wrote my own encrypted codes, and designed a GUI. At that time I went though lots of online tutorials on
Java and Markup languages, had an initial taste on implementation which is beyond the ‘normal’ responsibilities of a ‘normal’ data scientist. For example, rather than just using R for data analysis, I also checked
Markdown, Shiny, and S3 classes. Instead of writing a complete script in Matlab, I started to split my Java codes into modules. I asked Java/Python to connect to SQL database in a multi-thread manner, etc.
However, it was until the end of my second year working that I realised these are simple tricks in a software engineer’s life. If you think about JUnit, XML, Kafka, Kubernete, bash, inheritance, composition, encapsulation, they are just basics in engineering.
Encapsulation for example, I am sure many, if not most, so-called data scientists don’t even have an idea about wtf it is. Yeah, it is a fundamental concept in OOP, maybe in your first year university CS course,
but the fact is, lots of data scientists come from various backgrounds and they don’t necessarily have a CS/programming background. That’s sad. I have been learning C, passed the national example and got my qualification some 12 years ago,
but still I did not know what’s exactly agile programming (ignoring the fact that C is a structured/procedural language). I was more used to structured programming like R or Matlab or Python (not to mention S3 or OOP), and when I processed data I was thinking about procedure other than objects.
I believe most data scientists have similar experiences.
Later, I found the bottleneck - I was able to implement my ideas by assembling packages, but I wasn’t able to write my own packages, it’s a more engineering work rather than analysis (by package I mean a collection of classes providing similar functionalities towards achieving a task, think about the unified abstract interface provided by sklearn, .fit() and .predict() for example).
Lack of engineering skills induces lots of problems along the journey. It blocked me from accessing more advanced and efficient ways of data processing. For example, many people are familiar with Hadoop, no so many have
heard about Flink/Storm/MapReduce/Spark. That’s fine if you are dealing with batch data (which is the normal life of a data scientist/ML researcher); however, when it comes to real world problems, particularly those require
real-time feedback (e.g. online learning), you will need a streaming data processor. Other scenarios include when you cannot efficiently read in or save large chunk data, modelling with parallel or distributed (edge computing in contemporary words) resources, HPC, etc.
Or simply you need to pack all your codes into a package and implement it into an user interface for non-technical people to use. What if errors/exceptions encountered and you did not design a mechanism to handle them? The system will just fail. In run time, it can cause large volume of loss in a trading environment.
So I decided to return to school to pick up software design. I am a ML researcher working for the university while at the same time a research software engineer working for a local software firm. I write mathmatical models in LaTex in the morning and implement them in the afternoon in any kind of convenient language.
I care about the complexity of my algorithm but also concern about the generalisation/specialisation (bias/variance trade-off in ML languge) of my codes. More recently, I read 3 books on OOP software design, one focusing on design patterns and other two on programming. They refreshed my mind and gained me lots of new insights.
It makes me recall the old good days when I was a data scientists, though at that time I was totally blind in engineering. Accidently this morning I read a article from Medium on OOP for data science, I totally agree with those who had experiences in both data science and data engineering (including architect). So I think it would good to cite here.
I would make the assumption that a good ML researcher should be able to generate good ideas, turn your math models into actions, and implement them in a principled amnner. For qualified data scientists, a good understanding of engineering would help with efficient working. Beyond algebra, optimisation, stats, we have much more, and all of them are premises and recipes
to make a good ‘data science engineer’.
Intro
When it comes to programming, different levels are involved. Some are familiar with language-bespoke syntax, some are comfortable with data structures, some can write functions (conditionals, cases, loops),
some may think deeper to protect your codes (error/exception handling, encapsulation), you can go further to reuse your codes by applying inheritance/composition (association/agggregation) and design patterns for future debugging/testing/and extensibility.
A sime exaple would be, if you are asked to design a calculator, how would you confront it?
Some simply use the built-in arithmetic operators:
a+b
a-b
a*b
a/b
It looks fine. However, what if a or b is not an integer/double/float or primitive type, e.g. a String (here capital S represents a class)? Some languages can still process, for example, they treat a+b as a concatenation problem - it first cast the number into a String and then concatenate both. However, the result is not what we want. So we need a mechanism to detect the inputs and deal with it if it breaks (you can do that in a try/catch block or in conditionals - this is out of the scope here).
To make it structured, some may like to write functions:
def add(a,b):
return a+b
Remember you can add the ‘init’ constructor if you like (e.g. in defaut cases). In structured programming, people like functions - I am not saying OOP people don’t like it. In fact, both data and methods are integrated parts of an object.
So some people would like to create a class for the calculator to house both attributes/data/fields/properties and behaviours/methods/functions.
class Calculator:
def __init__(self,a,b):
self.a = a
self.b = b
def add(self):
return self.a+self.b
def sub(self):
return self.a-self.b
def mult(self):
return self.a*self.b
def div(self):
return self.a/self.b
This is in Python syntax. But it’s clear enough to demonstrate the beauty of bringing together data and methods: you pack the variables into private fields and carry out the designated arithmetics on call. This class can be initiated whenever you need it, i.e. it’s resuable.
You have access control (private/public/protected), yo separate class definition and implementation, you can override/extend the methods in sub-classes, you can even re-define the meaning of ‘+’, using polymorphism and overloading, and design a GUI for graphical operations. All in all, it’s more flexible and you have more control and options.
This is just a simple example however; good practices of SE is far beyond this, you can find more examples in the referenced article and books. I might write another post to summerize some OOP design principles and patterns. I do encourage all machine learners and data scientists to grasp some essentials of SE which could potentially take you to next level skills.