Hello everybody – we do hope this article finds you well.
As data scientists we aim for a few things:
- Pin down a business pain-point – known or derived from us deep diving into the data;
- Convert the business case into a Machine Learning (ML) one – supervised, unsupervised, reinforcement learning;
- Meet task’s requirements with data availability;
- Build the strongest model possible within the timeline available while obliging to the accuracy vs. interpretability tradeoff of the task at hand;
- Integrate the solution within the current strategy (check-out our finance industry article).
In today’s Industry 4.0, Big Data and AI we have a broad spectrum of tools to address the above. Just to name a few:
We are in the state where the interoperability between Visual Intelligence and Data Science on the one hand and the different tools within each of these on the other has greatly improved over the last few years (make sure to have a look at our article on the topic):
- All of Tablue, Qlik, PowerBI (obviously) and Shiny (even more obviously) have great R integrations so that you can utilize advanced analytics directly in your dashboard or take advantage of the visualizations available from R – such as Plotly.
- Execute R code within Python (rpy2 package), Python Code in R (reticulate package), call R scripts from SAS (proc options option=RLANG to verify permissions), SQL in SAS (proc sql – available for a long time now) and R (sqldf library), etc.
So, what is missing now and what is the future (and even the present) that we are making advancements in connecting the advanced analytics and visualization dots.
An example of the bigger picture with how Data Scientists interact with each element can be found in the picture below:
We, as data scientists, so far have mostly been involved into the green sections of the diagram above: building advanced analytics solutions and presenting them in way that the business can both interpret and appreciate the added-value of. The rest have been and still are a little bit out of scope:
- Data ingestion – we take the data as provided. Mostly here we need the help of data engineers to ingest it, data steward to help with availability and a business analyst to help us understand it, eventually 😊;
- Model Deployment – dev ops or software engineers lend a helping hand here. We all know this is crucial and it is often where the analytical exercise can fail and actually incur loss if our models are not properly placed within the live/ production environment;
- Roles and Access: Qlik, for example, allows you to set different access roles and dashboard view per users.
So, what are the major players doing – such as ® Amazon Web Services, ® Microsoft Azure, ® Oracle, etc.? They are aiming to connect all the dots. Have they succeeded? Well, our experience shows that the progress is rather significant. Let’s review some of the achievements for AWS and its SageMaker* – fully managed machine learning service where we have::
- Direct connection to the stored data – your S3 bucket (data ingestion, interoperability, self-service);
- Readily available Jupyter notebook on your EC2 instance with build-in Python, Spark, etc. (tools);
- Besides training you own algorithm – off-the shelf container with pre-built ML capabilities that are optimized for the AWS environment (self-service);
- Monitor resources utilizing AWS CloudWatch and monitor costs with AWS CloudTrail (costs);
- Connect to version control tools such as GitHub (version control, interoperability);
- Install R Server and R Shiny Server on your EC2 instance or simply install R notebook within Jupyter (as simple as: “!conda install –yes –name JupyterSystemEnv –channel r r-essentials=1.7”) (interoperability, tools);
- Set access control via an IAM role (roles and access level);
Below is a functional example of the extended train, host and predict path of your models within SageMaker provided by the SageMaker Python SDK. Just a couple of lines of code cover the full cycle for you:
Similar is the experience with Microsoft Azure. This is a nice recent article that you might want to check out.We, as data scientists, are currently exploring and tremendously enjoying ®Dataiku DSS capabilities.
Based on our projects here are some moments to pay attention to:
- Even though marketed as Plug & Play these service providers do have a learning curve and actually your initial interaction may be a little frustrating – especially in the areas that you are not used to taking part in (one such could be the deployment of your model);
- Self-service is a good thing but cannot substitute experienced professional – it only has limited capabilities/ lacks flexibility and you still have to know what you are doing (e.g. using one regression technique vs. another);
- Pricing: Not having to fully develop your own environment may reduce costs significantly. Still, please do make your calculations on the pricing of the service since some offers are good but when your data is indeed BIG (e.g. 1000s of sensors providing data each and every second across many manufacturing units) – well, you may be surprised what comes when you bill start knocking on your accountants door;
So, we do believe that we are making a great progress in the area of data storage, machine learning, visualizations and interoperability. Some of the solutions above will evolve especially in the aspects of user experience and documentation since as of now some of them are newborn.
On the other hand, we won’t be surprised to see new players or completely new platforms by the major providers emerging in the market aiming to meet self-service needs, exponential data growth and the torrent of tools.
Our advice? Stay curious and make sure you educate yourself unceasingly. Just like we do!
*Important note: we are not rejecting nor are we advocating for any Cloud or other service provider. This choice we leave for the user to select by scaling the pros and cons for their specific need.