Data Science
24 TopicsBring Vision to Life with Three Horizons, Data Mesh, Data Lakehouse, and Azure Cloud Scale Analytics
Bring Vision to Life with Three Horizons, Data Mesh, Data Lakehouse, and Azure Cloud Scale Analytics – Plus some bonus concepts! I have not posted in a while so this post is loaded with ideas and concepts to think about. I hope you enjoy it! The structure of the post is a chronological perspective of 4 recent events in my life: 1) Camping on the Olympic Peninsula in WA state, 2) Installation of new windows and external doors in my residential house, 3) Injuring my back (includes a metaphor for how things change over time), and 4) Camping at Kayak Point in Stanwood WA (where I finished writing this). Along with these series of events bookended by Camping trips, I also wanted to mention May 1 st which was International Workers Day (celebrated as Labor Day in September in the US and Canada). To reach the vision of digital transformation through cloud scale analytics we need many more workers (Architects, Developers, DBAs, Data Engineers, Data Scientists, Data Analysts, Data Consumers) and the support of many managers and leaders. Leadership is required so analytical systems can become more distributed and properly staffed to scale vs the centralized and small specialist teams that do not scale. Analytics could be a catalyst for employment with the accelerated building and operating of analytical systems. There is evidence that the structure of the teams working on these analytical systems will need to be more distributed to scale to the level of growth required. When focusing on data management, Data Mesh strives to be more distributed, and Data Lakehouse supports distributed architectures better than the analytical systems of the past. I am optimistic that cloud-based analytical systems supported by these distributed concepts can scale and progress to meet the data management, data engineering, data science, data analysis, and data consumer needs and requirements of many organizations.22KViews6likes1CommentA Data Science Process, Documentation, and Project Template You Can Use in Your Solutions
In most of the Data Science and AI articles, blogs and papers I read, the focus is on a particular algorithm or math angle to solving a puzzle. And that's awesome - we need LOTS of those. However, even if you figure those out, you have to use them somewhere. You have to run that on some sort of cloud or local system, you have to describe what you're doing, you have to distribute an app, import some data, check a security angle here and there, communicate with a team....you know, DevOps. In this article, I'll show you a complete process, procedures, and free resources to manage your Data Science project from beginning to end.11KViews0likes1CommentUsing Microsoft R in your Solutions
R is a powerful data language with thousands of "packages" allowing you to extend its uses for Data Science, Advanced Analytics, Machine learning and much more. Microsoft has enhanced this language with a Distribution called "Microsoft R Open". Read on to learn more about this powerful tool you can use stand-alone, or embedded in several Microsoft products.8.7KViews1like2CommentsDevOps for Data Science – Part 10 - Automated Testing and Scale
The final DevOps Maturity Model is Load Testing and Auto-Scale. Note that you want to follow this progression – there’s no way to do proper load-testing if you aren’t automatically integrating the Infrastructure as Code, CI, CD, RM and APM phases. The reason is that the automatic balancing you’ll do depends on the automation that precedes it – there’s no reason to scale something that you’re about to change.8.3KViews0likes0CommentsDevOps for Data Science – Part 9 - Application Performance Monitoring
In this series on DevOps for Data Science, I’ve explained the concept of a DevOps “Maturity Model” – a list of things you can do, in order, which will set you on the path for implementing DevOps in Data Science. The first thing you can do in your projects is to implement Infrastructure as Code (IaC) , and the second thing to focus on is Continuous Integration (CI). However, to set up CI, you need to have as much automated testing as you can – and in the case of Data Science programs, that’s difficult to do. From there, the next step in the DevOps Maturity Model is Continuous Delivery (CD). Once you have that maturity level down, you can focus on Release Management. And now we’re off to the next Maturity Level: Application Performance Monitoring (APM).7.3KViews1like3CommentsDevOps for Data Science – Part 8 - Release Management
Release Management (RM), as a concept, is essentially what it says – determining a method of releasing new and changed software into an environment in a planned fashion. While this sounds simple, it actually takes quite a bit of forethought and planning, and involves not only the technical teams, but several business teams as well. RM is slightly different based on whether you are selling the software you are writing. But in any case, it comes down to the business function of planning. So how does that affect the Data Science team?6.3KViews0likes0CommentsDevOps for Data Science - Part 5 - Infrastructure as Code
Right away I may have put a few Data Scientists off. No, it isn’t that they can’t set up a server or components, it’s just that this isn’t normally their job. However, the software, hardware configuration (virtual or otherwise), containers (if you use those, and yes, you should) Python environments, R libraries, and many other parameters affect the experiment. It’s essential that you’re able to duplicate all of that and store it in a source-control system so that you can re-create it for testing, deployment and the downstream phases. Read on for more.5.8KViews0likes0CommentsCreating a Kubernetes Application for Azure SQL Database
Modern application development has multiple challenges. From selecting a "stack" of front-end through data storage and processing from several competing standards, through ensuring the highest levels of security and performance, developers are required to ensure the application scales and performs well and is supportable on multiple platforms. For that last requirement, bundling up the application into Container technologies such as Docker and deploying multiple Containers onto the Kubernetes platform is now de rigueur in application development. In this example, we'll explore using Python, Docker Containers, and Kubernetes - all running on the Microsoft Azure platform. Using Kubernetes means that you also have the flexibility of using local environments or even other clouds for a seamless and consistent deployment of your application, and allows for multi-cloud deployments for even higher resiliency. We'll also use Microsoft Azure SQL Database for a service-based, scalable, highly resilient and secure environment for the data storage and processing. In fact, in many cases, other applications are often using Microsoft Azure SQL Database already, and this sample application can be used to further leverage and enrich that data. This example is fairly comprehensive in scope, but uses the simplest application, database and deployment to illustrate the process. You can adapt this sample to be far more robust, even including leveraging the latest technologies for the returned data. It's a useful learning tool to create a pattern for other applications.5.6KViews0likes0CommentsDevOps for Data Science – Part 7 - Continuous Delivery
To fully embrace DevOps in Data Science, you can start by implementing Infrastructure as Code (IaC) , and the second thing to focus on is Continuous Integration (CI) and Testing. The next step in the DevOps Maturity Model is Continuous Delivery (CD). There’s some discussion we need to cover here, since the definitions of DevOps and Continuous Delivery are quite similar, and to some, CD doesn’t belong “under” DevOps. Both DevOps and CD involve an agile mindset of releasing smaller, faster, and automated bits of code into the process rather than waiting for several changes to integrate at once.4.8KViews0likes0Comments