Connect with us

Big Data

A Brief History of Microservices

Published

on

The history and origins of microservices are a continuing effort to provide better communication between different platforms, greater simplicity, and more user-friendly systems. Microservices are typically thought of as a software development technique which organizes an application as a group of loosely coupled services. It is, however, any kind of small service which interacts with other services to operate a software application. Microservices architecture uses services that are fine-grained and protocols that are lightweight.

Dr. Peter Rodgers used the term “Micro-Web-Services” in 2005 during a presentation on cloud computing. Rodgers argued against conventional thinking and promoted software components supporting micro-web-services. In the presentation, he established a functional model of microservices which eventually became a reality. Explaining how complex service-assemblies work behind simple URI interfaces, he suggested, “Any service, at any granularity, can be exposed.” He went on to describe how a well-organized micro-web-services platform “applies the underlying architectural principles of the Web and REST services together with Unix-like scheduling and pipelines to provide radical flexibility and improved simplicity in service-oriented architectures.”

Enterprise Java Beans to Service Oriented Architecture

IBM released enterprise java bean (EJB) in 1997. It was one of the earliest efforts to provide “small” service or microservice working with web-related software components. EJBs were designed to allow developers to write code in a standardized way and automatically handle many common issues. They were the first specification that offered an easier way to “encapsulate and re-use” business logic for enterprise Java applications.

However, there were still frustrating issues with the use of enterprise java beans. They could only be used while working in Java and could not communicate with other systems. They were still complicated, not terribly user-friendly, and extremely difficult to debug. The limitation of working only with Java and not being able to extend communications to other platforms brought about the solution known as service-oriented architecture (SOA) which became the next evolutionary step for the existence of microservices.

A service, in SOA context, is a “self-contained” software program performing a specific task. SOA allows these services to communicate with one another across different languages and platforms by using a loose coupling system. (Loose coupling continues to be used for web services.) Loose coupling describes the ability of the client to maintain independence from the service it requires. Two platforms can communicate even if they are not closely related. This is accomplished through the use of a special interface capable of translating the data. (Some consider microservices to be a subdivision of SOA, but SOA and microservices each have their own strengths and weaknesses, and can be categorized separate systems.) SOA services are built primarily on simple object access protocol.

Simple Object Access Protocol (SOAP)

SOAP was essential to the development of microservices. The philosophy “Do the Simplest Thing Possible” is at the heart of SOAP, released in 1999 by Microsoft. SOAP can be described as a way to utilize object methods using HTTP (or hypertext transfer protocol). It is a way to transfer small amounts of information, or messages, over the Internet. These messages use an XML format and are usually sent using HTTP. The use of HTTP, a protocol employed by Web pages, has the advantage of bypassing most network firewalls. (Firewalls typically don’t block HTTP traffic, allowing most SOAP messages to pass through with no problems.) This solution took advantage of two trends in the computing world during the early 2000s:

  • The growing trend for using HTTP in corporate networks.
  • The support provided by automatic logging and debugging mechanisms for text-based Internet communications.

SOAP messages are contained within an “envelope,” which includes a body and a header. The header may contain the message’s ID and the sending date, while the body holds the actual message. SOAP messages each use the same format, making them compatible with a variety of different protocols and operating systems. For instance, a SOAP message can be sent to a Unix-based web server from a Windows XP computer without concerns the message will be distorted. The Unix server can then open the file or redirect this message to the most appropriate location.

Though SOAP does have its own strengths, it does not scale well, nor does it support error handling, two features microservices perform automatically. Additionally, though SOA and SOAP started off simply enough, efforts to expand their usefulness – by adding layers – made them clumsy and difficult to work with.

SOA continues to be used used as a basic tool, and there are continuing efforts to make SOAP more efficient.

REST

As a whole, the computer tech community began to reject the concepts required by SOAP in roughly 2005 through 2007. They shifted toward Representational State Transfer (REST) in 2008 through 2010, as it gained in popularity. This architectural style defines constraints that are used to create web services and supports use of the cloud to develop software apps. Web services conforming to REST’s architectural style are referred to as RESTful Web services, and such services support interoperability between computer systems communicating through the Internet.

REST using HTTP is a very common practice in modern microservices development. A RESTful API describes an application program interface (API) using HTTP requests to POST, GET, PUT, and DELETE data. APIs are portable, scalable, easy-to-use, and easy to integrate. A website’s API uses code that supports the communication between two software programs.

Roy Fielding worked on developing the REST architectural style from 1996 to1999, and he discussed it during his 2000 PhD dissertation, Architectural Styles and the Design of Network-based Software Architectures. Fielding was not an advocate for rigidness, but for flexibility. REST’s primary theme stated that for each portion of data, the URL stays the same, but depending on the method used, the operation would change. For example, ordering “http://theirsite.com/posts” to GET may return a list of posts, but a POST order to the same URL would create a new post.Asked about the development of REST, Fielding responded,

“Throughout the HTTP standardization process, I was called on to defend the design choices of the Web. That is an extremely difficult thing to do within a process that accepts proposals from anyone on a topic that was rapidly becoming the center of an entire industry. I had comments from well over 500 developers, many of whom were distinguished engineers with decades of experience, and I had to explain everything from the most abstract notions of Web interaction to the finest details of HTTP syntax. That process honed my model down to a core set of principles, properties, and constraints that are now called REST.”

REST is often used with microservices.

Microservices Become a Reality

In May, 2011, people in a workshop near Venice for software architects used the word “microservice” to describe an architectural style several of them had recently explored. In May, 2012, they decided microservices was the most appropriate name for the work they were doing and formally adopted it. They had been experimenting with building continuously deployed systems, while incorporating the DevOps philosophy. This form of architecture quickly gained popularity.

DevOps is a series of software development practices which combine development (Dev) and operations (Ops) in an effort to shorten the development life cycle. Microservices architecture effectively brings both departments together, allowing them to produce applications that are more reliable and have better performance and greater resilience. As a process, it allows developers to provide updates more often, ensuring the software stays aligned with the business’s objectives. Microservices architectures have quickly become the standard for DevOps. These services are especially useful for continuously releasing software.

The benefits of microservices architecture include:

  • Support: Microservices have become popular and are supported by many technology and cloud vendors. With the combination of efficiency and support, many businesses are adopting microservices because of their easy availability.
  • Language Independence: Microservices can work with almost any technology with only minor tweaks.
  • DevOps: The architecture of microservices are designed to support DevOps philosophies. In turn, DevOps helps guide the development of applications based on microservices.
  • Resources Are Not Shared: Earlier service architecture designs focused on shared resources. Shared resources tended to make the work unnecessarily complicated. The architectural design of microservices allows each service to be encapsulated. This provides developers with the ability to tweak the microservices without influencing other services.

The Future of Microservices

The Cloud Microservices Market Research Report of February of 2020 has predicted the size of the global microservice architecture market will increase with a compound annual growth rate of 21.37 percent from 2019 to 2026 with the market reaching a value to $3.1 billion by 2026.

Coinsmart. Beste Bitcoin-Börse in Europa
Source: https://www.dataversity.net/a-brief-history-of-microservices/

Big Data

Adventures in MLOps with Github Actions, Iterative.ai, Label Studio and NBDEV

Published

on

Adventures in MLOps with Github Actions, Iterative.ai, Label Studio and NBDEV

This article documents the authors’ experience building their custom MLOps approach.


By Aaron Soellinger & Will Kunz

When designing the MLOps stack for our project, we needed a solution that allowed for a high degree of customization and flexibility to evolve as our experimentation dictated. We considered large platforms that encompassed many functions, but found it limiting in some key areas. Ultimately we decided on an approach where separate specialized tools were implemented for labeling, data versioning, and continuous integration. This article documents our experience building this custom MLOps approach.



Photo by Finding Dan | Dan Grinwis on Unsplash

NBDEV

 
 



(Taken from https://github.com/fastai/nbdev)

 

The classic problem using Jupyter for development was moving from prototype to production required copy/pasting code from a notebook to a python module. NBDEV automates the transition between notebook and module, thus enabling the Jupyter notebook to be an official part of a production pipeline. NBDEV allows the developer to state which module a notebook should create, which notebook cells to push to the module and which notebook cells are tests. A key capability of NBDEV is its approach to testing within the notebooks, and the NBDEV template even provides a base Github Action to implement testing in the CI/CD framework. The resulting Python module requires no editing by the developer, and can easily be integrated into other notebooks or the project at large using built-in python import functionality.

Iterative.ai: DVC/CML

 
 



(Taken from https://iterative.ai/)

 

The files used in machine learning pipelines are often large archives of binary/compressed files, which are not accessible or cost prohibitive for existing version control solutions like git. DVC solves data versioning by representing large datasets as a hash of the file contents which enables DVC to track changes. It works similar to git (e.g. dvc adddvc push). When you run dvc add on your dataset, it gets added to the .gitignore and tracked for changes by dvc. CML is a project that provides functionality for publishing model artifacts from Github Actions workflows into comments attached Github Issues, Pull Requests, etc… That is important because it helps us start to fill in the gaps in the Pull Requests accounting for training data changes and resulting model accuracy and effectiveness.

Github Actions

 
 



(Taken from https://github.com/features/actions)

 

We want automated code testing, including building models in the automated testing pipeline. Github Actions is in competition with CircleCI, Travis, Jenkins, which is to automate testing around code pushes, commits, pull requests, etc. Since we’re already using Github to host our repos, we avoid another 3rd party app by using Actions. In this project we need to use Github self-hosted runners to run jobs on an on-prem GPU cluster.

Label Studio

 
 



(Taken from https://labelstud.io/)

 

We did a deep dive into how we’re using Label Studio found here. Label Studio is a solution for labeling data. It works well, and is flexible to run in a variety of environments.

Why use them together?

 
 
The setup is designed to deploy models faster. That means, more data scientists working harmoniously in parallel, transparency in the repository and faster onboarding time for new people. The goal is to standardize the types of activities that data scientists need to do in project and provide clear instructions for them.

The following is a list of tasks we want to streamline with this system design:

  1. Automate the ingest from Label Studio and provide a single point for ingesting that into the model training and evaluation activities.
  2. Automated testing on the data pipeline code, that is unit testing and re-deployment of containers used by the process.
  3. Automated testing on the model code, that is unit testing and re-deployment of containers used by the process.
  4. Enable automated testing to include model re-training and evaluation criteria. When the model code changes, train a model with the new code and compare it to the existing incumbent model.
  5. Trigger model retraining when training data changes.

Below is the description of pipeline for each task.

Traditional CI/CD Pipeline

 
 
This pipeline implements automated testing feedback for each pull request that includes evaluation of syntax, unit, regression and integration tests. The outcome of this process is a functionally tested docker image to our private repository. This process maximizes the likelihood that the latest best code is in a fully tested image available in the repository for downstream tasks. Here’s how the developer lifecycle works in the context of a new feature:



Here we show how the workflow function for while editing the code. Using NBDEV enables us to work directly from the Jupyter notebooks including writing the tests directly in the notebook. NBDEV requires that all the cells in the notebooks run without exception (unless the cell is flagged not to run). (Image by Author)

Data pipeline

 
 
Label Studio currently lacks event hooks enabling updates on-changes to the label data stored. So we take a cron triggered approach, updating the dataset every hour. Additionally, while the label studio training dataset is small enough, the updates can be done as part of the training pipeline as well. We have the ability to trigger the data pipeline refresh on demand using the Github Actions interface.



The data pipeline feeds from Label Studio, and persists every version of the dataset and relevant inputs to the DVC cache stored in AWS S3. (Image by Author)

Model Pipeline

 
 
The modeling pipeline integrates model training into the CI/CD pipeline for the repository. This enables each pull request to evaluate the syntax, unit, integration and regression tests configured on the codebase, but also can provide feedback that includes evaluating the new resulting model



The workflow in this case, run the model training experiment specified in the configuration file (model_params.yaml) and update the model artifact (best-model.pth) (Image by Author)

Benchmark Evaluation Pipeline

 
 
The benchmarking pipeline forms an “official submission” process to ensure all modeling activities are measured against the metrics of the project.



The newly trained model in best-model.pth is evaluated against the benchmark dataset and the results are tagged with the latest commit hash and persisted in AWS S3. (Image by Author)

Workflow

 
 
Here is the DAG definition file that is used by DVC. It captures the workflow steps and their inputs, and allows for reproducibility across users and machines.

stages: labelstudio_export_trad: cmd: python pipelines/1_labelstudio_export.py --config_fp pipelines/traditional_pipeline.yaml --ls_token *** --proj_root "." params: - pipelines/traditional_pipeline.yaml: - src.host - src.out_fp - src.proj_id dataset_create_trad: cmd: python pipelines/2_labelstudio_todataset.py --config_fp pipelines/create_traditional.yaml --proj_root "." deps: - data/raw_labels/traditional.json params: - pipelines/create_traditional.yaml: - dataset.bmdata_fp - dataset.labels_map - dataset.out_fp - dataset.rawdata_dir train_model_trad: cmd: python pipelines/3_train_model.py --config_fp pipelines/model_params.yaml --proj_root "." deps: - data/traditional_labeling params: - pipelines/model_params.yaml: - dataloader.bs - dataloader.size - dataloader.train_fp - dataloader.valid_fp - learner.backbone - learner.data_dir - learner.in_checkpoint - learner.metrics - learner.n_out - learner.wandb_project_name - train.cycles labelstudio_export_bench: cmd: python pipelines/1_labelstudio_export.py --config_fp pipelines/benchmark_pipeline.yaml --ls_token *** --proj_root "." params: - pipelines/benchmark_pipeline.yaml: - src.host - src.out_fp - src.proj_id dataset_create_bench: cmd: python pipelines/2_labelstudio_todataset.py --config_fp pipelines/create_benchmark.yaml --proj_root "." deps: - data/raw_labels/benchmark.json params: - pipelines/create_benchmark.yaml: - dataset.bmdata_fp - dataset.labels_map - dataset.out_fp - dataset.rawdata_dir eval_model_trad: cmd: python pipelines/4_eval_model.py --config_fp pipelines/bench_eval.yaml --proj_root "." deps: - data/models/best-model.pth params: - pipelines/bench_eval.yaml: - eval.bench_fp - eval.label_config - eval.metrics_fp - eval.model_conf - eval.overlay_dir

Findings

 
 

  1. The Github Actions workflow cron trigger is not extremely reliable. It does not guarantee timing.
  2. DVC does not work in a clear manner inside a Github Action workflow that is triggered on push. It will alter the trackers that are source controlled and when that is committed it will create another Github action.
  3. The Github Actions orchestration as a mechanism to run model requires a self-hosted runner to use a GPU. This means connecting to a GPU instance in the cloud or on-prem, and this presents issues with access control. For example, we can’t open source the exact repo without removing that self-hosted runner configuration from the repo or else random people would be able to run workloads on our training server by pushing code to the project.
  4. NBDEV built-in workflow is testing the code in the wrong place. It’s testing the notebook instead of the compiled package. On the one hand, it’s nice to be able to say that the “tests can be written right into the notebook”. On the other hand, testing the notebooks directly tests leaves open the possibility that the code package created by NBDEV fails even though the notebook ran. What we need is the ability to test the NBDEV-compiled package directly
  5. NBDEV doesn’t interoperate with “traditional” Python development in the sense that NBDEV is a one-way street. It simply allows the project to be developed in the interactive Jupyter notebook style. It makes it impossible to develop the Python modules directly. If at any point, the project wants to be converted to “traditional” Python development testing would need to be accomplished another way.
  6. In the beginning, we were using Weights & Biases as our experiment tracking dashboard, however there were issues deploying it into a Github Action. What we can say is that the user experience for implementing wandb hit its first hiccup in the Action Workflow. Removing Weights & Biases resolved the problem straight away. Before that, wandb stood out as the best user experience in MLOps.

Conclusions

 
 
Ultimately, it took one week to complete the implementation of these tools for managing our code with Github Actions, Iterative.ai tools (DVC & CML) and NBDEV. This provides us with the following capabilities:

  1. Work from Jupyter notebooks as the system of record for the code. We like Jupyter. The main use case it accomplishes is to enable us to work directly on any hardware we can SSH into by hosting a Jupyter server there and forwarding it to a desktop. To be clear, we would be doing this even if we were not using NBDev because the alternative is using Vim or some such tool that we don’t like as much. Past experiments to connect to remote servers with VS Code or Pycharm failed. So it’s Jupyter.
  2. Testing the code, and testing the model it creates. Now as part of the CI/CD pipeline we can evaluate whether or not the model resulting from the changes to the repo make the model better, worse or stay the same. This is all available in the pull request before it is merged into main.
  3. Using Github Actions server as an orchestrator for training runs begins to allow multiple data scientists to work simultaneously in a more clear manner. Going forward, we will see the limitations of this setup for orchestrating the collaborative data science process.

 
Aaron Soellinger has formerly worked as a data scientist and software engineer solving problems in finance, predictive maintenance and sports. He currently works as a machine learning systems consultant with Hoplabs working on a multi-camera computer vision application.

Will Kunz is a back end software developer, bringing a can-do attitude and dogged determination to challenges. It doesn’t matter if it’s tracking down an elusive bug or adapting quickly to a new technology. If there’s a solution, Will wants to find it.

Original. Reposted with permission.

Related:


PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://www.kdnuggets.com/2021/09/adventures-mlops-github-actions-iterative-ai-label-studio-and-nbdev.html

Continue Reading

Big Data

Adventures in MLOps with Github Actions, Iterative.ai, Label Studio and NBDEV

Published

on

Adventures in MLOps with Github Actions, Iterative.ai, Label Studio and NBDEV

This article documents the authors’ experience building their custom MLOps approach.


By Aaron Soellinger & Will Kunz

When designing the MLOps stack for our project, we needed a solution that allowed for a high degree of customization and flexibility to evolve as our experimentation dictated. We considered large platforms that encompassed many functions, but found it limiting in some key areas. Ultimately we decided on an approach where separate specialized tools were implemented for labeling, data versioning, and continuous integration. This article documents our experience building this custom MLOps approach.



Photo by Finding Dan | Dan Grinwis on Unsplash

NBDEV

 
 



(Taken from https://github.com/fastai/nbdev)

 

The classic problem using Jupyter for development was moving from prototype to production required copy/pasting code from a notebook to a python module. NBDEV automates the transition between notebook and module, thus enabling the Jupyter notebook to be an official part of a production pipeline. NBDEV allows the developer to state which module a notebook should create, which notebook cells to push to the module and which notebook cells are tests. A key capability of NBDEV is its approach to testing within the notebooks, and the NBDEV template even provides a base Github Action to implement testing in the CI/CD framework. The resulting Python module requires no editing by the developer, and can easily be integrated into other notebooks or the project at large using built-in python import functionality.

Iterative.ai: DVC/CML

 
 



(Taken from https://iterative.ai/)

 

The files used in machine learning pipelines are often large archives of binary/compressed files, which are not accessible or cost prohibitive for existing version control solutions like git. DVC solves data versioning by representing large datasets as a hash of the file contents which enables DVC to track changes. It works similar to git (e.g. dvc adddvc push). When you run dvc add on your dataset, it gets added to the .gitignore and tracked for changes by dvc. CML is a project that provides functionality for publishing model artifacts from Github Actions workflows into comments attached Github Issues, Pull Requests, etc… That is important because it helps us start to fill in the gaps in the Pull Requests accounting for training data changes and resulting model accuracy and effectiveness.

Github Actions

 
 



(Taken from https://github.com/features/actions)

 

We want automated code testing, including building models in the automated testing pipeline. Github Actions is in competition with CircleCI, Travis, Jenkins, which is to automate testing around code pushes, commits, pull requests, etc. Since we’re already using Github to host our repos, we avoid another 3rd party app by using Actions. In this project we need to use Github self-hosted runners to run jobs on an on-prem GPU cluster.

Label Studio

 
 



(Taken from https://labelstud.io/)

 

We did a deep dive into how we’re using Label Studio found here. Label Studio is a solution for labeling data. It works well, and is flexible to run in a variety of environments.

Why use them together?

 
 
The setup is designed to deploy models faster. That means, more data scientists working harmoniously in parallel, transparency in the repository and faster onboarding time for new people. The goal is to standardize the types of activities that data scientists need to do in project and provide clear instructions for them.

The following is a list of tasks we want to streamline with this system design:

  1. Automate the ingest from Label Studio and provide a single point for ingesting that into the model training and evaluation activities.
  2. Automated testing on the data pipeline code, that is unit testing and re-deployment of containers used by the process.
  3. Automated testing on the model code, that is unit testing and re-deployment of containers used by the process.
  4. Enable automated testing to include model re-training and evaluation criteria. When the model code changes, train a model with the new code and compare it to the existing incumbent model.
  5. Trigger model retraining when training data changes.

Below is the description of pipeline for each task.

Traditional CI/CD Pipeline

 
 
This pipeline implements automated testing feedback for each pull request that includes evaluation of syntax, unit, regression and integration tests. The outcome of this process is a functionally tested docker image to our private repository. This process maximizes the likelihood that the latest best code is in a fully tested image available in the repository for downstream tasks. Here’s how the developer lifecycle works in the context of a new feature:



Here we show how the workflow function for while editing the code. Using NBDEV enables us to work directly from the Jupyter notebooks including writing the tests directly in the notebook. NBDEV requires that all the cells in the notebooks run without exception (unless the cell is flagged not to run). (Image by Author)

Data pipeline

 
 
Label Studio currently lacks event hooks enabling updates on-changes to the label data stored. So we take a cron triggered approach, updating the dataset every hour. Additionally, while the label studio training dataset is small enough, the updates can be done as part of the training pipeline as well. We have the ability to trigger the data pipeline refresh on demand using the Github Actions interface.



The data pipeline feeds from Label Studio, and persists every version of the dataset and relevant inputs to the DVC cache stored in AWS S3. (Image by Author)

Model Pipeline

 
 
The modeling pipeline integrates model training into the CI/CD pipeline for the repository. This enables each pull request to evaluate the syntax, unit, integration and regression tests configured on the codebase, but also can provide feedback that includes evaluating the new resulting model



The workflow in this case, run the model training experiment specified in the configuration file (model_params.yaml) and update the model artifact (best-model.pth) (Image by Author)

Benchmark Evaluation Pipeline

 
 
The benchmarking pipeline forms an “official submission” process to ensure all modeling activities are measured against the metrics of the project.



The newly trained model in best-model.pth is evaluated against the benchmark dataset and the results are tagged with the latest commit hash and persisted in AWS S3. (Image by Author)

Workflow

 
 
Here is the DAG definition file that is used by DVC. It captures the workflow steps and their inputs, and allows for reproducibility across users and machines.

stages: labelstudio_export_trad: cmd: python pipelines/1_labelstudio_export.py --config_fp pipelines/traditional_pipeline.yaml --ls_token *** --proj_root "." params: - pipelines/traditional_pipeline.yaml: - src.host - src.out_fp - src.proj_id dataset_create_trad: cmd: python pipelines/2_labelstudio_todataset.py --config_fp pipelines/create_traditional.yaml --proj_root "." deps: - data/raw_labels/traditional.json params: - pipelines/create_traditional.yaml: - dataset.bmdata_fp - dataset.labels_map - dataset.out_fp - dataset.rawdata_dir train_model_trad: cmd: python pipelines/3_train_model.py --config_fp pipelines/model_params.yaml --proj_root "." deps: - data/traditional_labeling params: - pipelines/model_params.yaml: - dataloader.bs - dataloader.size - dataloader.train_fp - dataloader.valid_fp - learner.backbone - learner.data_dir - learner.in_checkpoint - learner.metrics - learner.n_out - learner.wandb_project_name - train.cycles labelstudio_export_bench: cmd: python pipelines/1_labelstudio_export.py --config_fp pipelines/benchmark_pipeline.yaml --ls_token *** --proj_root "." params: - pipelines/benchmark_pipeline.yaml: - src.host - src.out_fp - src.proj_id dataset_create_bench: cmd: python pipelines/2_labelstudio_todataset.py --config_fp pipelines/create_benchmark.yaml --proj_root "." deps: - data/raw_labels/benchmark.json params: - pipelines/create_benchmark.yaml: - dataset.bmdata_fp - dataset.labels_map - dataset.out_fp - dataset.rawdata_dir eval_model_trad: cmd: python pipelines/4_eval_model.py --config_fp pipelines/bench_eval.yaml --proj_root "." deps: - data/models/best-model.pth params: - pipelines/bench_eval.yaml: - eval.bench_fp - eval.label_config - eval.metrics_fp - eval.model_conf - eval.overlay_dir

Findings

 
 

  1. The Github Actions workflow cron trigger is not extremely reliable. It does not guarantee timing.
  2. DVC does not work in a clear manner inside a Github Action workflow that is triggered on push. It will alter the trackers that are source controlled and when that is committed it will create another Github action.
  3. The Github Actions orchestration as a mechanism to run model requires a self-hosted runner to use a GPU. This means connecting to a GPU instance in the cloud or on-prem, and this presents issues with access control. For example, we can’t open source the exact repo without removing that self-hosted runner configuration from the repo or else random people would be able to run workloads on our training server by pushing code to the project.
  4. NBDEV built-in workflow is testing the code in the wrong place. It’s testing the notebook instead of the compiled package. On the one hand, it’s nice to be able to say that the “tests can be written right into the notebook”. On the other hand, testing the notebooks directly tests leaves open the possibility that the code package created by NBDEV fails even though the notebook ran. What we need is the ability to test the NBDEV-compiled package directly
  5. NBDEV doesn’t interoperate with “traditional” Python development in the sense that NBDEV is a one-way street. It simply allows the project to be developed in the interactive Jupyter notebook style. It makes it impossible to develop the Python modules directly. If at any point, the project wants to be converted to “traditional” Python development testing would need to be accomplished another way.
  6. In the beginning, we were using Weights & Biases as our experiment tracking dashboard, however there were issues deploying it into a Github Action. What we can say is that the user experience for implementing wandb hit its first hiccup in the Action Workflow. Removing Weights & Biases resolved the problem straight away. Before that, wandb stood out as the best user experience in MLOps.

Conclusions

 
 
Ultimately, it took one week to complete the implementation of these tools for managing our code with Github Actions, Iterative.ai tools (DVC & CML) and NBDEV. This provides us with the following capabilities:

  1. Work from Jupyter notebooks as the system of record for the code. We like Jupyter. The main use case it accomplishes is to enable us to work directly on any hardware we can SSH into by hosting a Jupyter server there and forwarding it to a desktop. To be clear, we would be doing this even if we were not using NBDev because the alternative is using Vim or some such tool that we don’t like as much. Past experiments to connect to remote servers with VS Code or Pycharm failed. So it’s Jupyter.
  2. Testing the code, and testing the model it creates. Now as part of the CI/CD pipeline we can evaluate whether or not the model resulting from the changes to the repo make the model better, worse or stay the same. This is all available in the pull request before it is merged into main.
  3. Using Github Actions server as an orchestrator for training runs begins to allow multiple data scientists to work simultaneously in a more clear manner. Going forward, we will see the limitations of this setup for orchestrating the collaborative data science process.

 
Aaron Soellinger has formerly worked as a data scientist and software engineer solving problems in finance, predictive maintenance and sports. He currently works as a machine learning systems consultant with Hoplabs working on a multi-camera computer vision application.

Will Kunz is a back end software developer, bringing a can-do attitude and dogged determination to challenges. It doesn’t matter if it’s tracking down an elusive bug or adapting quickly to a new technology. If there’s a solution, Will wants to find it.

Original. Reposted with permission.

Related:


PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://www.kdnuggets.com/2021/09/adventures-mlops-github-actions-iterative-ai-label-studio-and-nbdev.html

Continue Reading

Big Data

The Machine & Deep Learning Compendium Open Book

Published

on

The Machine & Deep Learning Compendium Open Book

After years in the making, this extensive and comprehensive ebook resource is now available and open for data scientists and ML engineers. Learn from and contribute to this tome of valuable information to support all your work in data science from engineering to strategy to management.


By Ori Cohen, AI/ML/DL Expert, Researcher, Data Scientist.

Partial Topic List From The Machine & Deep Learning Compendium.

Nearly a year ago, I announced the Machine & Deep Learning Compendium, a Google document that I have been writing for the last 4 years. The ML Compendium contains over 500 topics, and it is over 400 pages long.

Today, I’m announcing that the Compendium is fully open. It is now a project on GitBook and GitHub (please star it!). I believe in knowledge sharing, and the Compendium will always be free to everyone.

I see this compendium as a gateway, as a frequently visited resource for people of various proficiency levels, for industry data scientists, and academics. The compendium will save you countless hours googling and sifting through articles that may not give you any value.

The Compendium includes around 500 topics that contain various summaries, links, and articles that I have read on numerous topics that I found interesting or that I had needed to learn. It includes the majority of modern machine learning algorithms, statistics, feature selection and engineering techniques, deep-learning, NLP, audio, deep and classic vision, time series, anomaly detection, graphs, experiment management, and much more. In addition, strategic topics, such as data science management and team building, are highlighted as well as other essential topics, such as product management, product design, and a technology stack from a data science perspective.

Please keep in mind that this is a perpetual work in progress on a variety of topics. If you feel that something should be changed, you can now easily contribute using GitBook, GitHub, or contact me.

GitBook

The ML Compendium is a project on GitBook, which means that you can contribute as a GitBook writer. Writing and editing content using the internal editor is easy and intuitive, especially compared to the more advanced option of contributing via GitHub pull requests.

You can visit the mlcompendium.com website or directly access the compendium “book”. As seen in Figure 1, on the left you have the main topics and on the right the sub-topics which are in each main topic, not to mention that the search feature is more advanced, especially compared to the old method of using CTRL-F inside the original document.

Figure 1: The Machine & Deep Learning Compendium with the GitBook UI.

The following are two topics that may interest you, the natural language processing (NLP) page, as seen in Figure 2, and the deep neural nets (DNN) page, as seen in Figure 3.

Figure 2: Natural Language Processing.

Figure 3: Deep Neural Nets.

GitHub

Alternatively, you can use GitHub (Figure 4) if you want to contribute content, please place the content within the proper topic, then create a pull request to a new branch. Finally, don’t forget to ‘Star’ the project if you like it.

The following is a simple set of instructions for contributing using GitHub:

  1. git clone https://github.com/orico/www.mlcompendium.com.git
  2. git branch mybranch
  3. git switch mybranch
  4. add your content
  5. git add the-edited-file
  6. git commit -m “my content”
  7. git push
  8. create a PR by visiting this link: https://github.com/orico/stateofmlops/pull/new/mybranch

Figure 4: The mlcompendium.com GitHub project.

.

Original. Reposted with permission.

Bio: Dr. Ori Cohen has a Ph.D. in Computer Science with a focus on machine learning. He is the author of the ML & DL Compendium and the StateOfMLOps.com. He is a lead data scientist at New Relic TLV, doing machine and deep learning research in the field of AIOps & MLOps.

Related:


PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://www.kdnuggets.com/2021/09/machine-deep-learning-open-book.html

Continue Reading

Big Data

The Machine & Deep Learning Compendium Open Book

Published

on

The Machine & Deep Learning Compendium Open Book

After years in the making, this extensive and comprehensive ebook resource is now available and open for data scientists and ML engineers. Learn from and contribute to this tome of valuable information to support all your work in data science from engineering to strategy to management.


By Ori Cohen, AI/ML/DL Expert, Researcher, Data Scientist.

Partial Topic List From The Machine & Deep Learning Compendium.

Nearly a year ago, I announced the Machine & Deep Learning Compendium, a Google document that I have been writing for the last 4 years. The ML Compendium contains over 500 topics, and it is over 400 pages long.

Today, I’m announcing that the Compendium is fully open. It is now a project on GitBook and GitHub (please star it!). I believe in knowledge sharing, and the Compendium will always be free to everyone.

I see this compendium as a gateway, as a frequently visited resource for people of various proficiency levels, for industry data scientists, and academics. The compendium will save you countless hours googling and sifting through articles that may not give you any value.

The Compendium includes around 500 topics that contain various summaries, links, and articles that I have read on numerous topics that I found interesting or that I had needed to learn. It includes the majority of modern machine learning algorithms, statistics, feature selection and engineering techniques, deep-learning, NLP, audio, deep and classic vision, time series, anomaly detection, graphs, experiment management, and much more. In addition, strategic topics, such as data science management and team building, are highlighted as well as other essential topics, such as product management, product design, and a technology stack from a data science perspective.

Please keep in mind that this is a perpetual work in progress on a variety of topics. If you feel that something should be changed, you can now easily contribute using GitBook, GitHub, or contact me.

GitBook

The ML Compendium is a project on GitBook, which means that you can contribute as a GitBook writer. Writing and editing content using the internal editor is easy and intuitive, especially compared to the more advanced option of contributing via GitHub pull requests.

You can visit the mlcompendium.com website or directly access the compendium “book”. As seen in Figure 1, on the left you have the main topics and on the right the sub-topics which are in each main topic, not to mention that the search feature is more advanced, especially compared to the old method of using CTRL-F inside the original document.

Figure 1: The Machine & Deep Learning Compendium with the GitBook UI.

The following are two topics that may interest you, the natural language processing (NLP) page, as seen in Figure 2, and the deep neural nets (DNN) page, as seen in Figure 3.

Figure 2: Natural Language Processing.

Figure 3: Deep Neural Nets.

GitHub

Alternatively, you can use GitHub (Figure 4) if you want to contribute content, please place the content within the proper topic, then create a pull request to a new branch. Finally, don’t forget to ‘Star’ the project if you like it.

The following is a simple set of instructions for contributing using GitHub:

  1. git clone https://github.com/orico/www.mlcompendium.com.git
  2. git branch mybranch
  3. git switch mybranch
  4. add your content
  5. git add the-edited-file
  6. git commit -m “my content”
  7. git push
  8. create a PR by visiting this link: https://github.com/orico/stateofmlops/pull/new/mybranch

Figure 4: The mlcompendium.com GitHub project.

.

Original. Reposted with permission.

Bio: Dr. Ori Cohen has a Ph.D. in Computer Science with a focus on machine learning. He is the author of the ML & DL Compendium and the StateOfMLOps.com. He is a lead data scientist at New Relic TLV, doing machine and deep learning research in the field of AIOps & MLOps.

Related:


PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://www.kdnuggets.com/2021/09/machine-deep-learning-open-book.html

Continue Reading
Crowdfunding2 days ago

Conister Bank Lends More to Time Finance

Esports4 days ago

NBA 2K22 MyCareer: How to Get a Shoe Deal

Oceania
Esports4 days ago

lol123 scrape by Rooster to claim final IEM Fall Oceania Closed Qualifier spot

Esports5 days ago

NBA 2K22 Soundtrack: Full List of Artists and Songs

cardano-price-on-the-brink-of-a-huge-breakout-5-imminent-for-ada.jpg
Blockchain5 days ago

Cardano Price on the Brink of a Huge Breakout! $5 Imminent for ADA

Aerospace4 days ago

Potential component defect to delay next Virgin Galactic flight

Esports3 days ago

How to complete a Sideways Encounter in Fortnite

Esports5 days ago

What Time Does the MyNBA 2K22 App Release?

Esports3 days ago

How to craft a weapon in Fortnite Chapter 2, season 8

Ukraine
Esports4 days ago

s1mple claims third consecutive MVP in EPL victory

Esports4 days ago

The best dunkers in NBA 2K22

Esports5 days ago

Pathfinder: Wrath of The Righteous Gray Garrison Statue Puzzle

Esports4 days ago

Na’Vi win ESL Pro League Season 14 in five-map thriller against Vitality to complete $1 million Intel Grand Slam

Aerospace4 days ago

Photos: SpaceX rocket arrives on launch pad for Inspiration4 mission

HRTech4 days ago

Amazon launches educational benefits for frontline staff

Esports4 days ago

AVE and Trasko emerge victorious from CIS IEM Fall open qualifier

Big Data4 days ago

China to break up Ant’s Alipay and force creation of separate loans app – FT

HRTech4 days ago

How Crompton is using the ‘Power of Language’ to create a high-engagement culture

Esports4 days ago

Imperial round out IEM Fall SA closed qualifier team list

Cleantech4 days ago

Bringing Solar & Tesla Batteries To Restaurants In New Orleans To “Stay Lit,” And How You Can Help

Trending

Copyright © 2020 Plato Technologies Inc.