MLA 012 Docker for Machine Learning Workflows

Machine Learning Guide

OCDevel

Artificial, Introduction, Learning, Courses, Technology, Ml, Intelligence, Ai, Machine, Education

4.9 • 848 Ratings

🗓️ 9 November 2020

⏱️ 32 minutes

🔗️ Recording | iTunes | RSS

🧾️ Download transcript

Summary

Docker enables efficient, consistent machine learning environment setup across local development and cloud deployment, avoiding many pitfalls of virtual machines and manual dependency management. It streamlines system reproduction, resource allocation, and GPU access, supporting portability and simplified collaboration for ML projects. Machine learning engineers benefit from using pre-built Docker images tailored for ML, allowing seamless project switching, host OS flexibility, and straightforward deployment to cloud platforms like AWS ECS and Batch, resulting in reproducible and maintainable workflows.

Traditional Environment Setup Challenges

Traditional machine learning development often requires configuring operating systems, GPU drivers (CUDA, cuDNN), and specific package versions directly on the host machine.
Manual setup can lead to version conflicts, resource allocation issues, and difficulty reproducing environments across different systems or between local and cloud deployments.
Tools like Anaconda and "pipenv" help manage Python and package versions, but they often fall short in managing system-level dependencies such as CUDA and cuDNN.

Virtual Machines vs Containers

Virtual machines (VMs) like VirtualBox or VMware allow multiple operating systems to run on a host, but they pre-allocate resources (RAM, CPU) up front and have limited access to host GPUs, restricting usability for machine learning tasks.
Docker uses containerization to package applications and dependencies, allowing containers to share host resources dynamically and to access the GPU directly, which is essential for ML workloads.

Benefits of Docker for Machine Learning

Dockerfiles describe the entire guest operating system and software environment in code, enabling complete automation and repeatability of environment setup.
Containers created from Dockerfiles use only the necessary resources at runtime and avoid interfering with the host OS, making it easy to switch projects, share setups, or scale deployments.
GPU support in Docker allows machine learning engineers to leverage their hardware regardless of host OS (with best results on Windows and Linux with Nvidia cards).
On Windows, enabling GPU support requires switching to the Dev/Insider channel and installing specific Nvidia drivers alongside WSL2 and Nvidia-Docker.
Macs are less suitable for GPU-accelerated ML due to their AMD graphics cards, although workarounds like PlaidML exist.

Cloud Deployment and Reproducibility

Deploying machine learning models traditionally required manual replication of environments on cloud servers, such as EC2 instances, which is time-consuming and error-prone.
With Docker, the same Dockerfile can be used locally and in the cloud (AWS ECS, Batch, Fargate, EKS, or SageMaker), ensuring the deployed environment matches local development exactly.
AWS ECS is suited for long-lived container services, while AWS Batch can be used for one-off or periodic jobs, offering cost-effective use of spot instances for GPU workloads.

Using Pre-Built Docker Images

Docker Hub provides pre-built images for ML environments, such as nvcr.io's CUDA/cuDNN images and HuggingFace's transformers setups, which can be inherited in custom Dockerfiles.
These images ensure compatibility between key ML libraries (PyTorch, TensorFlow, CUDA, cuDNN) and reduce setup friction.
Custom kitchen-sink images, like those in the "ml-tools" repository, offer a turnkey solution for getting started with machine learning in Docker.

Project Isolation and Maintenance

With Docker, each project can have a fully isolated environment, preventing dependency conflicts and simplifying switching between projects.
Updates or configuration changes are tracked and versioned in the Dockerfile, maintaining a single source of truth for the entire environment.
Modifying the Dockerfile to add dependencies or update versions ensures that local and cloud environments remain synchronized.

Host OS Recommendations for ML Development

Windows is recommended for local development with Docker, offering better desktop experience and driver support than Ubuntu for most users, particularly on laptops.
GPU-accelerated ML is not practical on Macs due to hardware limitations, while Ubuntu is suitable for advanced users comfortable with system configuration and driver management.

Useful Links

Transcript

Click on a timestamp to play from that location

0:00.0	You're listening to Machine Learning Applied.
0:02.3	In this episode, we're going to talk about Docker.
0:05.6	Docker.
0:06.6	Now, Docker is a technology that lets you run software on your computer, an operating system on your computer, okay?
0:15.6	Like an entire operating system packaged up into a little package that is running inside of your, what we call
0:23.3	it, the host operating system is running a guest operating system. So if you are developing on
0:29.5	Windows or Mac, you can use Docker to run Ubuntu Linux or Windows or Mac inside of your host machine.
0:39.8	Now, you may be familiar with this concept.
0:42.7	If you're not already familiar with Docker,
0:44.9	you might already be familiar with the concept of virtual machines or virtualization.
0:49.1	It's the same thing basically, but technologically different.
0:53.1	A virtual machine does the same thing. It allows you to
0:55.2	run an operating system inside of your operating system, Linux inside of your windows, but it is
1:01.8	technologically different in a few ways that are important to us as developers. The first way that's
1:07.8	important to us is that it has limited access to the host resources.
1:13.7	Namely, it can't access the GPU, which we use all the time, as machine learning engineers
1:19.9	for machine learning projects.
1:22.3	And another big pain about virtual machines is that you specify what resources they allocate and use up front.
1:31.2	So you tell a virtual machine like Virtual Box or VMware, these are two popular virtual
1:37.4	machine technologies, you're going to be using 5 gigs of RAM and 2 CPUs in advance.
1:43.8	And now it has those allocated. It pulls those aside for use
1:47.5	itself. And you, your machine, your host machine, can no longer access those resources, which sucks.
	...

Please login to see the full transcript.

Previous episode

Disclaimer: The podcast and artwork embedded on this page are from OCDevel, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of OCDevel and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.