7 Containerization
7.1 Containerization with Docker
As your data science projects grow more complex, you may encounter the “it works on my machine” problem—where code runs differently in different environments. Containerization solves this by packaging your code and its dependencies into a standardized unit called a container. Building on the environment management concepts we covered in the introductory chapter (conda environments for Python, renv for R), containers take isolation to the next level by packaging the entire runtime environment.
7.1.1 Why Containerization for Data Science?
Containerization offers several advantages for data science:
- Reproducibility: Ensures your analysis runs the same way everywhere
- Portability: Move your environment between computers or cloud platforms
- Dependency Management: Isolates project dependencies to avoid conflicts
- Collaboration: Easier sharing of complex environments with colleagues
- Deployment: Simplifies deploying models to production environments
Think of containers as lightweight, portable virtual machines that package everything your code needs to run. Unlike virtual machines, containers share the host operating system’s kernel, making them more efficient.
Containers have become widely adopted in production environments across the software industry, and their importance continues to grow in data science workflows.
7.1.2 Installing Docker
Docker is the most popular containerization platform. Let’s install it:
7.1.2.1 On Windows
- Download Docker Desktop for Windows
- Run the installer and follow the prompts
- Windows 10 Home users should ensure WSL 2 is installed first
7.1.2.2 On macOS
- Download Docker Desktop for Mac
- Run the installer and follow the prompts
7.1.2.3 On Linux
# For Ubuntu/Debian
sudo apt update
sudo apt install docker.io
sudo systemctl enable --now docker
# Add your user to the docker group to run Docker without sudo
sudo usermod -aG docker $USER
# Log out and back in for this to take effect7.1.2.4 Verifying Installation
Open a terminal and run:
docker --version
docker run hello-worldIf both commands complete successfully, Docker is installed correctly.
7.1.3 Docker Fundamentals
Before creating our first data science container, let’s understand some Docker basics:
- Images: Read-only templates that contain the application code, libraries, dependencies, and tools
- Containers: Running instances of images
- Dockerfile: A text file with instructions to build an image
- Docker Hub: A registry of pre-built Docker images
- Volumes: Persistent storage for containers
The relationship between these components works like this: you create a Dockerfile that defines how to build an image, the image is used to run containers, and volumes allow data to persist beyond the container lifecycle.
7.1.4 Creating Your First Data Science Container
Let’s create a basic data science container using a Dockerfile:
- Create a new directory for your project:
mkdir docker-data-science
cd docker-data-science- Create a file named
Dockerfilewith the following content:
# Use a base image with Python installed
FROM python:3.9-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
&& rm -rf /var/lib/apt/lists/*
# Set working directory
WORKDIR /app
# Copy requirements file
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy the rest of the code
COPY . .
# Command to run when the container starts
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]- Create a
requirements.txtfile with your Python dependencies:
numpy
pandas
matplotlib
scipy
scikit-learn
jupyter
jupyterlab
- Build the Docker image:
docker build -t data-science-env .This command tells Docker to build an image based on the instructions in the Dockerfile and tag it with the name “data-science-env”. The . at the end specifies that the build context is the current directory.
- Run a container from the image:
docker run -p 8888:8888 -v $(pwd):/app data-science-envThis command does two important things:
- Maps port 8888 in the container to port 8888 on your host machine, allowing you to access Jupyter Lab in your browser
- Mounts your current directory to
/appin the container, so changes to files are saved on your computer
- Open the Jupyter Lab URL shown in the terminal output
You now have a containerized data science environment that can be easily shared with others and deployed to different systems!
7.1.5 Understanding the Dockerfile
Let’s break down the Dockerfile we just created:
# Use a base image with Python installed
FROM python:3.9-slimThe FROM statement specifies the base image to use. We’re starting with a lightweight Python 3.9 image.
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
&& rm -rf /var/lib/apt/lists/*The RUN statement executes commands during the build process. Here, we’re updating the package list and installing gcc, which is required for building some Python packages.
# Set working directory
WORKDIR /appThe WORKDIR statement sets the working directory within the container.
# Copy requirements file
COPY requirements.txt .The COPY statement copies files from the host to the container. We copy the requirements file separately to take advantage of Docker’s caching mechanism.
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txtAnother RUN statement to install the Python dependencies listed in requirements.txt.
# Copy the rest of the code
COPY . .Copy all files from the current directory on the host to the working directory in the container.
# Command to run when the container starts
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]The CMD statement specifies the command to run when the container starts. In this case, we’re starting Jupyter Lab.
7.1.6 Using Pre-built Data Science Images
Instead of building your own Docker image, you can use popular pre-built images:
7.1.6.1 Jupyter Docker Stacks
The Jupyter team maintains several ready-to-use Docker images:
# Basic Jupyter Notebook
docker run -p 8888:8888 jupyter/minimal-notebook
# Data science-focused image with pandas, matplotlib, etc.
docker run -p 8888:8888 jupyter/datascience-notebook
# All the above plus TensorFlow and PyTorch
docker run -p 8888:8888 jupyter/tensorflow-notebookThese pre-built images offer a convenient way to get started without creating your own Dockerfile. The Jupyter Docker Stacks project provides a range of images for different needs, from minimal environments to comprehensive data science setups.
7.1.6.2 RStudio
For R users, there are RStudio Server images:
docker run -p 8787:8787 -e PASSWORD=yourpassword rocker/rstudioAccess RStudio at http://localhost:8787 with username “rstudio” and your chosen password.
7.1.7 Docker Compose for Multiple Containers
For more complex setups with multiple services (e.g., Python, R, and a database), Docker Compose allows you to define and run multi-container applications:
- Create a file named
docker-compose.yml:
version: '3'
services:
jupyter:
image: jupyter/datascience-notebook
ports:
- "8888:8888"
volumes:
- ./jupyter_data:/home/jovyan/work
rstudio:
image: rocker/rstudio
ports:
- "8787:8787"
environment:
- PASSWORD=yourpassword
volumes:
- ./r_data:/home/rstudio
postgres:
image: postgres:13
ports:
- "5432:5432"
environment:
- POSTGRES_PASSWORD=postgres
volumes:
- ./postgres_data:/var/lib/postgresql/data- Start all services:
docker-compose up- Access Jupyter at http://localhost:8888 and RStudio at http://localhost:8787
Docker Compose creates a separate container for each service in your configuration while allowing them to communicate with each other. This approach makes it easy to run complex data science environments with multiple tools.
7.1.8 Docker for Machine Learning Projects
For machine learning projects, containers are particularly valuable for ensuring model reproducibility and simplifying deployment:
- Create a project-specific Dockerfile:
FROM python:3.9-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Copy and install requirements first for better caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model code and artifacts
COPY models/ models/
COPY src/ src/
COPY app.py .
# Expose port for API
EXPOSE 5000
# Run the API service
CMD ["python", "app.py"]- Create a simple model serving API (
app.py):
from flask import Flask, request, jsonify
import pickle
import numpy as np
app = Flask(__name__)
# Load pre-trained model
with open('models/model.pkl', 'rb') as f:
model = pickle.load(f)
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
features = np.array(data['features']).reshape(1, -1)
prediction = model.predict(features)[0]
return jsonify({'prediction': prediction.tolist()})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)- Build and run the container:
docker build -t ml-model-api .
docker run -p 5000:5000 ml-model-apiThis creates a containerized API service for your machine learning model that can be deployed to any environment that supports Docker.
7.1.9 Best Practices for Docker in Data Science
To get the most out of Docker for data science, follow these best practices:
Keep images lean: Use slim or alpine base images when possible
FROM python:3.9-slim # Better than the full python:3.9 imageUse multi-stage builds for production: Separate building dependencies from runtime
# Build stage FROM python:3.9 AS builder WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Runtime stage FROM python:3.9-slim WORKDIR /app COPY --from=builder /usr/local/lib/python3.9/site-packages /usr/local/lib/python3.9/site-packages COPY . . CMD ["python", "app.py"]Layer your Dockerfile logically: Order commands from least to most likely to change
# System dependencies change rarely RUN apt-get update && apt-get install -y gcc # Requirements change occasionally COPY requirements.txt . RUN pip install -r requirements.txt # Application code changes frequently COPY . .Use volume mounts for data: Keep data outside the container
docker run -v /path/to/local/data:/app/data my-data-science-imageImplement proper versioning: Tag images meaningfully
docker build -t mymodel:1.0.0 .Create a .dockerignore file: Exclude unnecessary files
# .dockerignore .git __pycache__/ *.pyc venv/ data/Use environment variables for configuration:
ENV MODEL_PATH=/app/models/model.pkl
7.1.10 Common Docker Commands for Data Scientists
Here are some useful Docker commands for day-to-day work:
# List running containers
docker ps
# List all containers (including stopped ones)
docker ps -a
# List images
docker images
# Stop a container
docker stop container_id
# Remove a container
docker rm container_id
# Remove an image
docker rmi image_id
# View container logs
docker logs container_id
# Execute a command in a running container
docker exec -it container_id bash
# Clean up unused resources
docker system pruneUnderstanding these commands will help you manage your Docker workflow efficiently.
7.1.11 Conclusion
Containerization provides a powerful way to create reproducible, portable environments for data science. By packaging your code, dependencies, and configuration into a standardized unit, you can ensure consistent behavior across different systems and simplify collaboration with colleagues.
We’ve covered Docker for containerization but there are several good quality alternatives such as Podman and Rancher. As you grow more comfortable with Docker, you can explore advanced topics like custom image optimization, orchestration with Kubernetes, and CI/CD integration. The investment in learning containerization pays dividends in reproducibility, efficiency, and deployment simplicity throughout your data science career.