7  Containerization

7.1 Containerization with Docker

As your data science projects grow more complex, you may encounter the “it works on my machine” problem—where code runs differently in different environments. Containerization solves this by packaging your code and its dependencies into a standardized unit called a container. Building on the environment management concepts we covered in the introductory chapter (conda environments for Python, renv for R), containers take isolation to the next level by packaging the entire runtime environment.

7.1.1 Why Containerization for Data Science?

Containerization offers several advantages for data science:

  1. Reproducibility: Ensures your analysis runs the same way everywhere
  2. Portability: Move your environment between computers or cloud platforms
  3. Dependency Management: Isolates project dependencies to avoid conflicts
  4. Collaboration: Easier sharing of complex environments with colleagues
  5. Deployment: Simplifies deploying models to production environments

Think of containers as lightweight, portable virtual machines that package everything your code needs to run. Unlike virtual machines, containers share the host operating system’s kernel, making them more efficient.

Containers have become widely adopted in production environments across the software industry, and their importance continues to grow in data science workflows.

7.1.2 Installing Docker

Docker is the most popular containerization platform. Let’s install it:

7.1.2.1 On Windows

  1. Download Docker Desktop for Windows
  2. Run the installer and follow the prompts
  3. Windows 10 Home users should ensure WSL 2 is installed first

7.1.2.2 On macOS

  1. Download Docker Desktop for Mac
  2. Run the installer and follow the prompts

7.1.2.3 On Linux

# For Ubuntu/Debian
sudo apt update
sudo apt install docker.io
sudo systemctl enable --now docker

# Add your user to the docker group to run Docker without sudo
sudo usermod -aG docker $USER
# Log out and back in for this to take effect

7.1.2.4 Verifying Installation

Open a terminal and run:

docker --version
docker run hello-world

If both commands complete successfully, Docker is installed correctly.

7.1.3 Docker Fundamentals

Before creating our first data science container, let’s understand some Docker basics:

  1. Images: Read-only templates that contain the application code, libraries, dependencies, and tools
  2. Containers: Running instances of images
  3. Dockerfile: A text file with instructions to build an image
  4. Docker Hub: A registry of pre-built Docker images
  5. Volumes: Persistent storage for containers

The relationship between these components works like this: you create a Dockerfile that defines how to build an image, the image is used to run containers, and volumes allow data to persist beyond the container lifecycle.

7.1.4 Creating Your First Data Science Container

Let’s create a basic data science container using a Dockerfile:

  1. Create a new directory for your project:
mkdir docker-data-science
cd docker-data-science
  1. Create a file named Dockerfile with the following content:
# Use a base image with Python installed
FROM python:3.9-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy requirements file
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the code
COPY . .

# Command to run when the container starts
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]
  1. Create a requirements.txt file with your Python dependencies:
numpy
pandas
matplotlib
scipy
scikit-learn
jupyter
jupyterlab
  1. Build the Docker image:
docker build -t data-science-env .

This command tells Docker to build an image based on the instructions in the Dockerfile and tag it with the name “data-science-env”. The . at the end specifies that the build context is the current directory.

  1. Run a container from the image:
docker run -p 8888:8888 -v $(pwd):/app data-science-env

This command does two important things:

  • Maps port 8888 in the container to port 8888 on your host machine, allowing you to access Jupyter Lab in your browser
  • Mounts your current directory to /app in the container, so changes to files are saved on your computer
  1. Open the Jupyter Lab URL shown in the terminal output

You now have a containerized data science environment that can be easily shared with others and deployed to different systems!

7.1.5 Understanding the Dockerfile

Let’s break down the Dockerfile we just created:

# Use a base image with Python installed
FROM python:3.9-slim

The FROM statement specifies the base image to use. We’re starting with a lightweight Python 3.9 image.

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    && rm -rf /var/lib/apt/lists/*

The RUN statement executes commands during the build process. Here, we’re updating the package list and installing gcc, which is required for building some Python packages.

# Set working directory
WORKDIR /app

The WORKDIR statement sets the working directory within the container.

# Copy requirements file
COPY requirements.txt .

The COPY statement copies files from the host to the container. We copy the requirements file separately to take advantage of Docker’s caching mechanism.

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

Another RUN statement to install the Python dependencies listed in requirements.txt.

# Copy the rest of the code
COPY . .

Copy all files from the current directory on the host to the working directory in the container.

# Command to run when the container starts
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]

The CMD statement specifies the command to run when the container starts. In this case, we’re starting Jupyter Lab.

7.1.6 Using Pre-built Data Science Images

Instead of building your own Docker image, you can use popular pre-built images:

7.1.6.1 Jupyter Docker Stacks

The Jupyter team maintains several ready-to-use Docker images:

# Basic Jupyter Notebook
docker run -p 8888:8888 jupyter/minimal-notebook

# Data science-focused image with pandas, matplotlib, etc.
docker run -p 8888:8888 jupyter/datascience-notebook

# All the above plus TensorFlow and PyTorch
docker run -p 8888:8888 jupyter/tensorflow-notebook

These pre-built images offer a convenient way to get started without creating your own Dockerfile. The Jupyter Docker Stacks project provides a range of images for different needs, from minimal environments to comprehensive data science setups.

7.1.6.2 RStudio

For R users, there are RStudio Server images:

docker run -p 8787:8787 -e PASSWORD=yourpassword rocker/rstudio

Access RStudio at http://localhost:8787 with username “rstudio” and your chosen password.

7.1.7 Docker Compose for Multiple Containers

For more complex setups with multiple services (e.g., Python, R, and a database), Docker Compose allows you to define and run multi-container applications:

  1. Create a file named docker-compose.yml:
version: '3'
services:
  jupyter:
    image: jupyter/datascience-notebook
    ports:
      - "8888:8888"
    volumes:
      - ./jupyter_data:/home/jovyan/work
  
  rstudio:
    image: rocker/rstudio
    ports:
      - "8787:8787"
    environment:
      - PASSWORD=yourpassword
    volumes:
      - ./r_data:/home/rstudio
  
  postgres:
    image: postgres:13
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_PASSWORD=postgres
    volumes:
      - ./postgres_data:/var/lib/postgresql/data
  1. Start all services:
docker-compose up
  1. Access Jupyter at http://localhost:8888 and RStudio at http://localhost:8787

Docker Compose creates a separate container for each service in your configuration while allowing them to communicate with each other. This approach makes it easy to run complex data science environments with multiple tools.

7.1.8 Docker for Machine Learning Projects

For machine learning projects, containers are particularly valuable for ensuring model reproducibility and simplifying deployment:

  1. Create a project-specific Dockerfile:
FROM python:3.9-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy and install requirements first for better caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model code and artifacts
COPY models/ models/
COPY src/ src/
COPY app.py .

# Expose port for API
EXPOSE 5000

# Run the API service
CMD ["python", "app.py"]
  1. Create a simple model serving API (app.py):
from flask import Flask, request, jsonify
import pickle
import numpy as np

app = Flask(__name__)

# Load pre-trained model
with open('models/model.pkl', 'rb') as f:
    model = pickle.load(f)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    features = np.array(data['features']).reshape(1, -1)
    prediction = model.predict(features)[0]
    return jsonify({'prediction': prediction.tolist()})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
  1. Build and run the container:
docker build -t ml-model-api .
docker run -p 5000:5000 ml-model-api

This creates a containerized API service for your machine learning model that can be deployed to any environment that supports Docker.

7.1.9 Best Practices for Docker in Data Science

To get the most out of Docker for data science, follow these best practices:

  1. Keep images lean: Use slim or alpine base images when possible

    FROM python:3.9-slim  # Better than the full python:3.9 image
  2. Use multi-stage builds for production: Separate building dependencies from runtime

    # Build stage
    FROM python:3.9 AS builder
    WORKDIR /app
    COPY requirements.txt .
    RUN pip install --no-cache-dir -r requirements.txt
    
    # Runtime stage
    FROM python:3.9-slim
    WORKDIR /app
    COPY --from=builder /usr/local/lib/python3.9/site-packages /usr/local/lib/python3.9/site-packages
    COPY . .
    CMD ["python", "app.py"]
  3. Layer your Dockerfile logically: Order commands from least to most likely to change

    # System dependencies change rarely
    RUN apt-get update && apt-get install -y gcc
    
    # Requirements change occasionally
    COPY requirements.txt .
    RUN pip install -r requirements.txt
    
    # Application code changes frequently
    COPY . .
  4. Use volume mounts for data: Keep data outside the container

    docker run -v /path/to/local/data:/app/data my-data-science-image
  5. Implement proper versioning: Tag images meaningfully

    docker build -t mymodel:1.0.0 .
  6. Create a .dockerignore file: Exclude unnecessary files

    # .dockerignore
    .git
    __pycache__/
    *.pyc
    venv/
    data/
  7. Use environment variables for configuration:

    ENV MODEL_PATH=/app/models/model.pkl

7.1.10 Common Docker Commands for Data Scientists

Here are some useful Docker commands for day-to-day work:

# List running containers
docker ps

# List all containers (including stopped ones)
docker ps -a

# List images
docker images

# Stop a container
docker stop container_id

# Remove a container
docker rm container_id

# Remove an image
docker rmi image_id

# View container logs
docker logs container_id

# Execute a command in a running container
docker exec -it container_id bash

# Clean up unused resources
docker system prune

Understanding these commands will help you manage your Docker workflow efficiently.

7.1.11 Conclusion

Containerization provides a powerful way to create reproducible, portable environments for data science. By packaging your code, dependencies, and configuration into a standardized unit, you can ensure consistent behavior across different systems and simplify collaboration with colleagues.

We’ve covered Docker for containerization but there are several good quality alternatives such as Podman and Rancher. As you grow more comfortable with Docker, you can explore advanced topics like custom image optimization, orchestration with Kubernetes, and CI/CD integration. The investment in learning containerization pays dividends in reproducibility, efficiency, and deployment simplicity throughout your data science career.