8  Containerisation

8.1 Containerisation with Docker

As your data science projects grow more complex, you may encounter the “it works on my machine” problem—where code runs differently in different environments. Containerisation solves this by packaging your code and its dependencies into a standardised unit called a container. Building on the environment management concepts we covered in the introductory chapter (conda environments for Python, renv for R), containers take isolation to the next level by packaging the entire runtime environment.

8.1.1 Why Containerisation for Data Science?

Containerisation offers several advantages for data science:

  1. Reproducibility: Ensures your analysis runs the same way everywhere
  2. Portability: Move your environment between computers or cloud platforms
  3. Dependency Management: Isolates project dependencies to avoid conflicts
  4. Collaboration: Easier sharing of complex environments with colleagues
  5. Deployment: Simplifies deploying models to production environments

Think of containers as lightweight, portable units that package everything your code needs to run. Unlike virtual machines, containers don’t boot their own operating system; they share the host’s kernel and isolate themselves using OS-level features. That makes them much faster to start and much cheaper in memory than a VM, while still giving you a reproducible, isolated environment.

Containers have become widely adopted in production environments across the software industry, and their importance continues to grow in data science workflows.

8.1.2 Installing Docker

Docker is the most widely used containerisation platform, and its command-line tool (docker) is effectively the lingua franca of containers. Even the alternative tools below aim to be drop-in replacements for it. We’ll use Docker Desktop in this chapter, but note the following licensing caveat before installing:

NoteDocker Desktop licensing

Since 2021, Docker Desktop requires a paid subscription for commercial use at companies with more than 250 employees or over $10 million in annual revenue. It remains free for personal use, education, small businesses, and open-source projects. If your workplace falls above that threshold, the free alternatives below are excellent and all provide the same docker command:

  • Rancher Desktop: open source, works on Windows/macOS/Linux, bundles Kubernetes
  • Podman Desktop: open source, daemonless, Red Hat–maintained, the most drop-in replacement for Docker CLI
  • OrbStack (macOS only): very fast, polished; free for personal use, paid for commercial

All the Dockerfiles and docker commands in this chapter work identically on each of these tools.

8.1.2.1 On Windows

  1. Download Docker Desktop for Windows (or one of the alternatives above)
  2. Run the installer and follow the prompts
  3. Windows 10 Home users should ensure WSL 2 is installed first

8.1.2.2 On macOS

  1. Download Docker Desktop for Mac (or OrbStack / Rancher Desktop)
  2. Run the installer and follow the prompts

8.1.2.3 On Linux

# For Ubuntu/Debian
sudo apt update
sudo apt install docker.io
sudo systemctl enable --now docker

# Add your user to the docker group to run Docker without sudo
sudo usermod -aG docker $USER
# Log out and back in for this to take effect

8.1.2.4 Verifying Installation

Open a terminal and run:

docker --version
docker run hello-world

If both commands complete successfully, Docker is installed correctly.

8.1.3 Docker Fundamentals

Before creating our first data science container, let’s understand some Docker basics:

  1. Images: Read-only templates that contain the application code, libraries, dependencies, and tools
  2. Containers: Running instances of images
  3. Dockerfile: A text file with instructions to build an image
  4. Docker Hub: A registry of pre-built Docker images
  5. Volumes: Persistent storage for containers

The relationship between these components works like this: you create a Dockerfile that defines how to build an image, the image is used to run containers, and volumes allow data to persist beyond the container lifecycle.

8.1.4 Creating Your First Data Science Container

Let’s create a basic data science container using a Dockerfile:

  1. Create a new directory for your project:
mkdir docker-data-science
cd docker-data-science
  1. Create a file named Dockerfile with the following content:
# Use a base image with Python installed
FROM python:3.12-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy requirements file
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Create a non-root user (before copying files so we can set ownership)
RUN useradd --create-home --shell /bin/bash jovyan

# Copy the rest of the code, owned by the non-root user so Jupyter
# can write notebooks back to /app
COPY --chown=jovyan:jovyan . .
RUN chown -R jovyan:jovyan /app

USER jovyan

# Command to run when the container starts
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser"]
  1. Create a requirements.txt file with your Python dependencies:
numpy
pandas
matplotlib
scipy
scikit-learn
jupyter
jupyterlab
  1. Build the Docker image:
docker build -t data-science-env .

This command tells Docker to build an image based on the instructions in the Dockerfile and tag it with the name “data-science-env”. The . at the end specifies that the build context is the current directory.

  1. Run a container from the image:
docker run -p 8888:8888 -v $(pwd):/app data-science-env

This command does two important things:

  • Maps port 8888 in the container to port 8888 on your host machine, allowing you to access Jupyter Lab in your browser
  • Mounts your current directory to /app in the container, so changes to files are saved on your computer
  1. Open the Jupyter Lab URL shown in the terminal output

You now have a containerised data science environment that can be easily shared with others and deployed to different systems!

8.1.5 Understanding the Dockerfile

Let’s break down the Dockerfile we just created:

# Use a base image with Python installed
FROM python:3.12-slim

The FROM statement specifies the base image to use. We’re starting with a lightweight Python 3.12 image. Always pick a currently-supported Python version; consult the official release schedule before pinning in production, since each version’s support window eventually ends.

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    && rm -rf /var/lib/apt/lists/*

The RUN statement executes commands during the build process. Here, we’re updating the package list and installing gcc, which is required for building some Python packages.

# Set working directory
WORKDIR /app

The WORKDIR statement sets the working directory within the container.

# Copy requirements file
COPY requirements.txt .

The COPY statement copies files from the host to the container. We copy the requirements file separately to take advantage of Docker’s caching mechanism.

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

Another RUN statement to install the Python dependencies listed in requirements.txt.

# Copy the rest of the code
COPY . .

Copy all files from the current directory on the host to the working directory in the container.

# Command to run when the container starts
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser"]

The CMD statement specifies the command to run when the container starts. In this case, we’re starting Jupyter Lab. We bind to 0.0.0.0 so the server accepts connections from outside the container (otherwise the port mapping from the host wouldn’t reach it).

8.1.6 Using Pre-built Data Science Images

Instead of building your own Docker image, you can use popular pre-built images:

8.1.6.1 Jupyter Docker Stacks

The Jupyter team maintains several ready-to-use Docker images:

# Basic Jupyter Notebook. Pin a specific dated tag for reproducibility
docker run -p 8888:8888 quay.io/jupyter/minimal-notebook:2025-10-13

# Data science-focused image with pandas, matplotlib, etc.
docker run -p 8888:8888 quay.io/jupyter/datascience-notebook:2025-10-13

# All the above plus TensorFlow
docker run -p 8888:8888 quay.io/jupyter/tensorflow-notebook:2025-10-13

These pre-built images offer a convenient way to get started without creating your own Dockerfile. The Jupyter Docker Stacks project provides a range of images for different needs, from minimal environments to comprehensive data science setups.

Two small but important habits:

  • Pull from quay.io/jupyter/*: the project moved off Docker Hub, and quay.io is now the canonical location.
  • Pin a specific dated tag (e.g. 2025-10-13) rather than latest. Using latest means the image can silently change under you between runs, which defeats the whole point of containers for reproducibility. Check the jupyter/docker-stacks releases page for current tags.

8.1.6.2 RStudio

For R users, the Rocker project maintains RStudio Server images. Keep the password out of your shell history by reading it from an environment variable or .env file rather than typing it on the command line:

# Set the password once for your shell session (choose a strong one):
export RSTUDIO_PASSWORD='a-strong-password-you-chose'

docker run -p 8787:8787 \
  -e PASSWORD="$RSTUDIO_PASSWORD" \
  rocker/rstudio:4.4

Access RStudio at http://localhost:8787 with username rstudio and the password you set. Pin to a specific Rocker tag (e.g. rocker/rstudio:4.4) rather than using latest, for the same reproducibility reasons as above. Rocker publishes tags on Docker Hub: browse there to pick a specific patch version if you need one.

8.1.7 Docker Compose for Multiple Containers

For more complex setups with multiple services (e.g., Python, R, and a database), Docker Compose allows you to define and run multi-container applications:

  1. Create a .env file alongside your compose file with the secrets, and add .env to your .gitignore so it never reaches a repository:
RSTUDIO_PASSWORD=a-strong-password-you-chose
POSTGRES_PASSWORD=another-strong-password
  1. Create a file named compose.yaml (the newer Compose V2 name; docker-compose.yml still works):
services:
  jupyter:
    image: quay.io/jupyter/datascience-notebook:2025-10-13
    ports:
      - "8888:8888"
    volumes:
      - ./jupyter_data:/home/jovyan/work

  rstudio:
    image: rocker/rstudio:4.4
    ports:
      - "8787:8787"
    environment:
      - PASSWORD=${RSTUDIO_PASSWORD}
    volumes:
      - ./r_data:/home/rstudio

  postgres:
    image: postgres:16
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    volumes:
      - ./postgres_data:/var/lib/postgresql/data

The ${RSTUDIO_PASSWORD} and ${POSTGRES_PASSWORD} placeholders are filled in from your .env file at runtime, so the secrets never live in the file you commit.

  1. Start all services (Compose V2 uses docker compose, a space, not a hyphen):
docker compose up
  1. Access Jupyter at http://localhost:8888 and RStudio at http://localhost:8787

Docker Compose creates a separate container for each service in your configuration while allowing them to communicate with each other. This approach makes it easy to run complex data science environments with multiple tools.

8.1.8 Docker for Machine Learning Projects

For machine learning projects, containers are particularly valuable for ensuring model reproducibility and simplifying deployment:

  1. Create a project-specific Dockerfile:
FROM python:3.12-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy and install requirements first for better caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model code and artifacts
COPY models/ models/
COPY src/ src/
COPY app.py .

# Expose port for API
EXPOSE 5000

# Run the API service
CMD ["python", "app.py"]
  1. Create a simple model serving API (app.py):
from flask import Flask, request, jsonify
import pickle
import numpy as np

app = Flask(__name__)

# Load pre-trained model
with open('models/model.pkl', 'rb') as f:
    model = pickle.load(f)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    features = np.array(data['features']).reshape(1, -1)
    prediction = model.predict(features)[0]
    return jsonify({'prediction': prediction.tolist()})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
  1. Build and run the container:
docker build -t ml-model-api .
docker run -p 5000:5000 ml-model-api

This creates a containerised API service for your machine learning model that can be deployed to any environment that supports Docker.

8.1.9 Best Practices for Docker in Data Science

To get the most out of Docker for data science, follow these best practices:

  1. Keep images lean: Smaller images pull faster, deploy faster, and expose less attack surface. Use -slim (or -alpine for pure-Python workloads) base images when you can.

    FROM python:3.12-slim  # ~150 MB; the full python:3.12 image is ~1 GB

    Note that Alpine-based images use musl instead of glibc, which occasionally breaks scientific Python packages that ship precompiled wheels for glibc only. When in doubt, start with -slim.

  2. Use multi-stage builds for production: Separate building dependencies from runtime

    # Build stage
    FROM python:3.12 AS builder
    WORKDIR /app
    COPY requirements.txt .
    RUN pip install --no-cache-dir -r requirements.txt
    
    # Runtime stage
    FROM python:3.12-slim
    WORKDIR /app
    COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
    COPY . .
    CMD ["python", "app.py"]

    The benefit: your final image doesn’t carry compilers, header files, or build tools, only the installed Python packages and your application code. For typical data science containers this can cut image size by 30–50% and reduce the attack surface.

  3. Layer your Dockerfile logically: Order commands from least to most likely to change

    # System dependencies change rarely
    RUN apt-get update && apt-get install -y gcc
    
    # Requirements change occasionally
    COPY requirements.txt .
    RUN pip install -r requirements.txt
    
    # Application code changes frequently
    COPY . .
  4. Use volume mounts for data: Keep data outside the container

    docker run -v /path/to/local/data:/app/data my-data-science-image
  5. Implement proper versioning: Tag images meaningfully

    docker build -t mymodel:1.0.0 .
  6. Create a .dockerignore file: Exclude unnecessary files

    # .dockerignore
    .git
    __pycache__/
    *.pyc
    venv/
    data/
  7. Use environment variables for configuration (not secrets):

    ENV MODEL_PATH=/app/models/model.pkl

    ENV values are baked into the image and visible to anyone who can pull it, so they are fine for non-sensitive defaults like file paths or log levels. Pass actual secrets at runtime via docker run -e, --env-file, or your platform’s secret store. Never hard-code them in a Dockerfile.

  8. Add a HEALTHCHECK so the runtime can detect and restart broken containers:

    HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
      CMD curl -fsS http://localhost:8080/health || exit 1
  9. Run as a non-root user in production images. Most official base images provide one (e.g. the Jupyter stacks use jovyan, Rocker uses rstudio). For your own images, add a user near the end of the Dockerfile:

    RUN useradd --create-home appuser
    USER appuser

    Running as root inside the container isn’t catastrophic on its own, but combined with volume mounts or a container escape it makes things much worse than they need to be.

8.1.10 Common Docker Commands for Data Scientists

Here are some useful Docker commands for day-to-day work:

# List running containers
docker ps

# List all containers (including stopped ones)
docker ps -a

# List images
docker images

# Stop a container
docker stop container_id

# Remove a container
docker rm container_id

# Remove an image
docker rmi image_id

# View container logs
docker logs container_id

# Execute a command in a running container
docker exec -it container_id bash

# Clean up unused resources
docker system prune

Understanding these commands will help you manage your Docker workflow efficiently.

8.1.11 Conclusion

Containerisation provides a powerful way to create reproducible, portable environments for data science. By packaging your code, dependencies, and configuration into a standardised unit, you can ensure consistent behaviour across different systems and simplify collaboration with colleagues.

We’ve covered Docker for containerisation but there are several good quality alternatives such as Podman and Rancher. As you grow more comfortable with Docker, you can explore advanced topics like custom image optimisation, orchestration with Kubernetes, and CI/CD integration. The investment in learning containerisation pays dividends in reproducibility, efficiency, and deployment simplicity throughout your data science career.