Show code
install.packages("rmarkdown")As a data scientist, sharing your findings clearly is just as important as the analysis itself. Now that we have our analytics platforms set up, let’s explore tools for creating reports, documentation, and presentations.
Markdown is a lightweight markup language that’s easy to read and write. It forms the basis of many documentation systems.
Markdown’s simplicity and widespread support have made it the de facto standard for documentation in data science projects.
# Heading 1
## Heading 2
### Heading 3
**Bold text**
*Italic text*
[Link text](https://example.com)

- Bullet point 1
- Bullet point 2
1. Numbered item 1
2. Numbered item 2
Table:
| Column 1 | Column 2 |
|----------|----------|
| Cell 1 | Cell 2 |
> This is a blockquote
`Inline code`
```{python}
# Code block
print("Hello, world!")
```Markdown is designed to be readable even in its raw form. The syntax is intuitive—for example, surrounding text with asterisks makes it italic, and using hash symbols creates headings of different levels.
Many platforms interpret Markdown, including GitHub, Jupyter notebooks, and the documentation tools we’ll discuss next.
R Markdown combines R code, output, and narrative text in a single document that can be rendered to HTML, PDF, Word, and other formats.
The concept of “literate programming” behind R Markdown was first proposed by computer scientist Donald Knuth in 1984, and it has become a cornerstone of reproducible research in data science.
If you’ve installed R and RStudio as described earlier, R Markdown is just a package installation away:
install.packages("rmarkdown")To create your first R Markdown document:
RStudio creates a template document with examples of text, code chunks, and plots. This template is extremely helpful because it shows you the basic structure of an R Markdown document right away—you don’t have to start from scratch.
A typical R Markdown document consists of three components:
For example:
---
title: "My First Data Analysis"
author: "Your Name"
date: "2025-04-30"
output: html_document
---
# Introduction
This analysis explores the relationship between variables X (carat) and Y (price).
## Data Import and Cleaning
```{r setup, eval=FALSE}
# load the diamonds dataset from ggplot2
data(diamonds, package = "ggplot2")
# Create a smaller sample of the diamonds dataset
set.seed(123) # For reproducibility
my_data <- diamonds %>%
dplyr::sample_n(1000) %>%
dplyr::select(
X = carat,
Y = price,
cut = cut,
color = color,
clarity = clarity
)
# Display the first few rows
head(my_data)
```
## Data Visualization
```{r visualization, eval=FALSE}
ggplot2::ggplot(my_data, ggplot2::aes(x = X, y = Y)) +
ggplot2::geom_point() +
ggplot2::geom_smooth(method = "lm") +
ggplot2::labs(title = "Relationship between X and Y")
```Note that we’ve used the namespace convention (package::function()) in the code above rather than loading each package with library(). This is a matter of preference, but benefits include:
library()dplyr::filter() vs. stats::filter())When you click the “Knit” button in RStudio, the R code in the chunks is executed, and the results (including plots and tables) are embedded in the output document. The reason this is so powerful is that it combines your code, results, and narrative explanation in a single, reproducible document. If your data changes, you simply re-knit the document to update all results automatically.
R Markdown has become a standard in reproducible research because it creates a direct connection between your data, analysis, and conclusions. This connection makes your work more transparent and reliable, as anyone can follow your exact steps and see how you reached your conclusions.
We’ve already covered Jupyter notebooks for Python development, but they’re also excellent documentation tools. Like R Markdown, they combine code, output, and narrative text.
Jupyter notebooks can be exported to various formats:
Alternatively, you can use nbconvert from the command line:
jupyter nbconvert --to html my_notebook.ipynbThe ability to export notebooks is particularly valuable because it allows you to write your analysis once and then distribute it in whatever format your audience needs. For example, you might use the PDF format for a formal report to stakeholders, HTML for sharing on a website, or Markdown for including in a GitHub repository.
For larger documentation projects, Jupyter Book builds on the notebook format to create complete books:
# Install Jupyter Book
pip install jupyter-book
# Create a new book project
jupyter-book create my-book
# Build the book
jupyter-book build my-book/Jupyter Book organises multiple notebooks and markdown files into a cohesive book with navigation, search, and cross-references. This is especially useful for comprehensive documentation, tutorials, or course materials. The resulting books have a professional appearance with a table of contents, navigation panel, and consistent styling throughout.
Quarto is a newer system that works with both Python and R, unifying the best aspects of R Markdown and Jupyter notebooks.
# Install Quarto CLI from https://quarto.org/docs/get-started/
# Create a new Quarto project (website, book, manuscript, etc.)
quarto create project default my-project
# Or, for a standalone one-off document, simply create a file
# named document.qmd in your editor (no command needed).
# Render a document to HTML (or PDF, docx, etc.)
quarto render document.qmdQuarto represents an evolution in documentation tools because it provides a unified system for creating computational documents with multiple programming languages. This is particularly valuable if you work with both Python and R, as you can maintain a consistent documentation approach across all your projects.
The key advantage of Quarto is its language-agnostic design—you can mix Python, R, Julia, and other languages in a single document, which reflects the reality of many data science workflows where different tools are used for different tasks.
Since Quarto 1.4 (released in 2024), Quarto can render a document directly into a dashboard layout with rows, columns, value boxes, and tab sets, with no Shiny, Dash, or Streamlit required for the static case. If all you need is a periodic refresh of a dashboard view over your data, this is by far the lightest way to get there: you write a normal .qmd file with a format: dashboard YAML key, and Quarto handles the layout.
---
title: "Sales Overview"
format: dashboard
---#| title: "Revenue by Quarter"
ggplot(sales, aes(quarter, revenue)) + geom_col()For interactive behaviour (filters, user inputs), Quarto dashboards can embed Shiny, Observable JS, or even Python/R running in the browser via webR/Pyodide. For readers whose dashboards are read-only and updated nightly, this alone replaces a lot of what used to require a Shiny or Dash server.
One of the highest-value Quarto (and R Markdown) features for a business analytics audience is parameterised reports: a single template that you can render with different inputs to produce many tailored outputs. For example, a monthly sales report that takes a region and month parameter and can be rendered once for every region without copy-pasting the document.
---
title: "Sales Report"
format: html
params:
region: "South Africa"
month: "2026-03"
---Inside the document you refer to params$region (R) or params["region"] (Python), and render with:
quarto render sales.qmd -P region:"EU" -P month:"2026-03"This single pattern replaces a surprising amount of the ad-hoc “one notebook per client” sprawl that plagues data science teams.
When creating data science reports that require a professional appearance, particularly for academic or formal business contexts, LaTeX provides powerful typesetting capabilities. While Markdown is excellent for simple documents, LaTeX excels at complex formatting, mathematical equations, and producing publication-quality PDFs.
LaTeX offers several advantages for data science documentation:
LaTeX documents, particularly those with programmatically generated figures, tend to be more reproducible than those created with proprietary document formats.
LaTeX works differently from word processors—you write plain text with special commands, then compile it to produce a PDF. For data science, you don’t need to install a full LaTeX distribution, as Quarto and R Markdown can handle the compilation process.
The easiest way to install LaTeX for use with Quarto or R Markdown is to use TinyTeX, a lightweight LaTeX distribution:
In R:
install.packages("tinytex")
tinytex::install_tinytex()In the command line with Quarto:
quarto install tinytexTinyTeX is designed specifically for R Markdown and Quarto users. It installs only the essential LaTeX packages (around 150MB) compared to full distributions (several GB), and it automatically installs additional packages as needed when you render documents.
Let’s explore the essential LaTeX elements you’ll need for data science documentation:
A basic LaTeX document structure looks like this:
\documentclass{article}
\usepackage{graphicx} % For images
\usepackage{amsmath} % For advanced math
\usepackage{booktabs} % For professional tables
\title{Analysis of Customer Purchasing Patterns}
\author{Your Name}
\date{\today}
\begin{document}
\maketitle
\tableofcontents
\section{Introduction}
This report analyses...
\section{Methodology}
\subsection{Data Collection}
We collected data from...
\section{Results}
The results show...
\section{Conclusion}
In conclusion...
\end{document}When using Quarto or R Markdown, you won’t write this structure directly. Instead, it’s generated based on your YAML header and document content.
LaTeX shines when it comes to mathematical notation. Here are examples of common equation formats:
Inline equations use single dollar signs:
The model accuracy is $\alpha = 0.95$, which exceeds our threshold.Display equations use double dollar signs:
$$
\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i
$$Equation arrays for multi-line equations:
\begin{align}
Y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \\
&= \beta_0 + \sum_{i=1}^{2} \beta_i X_i + \epsilon
\end{align}Some common math symbols in data science:
| Description | LaTeX Code | Result |
|---|---|---|
| Summation | \sum_{i=1}^{n} |
\(\sum_{i=1}^{n}\) |
| Product | \prod_{i=1}^{n} |
\(\prod_{i=1}^{n}\) |
| Fraction | \frac{a}{b} |
\(\frac{a}{b}\) |
| Square root | \sqrt{x} |
\(\sqrt{x}\) |
| Bar (mean) | \bar{X} |
\(\bar{X}\) |
| Hat (estimate) | \hat{\beta} |
\(\hat{\beta}\) |
| Greek letters | \alpha, \beta, \gamma |
\(\alpha, \beta, \gamma\) |
| Infinity | \infty |
\(\infty\) |
| Approximately equal | \approx |
\(\approx\) |
| Distribution | X \sim N(\mu, \sigma^2) |
\(X \sim N(\mu, \sigma^2)\) |
LaTeX can create publication-quality tables. The booktabs package is recommended for professional-looking tables with proper spacing:
\begin{table}[htbp]
\centering
\caption{Model Performance Comparison}
\begin{tabular}{lrrr}
\toprule
Model & Accuracy & Precision & Recall \\
\midrule
Random Forest & 0.92 & 0.89 & 0.94 \\
XGBoost & 0.95 & 0.92 & 0.91 \\
Neural Network & 0.90 & 0.87 & 0.92 \\
\bottomrule
\end{tabular}
\end{table}To include figures with proper captioning and referencing:
\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\textwidth]{histogram.png}
\caption{Distribution of customer spending by category}
\label{fig:spending-dist}
\end{figure}
As shown in Figure \ref{fig:spending-dist}, the distribution is right-skewed.Quarto makes it easy to incorporate LaTeX features while keeping your document source readable. Here’s how to configure Quarto for PDF output using LaTeX:
In your Quarto YAML header, specify PDF output with LaTeX options:
---
title: "Analysis Report"
author: "Your Name"
format:
pdf:
documentclass: article
geometry:
- margin=1in
fontfamily: libertinus
colorlinks: true
number-sections: true
fig-width: 7
fig-height: 5
cite-method: biblatex
biblio-style: apa
---You can further customise the LaTeX template by:
Including raw LaTeX: Use the raw attribute to include LaTeX commands
```{=latex}
\begin{center}
\large\textbf{Confidential Report}
\end{center}
```Adding LaTeX packages: Include additional packages in the YAML
format:
pdf:
include-in-header:
text: |
\usepackage{siunitx}
\usepackage{algorithm2e}Using a custom template: Create your own template for full control
format:
pdf:
template: custom-template.texQuarto supports LaTeX math syntax directly:
The linear regression model can be represented as:
$$
y_i = \beta_0 + \beta_1 x_i + \epsilon_i
$$
where $\epsilon_i \sim N(0, \sigma^2)$.For managing citations, create a BibTeX file (e.g., references.bib):
@article{knuth84,
author = {Knuth, Donald E.},
title = {Literate Programming},
year = {1984},
journal = {The Computer Journal},
volume = {27},
number = {2},
pages = {97--111}
}Then cite in your Quarto document:
Literate programming [@knuth84] combines documentation and code.And configure in YAML:
bibliography: references.bib
csl: ieee.csl # Citation styleThe algorithm2e package helps document computational methods:
\begin{algorithm}[H]
\SetAlgoLined
\KwData{Training data $X$, target values $y$}
\KwResult{Trained model $M$}
Split data into training and validation sets\;
Initialize model $M$ with random weights\;
\For{each epoch}{
\For{each batch}{
Compute predictions $\hat{y}$\;
Calculate loss $L(y, \hat{y})$\;
Update model weights using gradient descent\;
}
Evaluate on validation set\;
\If{early stopping condition met}{
break\;
}
}
\caption{Training Neural Network with Early Stopping}
\end{algorithm}For reporting analysis results with significance levels:
\begin{table}[htbp]
\centering
\caption{Regression Results}
\begin{tabular}{lrrrr}
\toprule
Variable & Coefficient & Std. Error & t-statistic & p-value \\
\midrule
Intercept & 23.45 & 2.14 & 10.96 & $<0.001^{***}$ \\
Age & -0.32 & 0.05 & -6.4 & $<0.001^{***}$ \\
Income & 0.015 & 0.004 & 3.75 & $0.002^{**}$ \\
Education & 1.86 & 0.72 & 2.58 & $0.018^{*}$ \\
\bottomrule
\multicolumn{5}{l}{\scriptsize{$^{*}p<0.05$; $^{**}p<0.01$; $^{***}p<0.001$}} \\
\end{tabular}
\end{table}For comparing visualisations side by side:
\begin{figure}[htbp]
\centering
\begin{subfigure}{0.48\textwidth}
\includegraphics[width=\textwidth]{model1_results.png}
\caption{Linear Model Performance}
\label{fig:model1}
\end{subfigure}
\hfill
\begin{subfigure}{0.48\textwidth}
\includegraphics[width=\textwidth]{model2_results.png}
\caption{Neural Network Performance}
\label{fig:model2}
\end{subfigure}
\caption{Performance comparison of predictive models}
\label{fig:models-comparison}
\end{figure}If you’re using R Markdown instead of Quarto, the approach is similar:
---
title: "Statistical Analysis Report"
author: "Your Name"
output:
pdf_document:
toc: true
number_sections: true
fig_caption: true
keep_tex: true # Useful for debugging
includes:
in_header: preamble.tex
---The preamble.tex file can contain additional LaTeX packages and configurations:
% preamble.tex
\usepackage{booktabs}
\usepackage{longtable}
\usepackage{array}
\usepackage{multirow}
\usepackage{wrapfig}
\usepackage{float}
\usepackage{colortbl}
\usepackage{pdflscape}
\usepackage{tabu}
\usepackage{threeparttable}
\usepackage{threeparttablex}
\usepackage[normalem]{ulem}
\usepackage{makecell}
\usepackage{xcolor}LaTeX can sometimes produce cryptic error messages. Here are solutions to common issues:
If you get an error about a missing package when rendering:
! LaTeX Error: File 'tikz.sty' not found.
With TinyTeX, you can install the missing package:
tinytex::tlmgr_install("tikz")Or let TinyTeX handle it automatically:
options(tinytex.verbose = TRUE)If figures aren’t appearing where expected:
\begin{figure}[!htbp] % The ! makes LaTeX try harder to respect placementFor large tables that need to span pages:
\begin{longtable}{lrrr}
\caption{Comprehensive Model Results}\\
\toprule
Model & Accuracy & Precision & Recall \\
\midrule
\endhead
% Table contents...
\bottomrule
\end{longtable}If compilation seems to hang, it might be waiting for user input due to an error. Try:
# In R
tinytex::pdflatex('document.tex', pdflatex_args = c('-interaction=nonstopmode'))LaTeX has been the de facto gold standard for scientific documentation for decades, and for good reason. Most PDF rendering systems still use LaTeX under the hood, making it the backbone of academic publishing, technical reports, and mathematical documentation. When you generate a PDF from Quarto or R Markdown, you’re ultimately leveraging LaTeX’s sophisticated typesetting engine.
While LaTeX provides unmatched power and precision for creating professional data science documents, especially when mathematical notation is involved, there is undeniably a learning curve. The integration with Quarto and R Markdown has made LaTeX more accessible by handling much of the complexity behind the scenes, allowing you to focus on content rather than typesetting commands.
However, the document preparation landscape is evolving. Newer tools like Typst are emerging as modern alternatives that aim to simplify the traditional LaTeX workflow while maintaining high-quality output. Typst offers several advantages:
Simpler Syntax: Where LaTeX might require complex commands, Typst uses more intuitive markup:
// Typst syntax
= Introduction
== Subsection
$x = (a + b) / c$ // Math notation
#figure(
image("plot.png"),
caption: "Sample Plot"
)Compare this to equivalent LaTeX:
% LaTeX syntax
\section{Introduction}
\subsection{Subsection}
$x = \frac{a + b}{c}$
\begin{figure}
\includegraphics{plot.png}
\caption{Sample Plot}
\end{figure}Faster Compilation: Typst compiles documents significantly faster than LaTeX, making it more suitable for iterative document development.
Better Error Messages: When something goes wrong, Typst provides clearer, more actionable error messages compared to LaTeX’s often cryptic feedback.
Modern Design: Built from the ground up with modern document needs in mind, including better handling of digital-first workflows.
For data scientists starting their journey, here’s how to think about these tools:
Choose LaTeX when:
Consider Typst when:
The Quarto Advantage: One of Quarto’s strengths is that it abstracts away many of these decisions. You can often switch between PDF engines (including future Typst support) without changing your content, giving you flexibility as the ecosystem evolves.
As you progress in your data science career, investing time in understanding document preparation will pay dividends when creating reports, papers, or presentations that require precise typesetting and mathematical expressions. Whether you choose the established power of LaTeX or explore newer alternatives like Typst, start with the basics and gradually incorporate more advanced features as your needs grow.
The key is to pick the tool that best fits your current workflow and requirements, knowing that the fundamental principles of good document structure and clear communication remain constant regardless of the underlying technology.
For more complex projects, specialised documentation tools may be needed:
MkDocs creates a documentation website from Markdown files:
# Install MkDocs
pip install mkdocs
# Create a new project
mkdocs new my-documentation
# Serve the documentation locally
cd my-documentation
mkdocs serveMkDocs is focused on simplicity and readability. It generates a clean, responsive website from your Markdown files, with navigation, search, and themes. This makes it an excellent choice for project documentation that needs to be accessible to users or team members.
Sphinx is a more powerful documentation tool widely used in the Python ecosystem:
# Install Sphinx
pip install sphinx
# Create a new documentation project
sphinx-quickstart docs
# Build the documentation
cd docs
make htmlSphinx offers advanced features like automatic API documentation generation, cross-referencing, and multiple output formats. It’s the system behind the official documentation for Python itself and many major libraries like NumPy, pandas, and scikit-learn.
The reason Sphinx has become the standard for Python documentation is its powerful extension system and its ability to generate API documentation automatically from docstrings in your code. This means you can document your functions and classes directly in your code, and Sphinx will extract and format that information into comprehensive documentation.
When using external data files in Quarto projects, it’s important to understand how to handle file paths properly to ensure reproducibility across different environments.
The error 'my_data.csv' does not exist in current working directory is a common issue when transitioning between different editing environments like VS Code and RStudio. This happens because:
here PackageThe here package provides an elegant solution by creating paths relative to your project root:
library(tidyverse)
library(here)
# Load data using project-relative path
data <- read_csv(here("data", "my_data.csv"))
head(data)The here() function automatically detects your project root (usually where your .Rproj file is located) and constructs paths relative to that location. This ensures consistent file access regardless of:
To implement this approach:
data folder in your project roothere("data", "filename.csv") to reference themFor maximum reproducibility, consider using built-in datasets that come with R packages:
# Load a dataset from a package
data(diamonds, package = "ggplot2")
# Display the first few rows
head(diamonds)Using built-in datasets eliminates file path issues entirely, as these datasets are available to anyone who has the package installed. This is ideal for examples and tutorials where the specific data isn’t crucial.
Another reproducible approach is to generate sample data within your code:
# Create synthetic data
set.seed(0491) # For reproducibility
synthetic_data <- tibble(
id = 1:20,
value_x = rnorm(20),
value_y = value_x * 2 + rnorm(20, sd = 0.5),
category = sample(LETTERS[1:4], 20, replace = TRUE)
)
# Display the data
head(synthetic_data)This approach works well for illustrative examples and ensures anyone can run your code without any external files.
For real-world datasets that are too large to include in packages, you can fetch them from reliable URLs:
# URL to a stable dataset (ggplot2's default branch is 'main', not 'master')
url <- "https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/diamonds.csv"
# Download and read the data
remote_data <- readr::read_csv(url)
# Display the data
head(remote_data)The cache: true option tells Quarto to save the results and only re-execute this chunk when the code changes, which prevents unnecessary downloads.
Effective documentation follows certain principles:
Projects with comprehensive documentation tend to have fewer defects and require less maintenance effort. Well-documented data science projects are also more likely to be reproducible and reusable by others.
The practice of documenting your work isn’t just about helping others understand what you’ve done—it also helps you think more clearly about your own process. By explaining your choices and methods in writing, you often gain new insights and identify potential improvements in your approach.
This chapter has walked through the document side of data science output: Markdown and Quarto for literate programming, LaTeX (and newer alternatives like Typst) for publication-quality typesetting, parameterised reports for templating, and reproducible data-loading patterns so your reports can be rendered from any machine.
What we haven’t covered yet are the charts that live inside these documents and the interactive applications that extend beyond them:
Together these three chapters form a progression: from static documents that communicate findings, to visualisations that make those findings immediate, to interactive applications that invite exploration.