Show code
install.packages("rmarkdown")As a data scientist, sharing your findings clearly is just as important as the analysis itself. Now that we have our analytics platforms set up, let’s explore tools for creating reports, documentation, and presentations.
Markdown is a lightweight markup language that’s easy to read and write. It forms the basis of many documentation systems.
Markdown’s simplicity and widespread support have made it the de facto standard for documentation in data science projects.
# Heading 1
## Heading 2
### Heading 3
**Bold text**
*Italic text*
[Link text](https://example.com)

- Bullet point 1
- Bullet point 2
1. Numbered item 1
2. Numbered item 2
Table:
| Column 1 | Column 2 |
|----------|----------|
| Cell 1 | Cell 2 |
> This is a blockquote
`Inline code`
```{python}
# Code block
print("Hello, world!")
```Markdown is designed to be readable even in its raw form. The syntax is intuitive—for example, surrounding text with asterisks makes it italic, and using hash symbols creates headings of different levels.
Many platforms interpret Markdown, including GitHub, Jupyter notebooks, and the documentation tools we’ll discuss next.
R Markdown combines R code, output, and narrative text in a single document that can be rendered to HTML, PDF, Word, and other formats.
The concept of “literate programming” behind R Markdown was first proposed by computer scientist Donald Knuth in 1984, and it has become a cornerstone of reproducible research in data science.
If you’ve installed R and RStudio as described earlier, R Markdown is just a package installation away:
install.packages("rmarkdown")To create your first R Markdown document:
RStudio creates a template document with examples of text, code chunks, and plots. This template is extremely helpful because it shows you the basic structure of an R Markdown document right away—you don’t have to start from scratch.
A typical R Markdown document consists of three components:
For example:
---
title: "My First Data Analysis"
author: "Your Name"
date: "2025-04-30"
output: html_document
---
# Introduction
This analysis explores the relationship between variables X (carat) and Y (price).
## Data Import and Cleaning
```{r setup, eval=FALSE}
# load the diamonds dataset from ggplot2
data(diamonds, package = "ggplot2")
# Create a smaller sample of the diamonds dataset
set.seed(123) # For reproducibility
my_data <- diamonds %>%
dplyr::sample_n(1000) %>%
dplyr::select(
X = carat,
Y = price,
cut = cut,
color = color,
clarity = clarity
)
# Display the first few rows
head(my_data)
```
## Data Visualization
```{r visualization, eval=FALSE}
ggplot2::ggplot(my_data, ggplot2::aes(x = X, y = Y)) +
ggplot2::geom_point() +
ggplot2::geom_smooth(method = "lm") +
ggplot2::labs(title = "Relationship between X and Y")
```Note that we’ve used the namespace convention to call our functions in the markdown code above, rather than making using of Library(function_name). This is not strictly necessary and is a matter of preference, but benefits of using this convention include:
When you click the “Knit” button in RStudio, the R code in the chunks is executed, and the results (including plots and tables) are embedded in the output document. The reason this is so powerful is that it combines your code, results, and narrative explanation in a single, reproducible document. If your data changes, you simply re-knit the document to update all results automatically.
R Markdown has become a standard in reproducible research because it creates a direct connection between your data, analysis, and conclusions. This connection makes your work more transparent and reliable, as anyone can follow your exact steps and see how you reached your conclusions.
We’ve already covered Jupyter notebooks for Python development, but they’re also excellent documentation tools. Like R Markdown, they combine code, output, and narrative text.
Jupyter notebooks can be exported to various formats:
Alternatively, you can use nbconvert from the command line:
jupyter nbconvert --to html my_notebook.ipynbThe ability to export notebooks is particularly valuable because it allows you to write your analysis once and then distribute it in whatever format your audience needs. For example, you might use the PDF format for a formal report to stakeholders, HTML for sharing on a website, or Markdown for including in a GitHub repository.
For larger documentation projects, Jupyter Book builds on the notebook format to create complete books:
# Install Jupyter Book
pip install jupyter-book
# Create a new book project
jupyter-book create my-book
# Build the book
jupyter-book build my-book/Jupyter Book organizes multiple notebooks and markdown files into a cohesive book with navigation, search, and cross-references. This is especially useful for comprehensive documentation, tutorials, or course materials. The resulting books have a professional appearance with a table of contents, navigation panel, and consistent styling throughout.
Quarto is a newer system that works with both Python and R, unifying the best aspects of R Markdown and Jupyter notebooks.
# Install Quarto CLI from https://quarto.org/docs/get-started/
# Create a new Quarto document
quarto create document
# Render a document
quarto render document.qmdQuarto represents an evolution in documentation tools because it provides a unified system for creating computational documents with multiple programming languages. This is particularly valuable if you work with both Python and R, as you can maintain a consistent documentation approach across all your projects.
The key advantage of Quarto is its language-agnostic design—you can mix Python, R, Julia, and other languages in a single document, which reflects the reality of many data science workflows where different tools are used for different tasks.
When creating data science reports that require a professional appearance, particularly for academic or formal business contexts, LaTeX provides powerful typesetting capabilities. While Markdown is excellent for simple documents, LaTeX excels at complex formatting, mathematical equations, and producing publication-quality PDFs.
LaTeX offers several advantages for data science documentation:
LaTeX documents, particularly those with programmatically generated figures, tend to be more reproducible than those created with proprietary document formats.
LaTeX works differently from word processors—you write plain text with special commands, then compile it to produce a PDF. For data science, you don’t need to install a full LaTeX distribution, as Quarto and R Markdown can handle the compilation process.
The easiest way to install LaTeX for use with Quarto or R Markdown is to use TinyTeX, a lightweight LaTeX distribution:
In R:
install.packages("tinytex")
tinytex::install_tinytex()In the command line with Quarto:
quarto install tinytexTinyTeX is designed specifically for R Markdown and Quarto users. It installs only the essential LaTeX packages (around 150MB) compared to full distributions (several GB), and it automatically installs additional packages as needed when you render documents.
Let’s explore the essential LaTeX elements you’ll need for data science documentation:
A basic LaTeX document structure looks like this:
\documentclass{article}
\usepackage{graphicx} % For images
\usepackage{amsmath} % For advanced math
\usepackage{booktabs} % For professional tables
\title{Analysis of Customer Purchasing Patterns}
\author{Your Name}
\date{\today}
\begin{document}
\maketitle
\tableofcontents
\section{Introduction}
This report analyzes...
\section{Methodology}
\subsection{Data Collection}
We collected data from...
\section{Results}
The results show...
\section{Conclusion}
In conclusion...
\end{document}When using Quarto or R Markdown, you won’t write this structure directly. Instead, it’s generated based on your YAML header and document content.
LaTeX shines when it comes to mathematical notation. Here are examples of common equation formats:
Inline equations use single dollar signs:
The model accuracy is $\alpha = 0.95$, which exceeds our threshold.Display equations use double dollar signs:
$$
\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i
$$Equation arrays for multi-line equations:
\begin{align}
Y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \\
&= \beta_0 + \sum_{i=1}^{2} \beta_i X_i + \epsilon
\end{align}Some common math symbols in data science:
| Description | LaTeX Code | Result |
|---|---|---|
| Summation | \sum_{i=1}^{n} |
\(\sum_{i=1}^{n}\) |
| Product | \prod_{i=1}^{n} |
\(\prod_{i=1}^{n}\) |
| Fraction | \frac{a}{b} |
\(\frac{a}{b}\) |
| Square root | \sqrt{x} |
\(\sqrt{x}\) |
| Bar (mean) | \bar{X} |
\(\bar{X}\) |
| Hat (estimate) | \hat{\beta} |
\(\hat{\beta}\) |
| Greek letters | \alpha, \beta, \gamma |
\(\alpha, \beta, \gamma\) |
| Infinity | \infty |
\(\infty\) |
| Approximately equal | \approx |
\(\approx\) |
| Distribution | X \sim N(\mu, \sigma^2) |
\(X \sim N(\mu, \sigma^2)\) |
LaTeX can create publication-quality tables. The booktabs package is recommended for professional-looking tables with proper spacing:
\begin{table}[htbp]
\centering
\caption{Model Performance Comparison}
\begin{tabular}{lrrr}
\toprule
Model & Accuracy & Precision & Recall \\
\midrule
Random Forest & 0.92 & 0.89 & 0.94 \\
XGBoost & 0.95 & 0.92 & 0.91 \\
Neural Network & 0.90 & 0.87 & 0.92 \\
\bottomrule
\end{tabular}
\end{table}To include figures with proper captioning and referencing:
\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\textwidth]{histogram.png}
\caption{Distribution of customer spending by category}
\label{fig:spending-dist}
\end{figure}
As shown in Figure \ref{fig:spending-dist}, the distribution is right-skewed.Quarto makes it easy to incorporate LaTeX features while keeping your document source readable. Here’s how to configure Quarto for PDF output using LaTeX:
In your Quarto YAML header, specify PDF output with LaTeX options:
---
title: "Analysis Report"
author: "Your Name"
format:
pdf:
documentclass: article
geometry:
- margin=1in
fontfamily: libertinus
colorlinks: true
number-sections: true
fig-width: 7
fig-height: 5
cite-method: biblatex
biblio-style: apa
---You can further customize the LaTeX template by:
Including raw LaTeX: Use the raw attribute to include LaTeX commands
```{=latex}
\begin{center}
\large\textbf{Confidential Report}
\end{center}
```Adding LaTeX packages: Include additional packages in the YAML
format:
pdf:
include-in-header:
text: |
\usepackage{siunitx}
\usepackage{algorithm2e}Using a custom template: Create your own template for full control
format:
pdf:
template: custom-template.texQuarto supports LaTeX math syntax directly:
The linear regression model can be represented as:
$$
y_i = \beta_0 + \beta_1 x_i + \epsilon_i
$$
where $\epsilon_i \sim N(0, \sigma^2)$.For managing citations, create a BibTeX file (e.g., references.bib):
@article{knuth84,
author = {Knuth, Donald E.},
title = {Literate Programming},
year = {1984},
journal = {The Computer Journal},
volume = {27},
number = {2},
pages = {97--111}
}Then cite in your Quarto document:
Literate programming [@knuth84] combines documentation and code.And configure in YAML:
bibliography: references.bib
csl: ieee.csl # Citation styleThe algorithm2e package helps document computational methods:
\begin{algorithm}[H]
\SetAlgoLined
\KwData{Training data $X$, target values $y$}
\KwResult{Trained model $M$}
Split data into training and validation sets\;
Initialize model $M$ with random weights\;
\For{each epoch}{
\For{each batch}{
Compute predictions $\hat{y}$\;
Calculate loss $L(y, \hat{y})$\;
Update model weights using gradient descent\;
}
Evaluate on validation set\;
\If{early stopping condition met}{
break\;
}
}
\caption{Training Neural Network with Early Stopping}
\end{algorithm}For reporting analysis results with significance levels:
\begin{table}[htbp]
\centering
\caption{Regression Results}
\begin{tabular}{lrrrr}
\toprule
Variable & Coefficient & Std. Error & t-statistic & p-value \\
\midrule
Intercept & 23.45 & 2.14 & 10.96 & $<0.001^{***}$ \\
Age & -0.32 & 0.05 & -6.4 & $<0.001^{***}$ \\
Income & 0.015 & 0.004 & 3.75 & $0.002^{**}$ \\
Education & 1.86 & 0.72 & 2.58 & $0.018^{*}$ \\
\bottomrule
\multicolumn{5}{l}{\scriptsize{$^{*}p<0.05$; $^{**}p<0.01$; $^{***}p<0.001$}} \\
\end{tabular}
\end{table}For comparing visualizations side by side:
\begin{figure}[htbp]
\centering
\begin{subfigure}{0.48\textwidth}
\includegraphics[width=\textwidth]{model1_results.png}
\caption{Linear Model Performance}
\label{fig:model1}
\end{subfigure}
\hfill
\begin{subfigure}{0.48\textwidth}
\includegraphics[width=\textwidth]{model2_results.png}
\caption{Neural Network Performance}
\label{fig:model2}
\end{subfigure}
\caption{Performance comparison of predictive models}
\label{fig:models-comparison}
\end{figure}If you’re using R Markdown instead of Quarto, the approach is similar:
---
title: "Statistical Analysis Report"
author: "Your Name"
output:
pdf_document:
toc: true
number_sections: true
fig_caption: true
keep_tex: true # Useful for debugging
includes:
in_header: preamble.tex
---The preamble.tex file can contain additional LaTeX packages and configurations:
% preamble.tex
\usepackage{booktabs}
\usepackage{longtable}
\usepackage{array}
\usepackage{multirow}
\usepackage{wrapfig}
\usepackage{float}
\usepackage{colortbl}
\usepackage{pdflscape}
\usepackage{tabu}
\usepackage{threeparttable}
\usepackage{threeparttablex}
\usepackage[normalem]{ulem}
\usepackage{makecell}
\usepackage{xcolor}LaTeX can sometimes produce cryptic error messages. Here are solutions to common issues:
If you get an error about a missing package when rendering:
! LaTeX Error: File 'tikz.sty' not found.
With TinyTeX, you can install the missing package:
tinytex::tlmgr_install("tikz")Or let TinyTeX handle it automatically:
options(tinytex.verbose = TRUE)If figures aren’t appearing where expected:
\begin{figure}[!htbp] % The ! makes LaTeX try harder to respect placementFor large tables that need to span pages:
\begin{longtable}{lrrr}
\caption{Comprehensive Model Results}\\
\toprule
Model & Accuracy & Precision & Recall \\
\midrule
\endhead
% Table contents...
\bottomrule
\end{longtable}If compilation seems to hang, it might be waiting for user input due to an error. Try:
# In R
tinytex::pdflatex('document.tex', pdflatex_args = c('-interaction=nonstopmode'))LaTeX has been the de facto gold standard for scientific documentation for decades, and for good reason. Most PDF rendering systems still use LaTeX under the hood, making it the backbone of academic publishing, technical reports, and mathematical documentation. When you generate a PDF from Quarto or R Markdown, you’re ultimately leveraging LaTeX’s sophisticated typesetting engine.
While LaTeX provides unmatched power and precision for creating professional data science documents, especially when mathematical notation is involved, there is undeniably a learning curve. The integration with Quarto and R Markdown has made LaTeX more accessible by handling much of the complexity behind the scenes, allowing you to focus on content rather than typesetting commands.
However, the document preparation landscape is evolving. Newer tools like Typst are emerging as modern alternatives that aim to simplify the traditional LaTeX workflow while maintaining high-quality output. Typst offers several advantages:
Simpler Syntax: Where LaTeX might require complex commands, Typst uses more intuitive markup:
// Typst syntax
= Introduction
== Subsection
$x = (a + b) / c$ // Math notation
#figure(
image("plot.png"),
caption: "Sample Plot"
)Compare this to equivalent LaTeX:
% LaTeX syntax
\section{Introduction}
\subsection{Subsection}
$x = \frac{a + b}{c}$
\begin{figure}
\includegraphics{plot.png}
\caption{Sample Plot}
\end{figure}Faster Compilation: Typst compiles documents significantly faster than LaTeX, making it more suitable for iterative document development.
Better Error Messages: When something goes wrong, Typst provides clearer, more actionable error messages compared to LaTeX’s often cryptic feedback.
Modern Design: Built from the ground up with modern document needs in mind, including better handling of digital-first workflows.
For data scientists starting their journey, here’s how to think about these tools:
Choose LaTeX when:
Consider Typst when:
The Quarto Advantage: One of Quarto’s strengths is that it abstracts away many of these decisions. You can often switch between PDF engines (including future Typst support) without changing your content, giving you flexibility as the ecosystem evolves.
As you progress in your data science career, investing time in understanding document preparation will pay dividends when creating reports, papers, or presentations that require precise typesetting and mathematical expressions. Whether you choose the established power of LaTeX or explore newer alternatives like Typst, start with the basics and gradually incorporate more advanced features as your needs grow.
The key is to pick the tool that best fits your current workflow and requirements, knowing that the fundamental principles of good document structure and clear communication remain constant regardless of the underlying technology.
For more complex projects, specialized documentation tools may be needed:
MkDocs creates a documentation website from Markdown files:
# Install MkDocs
pip install mkdocs
# Create a new project
mkdocs new my-documentation
# Serve the documentation locally
cd my-documentation
mkdocs serveMkDocs is focused on simplicity and readability. It generates a clean, responsive website from your Markdown files, with navigation, search, and themes. This makes it an excellent choice for project documentation that needs to be accessible to users or team members.
Sphinx is a more powerful documentation tool widely used in the Python ecosystem:
# Install Sphinx
pip install sphinx
# Create a new documentation project
sphinx-quickstart docs
# Build the documentation
cd docs
make htmlSphinx offers advanced features like automatic API documentation generation, cross-referencing, and multiple output formats. It’s the system behind the official documentation for Python itself and many major libraries like NumPy, pandas, and scikit-learn.
The reason Sphinx has become the standard for Python documentation is its powerful extension system and its ability to generate API documentation automatically from docstrings in your code. This means you can document your functions and classes directly in your code, and Sphinx will extract and format that information into comprehensive documentation.
When using external data files in Quarto projects, it’s important to understand how to handle file paths properly to ensure reproducibility across different environments.
The error 'my_data.csv' does not exist in current working directory is a common issue when transitioning between different editing environments like VS Code and RStudio. This happens because:
here PackageThe here package provides an elegant solution by creating paths relative to your project root:
library(tidyverse)
library(here)
# Load data using project-relative path
data <- read_csv(here("data", "my_data.csv"))
head(data)The here() function automatically detects your project root (usually where your .Rproj file is located) and constructs paths relative to that location. This ensures consistent file access regardless of:
To implement this approach:
data folder in your project roothere("data", "filename.csv") to reference themFor maximum reproducibility, consider using built-in datasets that come with R packages:
# Load a dataset from a package
data(diamonds, package = "ggplot2")
# Display the first few rows
head(diamonds)Using built-in datasets eliminates file path issues entirely, as these datasets are available to anyone who has the package installed. This is ideal for examples and tutorials where the specific data isn’t crucial.
Another reproducible approach is to generate sample data within your code:
# Create synthetic data
set.seed(0491) # For reproducibility
synthetic_data <- tibble(
id = 1:20,
value_x = rnorm(20),
value_y = value_x * 2 + rnorm(20, sd = 0.5),
category = sample(LETTERS[1:4], 20, replace = TRUE)
)
# Display the data
head(synthetic_data)This approach works well for illustrative examples and ensures anyone can run your code without any external files.
For real-world datasets that are too large to include in packages, you can fetch them from reliable URLs:
# URL to a stable dataset
url <- "https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv"
# Download and read the data
remote_data <- readr::read_csv(url)
# Display the data
head(remote_data)The cache: true option tells Quarto to save the results and only re-execute this chunk when the code changes, which prevents unnecessary downloads.
Effective documentation follows certain principles:
Projects with comprehensive documentation tend to have fewer defects and require less maintenance effort. Well-documented data science projects are also more likely to be reproducible and reusable by others.
The practice of documenting your work isn’t just about helping others understand what you’ve done—it also helps you think more clearly about your own process. By explaining your choices and methods in writing, you often gain new insights and identify potential improvements in your approach.
Effective visualization is crucial for data science as it helps communicate findings and enables pattern discovery. Let’s explore essential visualization tools and techniques.
Data visualization serves multiple purposes in the data science workflow:
The power of visualization comes from leveraging human visual processing capabilities. Our brains can process visual information much faster than text or numbers. A well-designed chart can instantly convey relationships that would take paragraphs to explain in words.
Python offers several powerful libraries for data visualization, each with different strengths and use cases.
Matplotlib is the original Python visualization library and serves as the foundation for many others. It provides precise control over every element of a plot.
import matplotlib.pyplot as plt
import numpy as np
# Generate data
x = np.linspace(0, 10, 100)
y = np.sin(x)
# Create a figure and axis
fig, ax = plt.subplots(figsize=(10, 6))
# Plot data
ax.plot(x, y, 'b-', linewidth=2, label='sin(x)')
# Add labels and title
ax.set_xlabel('X-axis', fontsize=14)
ax.set_ylabel('Y-axis', fontsize=14)
ax.set_title('Sine Wave', fontsize=16)
# Add grid and legend
ax.grid(True, linestyle='--', alpha=0.7)
ax.legend(fontsize=12)
# Save and show the figure
plt.savefig('sine_wave.png', dpi=300, bbox_inches='tight')
plt.show()Matplotlib provides a blank canvas approach where you explicitly define every element. This gives you complete control but requires more code for complex visualizations.
Seaborn builds on Matplotlib to provide high-level functions for common statistical visualizations.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
# Set the theme
sns.set_theme(style="whitegrid")
# Load example data
tips = sns.load_dataset("tips")
# Create a visualization
plt.figure(figsize=(12, 6))
sns.boxplot(x="day", y="total_bill", hue="smoker", data=tips, palette="Set3")
plt.title("Total Bill by Day and Smoker Status", fontsize=16)
plt.xlabel("Day", fontsize=14)
plt.ylabel("Total Bill ($)", fontsize=14)
plt.tight_layout()
plt.show()Seaborn simplifies the creation of statistical visualizations like box plots, violin plots, and regression plots. It also comes with built-in themes that improve the default appearance of plots.
Plotly creates interactive visualizations that can be embedded in web applications or Jupyter notebooks.
import plotly.express as px
import pandas as pd
# Load example data
df = px.data.gapminder().query("year == 2007")
# Create an interactive scatter plot
fig = px.scatter(
df, x="gdpPercap", y="lifeExp", size="pop", color="continent",
log_x=True, size_max=60,
title="GDP per Capita vs Life Expectancy (2007)",
labels={"gdpPercap": "GDP per Capita", "lifeExp": "Life Expectancy (years)"}
)
# Update layout
fig.update_layout(
width=900, height=600,
legend_title="Continent",
font=dict(family="Arial", size=14)
)
# Show the figure
fig.show()Plotly’s interactive features include zooming, panning, hovering for details, and the ability to export plots as images. These features make exploration more intuitive and presentations more engaging. The example above uses Python but Plotly can just as easily be used in R.
R also provides powerful tools for data visualization, with ggplot2 being the most widely used library.
ggplot2 is the gold standard for data visualization in R, based on the Grammar of Graphics concept.
library(ggplot2)
library(dplyr)
# Load dataset
data(diamonds, package = "ggplot2")
# Create a sample of the data
set.seed(42)
diamonds_sample <- diamonds %>%
sample_n(1000)
# Create basic plot
p <- ggplot(diamonds_sample, aes(x = carat, y = price, color = cut)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE) +
scale_color_brewer(palette = "Set1") +
labs(
title = "Diamond Price vs. Carat by Cut Quality",
subtitle = "Sample of 1,000 diamonds",
x = "Carat (weight)",
y = "Price (USD)",
color = "Cut Quality"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 12, color = "gray50"),
axis.title = element_text(size = 12),
legend.position = "bottom"
)
# Display the plot
print(p)
# Save the plot
ggsave("diamond_price_carat.png", p, width = 10, height = 6, dpi = 300)ggplot2’s layered approach allows for the creation of complex visualizations by combining simple elements. This makes it both powerful and conceptually elegant.
The philosophy behind ggplot2 is that you build a visualization layer by layer, which corresponds to how we think about visualizations conceptually. First, you define your data and aesthetic mappings (which variables map to which visual properties), then add geometric objects (points, lines, bars), then statistical transformations, scales, coordinate systems, and finally visual themes. This layered approach makes it possible to create complex visualizations by combining simple, understandable components.
R also offers interactive visualization libraries:
library(plotly)
library(dplyr)
# Load and prepare data
data(gapminder, package = "gapminder")
data_2007 <- gapminder %>%
filter(year == 2007)
# Create interactive plot
p <- plot_ly(
data = data_2007,
x = ~gdpPercap,
y = ~lifeExp,
size = ~pop,
color = ~continent,
type = "scatter",
mode = "markers",
sizes = c(5, 70),
marker = list(opacity = 0.7, sizemode = "diameter"),
hoverinfo = "text",
text = ~paste(
"Country:", country, "<br>",
"Population:", format(pop, big.mark = ","), "<br>",
"Life Expectancy:", round(lifeExp, 1), "years<br>",
"GDP per Capita:", format(round(gdpPercap), big.mark = ","), "USD"
)
) %>%
layout(
title = "GDP per Capita vs. Life Expectancy (2007)",
xaxis = list(
title = "GDP per Capita (USD)",
type = "log",
gridcolor = "#EEEEEE"
),
yaxis = list(
title = "Life Expectancy (years)",
gridcolor = "#EEEEEE"
),
legend = list(title = list(text = "Continent"))
)
# Display the plot
pThe R version of plotly can convert ggplot2 visualizations to interactive versions with a single function call:
# Convert a ggplot to an interactive plotly visualization
ggplotly(p)This capability to transform static ggplot2 charts into interactive visualizations with a single function call is extremely convenient. It allows you to develop visualizations using the familiar ggplot2 syntax, then add interactivity with minimal effort. This is very powerful when you need to create reports in both PDF and HTML formats - use ggplot2 for static PDFs and Plotly for dynamic HTML.
Diagrams are essential for data science documentation, helping to explain workflows, architectures, and relationships. Rather than creating images with external tools, you can use code-based diagramming directly in your Quarto documents with Mermaid.
Using code-based diagramming with Mermaid offers several advantages:
For data scientists, this means your entire workflow—code, analysis, explanations, and diagrams—can all be maintained in the same reproducible environment.
Quarto has built-in support for Mermaid diagrams. To create a diagram, use a code block with the mermaid engine:
flowchart LR
A[Raw Data] --> B[Data Cleaning]
B --> C[Exploratory Analysis]
C --> D[Feature Engineering]
D --> E[Model Training]
E --> F[Evaluation]
F --> G[Deployment]flowchart LR
A[Raw Data] --> B[Data Cleaning]
B --> C[Exploratory Analysis]
C --> D[Feature Engineering]
D --> E[Model Training]
E --> F[Evaluation]
F --> G[Deployment]
The syntax starts with the diagram type (flowchart), followed by the direction (LR for left-to-right), and then the definition of nodes and connections.
Mermaid supports several diagram types that are particularly useful for data science:
Flowcharts are perfect for documenting data pipelines and analysis workflows:
flowchart TD
A[Raw Data] --> B{Missing Values?}
B -->|Yes| C[Imputation]
B -->|No| D[Feature Engineering]
C --> D
D --> E[Train Test Split]
E --> F[Model Training]
F --> G[Evaluation]
G --> H{Performance<br>Acceptable?}
H -->|Yes| I[Deploy Model]
H -->|No| J[Tune Parameters]
J --> Fflowchart TD
A[Raw Data] --> B{Missing Values?}
B -->|Yes| C[Imputation]
B -->|No| D[Feature Engineering]
C --> D
D --> E[Train Test Split]
E --> F[Model Training]
F --> G[Evaluation]
G --> H{Performance<br>Acceptable?}
H -->|Yes| I[Deploy Model]
H -->|No| J[Tune Parameters]
J --> F
This top-down (TD) flowchart illustrates a complete machine learning workflow with decision points. Notice how you can use different node shapes (rectangles, diamonds) and add text to connections.
Class diagrams help explain data structures and relationships:
classDiagram
class Dataset {
+DataFrame data
+load_from_csv(filename)
+split_train_test(test_size)
+normalize()
}
class Model {
+train(X, y)
+predict(X)
+evaluate(X, y)
+save(filename)
}
class Pipeline {
+steps
+add_step(transformer)
+fit_transform(data)
}
Dataset --> Model: provides data to
Pipeline --> Dataset: processes
Pipeline --> Model: feeds intoclassDiagram
class Dataset {
+DataFrame data
+load_from_csv(filename)
+split_train_test(test_size)
+normalize()
}
class Model {
+train(X, y)
+predict(X)
+evaluate(X, y)
+save(filename)
}
class Pipeline {
+steps
+add_step(transformer)
+fit_transform(data)
}
Dataset --> Model: provides data to
Pipeline --> Dataset: processes
Pipeline --> Model: feeds into
This diagram shows the relationships between key classes in a machine learning system. It’s useful for documenting the architecture of your data science projects.
Sequence diagrams show interactions between components over time:
sequenceDiagram
participant U as User
participant API as REST API
participant ML as ML Model
participant DB as Database
U->>API: Request prediction
API->>DB: Fetch features
DB-->>API: Return features
API->>ML: Send features for prediction
ML-->>API: Return prediction
API->>DB: Log prediction
API-->>U: Return resultssequenceDiagram
participant U as User
participant API as REST API
participant ML as ML Model
participant DB as Database
U->>API: Request prediction
API->>DB: Fetch features
DB-->>API: Return features
API->>ML: Send features for prediction
ML-->>API: Return prediction
API->>DB: Log prediction
API-->>U: Return results
This diagram illustrates the sequence of interactions in a model deployment scenario, showing how data flows between the user, API, model, and database.
Gantt charts are useful for project planning and timelines:
gantt
title Data Science Project Timeline
dateFormat YYYY-MM-DD
section Data Preparation
Collect raw data :a1, 2025-01-01, 10d
Clean and validate :a2, after a1, 5d
Exploratory analysis :a3, after a2, 7d
Feature engineering :a4, after a3, 8d
section Modeling
Split train/test :b1, after a4, 1d
Train baseline models :b2, after b1, 5d
Hyperparameter tuning :b3, after b2, 7d
Model evaluation :b4, after b3, 4d
section Deployment
Create API :c1, after b4, 6d
Documentation :c2, after b4, 8d
Testing :c3, after c1, 5d
Production release :milestone, after c2 c3, 0dgantt
title Data Science Project Timeline
dateFormat YYYY-MM-DD
section Data Preparation
Collect raw data :a1, 2025-01-01, 10d
Clean and validate :a2, after a1, 5d
Exploratory analysis :a3, after a2, 7d
Feature engineering :a4, after a3, 8d
section Modeling
Split train/test :b1, after a4, 1d
Train baseline models :b2, after b1, 5d
Hyperparameter tuning :b3, after b2, 7d
Model evaluation :b4, after b3, 4d
section Deployment
Create API :c1, after b4, 6d
Documentation :c2, after b4, 8d
Testing :c3, after c1, 5d
Production release :milestone, after c2 c3, 0d
This Gantt chart shows the timeline of a data science project, with tasks grouped into sections and dependencies between them clearly indicated.
ER diagrams are valuable for database schema design:
erDiagram
CUSTOMER ||--o{ ORDER : places
ORDER ||--|{ ORDER_ITEM : contains
PRODUCT ||--o{ ORDER_ITEM : "ordered in"
CUSTOMER {
int customer_id PK
string name
string email
date join_date
}
ORDER {
int order_id PK
int customer_id FK
date order_date
float total_amount
}
ORDER_ITEM {
int order_id PK,FK
int product_id PK,FK
int quantity
float price
}
PRODUCT {
int product_id PK
string name
string category
float unit_price
}erDiagram
CUSTOMER ||--o{ ORDER : places
ORDER ||--|{ ORDER_ITEM : contains
PRODUCT ||--o{ ORDER_ITEM : "ordered in"
CUSTOMER {
int customer_id PK
string name
string email
date join_date
}
ORDER {
int order_id PK
int customer_id FK
date order_date
float total_amount
}
ORDER_ITEM {
int order_id PK,FK
int product_id PK,FK
int quantity
float price
}
PRODUCT {
int product_id PK
string name
string category
float unit_price
}
This diagram shows a typical e-commerce database schema with relationships between tables and their attributes.
You can customize the appearance of your diagrams:
flowchart LR
A[Data Collection] --> B[Data Cleaning]
B --> C[Analysis]
style A fill:#f9f,stroke:#333,stroke-width:2px
style B fill:#bbf,stroke:#33f,stroke-width:2px
style C fill:#bfb,stroke:#3f3,stroke-width:2pxflowchart LR
A[Data Collection] --> B[Data Cleaning]
B --> C[Analysis]
style A fill:#f9f,stroke:#333,stroke-width:2px
style B fill:#bbf,stroke:#33f,stroke-width:2px
style C fill:#bfb,stroke:#3f3,stroke-width:2px
This diagram uses custom colors and border styles for each node to highlight different stages of the process.
For complex or dynamic diagrams, you can generate Mermaid code programmatically:
# Define the steps in a data pipeline
steps <- c("Import Data", "Clean Data", "Feature Engineering",
"Split Dataset", "Train Model", "Evaluate", "Deploy")
# Generate Mermaid flowchart code
mermaid_code <- c(
"```{mermaid}",
"flowchart LR"
)
# Add connections between steps
for (i in 1:(length(steps)-1)) {
mermaid_code <- c(
mermaid_code,
sprintf(" %s[\"%s\"] --> %s[\"%s\"]",
LETTERS[i], steps[i],
LETTERS[i+1], steps[i+1])
)
}
mermaid_code <- c(mermaid_code, "```")
# Output the Mermaid code
cat(paste(mermaid_code, collapse = "\n"))This R code generates a Mermaid flowchart based on a list of steps. This approach is particularly useful when you want to create diagrams based on data or configuration.
Diagrams should clarify your explanations, not complicate them. A well-designed diagram can make complex processes or relationships immediately understandable.
Moving beyond static visualizations, interactive dashboards allow users to explore data dynamically. These tools are essential for deploying data science results to stakeholders who need to interact with the findings.
Shiny allows you to build interactive web applications entirely in R, without requiring knowledge of HTML, CSS, or JavaScript:
# Install Shiny if needed
install.packages("shiny")A simple Shiny app consists of two components:
Here’s a basic example:
library(shiny)
library(ggplot2)
library(dplyr)
library(here)
# Define UI
ui <- fluidPage(
titlePanel("Diamond Explorer"),
sidebarLayout(
sidebarPanel(
sliderInput("carat_range",
"Carat Range:",
min = 0.2,
max = 5.0,
value = c(0.5, 3.0)),
selectInput("cut",
"Cut Quality:",
choices = c("All", unique(as.character(diamonds$cut))),
selected = "All")
),
mainPanel(
plotOutput("scatterplot"),
tableOutput("summary_table")
)
)
)
# Define server logic
server <- function(input, output) {
# Filter data based on inputs
filtered_data <- reactive({
data <- diamonds
# Filter by carat
data <- data %>%
filter(carat >= input$carat_range[1] & carat <= input$carat_range[2])
# Filter by cut if not "All"
if (input$cut != "All") {
data <- data %>% filter(cut == input$cut)
}
data
})
# Create scatter plot
output$scatterplot <- renderPlot({
ggplot(filtered_data(), aes(x = carat, y = price, color = cut)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "Diamond Price vs. Carat",
x = "Carat",
y = "Price (USD)")
})
# Create summary table
output$summary_table <- renderTable({
filtered_data() %>%
group_by(cut) %>%
summarize(
Count = n(),
`Avg Price` = round(mean(price), 2),
`Avg Carat` = round(mean(carat), 2)
)
})
}
# Run the application
shinyApp(ui = ui, server = server)What makes Shiny powerful is its reactivity system, which automatically updates outputs when inputs change. This means you can create interactive data exploration tools without manually coding how to respond to every possible user interaction.
The reactive programming model used by Shiny allows you to specify relationships between inputs and outputs, and the system takes care of updating the appropriate components when inputs change. This is similar to how a spreadsheet works - when you change a cell’s value, any formulas that depend on that cell automatically recalculate.
Dash is Python’s equivalent to Shiny, created by the makers of Plotly:
# Install Dash
pip install dash dash-bootstrap-componentsA simple Dash app follows a similar structure to Shiny:
import dash
from dash import dcc, html, dash_table
from dash.dependencies import Input, Output
import plotly.express as px
import pandas as pd
# Load data - using built-in dataset for reproducibility
df = px.data.iris()
# Initialize app
app = dash.Dash(__name__)
# Define layout
app.layout = html.Div([
html.H1("Iris Dataset Explorer"),
html.Div([
html.Div([
html.Label("Select Species:"),
dcc.Dropdown(
id='species-dropdown',
options=[{'label': 'All', 'value': 'all'}] +
[{'label': i, 'value': i} for i in df['species'].unique()],
value='all'
),
html.Label("Select Y-axis:"),
dcc.RadioItems(
id='y-axis',
options=[
{'label': 'Sepal Width', 'value': 'sepal_width'},
{'label': 'Petal Length', 'value': 'petal_length'},
{'label': 'Petal Width', 'value': 'petal_width'}
],
value='sepal_width'
)
], style={'width': '25%', 'padding': '20px'}),
html.Div([
dcc.Graph(id='scatter-plot')
], style={'width': '75%'})
], style={'display': 'flex'}),
html.Div([
html.H3("Data Summary"),
dash_table.DataTable(
id='summary-table',
style_cell={'textAlign': 'left'},
style_header={
'backgroundColor': 'lightgrey',
'fontWeight': 'bold'
}
)
])
])
# Define callbacks
@app.callback(
[Output('scatter-plot', 'figure'),
Output('summary-table', 'data'),
Output('summary-table', 'columns')],
[Input('species-dropdown', 'value'),
Input('y-axis', 'value')]
)
def update_graph_and_table(selected_species, y_axis):
# Filter data
if selected_species == 'all':
filtered_df = df
else:
filtered_df = df[df['species'] == selected_species]
# Create figure
fig = px.scatter(
filtered_df,
x='sepal_length',
y=y_axis,
color='species',
title=f'Sepal Length vs {y_axis.replace("_", " ").title()}'
)
# Create summary table
summary_df = filtered_df.groupby('species').agg({
'sepal_length': ['mean', 'std'],
'sepal_width': ['mean', 'std'],
'petal_length': ['mean', 'std'],
'petal_width': ['mean', 'std']
}).reset_index()
# Flatten the multi-index
summary_df.columns = ['_'.join(col).strip('_') for col in summary_df.columns.values]
# Format table
table_data = summary_df.to_dict('records')
columns = [{"name": col.replace('_', ' ').title(), "id": col} for col in summary_df.columns]
return fig, table_data, columns
# Run app
if __name__ == '__main__':
app.run_server(debug=True)Dash leverages Plotly for visualizations and React.js for the user interface, resulting in modern, responsive applications without requiring front-end web development experience.
Unlike Shiny’s reactive programming model, Dash uses a callback-based approach. You explicitly define functions that take specific inputs and produce specific outputs, with the Dash framework handling the connections between them. This approach may feel more familiar to Python programmers who are used to callback-based frameworks.
Streamlit simplifies interactive app creation even further with a minimal, straightforward API. Here’s a simple Streamlit app:
```{python}
#| eval: false
import streamlit as st
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
# Set page title
st.set_page_config(page_title="Data Explorer", page_icon="📊")
# Add a title
st.title("Interactive Data Explorer")
# Add sidebar with dataset options
st.sidebar.header("Settings")
dataset_name = st.sidebar.selectbox(
"Select Dataset",
options=["Iris", "Diamonds", "Gapminder"]
)
# Load data based on selection - using built-in datasets for reproducibility
@st.cache_data
def load_data(dataset):
if dataset == "Iris":
return sns.load_dataset("iris")
elif dataset == "Diamonds":
return sns.load_dataset("diamonds").sample(1000, random_state=42)
else: # Gapminder
return px.data.gapminder()
df = load_data(dataset_name)
# Display basic dataset information
st.header(f"{dataset_name} Dataset")
tab1, tab2, tab3 = st.tabs(["📋 Data", "📈 Visualization", "📊 Summary"])
with tab1:
st.subheader("Raw Data")
st.dataframe(df.head(100))
st.subheader("Data Types")
types_df = pd.DataFrame(df.dtypes, columns=["Data Type"])
types_df.index.name = "Column"
st.dataframe(types_df)
with tab2:
st.subheader("Data Visualization")
if dataset_name == "Iris":
# For Iris dataset
x_var = st.selectbox("X variable", options=df.select_dtypes("number").columns)
y_var = st.selectbox("Y variable", options=df.select_dtypes("number").columns, index=1)
fig = px.scatter(
df, x=x_var, y=y_var, color="species",
title=f"{x_var} vs {y_var} by Species"
)
st.plotly_chart(fig, use_container_width=True)
elif dataset_name == "Diamonds":
# For Diamonds dataset
chart_type = st.radio("Chart Type", ["Scatter", "Histogram", "Box"])
if chart_type == "Scatter":
fig = px.scatter(
df, x="carat", y="price", color="cut",
title="Diamond Price vs Carat by Cut Quality"
)
elif chart_type == "Histogram":
fig = px.histogram(
df, x="price", color="cut", nbins=50,
title="Distribution of Diamond Prices by Cut"
)
else: # Box plot
fig = px.box(
df, x="cut", y="price",
title="Diamond Price Distribution by Cut"
)
st.plotly_chart(fig, use_container_width=True)
else: # Gapminder
year = st.slider("Select Year", min_value=1952, max_value=2007, step=5, value=2007)
filtered_df = df[df["year"] == year]
fig = px.scatter(
filtered_df, x="gdpPercap", y="lifeExp", size="pop", color="continent",
log_x=True, size_max=60, hover_name="country",
title=f"GDP per Capita vs Life Expectancy ({year})"
)
st.plotly_chart(fig, use_container_width=True)
with tab3:
st.subheader("Statistical Summary")
if df.select_dtypes("number").shape[1] > 0:
st.dataframe(df.describe())
# Show counts for categorical variables
categorical_cols = df.select_dtypes(include=["object", "category"]).columns
if len(categorical_cols) > 0:
cat_col = st.selectbox("Select Categorical Variable", options=categorical_cols)
cat_counts = df[cat_col].value_counts().reset_index()
cat_counts.columns = [cat_col, "Count"]
fig = px.bar(
cat_counts, x=cat_col, y="Count",
title=f"Counts of {cat_col}"
)
st.plotly_chart(fig, use_container_width=True)
```Streamlit’s appeal lies in its simplicity. Instead of defining callbacks between inputs and outputs (as in Dash and Shiny), the entire script runs from top to bottom when any input changes. This makes it exceptionally easy to prototype applications quickly.
The Streamlit approach is radically different from both Shiny and Dash. Rather than defining a layout and then wiring up callbacks or reactive expressions, you write a straightforward Python script that builds the UI from top to bottom. When any input changes, Streamlit simply reruns your script. This procedural approach is very intuitive for beginners and allows for rapid prototyping, though it can become less efficient for complex applications.
The tools and approaches covered in this chapter work best when integrated into a cohesive workflow. Here’s an example of how to combine them:
here packageThis integrated approach ensures your work is reproducible, well-documented, and accessible to others.
Let’s consider how these tools might be used together in a real data science project:
By leveraging all these tools appropriately, you create a project that is not only technically sound but also well-documented and accessible to both technical and non-technical audiences.
In this chapter, we explored advanced tools for data science that enhance documentation, visualization, and interactivity. We’ve seen how:
here package ensure reproducibility across environmentsAs you continue your data science journey, integrating these tools into your workflow will help you create more professional, reproducible, and impactful projects. The key is to select the right tool for each specific task, while maintaining a cohesive overall approach that prioritizes reproducibility and clear communication. In the Deployment chapter, we’ll explore how to share these reports and dashboards with stakeholders through various hosting platforms.
Remember that the ultimate goal of these tools is not just to make your work easier, but to make your insights more accessible and actionable for others. By investing time in proper documentation, visualization, and interactivity, you amplify the impact of your data science work. At this point, I’d like to interject with a note on AI - if you don’t know these tools and how they work, you can’t hope to ask AI what to produce for you. While building a Shiny app from scratch is no longer necessary, you need to know what Shiny is capable of and how it’s best applied. You also need the correct environment setup so that you can run your app. Please continue to bear-in-mind that your understanding of data science tools and processes is going to become increasingly more important than being able to write code from scratch.