2  Setting Up for Success: Infrastructure for the Modern Data Scientist

2.1 Introduction

If you’re coming from economics, statistics, engineering, or another technical field, you already have many of the analytical skills needed to make productive use of data. However, since you’re reading this, you’d like some help setting up the technical infrastructure that supports modern data science work. For those without a computer science background, all of this may seem overwhelming at first, but soon you’ll have the tools to make your workflows even more productive.

This guide focuses on getting you set up with the tools you need to practice data science, rather than teaching you how to code. Think of it as preparing your workshop before you begin crafting. We’ll cover installing and configuring the essential software, platforms, and tools that data scientists use regularly.

By the end of this guide, you’ll have:

  • A fully configured development environment for Python, R, and SQL
  • Experience with version control through Git and GitHub
  • The ability to create interactive reports and visualizations
  • Knowledge of how to deploy your work for others to see and use
  • A foundation in the command line and other developer tools

While this guide is written to provide a natural progression from fundamental concepts to more involved material that builds on prior knowledge, each chapter is designed to be a standalone reference—you don’t need to read “Understanding the Command Line” if all you need is help with app deployment.

The resources presented in this guide are largely freely available up to some tier (except for some of the cloud platforms, which are free to set up but incur usage costs), so you can get started without needing to make decisions based on costs.

2.2 Understanding the Command Line

Before starting with specific data science tools, we need to understand one of the most fundamental interfaces in computing: the command line. Many data science tools are best installed, configured, and sometimes even used through this text-based interface. Further, when we later discuss Integrated Development Environments (IDEs) such as Visual Studio Code, RStudio, and many others, you’ll find that they provide dedicated functionality to allow you to interact directly with the command line, so understanding its purpose is globally useful across workflows.

2.2.1 What is the Command Line?

The command line (also called terminal, shell, or console) is a text-based interface where you type commands for the computer to execute. While graphical user interfaces (GUIs) let you point and click, the command line gives you more precise control through text commands.

Why use the command line when we have modern GUIs?

  1. Many data science tools are designed to be used this way: Tools like Git, Docker, and many Python and R package management utilities primarily use command-line interfaces.

  2. It allows for reproducibility through scripts: Command-line operations can be saved in script files and run again later, ensuring that the exact same steps are followed each time. This reproducibility is essential for reliable data analysis.

  3. It often provides more flexibility and power: Command-line tools typically offer more options and configurations than their graphical counterparts. For example, when installing Python packages, the command-line tool pip offers dozens of options to handle dependencies, versions, and installation locations that aren’t available in most graphical installers.

  4. It’s faster for many operations once you learn the commands: After becoming familiar with the commands, many operations can be performed more quickly than navigating through multiple screens in a GUI. For instance, you can install multiple Python packages with a single command line rather than clicking through installation wizards for each one.

2.2.2 Getting Started with the Command Line

2.2.2.1 On Windows

Windows offers several options for command line interfaces:

  1. Command Prompt: Built into Windows, but limited in functionality
  2. PowerShell: A more powerful alternative built into Windows
  3. Windows Subsystem for Linux (WSL): Provides a Linux environment within Windows (recommended)

To install WSL, open PowerShell as administrator and run:

wsl --install

This installs Ubuntu Linux by default. After installation, restart your computer and follow the setup prompts.

2.2.2.2 On macOS

The Terminal application comes pre-installed:

  1. Press Cmd+Space to open Spotlight search
  2. Type “Terminal” and press Enter

2.2.2.3 On Linux

Most Linux distributions come with a terminal emulator. Look for “Terminal” in your applications menu.

2.2.3 Essential Command Line Operations

Let’s practice some basic commands. Open your terminal and try these:

2.2.3.2 Creating and Editing Files

While you can create files through the command line, it’s often easier to use a text editor. However, it’s good to know these commands:

# Create an empty file
touch newfile.txt

# Display file contents
cat filename.txt

# Simple editor (press i to insert, Esc then :wq to save and quit)
vim filename.txt

Think of these commands as ways to create and look at the contents of notes or documents on your computer, all without opening a word processor or text editor application.

2.2.4 Package Managers

Most command line environments include package managers, which help install and update software. Think of package managers as app stores for your command line. Common ones include:

  • apt (Ubuntu/Debian Linux)
  • brew (macOS)
  • winget (Windows)

For example, on Ubuntu you might install Python using:

sudo apt update
sudo apt install python3

On macOS with Homebrew:

# Install Homebrew first if you don't have it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Then install Python
brew install python

The term “sudo” gives you temporary administrator-level privileges, similar to when Windows asks “Do you want to allow this app to make changes to your device?”

Understanding these basics will help tremendously as we set up our data science tools. The command line might seem intimidating at first, but it becomes an invaluable ally as you grow more comfortable with it.

2.3 Setting Up Python

Python has become a cornerstone language in data science due to its readability, extensive libraries, and versatile applications. Let’s set up a proper Python environment.

2.3.1 Why Python for Data Science?

Python offers several advantages for data science:

  1. Rich ecosystem of specialized libraries (NumPy, pandas, scikit-learn, etc.)
  2. Readable syntax that makes complex analyses more accessible
  3. Strong community support and documentation
  4. Integration with various data sources and visualization tools

Python consistently ranks among the top programming languages for data science and is widely used across the industry.

2.3.2 Installing Python

We’ll install Python using a distribution called Anaconda, which includes Python itself plus many data science packages. Anaconda provides a package manager called conda that creates isolated environments, helping you manage different projects with different dependencies.

2.3.2.1 Installing Anaconda

  1. Visit the Anaconda download page
  2. Download the appropriate installer for your operating system
  3. Run the installer and follow the prompts

During installation on Windows, you may be asked whether to add Anaconda to your PATH environment variable. While checking this box can make commands available from any terminal, it might interfere with other Python installations. The safer choice is to leave it unchecked and use the Anaconda Prompt specifically.

The “PATH” is like an address book that tells your computer where to find programs when you type their names. Adding Anaconda to your PATH means you can use Python from any command prompt, but it could cause conflicts with other versions of Python on your system.

2.3.2.2 Verifying Installation

Open a new terminal (or Anaconda Prompt on Windows) and type:

python --version

You should see the Python version number. Also, check that conda is installed:

conda --version

2.3.3 Creating a Python Environment

Environments let you isolate projects with specific dependencies. Think of environments as separate workspaces for different projects—like having different toolboxes for different types of jobs. Here’s how to create one:

# Create an environment named 'datasci' with Python 3.11
conda create -n datasci python=3.11

# Activate the environment
conda activate datasci

# Install common data science packages
conda install numpy pandas matplotlib scikit-learn jupyter

Whenever you work on your data science projects, activate this environment first.

2.3.4 Using Jupyter Notebooks

Jupyter notebooks provide an interactive environment for Python development, popular in data science for combining code, visualizations, and narrative text. They’re like digital lab notebooks where you can document your analysis process along with the code and results.

# Make sure your environment is activated
conda activate datasci

# Launch Jupyter Notebook
jupyter notebook

This opens a web browser where you can create and work with notebooks. Let’s create a simple notebook to verify everything works:

  1. Click “New” → “Python 3”
  2. In the first cell, type:
Show code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Create some sample data
data = pd.DataFrame({
    'x': range(1, 11),
    'y': np.random.randn(10)
})

# Create a simple plot
plt.figure(figsize=(8, 4))
plt.plot(data['x'], data['y'], marker='o')
plt.title('Sample Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.grid(True)
plt.show()

print("Python environment is working correctly!")
  1. Press Shift+Enter to run the cell

If you see a plot and the success message, your Python setup is complete!

2.3.5 Installing Additional Packages

As your data science journey progresses, you’ll need additional packages. Use either:

# Using conda (preferred when available)
conda install package_name

# Using pip (when packages aren't available in conda)
pip install package_name

Conda is often preferred for data science packages because it handles complex dependencies better, especially for packages with C/C++ components. This is particularly important for libraries that have parts written in lower-level programming languages to make them run faster.

2.4 Setting Up R

R is a powerful language and environment specifically designed for statistical computing and graphics. Many statisticians and data scientists prefer R for statistical analysis and visualization.

2.4.1 Why R for Data Science?

R offers several advantages:

  1. Built specifically for statistical analysis
  2. Excellent for data visualization with ggplot2
  3. A rich ecosystem of packages for specialized statistical methods
  4. Strong in reproducible research through R Markdown

R has thousands of packages available on CRAN for various statistical and data analysis tasks, with active development from the statistics and research communities.

2.4.2 Installing R

Let’s install both R itself and RStudio, a popular integrated development environment for R.

2.4.2.1 Installing Base R

  1. Visit the Comprehensive R Archive Network (CRAN)
  2. Click on the link for your operating system
  3. Follow the installation instructions

2.4.2.2 Installing RStudio Desktop

RStudio provides a user-friendly interface for working with R.

  1. Visit the RStudio download page
  2. Download the free RStudio Desktop version for your operating system
  3. Run the installer and follow the prompts

Think of R as the engine and RStudio as the dashboard that makes it easier to control that engine. You could use R without RStudio, but RStudio makes many tasks more convenient.

2.4.2.3 Verifying Installation

Open RStudio and enter this command in the console (lower-left pane):

Show code
R.version.string

You should see the R version information displayed. You can verify this as the version is the first printed output you will see in the console at the start of a new session. It should look something like this:

R version 4.5.0 (2025-04-11 ucrt) -- "How About a Twenty-Six" Copyright (C) 2025 The R Foundation for Statistical Computing Platform: x86_64-w64-mingw32/x64

2.4.3 Essential R Packages for Data Science

Let’s install some core packages that you’ll likely need:

Show code
# Install essential packages
install.packages(c("tidyverse", "rmarkdown", "shiny", "knitr", "plotly"))

This installs:

  • tidyverse: A collection of packages for data manipulation and visualization
  • rmarkdown: For creating documents that mix code and text
  • shiny: For building interactive web applications
  • knitr: For dynamic report generation
  • plotly: For interactive visualizations

These packages are like specialized toolkits that expand what you can do with R. The tidyverse, for example, makes data manipulation much more intuitive than it would be using just base R.

2.4.4 Creating Your First R Script

Let’s verify our setup with a simple R script:

  1. In RStudio, go to File → New File → R Script
  2. Enter the following code:
Show code
# Load libraries
library(tidyverse)

# Create sample data
data <- tibble(
  x = 1:10,
  y = rnorm(10)
)

# Create a plot with ggplot2
ggplot(data, aes(x = x, y = y)) +
  geom_point() +
  geom_line() +
  labs(title = "Sample Plot in R",
       x = "X-axis",
       y = "Y-axis") +
  theme_minimal()

print("R environment is working correctly!")
  1. Click the “Run” button or press Ctrl+Enter (Cmd+Enter on Mac) to execute the code

If you see a plot in the lower-right pane and the success message in the console, your R setup is complete!

2.4.5 Understanding R Packages

Unlike Python, where conda or pip manage packages, R has its own built-in package management system accessed through functions like install.packages() and library().

There are thousands of R packages available on CRAN, with more on Bioconductor (for bioinformatics) and GitHub. To install a package from GitHub, you first need the devtools package:

Show code
install.packages("devtools")
devtools::install_github("username/package")

Think of CRAN as the official app store for R packages, while GitHub is like getting apps directly from developers. Both are useful, but packages on CRAN have gone through more quality checks.

2.5 SQL Fundamentals and Setup

SQL (Structured Query Language) is essential for data scientists to interact with databases. We’ll set up a lightweight database system so you can practice SQL queries locally.

2.5.1 Why SQL for Data Science?

SQL is crucial for data science because:

  1. Most organizational data resides in databases
  2. It provides a standard way to query and manipulate data
  3. It’s often more efficient than Python or R for large data operations
  4. Data transformation often happens in databases before analysis

SQL is one of the most important skills for data scientists, as most organizational data resides in relational databases.

2.5.2 Installing SQLite

SQLite is a lightweight, file-based database that requires no server setup, making it perfect for learning.

Think of SQLite as a simple filing cabinet for your data that you can easily carry around, unlike larger database systems that require dedicated servers.

2.5.2.1 On Windows

  1. Download the SQLite command-line tools from the SQLite download page
  2. Extract the files to a folder (e.g., C:\sqlite)
  3. Add this folder to your PATH environment variable

2.5.2.2 On macOS

SQLite comes pre-installed, but you can install a newer version with Homebrew:

# Install SQLite
brew install sqlite

2.5.2.3 On Linux

sudo apt update
sudo apt install sqlite3

2.5.2.4 Verifying Installation

Open a terminal or command prompt and type:

sqlite3 --version

You should see the version information displayed.

2.5.3 Creating Your First Database

Let’s create a simple database to verify our setup:

# Create a new database file
sqlite3 sample.db

# In the SQLite prompt, create a table
CREATE TABLE people (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    age INTEGER,
    city TEXT
);

# Insert some data
INSERT INTO people (name, age, city) VALUES ('Alice', 28, 'New York');
INSERT INTO people (name, age, city) VALUES ('Bob', 35, 'Chicago');
INSERT INTO people (name, age, city) VALUES ('Charlie', 42, 'San Francisco');

# Query the data
SELECT * FROM people;

# Exit SQLite
.exit

Think of this process as creating a spreadsheet (the table) within a file (the database), then adding some rows of data, and finally viewing all the data.

2.5.4 SQL GUIs for Easier Database Management

While the command line is powerful, graphical interfaces can make working with databases more intuitive:

2.5.4.1 DB Browser for SQLite

This free, open-source tool provides a user-friendly interface for SQLite databases.

  1. Visit the DB Browser for SQLite download page
  2. Download the appropriate version for your operating system
  3. Install and open it
  4. Open the sample.db file you created earlier

DB Browser for SQLite acts like a spreadsheet program for your database, making it easier to view and edit data without typing SQL commands.

2.5.4.2 Using SQL from Python and R

You can also interact with SQLite databases from Python and R:

2.5.4.2.1 Python
Show code
import sqlite3
import pandas as pd

# Connect to the database
conn = sqlite3.connect('sample.db')

# Query data into a pandas DataFrame
df = pd.read_sql_query("SELECT * FROM people", conn)

# Display the data
print(df)

# Close the connection
conn.close()
2.5.4.2.2 R
Show code
library(RSQLite)
library(DBI)

# Connect to the database
conn <- dbConnect(SQLite(), "sample.db")

# Query data into a data frame
df <- dbGetQuery(conn, "SELECT * FROM people")

# Display the data
print(df)

# Close the connection
dbDisconnect(conn)

This interoperability between SQL, Python, and R is a fundamental skill for data scientists, allowing you to leverage the strengths of each tool. You can store data in a database, query it with SQL, then analyze it with Python or R—all within the same workflow.

2.6 Integrated Development Environments (IDEs)

An Integrated Development Environment (IDE) combines the tools needed for software development into a single application. A good IDE dramatically improves productivity by providing code editing, debugging, execution, and project management in one place.

2.6.1 Why IDEs Matter for Data Science

IDEs help data scientists by:

  1. Providing syntax highlighting and code completion
  2. Catching errors before execution
  3. Offering integrated documentation
  4. Simplifying project organization and version control

Most professional developers use a specialized IDE rather than a basic text editor, as the additional features significantly improve productivity.

Think of an IDE as a fully equipped workshop rather than just having a single tool. It has everything arranged conveniently in one place.

We’ve already installed RStudio for R development. Now let’s look at options for Python and SQL.

2.6.2 VS Code: A Universal IDE

Visual Studio Code (VS Code) is a free, open-source editor that supports multiple languages through extensions. Its flexibility makes it an excellent choice for data scientists.

2.6.2.1 Installing VS Code

  1. Visit the VS Code download page
  2. Download the appropriate version for your operating system
  3. Run the installer and follow the prompts

2.6.2.2 Essential VS Code Extensions for Data Science

After installing VS Code, add these extensions by clicking on the Extensions icon in the sidebar (or pressing Ctrl+Shift+X):

  • Python by Microsoft: Python language support
  • Jupyter: Support for Jupyter notebooks
  • Rainbow CSV: Makes CSV files easier to read
  • SQLite: SQLite database support
  • R: R language support (if you plan to use R in VS Code)
  • GitLens: Enhanced Git capabilities

Extensions in VS Code are like add-ons or plugins that enhance its functionality for specific tasks or languages, similar to how you might install apps on your phone to give it new capabilities.

2.6.2.3 Configuring VS Code for Python

  1. Open VS Code
  2. Press Ctrl+Shift+P (Cmd+Shift+P on Mac) to open the command palette
  3. Type “Python: Select Interpreter” and select it
  4. Choose your conda environment (e.g., datasci)

This step tells VS Code which Python installation to use when running your code. It’s like telling a multilingual person which language to speak when communicating with you.

2.6.3 PyCharm Community Edition

PyCharm is an IDE specifically designed for Python development, with excellent data science support.

2.6.3.1 Installing PyCharm Community Edition

  1. Visit the PyCharm download page
  2. Download the free Community Edition
  3. Run the installer and follow the prompts

2.6.3.2 Configuring PyCharm for Your Conda Environment

  1. Open PyCharm
  2. Create a new project
  3. Click on “Previously configured interpreter”
  4. Click on the gear icon and select “Add…”
  5. Choose “Conda Environment” → “Existing environment”
  6. Browse to your conda environment’s Python executable
    • On Windows: Usually in C:\Users\<username>\anaconda3\envs\datasci\python.exe
    • On macOS/Linux: Usually in /home/<username>/anaconda3/envs/datasci/bin/python
Note

Note: In file paths, forward slashes (/) are primarily used in Unix-like systems like Linux and macOS, while backslashes (\) are commonly used in Windows.

2.6.4 Working with Jupyter Notebooks

While we already mentioned Jupyter notebooks in the Python section, they deserve more attention as a popular IDE-like interface for data science.

2.6.4.1 JupyterLab: The Next Generation of Jupyter

JupyterLab is a web-based interactive development environment that extends the notebook interface with a file browser, consoles, terminals, and more.

# Install JupyterLab
conda activate datasci
conda install -c conda-forge jupyterlab

# Launch JupyterLab
jupyter lab

JupyterLab provides a more IDE-like experience than classic Jupyter notebooks, with the ability to open multiple notebooks, view data frames, and edit other file types in a single interface. It’s like upgrading from having separate tools to having a comprehensive workbench.

2.6.5 Choosing the Right IDE

Each IDE has strengths and weaknesses:

  • VS Code: Versatile, lightweight, supports multiple languages
  • PyCharm: Robust Python-specific features, excellent for large projects
  • RStudio: Optimized for R development
  • JupyterLab: Excellent for exploratory data analysis and sharing results

Many data scientists use multiple IDEs depending on the task. For example, you might use:

  • JupyterLab for exploration and visualization
  • VS Code for script development and Git integration
  • RStudio for statistical analysis and report generation

Choose the tools that best fit your workflow and preferences. It’s perfectly fine to start with one and add others as you grow more comfortable.

2.7 Version Control with Git and GitHub

Version control is a system that records changes to files over time, allowing you to recall specific versions later. Git is the most widely used version control system, and GitHub is a popular platform for hosting Git repositories.

2.7.1 Why Version Control for Data Science?

Version control is essential for data science because it:

  1. Tracks changes to code and documentation
  2. Facilitates collaboration with others
  3. Provides a backup of your work
  4. Documents the evolution of your analysis
  5. Enables reproducibility by capturing the state of code at specific points

Proper version control is essential for reproducibility and collaboration in data science work.

Think of Git as a time machine for your code. It allows you to save snapshots of your project at different points in time and revisit or restore those snapshots if needed.

2.7.2 Installing Git

2.7.2.1 On Windows

  1. Download the installer from Git for Windows
  2. Run the installer, accepting the default options (though you may want to choose VS Code as your default editor if you installed it)

2.7.2.2 On macOS

Git may already be installed. Check by typing git --version in the terminal. If not:

# Install Git using Homebrew
brew install git

2.7.2.3 On Linux

sudo apt update
sudo apt install git

2.7.2.4 Configuring Git

After installation, open a terminal and configure your identity:

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

This is like putting your name and address on a letter. When you make changes to a project, Git will know who made them.

2.7.3 Creating a GitHub Account

GitHub provides free hosting for Git repositories, making it easy to share code and collaborate.

  1. Visit GitHub
  2. Click “Sign up” and follow the instructions
  3. Verify your email address

GitHub is to Git what social media is to your photos—a place to share your work with others and collaborate on projects.

2.7.4 Setting Up SSH Authentication for GitHub

Using SSH keys makes it more secure and convenient to interact with GitHub:

2.7.4.1 Generating SSH Keys

# Generate a new SSH key
ssh-keygen -t ed25519 -C "your.email@example.com"

# Start the SSH agent
eval "$(ssh-agent -s)"

# Add your key to the agent
ssh-add ~/.ssh/id_ed25519

SSH keys are like a special lock and key system. Instead of typing your password every time you interact with GitHub, your computer uses these keys to prove it’s really you.

2.7.4.2 Adding Your SSH Key to GitHub

  1. Copy your public key to the clipboard:
    • On Windows (in Git Bash): cat ~/.ssh/id_ed25519.pub | clip
    • On macOS: pbcopy < ~/.ssh/id_ed25519.pub
    • On Linux: cat ~/.ssh/id_ed25519.pub | xclip -selection clipboard
  2. Go to GitHub → Settings → SSH and GPG keys → New SSH key
  3. Paste your key and save

2.7.5 Basic Git Workflow

Let’s create a repository and learn the essential Git commands:

# Create a new directory
mkdir my_first_repo
cd my_first_repo

# Initialize a Git repository
git init

# Create a README file
echo "# My First Repository" > README.md

# Add the file to the staging area
git add README.md

# Commit the changes
git commit -m "Initial commit"

Think of this process as:

  1. Creating a new folder for your project
  2. Telling Git to start tracking changes in this folder
  3. Creating a simple text file
  4. Telling Git you want to include this file in your next snapshot
  5. Taking the snapshot with a brief description

2.7.6 Connecting to GitHub

Now let’s push this local repository to GitHub:

  1. On GitHub, click “+” in the top-right corner and select “New repository”
  2. Name it “my_first_repo”
  3. Leave it as a public repository
  4. Don’t initialize with a README (we already created one)
  5. Click “Create repository”
  6. Follow the instructions for “push an existing repository from the command line”:
git remote add origin git@github.com:yourusername/my_first_repo.git
git branch -M main
git push -u origin main

This process connects your local repository to GitHub (like linking your local folder to a cloud storage service) and uploads your code.

2.7.7 Basic Git Commands for Daily Use

These commands form the core of day-to-day Git usage:

# Check status of your repository
git status

# View commit history
git log

# Create and switch to a new branch
git checkout -b new-feature

# Switch between existing branches
git checkout main

# Pull latest changes from remote repository
git pull

# Add all changed files to staging
git add .

# Commit staged changes
git commit -m "Description of changes"

# Push commits to remote repository
git push

Think of branches as parallel versions of your project. The main branch is like the trunk of a tree, and other branches are like branches growing out from it. You can work on different features in different branches without affecting the main branch, then combine them when they’re ready.

2.7.8 Using Git in IDEs

Most modern IDEs integrate with Git, making version control easier:

2.7.8.1 VS Code

  • Click the Source Control icon in the sidebar
  • Use the interface to stage, commit, and push changes

2.7.8.2 PyCharm

  • Go to VCS → Git in the menu
  • Use the interface for Git operations

2.7.8.3 RStudio

  • Click the Git tab in the upper-right panel
  • Use the interface for Git operations

These integrations mean you don’t have to use the command line for every Git operation—you can manage version control without leaving your coding environment.

2.7.9 Collaborating with Others on GitHub

GitHub facilitates collaboration through pull requests:

  1. Fork someone’s repository by clicking the “Fork” button on GitHub

  2. Clone your fork locally:

    git clone git@github.com:yourusername/their-repo.git
  3. Create a branch for your changes:

    git checkout -b my-feature
  4. Make changes, commit them, and push to your fork:

    git push origin my-feature
  5. On GitHub, navigate to your fork and click “New pull request”

Pull requests allow project maintainers to review your changes before incorporating them. It’s like submitting a draft for review before it gets published.

The “fork and pull request” workflow is used by nearly all open-source projects, from small libraries to major platforms like TensorFlow and pandas. It’s considered a best practice for collaborative development.