2Setting Up for Success: Infrastructure for the Modern Data Scientist
2.1 Introduction
If you’re coming from economics, statistics, engineering, or another technical field, you already have many of the analytical skills needed to make productive use of data. However, since you’re reading this, you’d like some help setting up the technical infrastructure that supports modern data science work. For those without a computer science background, all of this may seem overwhelming at first, but soon you’ll have the tools to make your workflows even more productive.
This guide focuses on getting you set up with the tools you need to practice data science, rather than teaching you how to code. Think of it as preparing your workshop before you begin crafting. We’ll cover installing and configuring the essential software, platforms, and tools that data scientists use regularly.
By the end of this guide, you’ll have:
A fully configured development environment for Python, R, and SQL
Experience with version control through Git and GitHub
The ability to create interactive reports and visualizations
Knowledge of how to deploy your work for others to see and use
A foundation in the command line and other developer tools
While this guide is written to provide a natural progression from fundamental concepts to more involved material that builds on prior knowledge, each chapter is designed to be a standalone reference—you don’t need to read “Understanding the Command Line” if all you need is help with app deployment.
The resources presented in this guide are largely freely available up to some tier (except for some of the cloud platforms, which are free to set up but incur usage costs), so you can get started without needing to make decisions based on costs.
2.2 Understanding the Command Line
Before starting with specific data science tools, we need to understand one of the most fundamental interfaces in computing: the command line. Many data science tools are best installed, configured, and sometimes even used through this text-based interface. Further, when we later discuss Integrated Development Environments (IDEs) such as Visual Studio Code, RStudio, and many others, you’ll find that they provide dedicated functionality to allow you to interact directly with the command line, so understanding its purpose is globally useful across workflows.
2.2.1 What is the Command Line?
The command line (also called terminal, shell, or console) is a text-based interface where you type commands for the computer to execute. While graphical user interfaces (GUIs) let you point and click, the command line gives you more precise control through text commands.
Why use the command line when we have modern GUIs?
Many data science tools are designed to be used this way: Tools like Git, Docker, and many Python and R package management utilities primarily use command-line interfaces.
It allows for reproducibility through scripts: Command-line operations can be saved in script files and run again later, ensuring that the exact same steps are followed each time. This reproducibility is essential for reliable data analysis.
It often provides more flexibility and power: Command-line tools typically offer more options and configurations than their graphical counterparts. For example, when installing Python packages, the command-line tool pip offers dozens of options to handle dependencies, versions, and installation locations that aren’t available in most graphical installers.
It’s faster for many operations once you learn the commands: After becoming familiar with the commands, many operations can be performed more quickly than navigating through multiple screens in a GUI. For instance, you can install multiple Python packages with a single command line rather than clicking through installation wizards for each one.
2.2.2 Getting Started with the Command Line
2.2.2.1 On Windows
Windows offers several options for command line interfaces:
Command Prompt: Built into Windows, but limited in functionality
PowerShell: A more powerful alternative built into Windows
Windows Subsystem for Linux (WSL): Provides a Linux environment within Windows (recommended)
To install WSL, open PowerShell as administrator and run:
wsl --install
This installs Ubuntu Linux by default. After installation, restart your computer and follow the setup prompts.
2.2.2.2 On macOS
The Terminal application comes pre-installed:
Press Cmd+Space to open Spotlight search
Type “Terminal” and press Enter
2.2.2.3 On Linux
Most Linux distributions come with a terminal emulator. Look for “Terminal” in your applications menu.
2.2.3 Essential Command Line Operations
Let’s practice some basic commands. Open your terminal and try these:
2.2.3.1 Navigating the File System
# Print working directory (shows where you are)pwd# List files and directoriesls# Change directory [to Documents]cd Documents# Go up one directory level (like clicking the back button in your browser)cd ..# Create a new directorymkdir data_science_projects# Remove a file (be careful!)rm filename.txt# Remove a directoryrmdir directory_name
These commands form the foundation of file navigation and manipulation. As you work with data science tools, you’ll find yourself using them frequently.
The commands above are like giving directions to your computer. Just as you might tell someone “Go down this street, then turn left at the second intersection,” these commands tell your computer “Show me where I am,” “Show me what’s here,” “Go into this folder,” and so on.
2.2.3.2 Creating and Editing Files
While you can create files through the command line, it’s often easier to use a text editor. However, it’s good to know these commands:
# Create an empty filetouch newfile.txt# Display file contentscat filename.txt# Simple editor (press i to insert, Esc then :wq to save and quit)vim filename.txt
Think of these commands as ways to create and look at the contents of notes or documents on your computer, all without opening a word processor or text editor application.
2.2.4 Package Managers
Most command line environments include package managers, which help install and update software. Think of package managers as app stores for your command line. Common ones include:
apt (Ubuntu/Debian Linux)
brew (macOS)
winget (Windows)
For example, on Ubuntu you might install Python using:
sudo apt updatesudo apt install python3
On macOS with Homebrew:
# Install Homebrew first if you don't have it/bin/bash-c"$(curl-fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"# Then install Pythonbrew install python
The term “sudo” gives you temporary administrator-level privileges, similar to when Windows asks “Do you want to allow this app to make changes to your device?”
Understanding these basics will help tremendously as we set up our data science tools. The command line might seem intimidating at first, but it becomes an invaluable ally as you grow more comfortable with it.
2.3 Setting Up Python
Python has become a cornerstone language in data science due to its readability, extensive libraries, and versatile applications. Let’s set up a proper Python environment.
2.3.1 Why Python for Data Science?
Python offers several advantages for data science:
Rich ecosystem of specialized libraries (NumPy, pandas, scikit-learn, etc.)
Readable syntax that makes complex analyses more accessible
Strong community support and documentation
Integration with various data sources and visualization tools
Python consistently ranks among the top programming languages for data science and is widely used across the industry.
2.3.2 Installing Python
We’ll install Python using a distribution called Anaconda, which includes Python itself plus many data science packages. Anaconda provides a package manager called conda that creates isolated environments, helping you manage different projects with different dependencies.
Download the appropriate installer for your operating system
Run the installer and follow the prompts
During installation on Windows, you may be asked whether to add Anaconda to your PATH environment variable. While checking this box can make commands available from any terminal, it might interfere with other Python installations. The safer choice is to leave it unchecked and use the Anaconda Prompt specifically.
The “PATH” is like an address book that tells your computer where to find programs when you type their names. Adding Anaconda to your PATH means you can use Python from any command prompt, but it could cause conflicts with other versions of Python on your system.
2.3.2.2 Verifying Installation
Open a new terminal (or Anaconda Prompt on Windows) and type:
python--version
You should see the Python version number. Also, check that conda is installed:
conda--version
2.3.3 Creating a Python Environment
Environments let you isolate projects with specific dependencies. Think of environments as separate workspaces for different projects—like having different toolboxes for different types of jobs. Here’s how to create one:
# Create an environment named 'datasci' with Python 3.11conda create -n datasci python=3.11# Activate the environmentconda activate datasci# Install common data science packagesconda install numpy pandas matplotlib scikit-learn jupyter
Whenever you work on your data science projects, activate this environment first.
2.3.4 Using Jupyter Notebooks
Jupyter notebooks provide an interactive environment for Python development, popular in data science for combining code, visualizations, and narrative text. They’re like digital lab notebooks where you can document your analysis process along with the code and results.
# Make sure your environment is activatedconda activate datasci# Launch Jupyter Notebookjupyter notebook
This opens a web browser where you can create and work with notebooks. Let’s create a simple notebook to verify everything works:
Click “New” → “Python 3”
In the first cell, type:
Show code
import numpy as npimport pandas as pdimport matplotlib.pyplot as plt# Create some sample datadata = pd.DataFrame({'x': range(1, 11),'y': np.random.randn(10)})# Create a simple plotplt.figure(figsize=(8, 4))plt.plot(data['x'], data['y'], marker='o')plt.title('Sample Plot')plt.xlabel('X-axis')plt.ylabel('Y-axis')plt.grid(True)plt.show()print("Python environment is working correctly!")
Press Shift+Enter to run the cell
If you see a plot and the success message, your Python setup is complete!
2.3.5 Installing Additional Packages
As your data science journey progresses, you’ll need additional packages. Use either:
# Using conda (preferred when available)conda install package_name# Using pip (when packages aren't available in conda)pip install package_name
Conda is often preferred for data science packages because it handles complex dependencies better, especially for packages with C/C++ components. This is particularly important for libraries that have parts written in lower-level programming languages to make them run faster.
2.4 Setting Up R
R is a powerful language and environment specifically designed for statistical computing and graphics. Many statisticians and data scientists prefer R for statistical analysis and visualization.
2.4.1 Why R for Data Science?
R offers several advantages:
Built specifically for statistical analysis
Excellent for data visualization with ggplot2
A rich ecosystem of packages for specialized statistical methods
Strong in reproducible research through R Markdown
R has thousands of packages available on CRAN for various statistical and data analysis tasks, with active development from the statistics and research communities.
2.4.2 Installing R
Let’s install both R itself and RStudio, a popular integrated development environment for R.
Download the free RStudio Desktop version for your operating system
Run the installer and follow the prompts
Think of R as the engine and RStudio as the dashboard that makes it easier to control that engine. You could use R without RStudio, but RStudio makes many tasks more convenient.
2.4.2.3 Verifying Installation
Open RStudio and enter this command in the console (lower-left pane):
Show code
R.version.string
You should see the R version information displayed. You can verify this as the version is the first printed output you will see in the console at the start of a new session. It should look something like this:
R version 4.5.0 (2025-04-11 ucrt) -- "How About a Twenty-Six"Copyright (C) 2025 The R Foundation for Statistical ComputingPlatform: x86_64-w64-mingw32/x64
2.4.3 Essential R Packages for Data Science
Let’s install some core packages that you’ll likely need:
tidyverse: A collection of packages for data manipulation and visualization
rmarkdown: For creating documents that mix code and text
shiny: For building interactive web applications
knitr: For dynamic report generation
plotly: For interactive visualizations
These packages are like specialized toolkits that expand what you can do with R. The tidyverse, for example, makes data manipulation much more intuitive than it would be using just base R.
2.4.4 Creating Your First R Script
Let’s verify our setup with a simple R script:
In RStudio, go to File → New File → R Script
Enter the following code:
Show code
# Load librarieslibrary(tidyverse)# Create sample datadata <-tibble(x =1:10,y =rnorm(10))# Create a plot with ggplot2ggplot(data, aes(x = x, y = y)) +geom_point() +geom_line() +labs(title ="Sample Plot in R",x ="X-axis",y ="Y-axis") +theme_minimal()print("R environment is working correctly!")
Click the “Run” button or press Ctrl+Enter (Cmd+Enter on Mac) to execute the code
If you see a plot in the lower-right pane and the success message in the console, your R setup is complete!
2.4.5 Understanding R Packages
Unlike Python, where conda or pip manage packages, R has its own built-in package management system accessed through functions like install.packages() and library().
There are thousands of R packages available on CRAN, with more on Bioconductor (for bioinformatics) and GitHub. To install a package from GitHub, you first need the devtools package:
Think of CRAN as the official app store for R packages, while GitHub is like getting apps directly from developers. Both are useful, but packages on CRAN have gone through more quality checks.
2.5 SQL Fundamentals and Setup
SQL (Structured Query Language) is essential for data scientists to interact with databases. We’ll set up a lightweight database system so you can practice SQL queries locally.
2.5.1 Why SQL for Data Science?
SQL is crucial for data science because:
Most organizational data resides in databases
It provides a standard way to query and manipulate data
It’s often more efficient than Python or R for large data operations
Data transformation often happens in databases before analysis
SQL is one of the most important skills for data scientists, as most organizational data resides in relational databases.
2.5.2 Installing SQLite
SQLite is a lightweight, file-based database that requires no server setup, making it perfect for learning.
Think of SQLite as a simple filing cabinet for your data that you can easily carry around, unlike larger database systems that require dedicated servers.
SQLite comes pre-installed, but you can install a newer version with Homebrew:
# Install SQLitebrew install sqlite
2.5.2.3 On Linux
sudo apt updatesudo apt install sqlite3
2.5.2.4 Verifying Installation
Open a terminal or command prompt and type:
sqlite3--version
You should see the version information displayed.
2.5.3 Creating Your First Database
Let’s create a simple database to verify our setup:
# Create a new database filesqlite3 sample.db# In the SQLite prompt, create a tableCREATE TABLE people (id INTEGER PRIMARY KEY,name TEXT NOT NULL,age INTEGER,city TEXT);# Insert some dataINSERT INTO people (name, age, city)VALUES('Alice', 28, 'New York');INSERT INTO people (name, age, city)VALUES('Bob', 35, 'Chicago');INSERT INTO people (name, age, city)VALUES('Charlie', 42, 'San Francisco');# Query the dataSELECT* FROM people;# Exit SQLite.exit
Think of this process as creating a spreadsheet (the table) within a file (the database), then adding some rows of data, and finally viewing all the data.
2.5.4 SQL GUIs for Easier Database Management
While the command line is powerful, graphical interfaces can make working with databases more intuitive:
2.5.4.1 DB Browser for SQLite
This free, open-source tool provides a user-friendly interface for SQLite databases.
Download the appropriate version for your operating system
Install and open it
Open the sample.db file you created earlier
DB Browser for SQLite acts like a spreadsheet program for your database, making it easier to view and edit data without typing SQL commands.
2.5.4.2 Using SQL from Python and R
You can also interact with SQLite databases from Python and R:
2.5.4.2.1 Python
Show code
import sqlite3import pandas as pd# Connect to the databaseconn = sqlite3.connect('sample.db')# Query data into a pandas DataFramedf = pd.read_sql_query("SELECT * FROM people", conn)# Display the dataprint(df)# Close the connectionconn.close()
2.5.4.2.2 R
Show code
library(RSQLite)library(DBI)# Connect to the databaseconn <-dbConnect(SQLite(), "sample.db")# Query data into a data framedf <-dbGetQuery(conn, "SELECT * FROM people")# Display the dataprint(df)# Close the connectiondbDisconnect(conn)
This interoperability between SQL, Python, and R is a fundamental skill for data scientists, allowing you to leverage the strengths of each tool. You can store data in a database, query it with SQL, then analyze it with Python or R—all within the same workflow.
2.6 Integrated Development Environments (IDEs)
An Integrated Development Environment (IDE) combines the tools needed for software development into a single application. A good IDE dramatically improves productivity by providing code editing, debugging, execution, and project management in one place.
2.6.1 Why IDEs Matter for Data Science
IDEs help data scientists by:
Providing syntax highlighting and code completion
Catching errors before execution
Offering integrated documentation
Simplifying project organization and version control
Most professional developers use a specialized IDE rather than a basic text editor, as the additional features significantly improve productivity.
Think of an IDE as a fully equipped workshop rather than just having a single tool. It has everything arranged conveniently in one place.
We’ve already installed RStudio for R development. Now let’s look at options for Python and SQL.
2.6.2 VS Code: A Universal IDE
Visual Studio Code (VS Code) is a free, open-source editor that supports multiple languages through extensions. Its flexibility makes it an excellent choice for data scientists.
Download the appropriate version for your operating system
Run the installer and follow the prompts
2.6.2.2 Essential VS Code Extensions for Data Science
After installing VS Code, add these extensions by clicking on the Extensions icon in the sidebar (or pressing Ctrl+Shift+X):
Python by Microsoft: Python language support
Jupyter: Support for Jupyter notebooks
Rainbow CSV: Makes CSV files easier to read
SQLite: SQLite database support
R: R language support (if you plan to use R in VS Code)
GitLens: Enhanced Git capabilities
Extensions in VS Code are like add-ons or plugins that enhance its functionality for specific tasks or languages, similar to how you might install apps on your phone to give it new capabilities.
2.6.2.3 Configuring VS Code for Python
Open VS Code
Press Ctrl+Shift+P (Cmd+Shift+P on Mac) to open the command palette
Type “Python: Select Interpreter” and select it
Choose your conda environment (e.g., datasci)
This step tells VS Code which Python installation to use when running your code. It’s like telling a multilingual person which language to speak when communicating with you.
2.6.3 PyCharm Community Edition
PyCharm is an IDE specifically designed for Python development, with excellent data science support.
Browse to your conda environment’s Python executable
On Windows: Usually in C:\Users\<username>\anaconda3\envs\datasci\python.exe
On macOS/Linux: Usually in /home/<username>/anaconda3/envs/datasci/bin/python
Note
Note: In file paths, forward slashes (/) are primarily used in Unix-like systems like Linux and macOS, while backslashes (\) are commonly used in Windows.
2.6.4 Working with Jupyter Notebooks
While we already mentioned Jupyter notebooks in the Python section, they deserve more attention as a popular IDE-like interface for data science.
2.6.4.1 JupyterLab: The Next Generation of Jupyter
JupyterLab is a web-based interactive development environment that extends the notebook interface with a file browser, consoles, terminals, and more.
JupyterLab provides a more IDE-like experience than classic Jupyter notebooks, with the ability to open multiple notebooks, view data frames, and edit other file types in a single interface. It’s like upgrading from having separate tools to having a comprehensive workbench.
2.6.5 Choosing the Right IDE
Each IDE has strengths and weaknesses:
VS Code: Versatile, lightweight, supports multiple languages
PyCharm: Robust Python-specific features, excellent for large projects
RStudio: Optimized for R development
JupyterLab: Excellent for exploratory data analysis and sharing results
Many data scientists use multiple IDEs depending on the task. For example, you might use:
JupyterLab for exploration and visualization
VS Code for script development and Git integration
RStudio for statistical analysis and report generation
Choose the tools that best fit your workflow and preferences. It’s perfectly fine to start with one and add others as you grow more comfortable.
2.7 Version Control with Git and GitHub
Version control is a system that records changes to files over time, allowing you to recall specific versions later. Git is the most widely used version control system, and GitHub is a popular platform for hosting Git repositories.
2.7.1 Why Version Control for Data Science?
Version control is essential for data science because it:
Tracks changes to code and documentation
Facilitates collaboration with others
Provides a backup of your work
Documents the evolution of your analysis
Enables reproducibility by capturing the state of code at specific points
Proper version control is essential for reproducibility and collaboration in data science work.
Think of Git as a time machine for your code. It allows you to save snapshots of your project at different points in time and revisit or restore those snapshots if needed.
GitHub is to Git what social media is to your photos—a place to share your work with others and collaborate on projects.
2.7.4 Setting Up SSH Authentication for GitHub
Using SSH keys makes it more secure and convenient to interact with GitHub:
2.7.4.1 Generating SSH Keys
# Generate a new SSH keyssh-keygen-t ed25519 -C"your.email@example.com"# Start the SSH agenteval"$(ssh-agent-s)"# Add your key to the agentssh-add ~/.ssh/id_ed25519
SSH keys are like a special lock and key system. Instead of typing your password every time you interact with GitHub, your computer uses these keys to prove it’s really you.
2.7.4.2 Adding Your SSH Key to GitHub
Copy your public key to the clipboard:
On Windows (in Git Bash): cat ~/.ssh/id_ed25519.pub | clip
On macOS: pbcopy < ~/.ssh/id_ed25519.pub
On Linux: cat ~/.ssh/id_ed25519.pub | xclip -selection clipboard
Go to GitHub → Settings → SSH and GPG keys → New SSH key
Paste your key and save
2.7.5 Basic Git Workflow
Let’s create a repository and learn the essential Git commands:
# Create a new directorymkdir my_first_repocd my_first_repo# Initialize a Git repositorygit init# Create a README fileecho"# My First Repository"> README.md# Add the file to the staging areagit add README.md# Commit the changesgit commit -m"Initial commit"
Think of this process as:
Creating a new folder for your project
Telling Git to start tracking changes in this folder
Creating a simple text file
Telling Git you want to include this file in your next snapshot
Taking the snapshot with a brief description
2.7.6 Connecting to GitHub
Now let’s push this local repository to GitHub:
On GitHub, click “+” in the top-right corner and select “New repository”
Name it “my_first_repo”
Leave it as a public repository
Don’t initialize with a README (we already created one)
Click “Create repository”
Follow the instructions for “push an existing repository from the command line”:
This process connects your local repository to GitHub (like linking your local folder to a cloud storage service) and uploads your code.
2.7.7 Basic Git Commands for Daily Use
These commands form the core of day-to-day Git usage:
# Check status of your repositorygit status# View commit historygit log# Create and switch to a new branchgit checkout -b new-feature# Switch between existing branchesgit checkout main# Pull latest changes from remote repositorygit pull# Add all changed files to staginggit add .# Commit staged changesgit commit -m"Description of changes"# Push commits to remote repositorygit push
Think of branches as parallel versions of your project. The main branch is like the trunk of a tree, and other branches are like branches growing out from it. You can work on different features in different branches without affecting the main branch, then combine them when they’re ready.
2.7.8 Using Git in IDEs
Most modern IDEs integrate with Git, making version control easier:
2.7.8.1 VS Code
Click the Source Control icon in the sidebar
Use the interface to stage, commit, and push changes
2.7.8.2 PyCharm
Go to VCS → Git in the menu
Use the interface for Git operations
2.7.8.3 RStudio
Click the Git tab in the upper-right panel
Use the interface for Git operations
These integrations mean you don’t have to use the command line for every Git operation—you can manage version control without leaving your coding environment.
2.7.9 Collaborating with Others on GitHub
GitHub facilitates collaboration through pull requests:
Fork someone’s repository by clicking the “Fork” button on GitHub
On GitHub, navigate to your fork and click “New pull request”
Pull requests allow project maintainers to review your changes before incorporating them. It’s like submitting a draft for review before it gets published.
The “fork and pull request” workflow is used by nearly all open-source projects, from small libraries to major platforms like TensorFlow and pandas. It’s considered a best practice for collaborative development.