2 Setting Up for Success: Infrastructure for the Modern Data Scientist

2.1 Introduction

If you’re coming from economics, statistics, engineering, or another technical field, you already have many of the analytical skills needed to make productive use of data. However, since you’re reading this, you’d like some help setting up the technical infrastructure that supports modern data science work. For those without a computer science background, all of this may seem overwhelming at first, but soon you’ll have the tools to make your workflows even more productive.

This guide focuses on getting you set up with the tools you need to practise data science, rather than teaching you how to code. Think of it as preparing your workshop before you begin crafting. We’ll cover installing and configuring the essential software, platforms, and tools that data scientists use regularly.

By the end of this guide, you’ll have:

A fully configured development environment for Python, R, and SQL
Experience with version control through Git and GitHub
The ability to create interactive reports and visualisations
Knowledge of how to deploy your work for others to see and use
A foundation in the command line and other developer tools

While this guide is written to provide a natural progression from fundamental concepts to more involved material that builds on prior knowledge, each chapter is designed to be a standalone reference—you don’t need to read “Understanding the Command Line” if all you need is help with app deployment.

The resources presented in this guide are largely freely available up to some tier (except for some of the cloud platforms, which are free to set up but incur usage costs), so you can get started without needing to make decisions based on costs.

2.2 Understanding the Command Line

Before starting with specific data science tools, we need to understand one of the most fundamental interfaces in computing: the command line. Many data science tools are best installed, configured, and sometimes even used through this text-based interface. Further, when we later discuss Integrated Development Environments (IDEs) such as Visual Studio Code, RStudio, and many others, you’ll find that they provide dedicated functionality to allow you to interact directly with the command line, so understanding its purpose is globally useful across workflows.

2.2.1 What is the Command Line?

The command line (also called terminal, shell, or console) is a text-based interface where you type commands for the computer to execute. While graphical user interfaces (GUIs) let you point and click, the command line gives you more precise control through text commands.

Why use the command line when we have modern GUIs?

Many data science tools are designed to be used this way: Tools like Git, Docker, and many Python and R package management utilities primarily use command-line interfaces.
It allows for reproducibility through scripts: Command-line operations can be saved in script files and run again later, ensuring that the exact same steps are followed each time. This reproducibility is essential for reliable data analysis.
It often provides more flexibility and power: Command-line tools typically offer more options and configurations than their graphical counterparts. For example, when installing Python packages, the command-line tool pip offers dozens of options to handle dependencies, versions, and installation locations that aren’t available in most graphical installers.
It’s faster for many operations once you learn the commands: After becoming familiar with the commands, many operations can be performed more quickly than navigating through multiple screens in a GUI. For instance, you can install multiple Python packages with a single command line rather than clicking through installation wizards for each one.

2.2.2 Getting Started with the Command Line

2.2.2.1 On Windows

Windows offers several options for command line interfaces:

Command Prompt: Built into Windows, but limited in functionality
PowerShell: A more powerful alternative built into Windows
Windows Subsystem for Linux (WSL): Provides a Linux environment within Windows (recommended)

To install WSL, open PowerShell as administrator and run:

wsl --install

This installs Ubuntu Linux by default. After installation, restart your computer and follow the setup prompts.

2.2.2.2 On macOS

The Terminal application comes pre-installed:

Press Cmd+Space to open Spotlight search
Type “Terminal” and press Enter

2.2.2.3 On Linux

Most Linux distributions come with a terminal emulator. Look for “Terminal” in your applications menu.

2.2.3 Essential Command Line Operations

Let’s practise some basic commands. Open your terminal and try these:

2.2.3.1 Navigating the File System

# Print working directory (shows where you are)
pwd

# List files and directories
ls

# Change directory [to Documents]
cd Documents

# Go up one directory level (like clicking the back button in your browser)
cd ..

# Create a new directory
mkdir data_science_projects

# Remove a file (be careful!)
rm filename.txt

# Remove a directory
rmdir directory_name

These commands form the foundation of file navigation and manipulation. As you work with data science tools, you’ll find yourself using them frequently.

The commands above are like giving directions to your computer. Just as you might tell someone “Go down this street, then turn left at the second intersection,” these commands tell your computer “Show me where I am,” “Show me what’s here,” “Go into this folder,” and so on.

2.2.3.2 Creating and Editing Files

While you can create files through the command line, it’s often easier to use a text editor. However, it’s good to know these commands:

# Create an empty file
touch newfile.txt

# Display file contents
cat filename.txt

# Simple editor (press i to insert, Esc then :wq to save and quit)
vim filename.txt

Think of these commands as ways to create and look at the contents of notes or documents on your computer, all without opening a word processor or text editor application.

2.2.4 Package Managers

Most command line environments include package managers, which help install and update software. Think of package managers as app stores for your command line. Common ones include:

apt (Ubuntu/Debian Linux)
brew (macOS)
winget (Windows)

For example, on Ubuntu you might install Python using:

sudo apt update
sudo apt install python3

On macOS with Homebrew:

# Install Homebrew first if you don't have it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Then install Python
brew install python

The term “sudo” gives you temporary administrator-level privileges, similar to when Windows asks “Do you want to allow this app to make changes to your device?”

Understanding these basics will help tremendously as we set up our data science tools. The command line might seem intimidating at first, but it becomes an invaluable ally as you grow more comfortable with it.

2.3 Setting Up Python

Python has become a cornerstone language in data science due to its readability, extensive libraries, and versatile applications. Let’s set up a proper Python environment.

2.3.1 Why Python for Data Science?

Python offers several advantages for data science:

Rich ecosystem of specialised libraries (NumPy, pandas, scikit-learn, etc.)
Readable syntax that makes complex analyses more accessible
Strong community support and documentation
Integration with various data sources and visualisation tools

Python consistently ranks among the top programming languages for data science and is widely used across the industry.

2.3.2 Installing Python

You have a few reasonable ways to install Python for data science. Each has trade-offs:

Miniforge (recommended for most readers): a minimal, community-maintained conda distribution that defaults to the free conda-forge channel. Gives you conda for environment management without the licensing concerns that now apply to the full Anaconda distribution for commercial users in larger organisations.
Anaconda: the “batteries-included” distribution. Convenient but note that since 2024 its Terms of Service restrict free commercial use for organisations with more than ~200 employees. Fine for learning and personal projects.
uv (the modern, fast option): a single binary that installs Python versions and manages virtual environments and packages. Much faster than conda or pip and rapidly becoming the default for pure-Python projects. A great choice once you’re comfortable, especially if you don’t need conda’s scientific package ecosystem.
System Python (python.org installer, brew install python, or apt install python3): simple, but you’ll still want to create isolated environments per project with venv or uv venv.

For this book we’ll use conda-based instructions since they work uniformly across Windows, macOS, and Linux and handle tricky scientific packages well. Everything shown below works with either Miniforge or Anaconda.

2.3.2.1 Installing Miniforge (or Anaconda)

Visit the Miniforge releases page (or the Anaconda download page if you prefer)
Download the appropriate installer for your operating system
Run the installer and follow the prompts

During installation on Windows, you may be asked whether to add the installation to your PATH environment variable. Leave this unchecked and use the Miniforge Prompt (or Anaconda Prompt) from the Start menu, which avoids conflicts with other Python installations on your system.

The “PATH” is like an address book that tells your computer where to find programs when you type their names. Adding your conda install to PATH means you can use Python from any command prompt, but it could cause conflicts with other versions of Python on your system.

2.3.2.2 Verifying Installation

Open a new terminal (or Miniforge/Anaconda Prompt on Windows) and type:

python --version

You should see the Python version number. Also, check that conda is installed:

conda --version

2.3.3 Creating a Python Environment

Environments let you isolate projects with specific dependencies. Think of environments as separate workspaces for different projects—like having different toolboxes for different types of jobs. Here’s how to create one:

# Create an environment named 'datasci' with Python 3.12
conda create -n datasci python=3.12

# Activate the environment
conda activate datasci

# Install common data science packages
conda install numpy pandas matplotlib scikit-learn jupyter

Whenever you work on your data science projects, activate this environment first.

Lighter alternative: venv or uv

If you don’t need conda’s broader scientific ecosystem, Python’s built-in venv module is a lighter alternative:

python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate
pip install numpy pandas matplotlib scikit-learn jupyter

Or with uv (much faster, installs Python for you if needed):

uv venv --python 3.12
source .venv/bin/activate   # Windows: .venv\Scripts\activate
uv pip install numpy pandas matplotlib scikit-learn jupyter

A good rule of thumb: use conda environments when you need packages with heavy C/C++/Fortran components (GDAL, PyTorch with CUDA, bioinformatics tools). Use venv/uv for everything else.

2.3.4 Recording Your Environment

Whichever tool you pick, write down what’s in your environment so someone else (including future you) can reproduce it:

# conda
conda env export --from-history > environment.yml

# pip / venv
pip freeze > requirements.txt

# uv
uv pip freeze > requirements.txt

Commit this file alongside your code. We’ll come back to reproducibility when we discuss containers and workflows.

2.3.5 Using Jupyter Notebooks

Jupyter notebooks provide an interactive environment for Python development, popular in data science for combining code, visualisations, and narrative text. They’re like digital lab notebooks where you can document your analysis process along with the code and results.

# Make sure your environment is activated
conda activate datasci

# Launch Jupyter Notebook
jupyter notebook

This opens a web browser where you can create and work with notebooks. Let’s create a simple notebook to verify everything works:

Click “New” → “Python 3”
In the first cell, type:

Show code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Create some sample data
data = pd.DataFrame({
    'x': range(1, 11),
    'y': np.random.randn(10)
})

# Create a simple plot
plt.figure(figsize=(8, 4))
plt.plot(data['x'], data['y'], marker='o')
plt.title('Sample Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.grid(True)
plt.show()

print("Python environment is working correctly!")

Press Shift+Enter to run the cell

If you see a plot and the success message, your Python setup is complete!

2.3.6 Installing Additional Packages

As your data science journey progresses, you’ll need additional packages. Use either:

# Using conda (preferred when available)
conda install package_name

# Using pip (when packages aren't available in conda)
pip install package_name

Conda is often preferred for data science packages because it handles complex dependencies better, especially for packages with C/C++ components. This is particularly important for libraries that have parts written in lower-level programming languages to make them run faster.

2.4 Setting Up R

R is a powerful language and environment specifically designed for statistical computing and graphics. Many statisticians and data scientists prefer R for statistical analysis and visualisation.

2.4.1 Why R for Data Science?

R offers several advantages:

Built specifically for statistical analysis
Excellent for data visualisation with ggplot2
A rich ecosystem of packages for specialised statistical methods
Strong in reproducible research through R Markdown

R has thousands of packages available on CRAN for various statistical and data analysis tasks, with active development from the statistics and research communities.

2.4.2 Installing R

Let’s install both R itself and RStudio, a popular integrated development environment for R.

2.4.2.1 Installing Base R

Visit the Comprehensive R Archive Network (CRAN)
Click on the link for your operating system
Follow the installation instructions

2.4.2.2 Installing RStudio Desktop

RStudio provides a user-friendly interface for working with R.

Visit the Posit RStudio download page (RStudio’s parent company rebranded to Posit in 2022)
Download the free RStudio Desktop version for your operating system
Run the installer and follow the prompts

Think of R as the engine and RStudio as the dashboard that makes it easier to control that engine. You could use R without RStudio, but RStudio makes many tasks more convenient.

2.4.2.3 Verifying Installation

Open RStudio and enter this command in the console (lower-left pane):

Show code

R.version.string

You should see the R version information displayed. You can verify this as the version is the first printed output you will see in the console at the start of a new session. It should look something like this:

2.4.3 Essential R Packages for Data Science

Let’s install some core packages that you’ll likely need:

Show code

# Install essential packages
install.packages(c("tidyverse", "rmarkdown", "shiny", "knitr", "plotly"))

This installs:

tidyverse: A collection of packages for data manipulation and visualisation
rmarkdown: For creating documents that mix code and text
shiny: For building interactive web applications
knitr: For dynamic report generation
plotly: For interactive visualisations

These packages are like specialised toolkits that expand what you can do with R. The tidyverse, for example, makes data manipulation much more intuitive than it would be using just base R.

2.4.4 Creating Your First R Script

Let’s verify our setup with a simple R script:

In RStudio, go to File → New File → R Script
Enter the following code:

Show code

# Load libraries
library(tidyverse)

# Create sample data
data <- tibble(
  x = 1:10,
  y = rnorm(10)
)

# Create a plot with ggplot2
ggplot(data, aes(x = x, y = y)) +
  geom_point() +
  geom_line() +
  labs(title = "Sample Plot in R",
       x = "X-axis",
       y = "Y-axis") +
  theme_minimal()

print("R environment is working correctly!")

Click the “Run” button or press Ctrl+Enter (Cmd+Enter on Mac) to execute the code

If you see a plot in the lower-right pane and the success message in the console, your R setup is complete!

2.4.5 Understanding R Packages

Unlike Python, where conda or pip manage packages, R has its own built-in package management system accessed through functions like install.packages() and library().

There are thousands of R packages available on CRAN, with more on Bioconductor (for bioinformatics) and GitHub. To install a package from GitHub, you first need the devtools package:

Show code

install.packages("devtools")
devtools::install_github("username/package")

Think of CRAN as the official app store for R packages, while GitHub is like getting apps directly from developers. Both are useful, but packages on CRAN have gone through more quality checks.

2.5 SQL Fundamentals and Setup

SQL (Structured Query Language) is essential for data scientists to interact with databases. We’ll set up a lightweight database system so you can practise SQL queries locally.

2.5.1 Why SQL for Data Science?

SQL is crucial for data science because:

Most organisational data resides in databases
It provides a standard way to query and manipulate data
It’s often more efficient than Python or R for large data operations
Data transformation often happens in databases before analysis

SQL is one of the most important skills for data scientists, as most organisational data resides in relational databases.

2.5.2 Installing SQLite

SQLite is a lightweight, file-based database that requires no server setup, making it perfect for learning.

Think of SQLite as a simple filing cabinet for your data that you can easily carry around, unlike larger database systems that require dedicated servers.

2.5.2.1 On Windows

Download the SQLite command-line tools (look for the “Precompiled Binaries for Windows” section) from the SQLite download page
Extract the ZIP to a folder (e.g., C:\sqlite)
Add this folder to your PATH environment variable so you can run sqlite3 from any terminal:
- Press the Windows key, type “environment variables”, and open “Edit the system environment variables”
- Click “Environment Variables…”
- Under User variables, select Path and click “Edit…”
- Click “New” and paste C:\sqlite (or wherever you extracted the files)
- Click OK on every dialogue, then open a new terminal window for the change to take effect

2.5.2.2 On macOS

SQLite comes pre-installed, but you can install a newer version with Homebrew:

# Install SQLite
brew install sqlite

2.5.2.3 On Linux

sudo apt update
sudo apt install sqlite3

2.5.2.4 Verifying Installation

Open a terminal or command prompt and type:

sqlite3 --version

You should see the version information displayed.

2.5.3 Creating Your First Database

Let’s create a simple database to verify our setup:

# Create a new database file
sqlite3 sample.db

# In the SQLite prompt, create a table
CREATE TABLE people (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    age INTEGER,
    city TEXT
);

# Insert some data
INSERT INTO people (name, age, city) VALUES ('Alice', 28, 'New York');
INSERT INTO people (name, age, city) VALUES ('Bob', 35, 'Chicago');
INSERT INTO people (name, age, city) VALUES ('Charlie', 42, 'San Francisco');

# Query the data
SELECT * FROM people;

# Exit SQLite
.exit

Think of this process as creating a spreadsheet (the table) within a file (the database), then adding some rows of data, and finally viewing all the data.

2.5.4 SQL GUIs for Easier Database Management

While the command line is powerful, graphical interfaces can make working with databases more intuitive:

2.5.4.1 DB Browser for SQLite

This free, open-source tool provides a user-friendly interface for SQLite databases.

Visit the DB Browser for SQLite download page
Download the appropriate version for your operating system
Install and open it
Open the sample.db file you created earlier

DB Browser for SQLite acts like a spreadsheet program for your database, making it easier to view and edit data without typing SQL commands.

2.5.4.2 Using SQL from Python and R

You can also interact with SQLite databases from Python and R:

2.5.4.2.1 Python

Show code

import sqlite3
import pandas as pd

# Connect to the database
conn = sqlite3.connect('sample.db')

# Query data into a pandas DataFrame
df = pd.read_sql_query("SELECT * FROM people", conn)

# Display the data
print(df)

# Close the connection
conn.close()

2.5.4.2.2 R

Show code

library(RSQLite)
library(DBI)

# Connect to the database
conn <- dbConnect(SQLite(), "sample.db")

# Query data into a data frame
df <- dbGetQuery(conn, "SELECT * FROM people")

# Display the data
print(df)

# Close the connection
dbDisconnect(conn)

This interoperability between SQL, Python, and R is a fundamental skill for data scientists, allowing you to leverage the strengths of each tool. You can store data in a database, query it with SQL, then analyse it with Python or R—all within the same workflow.

2.6 Integrated Development Environments (IDEs)

An Integrated Development Environment (IDE) combines the tools needed for software development into a single application. A good IDE dramatically improves productivity by providing code editing, debugging, execution, and project management in one place.

2.6.1 Why IDEs Matter for Data Science

IDEs help data scientists by:

Providing syntax highlighting and code completion
Catching errors before execution
Offering integrated documentation
Simplifying project organisation and version control

Most professional developers use a specialised IDE rather than a basic text editor, as the additional features significantly improve productivity.

Think of an IDE as a fully equipped workshop rather than just having a single tool. It has everything arranged conveniently in one place.

We’ve already installed RStudio for R development. Now let’s look at options for Python and SQL.

2.6.2 VS Code: A Universal IDE

Visual Studio Code (VS Code) is a free, open-source editor that supports multiple languages through extensions. Its flexibility makes it an excellent choice for data scientists.

2.6.2.1 Installing VS Code

Visit the VS Code download page
Download the appropriate version for your operating system
Run the installer and follow the prompts

2.6.2.2 Essential VS Code Extensions for Data Science

After installing VS Code, add these extensions by clicking on the Extensions icon in the sidebar (or pressing Ctrl+Shift+X):

Python by Microsoft: Python language support
Jupyter: Support for Jupyter notebooks
Rainbow CSV: Makes CSV files easier to read
SQLite: SQLite database support
R: R language support (if you plan to use R in VS Code)
GitLens: Enhanced Git capabilities
GitHub Copilot or Claude for VS Code: LLM-assisted code completion and chat. These have become nearly standard for professional development since 2024 and are particularly useful for readers without a CS background: they’re excellent at scaffolding boilerplate, explaining unfamiliar code, and suggesting fixes to error messages. Both offer free tiers for individuals. Treat them as a fast pair-programmer, not an oracle: always read the suggestions before accepting them.

Extensions in VS Code are like add-ons or plugins that enhance its functionality for specific tasks or languages, similar to how you might install apps on your phone to give it new capabilities.

2.6.2.3 Configuring VS Code for Python

Open VS Code
Press Ctrl+Shift+P (Cmd+Shift+P on Mac) to open the command palette
Type “Python: Select Interpreter” and select it
Choose your conda environment (e.g., datasci)

This step tells VS Code which Python installation to use when running your code. It’s like telling a multilingual person which language to speak when communicating with you.

2.6.3 PyCharm Community Edition

PyCharm is an IDE specifically designed for Python development, with excellent data science support.

2.6.3.1 Installing PyCharm Community Edition

Visit the PyCharm download page
Download the free Community Edition
Run the installer and follow the prompts

2.6.3.2 Configuring PyCharm for Your Conda Environment

Open PyCharm
Create a new project
Click on “Previously configured interpreter”
Click on the gear icon and select “Add…”
Choose “Conda Environment” → “Existing environment”
Browse to your conda environment’s Python executable
- On Windows: Usually in C:\Users\<username>\anaconda3\envs\datasci\python.exe
- On macOS/Linux: Usually in /home/<username>/anaconda3/envs/datasci/bin/python

Note

Note: In file paths, forward slashes (/) are primarily used in Unix-like systems like Linux and macOS, while backslashes (\) are commonly used in Windows.

2.6.4 Working with Jupyter Notebooks

While we already mentioned Jupyter notebooks in the Python section, they deserve more attention as a popular IDE-like interface for data science.

2.6.4.1 JupyterLab: The Next Generation of Jupyter

JupyterLab is a web-based interactive development environment that extends the notebook interface with a file browser, consoles, terminals, and more.

# Install JupyterLab
conda activate datasci
conda install -c conda-forge jupyterlab

# Launch JupyterLab
jupyter lab

JupyterLab provides a more IDE-like experience than classic Jupyter notebooks, with the ability to open multiple notebooks, view data frames, and edit other file types in a single interface. It’s like upgrading from having separate tools to having a comprehensive workbench.

2.6.5 Choosing the Right IDE

Each IDE has strengths and weaknesses:

VS Code: Versatile, lightweight, supports multiple languages
PyCharm: Robust Python-specific features, excellent for large projects
RStudio: Optimised for R development
JupyterLab: Excellent for exploratory data analysis and sharing results

Many data scientists use multiple IDEs depending on the task. For example, you might use:

JupyterLab for exploration and visualisation
VS Code for script development and Git integration
RStudio for statistical analysis and report generation

Choose the tools that best fit your workflow and preferences. It’s perfectly fine to start with one and add others as you grow more comfortable.

2.7 Version Control with Git and GitHub

Version control is a system that records changes to files over time, allowing you to recall specific versions later. Git is the most widely used version control system, and GitHub is a popular platform for hosting Git repositories.

2.7.1 Why Version Control for Data Science?

Version control is essential for data science because it:

Tracks changes to code and documentation
Facilitates collaboration with others
Provides a backup of your work
Documents the evolution of your analysis
Enables reproducibility by capturing the state of code at specific points

Proper version control is essential for reproducibility and collaboration in data science work.

Think of Git as a time machine for your code. It allows you to save snapshots of your project at different points in time and revisit or restore those snapshots if needed.

2.7.2 Installing Git

2.7.2.1 On Windows

Download the installer from Git for Windows
Run the installer, accepting the default options (though you may want to choose VS Code as your default editor if you installed it)

2.7.2.2 On macOS

Git may already be installed. Check by typing git --version in the terminal. If not:

# Install Git using Homebrew
brew install git

2.7.2.3 On Linux

sudo apt update
sudo apt install git

2.7.2.4 Configuring Git

After installation, open a terminal and configure your identity:

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

This is like putting your name and address on a letter. When you make changes to a project, Git will know who made them.

2.7.3 Creating a GitHub Account

GitHub provides free hosting for Git repositories, making it easy to share code and collaborate.

Visit GitHub
Click “Sign up” and follow the instructions
Verify your email address

GitHub is to Git what social media is to your photos—a place to share your work with others and collaborate on projects.

2.7.4 Setting Up SSH Authentication for GitHub

Using SSH keys makes it more secure and convenient to interact with GitHub:

2.7.4.1 Generating SSH Keys

# Generate a new SSH key
ssh-keygen -t ed25519 -C "your.email@example.com"

# Start the SSH agent
eval "$(ssh-agent -s)"

# Add your key to the agent
ssh-add ~/.ssh/id_ed25519

When ssh-keygen prompts you for a passphrase, set one. It’s tempting to press Enter for a blank passphrase, but an unprotected private key on a compromised laptop gives an attacker your full GitHub access. The SSH agent will remember the passphrase for your session so you only type it once after login.

SSH keys are like a special lock and key system. Instead of typing your password every time you interact with GitHub, your computer uses these keys to prove it’s really you.

2.7.4.2 Adding Your SSH Key to GitHub

Copy your public key to the clipboard:
- On Windows (in Git Bash): cat ~/.ssh/id_ed25519.pub | clip
- On macOS: pbcopy < ~/.ssh/id_ed25519.pub
- On Linux: cat ~/.ssh/id_ed25519.pub | xclip -selection clipboard
Go to GitHub → Settings → SSH and GPG keys → New SSH key
Paste your key and save

2.7.5 Basic Git Workflow

Let’s create a repository and learn the essential Git commands:

# Create a new directory
mkdir my_first_repo
cd my_first_repo

# Initialize a Git repository
git init

# Create a README file
echo "# My First Repository" > README.md

# Add the file to the staging area
git add README.md

# Commit the changes
git commit -m "Initial commit"

Think of this process as:

Creating a new folder for your project
Telling Git to start tracking changes in this folder
Creating a simple text file
Telling Git you want to include this file in your next snapshot
Taking the snapshot with a brief description

2.7.6 Telling Git What to Ignore: `.gitignore`

Before you commit anything else, create a .gitignore file in the root of your repository. This tells Git which files and folders it should never track: things like local virtual environments, cached files, large data dumps, and, critically, secrets like API keys. Getting this right from day one prevents the single most common beginner mistake: accidentally pushing credentials to a public repository.

Create a file named .gitignore (note the leading dot) with content like this as a starting point for a data science project:

# Secrets: never commit these
.env
.env.local
*.pem
credentials.json

# Python
__pycache__/
*.pyc
.venv/
venv/
.ipynb_checkpoints/

# R
.Rhistory
.RData
.Rproj.user/
renv/library/

# Data: keep raw data out of Git, use DVC or cloud storage
data/raw/
data/processed/
*.parquet
*.db

# OS and editor clutter
.DS_Store
Thumbs.db
.vscode/
.idea/

Adjust the patterns to fit your project. If you do need to store small reference datasets in Git, remove the relevant data/ lines. If you find yourself wanting to commit something you’ve already ignored, you can force it with git add -f <file>.

If you ever accidentally commit a secret, rotating the credential is faster and safer than trying to scrub it from Git history. Assume anything that reached a public repository is compromised and generate a new key.

2.7.7 Connecting to GitHub

Now let’s push this local repository to GitHub:

On GitHub, click “+” in the top-right corner and select “New repository”
Name it “my_first_repo”
Leave it as a public repository
Don’t initialise with a README (we already created one)
Click “Create repository”
Follow the instructions for “push an existing repository from the command line”:

git remote add origin git@github.com:yourusername/my_first_repo.git
git branch -M main
git push -u origin main

This process connects your local repository to GitHub (like linking your local folder to a cloud storage service) and uploads your code.

2.7.8 Basic Git Commands for Daily Use

These commands form the core of day-to-day Git usage:

# Check status of your repository
git status

# View commit history
git log

# Create and switch to a new branch
git switch -c new-feature      # older syntax: git checkout -b new-feature

# Switch between existing branches
git switch main                # older syntax: git checkout main

# Pull latest changes from remote repository
git pull

# Add all changed files to staging
git add .

# Commit staged changes
git commit -m "Description of changes"

# Push commits to remote repository
git push

Think of branches as parallel versions of your project. The main branch is like the trunk of a tree, and other branches are like branches growing out from it. You can work on different features in different branches without affecting the main branch, then combine them when they’re ready.

2.7.9 Using Git in IDEs

Most modern IDEs integrate with Git, making version control easier:

2.7.9.1 VS Code

Click the Source Control icon in the sidebar
Use the interface to stage, commit, and push changes

2.7.9.2 PyCharm

Go to VCS → Git in the menu
Use the interface for Git operations

2.7.9.3 RStudio

Click the Git tab in the upper-right panel
Use the interface for Git operations

These integrations mean you don’t have to use the command line for every Git operation—you can manage version control without leaving your coding environment.

2.7.10 Collaborating with Others on GitHub

GitHub facilitates collaboration through pull requests:

Fork someone’s repository by clicking the “Fork” button on GitHub

Clone your fork locally:

git clone git@github.com:yourusername/their-repo.git

Create a branch for your changes:
```
git switch -c my-feature
```
Make changes, commit them, and push to your fork:
```
git push origin my-feature
```
On GitHub, navigate to your fork and click “New pull request”

Pull requests allow project maintainers to review your changes before incorporating them. It’s like submitting a draft for review before it gets published.

The “fork and pull request” workflow is used by nearly all open-source projects, from small libraries to major platforms like TensorFlow and pandas. It’s considered a best practice for collaborative development.

2.1 Introduction

2.2 Understanding the Command Line

2.2.1 What is the Command Line?

2.2.2 Getting Started with the Command Line

2.2.2.1 On Windows

2.2.2.2 On macOS

2.2.2.3 On Linux

2.2.3 Essential Command Line Operations

2.2.3.1 Navigating the File System

2.2.3.2 Creating and Editing Files

2.2.4 Package Managers

2.3 Setting Up Python

2.3.1 Why Python for Data Science?

2.3.2 Installing Python

2.3.2.1 Installing Miniforge (or Anaconda)

2.3.2.2 Verifying Installation

2.3.3 Creating a Python Environment

2.3.4 Recording Your Environment

2.3.5 Using Jupyter Notebooks

2.3.6 Installing Additional Packages

2.4 Setting Up R

2.4.1 Why R for Data Science?

2.4.2 Installing R

2.4.2.1 Installing Base R

2.4.2.2 Installing RStudio Desktop

2.4.2.3 Verifying Installation

2.4.3 Essential R Packages for Data Science

2.4.4 Creating Your First R Script

2.4.5 Understanding R Packages

2.5 SQL Fundamentals and Setup

2.5.1 Why SQL for Data Science?

2.5.2 Installing SQLite

2.5.2.1 On Windows

2.5.2.2 On macOS

2.5.2.3 On Linux

2.5.2.4 Verifying Installation

2.5.3 Creating Your First Database

2.5.4 SQL GUIs for Easier Database Management

2.5.4.1 DB Browser for SQLite

2.5.4.2 Using SQL from Python and R

2.5.4.2.1 Python

2.5.4.2.2 R

2.6 Integrated Development Environments (IDEs)

2.6.1 Why IDEs Matter for Data Science

2.6.2 VS Code: A Universal IDE

2.6.2.1 Installing VS Code

2.6.2.2 Essential VS Code Extensions for Data Science

2.6.2.3 Configuring VS Code for Python

2.6.3 PyCharm Community Edition

2.6.3.1 Installing PyCharm Community Edition

2.6.3.2 Configuring PyCharm for Your Conda Environment

2.6.4 Working with Jupyter Notebooks

2.6.4.1 JupyterLab: The Next Generation of Jupyter

2.6.5 Choosing the Right IDE

2.7 Version Control with Git and GitHub

2.7.1 Why Version Control for Data Science?

2.7.2 Installing Git

2.7.2.1 On Windows

2.7.2.2 On macOS

2.7.2.3 On Linux

2.7.2.4 Configuring Git

2.7.3 Creating a GitHub Account

2.7.4 Setting Up SSH Authentication for GitHub

2.7.4.1 Generating SSH Keys

2.7.4.2 Adding Your SSH Key to GitHub

2.7.5 Basic Git Workflow

2.7.6 Telling Git What to Ignore: .gitignore

2.7.7 Connecting to GitHub

2.7.8 Basic Git Commands for Daily Use

2.7.9 Using Git in IDEs

2.7.9.1 VS Code

2.7.9.2 PyCharm

2.7.9.3 RStudio

2.7.10 Collaborating with Others on GitHub

2.7.6 Telling Git What to Ignore: `.gitignore`