10 Utility Tools for Data Scientists

10.1 Utility Tools for Data Scientists

While programming languages, libraries, and frameworks form the core of your data science toolkit, a collection of utility tools can significantly enhance your productivity and effectiveness. This chapter covers specialized tools that address specific needs in the data science workflow.

10.1.1 Text Editors and IDE Enhancements

Text editors offer lightweight alternatives to full IDEs for quick edits and specialized text processing tasks.

10.1.1.1 Notepad++

Notepad++ is a free, open-source text editor for Windows that’s more powerful than the default Notepad application.

Key features for data scientists:

Syntax highlighting: Supports many languages including Python, R, SQL, JSON, and more
Column editing: Edit multiple lines simultaneously (useful for cleaning data)
Regex search and replace: Powerful pattern matching for text manipulation
Macro recording: Automate repetitive text edits
Plugins: Extend functionality with additional tools that facilitate more expeditious exploration and development, such as JSON, HTML, and Markdown viewers.

Installation:

Download from notepad-plus-plus.org
Run the installer and follow the prompts

Useful shortcuts:

Ctrl+H: Find and replace
Alt+Shift+Arrow: Column selection mode
Ctrl+D: Duplicate current line
Ctrl+Shift+Up/Down: Move current line up/down

Notepad++ is particularly useful for quickly viewing and editing large text files, CSV data, or configuration files without launching a full IDE. Its ability to handle multi-gigabyte files makes it valuable for inspecting large datasets.

Notepad++ (as of writing) is not available on Mac, but the following alternative is.

10.1.1.2 Sublime Text

Sublime Text is a sophisticated cross-platform text editor with powerful features for code editing.

Key features for data scientists:

Multiple selections: Edit many places at once
Command palette: Quickly access commands without menus
Distraction-free mode: Focus on your text without UI elements
Splits and grids: View multiple files or parts of files simultaneously
Customizable key bindings: Create shortcuts tailored to your workflow

Installation:

Download from sublimetext.com
Install and activate (free evaluation with occasional purchase reminder)

Sublime Text’s speed and versatility make it excellent for manipulating text data, writing scripts, or making quick edits to code without launching a heavier IDE.

10.1.2 API Development and Testing Tools

APIs (Application Programming Interfaces) are crucial for accessing web services and databases. These tools help you test, debug, and document APIs.

10.1.2.1 Postman

Postman is the industry standard for API development and testing.

Key features for data scientists:

Request building: Create and save HTTP requests
Collections: Organize and share API requests
Environment variables: Manage different settings (dev/prod)
Automated testing: Create test scripts to validate responses
Mock servers: Simulate API responses without a backend

Installation:

Download from postman.com
Create a free account to sync across devices

Example workflow:

Create a new request to a data API:

GET https://api.example.com/data?limit=100

Add authentication (if required):
```
Authorization: Bearer your_token_here
```
Send the request and analyze the JSON response
Save the request to a collection for future use

Postman is invaluable when working with data APIs, whether you’re fetching data from public sources like financial markets, weather services, or social media platforms, or interacting with internal company APIs.

10.1.2.2 Insomnia

Insomnia is a lightweight alternative to Postman with an intuitive interface.

Key features:

Clean, focused UI: Less complex than Postman
GraphQL support: Built-in tools for GraphQL queries
Request chaining: Use data from one request in another
Environment management: Switch between configurations easily
Open source: Free core version available

Installation:

Download from insomnia.rest
Run the installer

For data scientists who occasionally work with APIs but don’t need Postman’s full feature set, Insomnia offers a streamlined alternative.

10.1.3 Database Management Tools

These tools provide graphical interfaces for working with databases, making it easier to explore and manipulate data.

10.1.3.1 DBeaver

DBeaver is a universal database tool that works with almost any database system.

Key features for data scientists:

Multi-database support: Works with PostgreSQL, MySQL, SQLite, Oracle, and more
Visual query builder: Create SQL queries without writing code
Data export/import: Move data between different formats and databases
ER diagrams: Visualize database structure
SQL editor: Write and execute queries with syntax highlighting

Installation:

Download from dbeaver.io
Run the installer

Example workflow:

Connect to a database with connection parameters
Browse tables and view structure

Use the SQL editor to write a query:

SELECT 
    product_category,
    COUNT(*) as count,
    AVG(price) as avg_price
FROM products
GROUP BY product_category
ORDER BY count DESC;

Export results to CSV for analysis in Python or R

DBeaver streamlines database interactions, allowing you to explore data structures, write queries, and export results without writing code to establish database connections.

10.1.3.2 pgAdmin

pgAdmin is a specialized tool for PostgreSQL databases.

Key features:

PostgreSQL-specific features: Optimized for PostgreSQL
Server monitoring: View database performance
Backup and restore: Manage database backups
User management: Control access to databases
Procedural language debugging: Test stored procedures

Installation:

Download from pgadmin.org
Run the installer

For data scientists working specifically with PostgreSQL databases, pgAdmin provides specialized features that generic tools may lack.

10.1.4 File Comparison and Merging Tools

These tools help identify differences between files and directories, which is useful for comparing datasets or code versions.

10.1.4.1 Beyond Compare

Beyond Compare is a powerful file and directory comparison tool.

Key features for data scientists:

Text comparison: View differences between text files line by line
Table comparison: Compare CSV and Excel files with data-aware features
Directory sync: Compare and synchronize folders
3-way merge: Resolve conflicts between different versions
Byte-level comparison: Analyze binary files

Installation:

Download from scootersoftware.com
Install (trial version available)

Example data science use case: Comparing two versions of a dataset to identify changes:

Open two CSV files in Table Compare mode
Automatically align columns by name
Identify added, removed, or modified rows
Export the differences to a new file

Beyond Compare is particularly valuable when dealing with evolving datasets, where you need to understand what changed between versions.

10.1.4.2 WinMerge

WinMerge is a free, open-source alternative for file and folder comparison.

Key features:

Visual text comparison: Side-by-side differences with highlighting
Folder comparison: Compare directory structures
Image comparison: Visual diff for images
Plugins: Extend functionality for additional file types
Integration: Works with source control systems

Installation:

Download from winmerge.org
Run the installer

WinMerge is an excellent free option for basic comparison needs, though it lacks some of the advanced features of commercial alternatives.

10.1.5 Terminal Enhancements

Improving your command line experience can significantly boost productivity when working with data and code.

10.1.5.1 Oh My Zsh

Oh My Zsh is a framework for managing your Zsh configuration, providing themes and plugins for the Z shell.

Key features for data scientists:

Tab completion: Intelligent completion for commands and paths
Git integration: Visual indicators of repository status
Syntax highlighting: Color-coded command syntax
Command history: Improved search through previous commands
Customizable themes: Visual enhancements for the terminal

Installation (macOS or Linux):

# Install Zsh first if needed
# Ubuntu/Debian:
# sudo apt install zsh
# macOS (usually pre-installed)

# Set Zsh as default shell
chsh -s $(which zsh)

# Install Oh My Zsh
sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

Useful plugins for data scientists:

# Edit ~/.zshrc to activate plugins
plugins=(git python pip conda docker jupyter)

Oh My Zsh makes the command line more user-friendly and efficient, which is valuable when working with data processing tools, running scripts, or managing environments.

10.1.6 Data Wrangling Tools

These specialized tools help with specific data manipulation tasks that complement programming languages.

10.1.6.1 CSVKit

CSVKit is a suite of command-line tools for working with CSV files.

Key features for data scientists:

csvstat: Generate descriptive statistics on CSV files
csvcut: Extract specific columns
csvgrep: Filter rows based on patterns
csvsort: Sort CSV files
csvjoin: SQL-like join operations between CSV files

Installation:

pip install csvkit

Example commands:

# View basic statistics of a CSV file
csvstat data.csv

# Extract specific columns
csvcut -c 1,3,5 data.csv > extracted.csv

# Filter rows containing a pattern
csvgrep -c 2 -m "Pattern" data.csv > filtered.csv

# Sort by a column
csvsort -c 3 data.csv > sorted.csv

CSVKit is extremely useful for quick data exploration and manipulation directly from the command line, without the need to write Python or R code for simple operations.

10.1.6.2 jq

jq is a lightweight command-line JSON processor that helps manipulate JSON data.

Key features:

Filtering: Extract specific data from complex JSON
Transformation: Reshape JSON structures
Combination: Merge multiple JSON sources
Computation: Perform calculations on numeric values
Formatting: Pretty-print and compact JSON

Installation:

# macOS
brew install jq

# Ubuntu/Debian
sudo apt install jq

# Windows (with Chocolatey)
choco install jq

Example commands:

# Pretty-print JSON
cat data.json | jq '.'

# Extract specific fields
cat data.json | jq '.results[] | {name, value}'

# Filter based on a condition
cat data.json | jq '.results[] | select(.value > 100)'

# Calculate statistics
cat data.json | jq '[.results[].value] | {count: length, sum: add, average: add/length}'

jq is invaluable when working with APIs that return JSON data or when preparing JSON data for visualization or further analysis.

10.1.7 Diagramming and Visualization Tools

While code-based visualization is powerful, sometimes you need standalone tools for creating diagrams and flowcharts.

10.1.7.1 Diagram.net (formerly draw.io)

Diagram.net is a free online diagramming tool that works with various diagram types.

Key features for data scientists:

Flowcharts: Document data pipelines and workflows
ER diagrams: Model database relationships
Network diagrams: Visualize system architecture
Multiple export formats: PNG, SVG, PDF, etc.
Integration: Works with Google Drive, Dropbox, etc.

Access:

Go to diagram.net in your browser
Choose where to save your diagrams (local, Google Drive, etc.)

Example data science use case: Creating a data flow diagram to document an ETL process:

Select the flowchart template
Add data sources, transformation steps, and outputs
Connect components with arrows showing data flow
Add annotations explaining transformations
Export as PNG for inclusion in documentation

Clear diagrams are essential for communicating complex data processing workflows to stakeholders or documenting them for future reference.

10.1.7.2 Graphviz

Graphviz is a command-line tool for creating structured diagrams from text descriptions.

Key features:

Programmatic diagrams: Generate diagrams from code
Automatic layout: Optimal arrangement of elements
Various diagram types: Directed graphs, hierarchies, networks
Integration: Works with Python, R, and other languages
Scriptable: Automate diagram generation

Installation:

# macOS
brew install graphviz

# Ubuntu/Debian
sudo apt install graphviz

# Windows (with Chocolatey)
choco install graphviz

Example DOT file (graph.dot):

digraph DataPipeline {
  rankdir=LR;
  
  raw_data [label="Raw Data"];
  cleaning [label="Data Cleaning"];
  features [label="Feature Engineering"];
  modeling [label="Model Training"];
  evaluation [label="Evaluation"];
  deployment [label="Deployment"];
  
  raw_data -> cleaning;
  cleaning -> features;
  features -> modeling;
  modeling -> evaluation;
  evaluation -> deployment;
  evaluation -> features [label="Iterate", style="dashed"];
}

Generate the diagram:

dot -Tpng graph.dot -o pipeline.png

Graphviz is particularly useful for generating diagrams programmatically as part of automated documentation processes or for visualizing complex relationships that would be tedious to draw manually.

10.1.8 Screenshot and Recording Tools

These tools help create visual documentation and tutorials.

10.1.8.1 Greenshot

Greenshot is a lightweight screenshot tool with annotation features.

Key features for data scientists:

Region selection: Capture specific areas of the screen
Window capture: Automatically capture a window
Annotation: Add text, highlights, and arrows
Auto-save: Configure automatic saving patterns
Integration: Send to image editor, clipboard, or file

Installation:

Download from getgreenshot.org
Run the installer

Default shortcuts:

Print Screen: Capture region
Alt+Print Screen: Capture active window
Ctrl+Print Screen: Capture full screen

Greenshot is useful for capturing visualizations, error messages, or UI elements for documentation or troubleshooting.

10.1.8.2 OBS Studio

OBS (Open Broadcaster Software) Studio is a powerful tool for screen recording and streaming.

Key features:

High-quality recording: Capture screen activity with audio
Multiple sources: Record specific windows or regions
Scene composition: Create layouts combining different sources
Flexible output: Record to file or stream online
Cross-platform: Available for Windows, macOS, and Linux

Installation:

Download from obsproject.com
Run the installer

OBS is excellent for creating tutorial videos, recording presentations, or documenting complex data analysis processes for training purposes.

10.1.9 Productivity and Note-Taking Tools

These tools help organize your thinking, document your work, and manage your projects.

10.1.9.1 Obsidian

Obsidian is a knowledge base and note-taking application that works on Markdown files.

Key features for data scientists:

Markdown format: Write notes with the same syntax used in Jupyter notebooks
Bidirectional linking: Connect related notes
Graph view: Visualize relationships between notes
Local storage: Files stored on your computer, not in the cloud
Extensible: Plugins for additional functionality

Installation:

Download from obsidian.md
Run the installer

Example data science use case: Creating a personal knowledge base for your data science projects:

Create notes for each project with objectives and findings
Link to related techniques and concepts
Embed code snippets and results
Use tags to categorize by domain or technology
Visualize the connections in your knowledge with the graph view

Obsidian helps capture the thought process behind your data science work, creating a valuable reference for future projects.

10.1.9.2 Notion

Notion is an all-in-one workspace that combines notes, tasks, databases, and more.

Key features:

Rich content: Mix text, code, embeds, and databases
Templates: Pre-built layouts for different use cases
Collaboration: Share and work together with others
Web-based: Access from any device
Integration: Connect with other tools and services

Installation:

Sign up at notion.so
Download desktop and mobile apps if desired

Notion is particularly useful for team-based data science projects, where you need to coordinate tasks, share documentation, and track progress in one place.

10.1.10 File Management Tools

Managing, finding, and organizing files is an essential but often overlooked part of data science work.

10.1.10.1 Total Commander

Total Commander is a comprehensive file manager with advanced features.

Key features for data scientists:

Dual-pane interface: Compare and move files efficiently
Built-in viewers: View text, images, and other files without opening separate programs
Advanced search: Find files by content, name, size, or date
Batch rename: Rename multiple files with patterns
FTP/SFTP client: Transfer files to and from servers

Installation:

Download from ghisler.com
Run the installer (shareware with unlimited trial)

Total Commander streamlines file operations that are common in data science work, such as organizing datasets, managing project files, or transferring data to and from remote servers.

10.1.10.2 Agent Ransack

Agent Ransack is a powerful file search tool that can find text within files.

Key features:

Content search: Find files containing specific text
Regular expressions: Use patterns for advanced searching
Search filters: Limit by file type, size, or date
Result preview: See matching text without opening files
Boolean operators: Combine multiple search terms

Installation:

Download from mythicsoft.com
Run the installer

Agent Ransack is invaluable when you need to find specific data or code across multiple projects or locate where certain variables or functions are used in a large codebase.

10.1.11 Conclusion: Building Your Utility Toolkit

While the core programming languages and frameworks form the foundation of data science work, these utility tools provide specialized capabilities that can significantly enhance your productivity. As you progress in your data science journey, you’ll likely discover which tools best complement your workflow.

Start by incorporating a few tools that address your immediate needs—perhaps a better text editor, an API testing tool, or a database management interface. Over time, expand your toolkit as you encounter new challenges. Remember that the goal is not to use every tool available, but to find the combination that helps you work most effectively.

Many of these tools have free versions or trials, so you can experiment without financial commitment. You’ll soon discover your favourites and find which tools save you time or reduce friction in your workflow.

By thoughtfully building your utility toolkit alongside your core data science skills, you’ll be better equipped to handle the varied challenges of real-world data science projects.