10 Utility Tools for Data Scientists
10.1 Utility Tools for Data Scientists
While programming languages, libraries, and frameworks form the core of your data science toolkit, a collection of utility tools can significantly enhance your productivity and effectiveness. This chapter covers specialized tools that address specific needs in the data science workflow.
10.1.1 Text Editors and IDE Enhancements
Text editors offer lightweight alternatives to full IDEs for quick edits and specialized text processing tasks.
10.1.1.1 Notepad++
Notepad++ is a free, open-source text editor for Windows that’s more powerful than the default Notepad application.
Key features for data scientists:
- Syntax highlighting: Supports many languages including Python, R, SQL, JSON, and more
- Column editing: Edit multiple lines simultaneously (useful for cleaning data)
- Regex search and replace: Powerful pattern matching for text manipulation
- Macro recording: Automate repetitive text edits
- Plugins: Extend functionality with additional tools that facilitate more expeditious exploration and development, such as JSON, HTML, and Markdown viewers.
Installation:
- Download from notepad-plus-plus.org
- Run the installer and follow the prompts
Useful shortcuts:
Ctrl+H: Find and replaceAlt+Shift+Arrow: Column selection modeCtrl+D: Duplicate current lineCtrl+Shift+Up/Down: Move current line up/down
Notepad++ is particularly useful for quickly viewing and editing large text files, CSV data, or configuration files without launching a full IDE. Its ability to handle multi-gigabyte files makes it valuable for inspecting large datasets.
Notepad++ (as of writing) is not available on Mac, but the following alternative is.
10.1.1.2 Sublime Text
Sublime Text is a sophisticated cross-platform text editor with powerful features for code editing.
Key features for data scientists:
- Multiple selections: Edit many places at once
- Command palette: Quickly access commands without menus
- Distraction-free mode: Focus on your text without UI elements
- Splits and grids: View multiple files or parts of files simultaneously
- Customizable key bindings: Create shortcuts tailored to your workflow
Installation:
- Download from sublimetext.com
- Install and activate (free evaluation with occasional purchase reminder)
Sublime Text’s speed and versatility make it excellent for manipulating text data, writing scripts, or making quick edits to code without launching a heavier IDE.
10.1.2 API Development and Testing Tools
APIs (Application Programming Interfaces) are crucial for accessing web services and databases. These tools help you test, debug, and document APIs.
10.1.2.1 Postman
Postman is the industry standard for API development and testing.
Key features for data scientists:
- Request building: Create and save HTTP requests
- Collections: Organize and share API requests
- Environment variables: Manage different settings (dev/prod)
- Automated testing: Create test scripts to validate responses
- Mock servers: Simulate API responses without a backend
Installation:
- Download from postman.com
- Create a free account to sync across devices
Example workflow:
Create a new request to a data API:
GET https://api.example.com/data?limit=100Add authentication (if required):
Authorization: Bearer your_token_hereSend the request and analyze the JSON response
Save the request to a collection for future use
Postman is invaluable when working with data APIs, whether you’re fetching data from public sources like financial markets, weather services, or social media platforms, or interacting with internal company APIs.
10.1.2.2 Insomnia
Insomnia is a lightweight alternative to Postman with an intuitive interface.
Key features:
- Clean, focused UI: Less complex than Postman
- GraphQL support: Built-in tools for GraphQL queries
- Request chaining: Use data from one request in another
- Environment management: Switch between configurations easily
- Open source: Free core version available
Installation:
- Download from insomnia.rest
- Run the installer
For data scientists who occasionally work with APIs but don’t need Postman’s full feature set, Insomnia offers a streamlined alternative.
10.1.3 Database Management Tools
These tools provide graphical interfaces for working with databases, making it easier to explore and manipulate data.
10.1.3.1 DBeaver
DBeaver is a universal database tool that works with almost any database system.
Key features for data scientists:
- Multi-database support: Works with PostgreSQL, MySQL, SQLite, Oracle, and more
- Visual query builder: Create SQL queries without writing code
- Data export/import: Move data between different formats and databases
- ER diagrams: Visualize database structure
- SQL editor: Write and execute queries with syntax highlighting
Installation:
- Download from dbeaver.io
- Run the installer
Example workflow:
Connect to a database with connection parameters
Browse tables and view structure
Use the SQL editor to write a query:
SELECT product_category, COUNT(*) as count, AVG(price) as avg_price FROM products GROUP BY product_category ORDER BY count DESC;Export results to CSV for analysis in Python or R
DBeaver streamlines database interactions, allowing you to explore data structures, write queries, and export results without writing code to establish database connections.
10.1.3.2 pgAdmin
pgAdmin is a specialized tool for PostgreSQL databases.
Key features:
- PostgreSQL-specific features: Optimized for PostgreSQL
- Server monitoring: View database performance
- Backup and restore: Manage database backups
- User management: Control access to databases
- Procedural language debugging: Test stored procedures
Installation:
- Download from pgadmin.org
- Run the installer
For data scientists working specifically with PostgreSQL databases, pgAdmin provides specialized features that generic tools may lack.
10.1.4 File Comparison and Merging Tools
These tools help identify differences between files and directories, which is useful for comparing datasets or code versions.
10.1.4.1 Beyond Compare
Beyond Compare is a powerful file and directory comparison tool.
Key features for data scientists:
- Text comparison: View differences between text files line by line
- Table comparison: Compare CSV and Excel files with data-aware features
- Directory sync: Compare and synchronize folders
- 3-way merge: Resolve conflicts between different versions
- Byte-level comparison: Analyze binary files
Installation:
- Download from scootersoftware.com
- Install (trial version available)
Example data science use case: Comparing two versions of a dataset to identify changes:
- Open two CSV files in Table Compare mode
- Automatically align columns by name
- Identify added, removed, or modified rows
- Export the differences to a new file
Beyond Compare is particularly valuable when dealing with evolving datasets, where you need to understand what changed between versions.
10.1.4.2 WinMerge
WinMerge is a free, open-source alternative for file and folder comparison.
Key features:
- Visual text comparison: Side-by-side differences with highlighting
- Folder comparison: Compare directory structures
- Image comparison: Visual diff for images
- Plugins: Extend functionality for additional file types
- Integration: Works with source control systems
Installation:
- Download from winmerge.org
- Run the installer
WinMerge is an excellent free option for basic comparison needs, though it lacks some of the advanced features of commercial alternatives.
10.1.5 Terminal Enhancements
Improving your command line experience can significantly boost productivity when working with data and code.
10.1.5.1 Oh My Zsh
Oh My Zsh is a framework for managing your Zsh configuration, providing themes and plugins for the Z shell.
Key features for data scientists:
- Tab completion: Intelligent completion for commands and paths
- Git integration: Visual indicators of repository status
- Syntax highlighting: Color-coded command syntax
- Command history: Improved search through previous commands
- Customizable themes: Visual enhancements for the terminal
Installation (macOS or Linux):
# Install Zsh first if needed
# Ubuntu/Debian:
# sudo apt install zsh
# macOS (usually pre-installed)
# Set Zsh as default shell
chsh -s $(which zsh)
# Install Oh My Zsh
sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"Useful plugins for data scientists:
# Edit ~/.zshrc to activate plugins
plugins=(git python pip conda docker jupyter)Oh My Zsh makes the command line more user-friendly and efficient, which is valuable when working with data processing tools, running scripts, or managing environments.
10.1.6 Data Wrangling Tools
These specialized tools help with specific data manipulation tasks that complement programming languages.
10.1.6.1 CSVKit
CSVKit is a suite of command-line tools for working with CSV files.
Key features for data scientists:
- csvstat: Generate descriptive statistics on CSV files
- csvcut: Extract specific columns
- csvgrep: Filter rows based on patterns
- csvsort: Sort CSV files
- csvjoin: SQL-like join operations between CSV files
Installation:
pip install csvkitExample commands:
# View basic statistics of a CSV file
csvstat data.csv
# Extract specific columns
csvcut -c 1,3,5 data.csv > extracted.csv
# Filter rows containing a pattern
csvgrep -c 2 -m "Pattern" data.csv > filtered.csv
# Sort by a column
csvsort -c 3 data.csv > sorted.csvCSVKit is extremely useful for quick data exploration and manipulation directly from the command line, without the need to write Python or R code for simple operations.
10.1.6.2 jq
jq is a lightweight command-line JSON processor that helps manipulate JSON data.
Key features:
- Filtering: Extract specific data from complex JSON
- Transformation: Reshape JSON structures
- Combination: Merge multiple JSON sources
- Computation: Perform calculations on numeric values
- Formatting: Pretty-print and compact JSON
Installation:
# macOS
brew install jq
# Ubuntu/Debian
sudo apt install jq
# Windows (with Chocolatey)
choco install jqExample commands:
# Pretty-print JSON
cat data.json | jq '.'
# Extract specific fields
cat data.json | jq '.results[] | {name, value}'
# Filter based on a condition
cat data.json | jq '.results[] | select(.value > 100)'
# Calculate statistics
cat data.json | jq '[.results[].value] | {count: length, sum: add, average: add/length}'jq is invaluable when working with APIs that return JSON data or when preparing JSON data for visualization or further analysis.
10.1.7 Diagramming and Visualization Tools
While code-based visualization is powerful, sometimes you need standalone tools for creating diagrams and flowcharts.
10.1.7.1 Diagram.net (formerly draw.io)
Diagram.net is a free online diagramming tool that works with various diagram types.
Key features for data scientists:
- Flowcharts: Document data pipelines and workflows
- ER diagrams: Model database relationships
- Network diagrams: Visualize system architecture
- Multiple export formats: PNG, SVG, PDF, etc.
- Integration: Works with Google Drive, Dropbox, etc.
Access:
- Go to diagram.net in your browser
- Choose where to save your diagrams (local, Google Drive, etc.)
Example data science use case: Creating a data flow diagram to document an ETL process:
- Select the flowchart template
- Add data sources, transformation steps, and outputs
- Connect components with arrows showing data flow
- Add annotations explaining transformations
- Export as PNG for inclusion in documentation
Clear diagrams are essential for communicating complex data processing workflows to stakeholders or documenting them for future reference.
10.1.7.2 Graphviz
Graphviz is a command-line tool for creating structured diagrams from text descriptions.
Key features:
- Programmatic diagrams: Generate diagrams from code
- Automatic layout: Optimal arrangement of elements
- Various diagram types: Directed graphs, hierarchies, networks
- Integration: Works with Python, R, and other languages
- Scriptable: Automate diagram generation
Installation:
# macOS
brew install graphviz
# Ubuntu/Debian
sudo apt install graphviz
# Windows (with Chocolatey)
choco install graphvizExample DOT file (graph.dot):
digraph DataPipeline {
rankdir=LR;
raw_data [label="Raw Data"];
cleaning [label="Data Cleaning"];
features [label="Feature Engineering"];
modeling [label="Model Training"];
evaluation [label="Evaluation"];
deployment [label="Deployment"];
raw_data -> cleaning;
cleaning -> features;
features -> modeling;
modeling -> evaluation;
evaluation -> deployment;
evaluation -> features [label="Iterate", style="dashed"];
}Generate the diagram:
dot -Tpng graph.dot -o pipeline.pngGraphviz is particularly useful for generating diagrams programmatically as part of automated documentation processes or for visualizing complex relationships that would be tedious to draw manually.
10.1.8 Screenshot and Recording Tools
These tools help create visual documentation and tutorials.
10.1.8.1 Greenshot
Greenshot is a lightweight screenshot tool with annotation features.
Key features for data scientists:
- Region selection: Capture specific areas of the screen
- Window capture: Automatically capture a window
- Annotation: Add text, highlights, and arrows
- Auto-save: Configure automatic saving patterns
- Integration: Send to image editor, clipboard, or file
Installation:
- Download from getgreenshot.org
- Run the installer
Default shortcuts:
Print Screen: Capture regionAlt+Print Screen: Capture active windowCtrl+Print Screen: Capture full screen
Greenshot is useful for capturing visualizations, error messages, or UI elements for documentation or troubleshooting.
10.1.8.2 OBS Studio
OBS (Open Broadcaster Software) Studio is a powerful tool for screen recording and streaming.
Key features:
- High-quality recording: Capture screen activity with audio
- Multiple sources: Record specific windows or regions
- Scene composition: Create layouts combining different sources
- Flexible output: Record to file or stream online
- Cross-platform: Available for Windows, macOS, and Linux
Installation:
- Download from obsproject.com
- Run the installer
OBS is excellent for creating tutorial videos, recording presentations, or documenting complex data analysis processes for training purposes.
10.1.9 Productivity and Note-Taking Tools
These tools help organize your thinking, document your work, and manage your projects.
10.1.9.1 Obsidian
Obsidian is a knowledge base and note-taking application that works on Markdown files.
Key features for data scientists:
- Markdown format: Write notes with the same syntax used in Jupyter notebooks
- Bidirectional linking: Connect related notes
- Graph view: Visualize relationships between notes
- Local storage: Files stored on your computer, not in the cloud
- Extensible: Plugins for additional functionality
Installation:
- Download from obsidian.md
- Run the installer
Example data science use case: Creating a personal knowledge base for your data science projects:
- Create notes for each project with objectives and findings
- Link to related techniques and concepts
- Embed code snippets and results
- Use tags to categorize by domain or technology
- Visualize the connections in your knowledge with the graph view
Obsidian helps capture the thought process behind your data science work, creating a valuable reference for future projects.
10.1.9.2 Notion
Notion is an all-in-one workspace that combines notes, tasks, databases, and more.
Key features:
- Rich content: Mix text, code, embeds, and databases
- Templates: Pre-built layouts for different use cases
- Collaboration: Share and work together with others
- Web-based: Access from any device
- Integration: Connect with other tools and services
Installation:
- Sign up at notion.so
- Download desktop and mobile apps if desired
Notion is particularly useful for team-based data science projects, where you need to coordinate tasks, share documentation, and track progress in one place.
10.1.10 File Management Tools
Managing, finding, and organizing files is an essential but often overlooked part of data science work.
10.1.10.1 Total Commander
Total Commander is a comprehensive file manager with advanced features.
Key features for data scientists:
- Dual-pane interface: Compare and move files efficiently
- Built-in viewers: View text, images, and other files without opening separate programs
- Advanced search: Find files by content, name, size, or date
- Batch rename: Rename multiple files with patterns
- FTP/SFTP client: Transfer files to and from servers
Installation:
- Download from ghisler.com
- Run the installer (shareware with unlimited trial)
Total Commander streamlines file operations that are common in data science work, such as organizing datasets, managing project files, or transferring data to and from remote servers.
10.1.10.2 Agent Ransack
Agent Ransack is a powerful file search tool that can find text within files.
Key features:
- Content search: Find files containing specific text
- Regular expressions: Use patterns for advanced searching
- Search filters: Limit by file type, size, or date
- Result preview: See matching text without opening files
- Boolean operators: Combine multiple search terms
Installation:
- Download from mythicsoft.com
- Run the installer
Agent Ransack is invaluable when you need to find specific data or code across multiple projects or locate where certain variables or functions are used in a large codebase.
10.1.11 Conclusion: Building Your Utility Toolkit
While the core programming languages and frameworks form the foundation of data science work, these utility tools provide specialized capabilities that can significantly enhance your productivity. As you progress in your data science journey, you’ll likely discover which tools best complement your workflow.
Start by incorporating a few tools that address your immediate needs—perhaps a better text editor, an API testing tool, or a database management interface. Over time, expand your toolkit as you encounter new challenges. Remember that the goal is not to use every tool available, but to find the combination that helps you work most effectively.
Many of these tools have free versions or trials, so you can experiment without financial commitment. You’ll soon discover your favourites and find which tools save you time or reduce friction in your workflow.
By thoughtfully building your utility toolkit alongside your core data science skills, you’ll be better equipped to handle the varied challenges of real-world data science projects.