As your projects grow in size and complexity, you may need more computing power than your local machine can provide. You may require secure, centralised storage solutions that scale seamlessly. Cloud platforms offer scalable resources and specialised tools for data science.
5.1.1 Why Use Cloud Platforms?
Cloud platforms offer several advantages for data science:
Scalability: Access to more storage and computing power when needed
Collaboration: Easier sharing of resources and results with team members
Specialised Hardware: Access to GPUs and TPUs for deep learning
Managed Services: Pre-configured tools and infrastructure
Cost Efficiency: Pay only for what you use
The ability to scale compute resources is particularly valuable for data scientists working with large datasets or computationally intensive models. Rather than investing in expensive hardware that might sit idle most of the time, cloud platforms allow you to rent powerful machines when you need them and shut them down when you don’t.
5.1.2 Getting Started with Google Colab
Google Colab provides free access to Python notebooks with GPU and TPU acceleration. It’s an excellent way to get started with cloud-based data science without any financial commitment.
Google Colab is essentially Jupyter notebooks running on Google’s servers, with a few additional features. You can run Python code, create visualisations, and even access GPU and TPU accelerators for free (with usage limits).
The key advantages of Colab include:
No setup required - just open your browser and start coding
Free access to GPUs and TPUs for accelerated machine learning
Easy sharing and collaboration through Google Drive
Pre-installed data science libraries
Integration with GitHub for loading and saving notebooks
5.1.3 Basic Cloud Storage Options
Cloud storage services provide an easy way to store and share data:
Google Drive: 15GB free storage, integrates well with Colab
Microsoft OneDrive: 5GB free storage, integrates with Office tools
Dropbox: 2GB free storage, good for file sharing
GitHub: Free storage for code and small datasets (files under 100MB)
These services can be used to store datasets, notebooks, and results. They also facilitate collaboration, as you can easily share files with colleagues.
For larger datasets or specialised needs, you’ll want to look at dedicated cloud storage solutions like Amazon S3, Google Cloud Storage, or Azure Blob Storage. These services are designed for scalability and can handle terabytes or even petabytes of data.
5.1.4 Comprehensive Cloud Platforms
For more advanced needs, consider these major cloud platforms:
5.1.4.1 Amazon Web Services (AWS)
AWS offers a comprehensive suite of data science tools:
SageMaker: Managed Jupyter notebooks with integrated ML tools
EC2: Virtual machines for customised environments
S3: Scalable storage for datasets
Redshift: Data warehousing
Lambda: Serverless computing for data processing
AWS offers a free tier that includes limited access to many of these services, allowing you to experiment before committing financially.
5.1.4.2 Google Cloud Platform (GCP)
GCP provides similar capabilities:
Vertex AI: End-to-end machine learning platform
Compute Engine: Virtual machines
BigQuery: Serverless data warehousing
Cloud Storage: Object storage
Dataproc: Managed Spark and Hadoop
5.1.4.3 Microsoft Azure
Azure is particularly well-integrated with Microsoft’s other tools:
Azure Machine Learning: End-to-end ML platform
Azure Databricks: Spark-based analytics
Azure Storage: Various storage options
Azure SQL Database: Managed SQL
Power BI: Business intelligence and visualisation
Each platform has its strengths, and many organisations use multiple clouds for different purposes. AWS has the broadest range of services, GCP excels in machine learning tools, and Azure integrates well with Microsoft’s enterprise ecosystem.
5.1.5 Choosing the Right Cloud Services
When selecting cloud services for data science, consider these factors:
Project requirements: Match services to your specific needs
Budget constraints: Compare pricing models across providers
Technical expertise: Some platforms have steeper learning curves
Integration needs: Consider existing tools in your workflow
Security requirements: Review compliance certifications and features
Data residency and regulation: Where your data physically lives matters legally. Personal data about EU residents falls under GDPR; South African personal data falls under POPIA; similar laws exist in most jurisdictions. Pick a region in the right country or economic area for your data subjects, and check whether your employer or client has contractual restrictions on where data can be stored or processed. All three major clouds (AWS, GCP, Azure) have regions in both South Africa and Europe.
A strategic approach is to start with a small project on your chosen platform. This allows you to gain familiarity with the environment before committing to larger workloads.
5.1.6 Getting Started with a Cloud Platform
Let’s create a basic starter project on AWS as an example:
Navigate to SageMaker AI → Studio in the AWS console
Create a Studio domain (for personal use, pick “Quick setup”)
Open Studio and launch a JupyterLab space, choosing a small instance type such as ml.t3.medium (covered by the SageMaker free tier for new accounts)
Start the space, open JupyterLab, create a notebook, and start working
This gives you a fully configured Jupyter environment with access to more computational resources than your local machine likely has. SageMaker comes pre-installed with popular data science libraries and integrates with other AWS services like S3 for storage.
WarningShut it down when you’re done
Studio spaces and notebook instances bill by the minute they’re running, not by how much you use them. A notebook you forgot to stop over the weekend is the single most common surprise bill for beginners. Stop the space from the Studio UI when you finish a session.
5.1.7 Managing Cloud Costs
One of the most important aspects of using cloud platforms is managing costs effectively. A few traps catch almost every beginner:
Idle notebook and VM instances: Cloud notebooks bill by wall-clock time, not usage. An instance left running overnight can cost more than a whole week of active work.
Data egress: Moving data out of a cloud provider (to your laptop, to another region, to another cloud) is almost always charged. Moving it in is usually free. Pulling a 500 GB dataset to your machine “just to look at it” is a classic expensive mistake.
NAT gateways and load balancers: On AWS especially, these run 24/7 and bill by the hour even if nothing is using them. Delete them when the project ends.
Storage classes: Object storage has cheap long-term tiers (S3 Glacier, GCS Coldline, Azure Archive) for data you rarely read. Use them for raw archives.
Free tier expiry: Most “free tier” offers last 12 months from signup, after which the same resources begin charging at normal rates.
Practical habits that help:
Set up billing alerts: Configure notifications when spending reaches certain thresholds
Use spot / preemptible instances: Steep discounts for interruptible workloads such as batch training jobs
Right-size resources: Choose appropriate instance types for your workloads
Schedule shutdowns: Automatically stop instances when not in use
Clean up resources: Delete unused storage, instances, load balancers, and services when the project ends
For example, in AWS you can create a budget with alerts:
Navigate to AWS Billing Dashboard
Select “Budgets” from the left navigation
Create a budget with monthly limits
Set up email alerts at 50%, 80%, and 100% of your budget
When working with cloud platforms, it’s important to remember to shut down resources when you’re not using them to avoid unnecessary charges. Most platforms provide cost management tools to help you monitor and control your spending.
5.1.8 Security Best Practices in the Cloud
Data security is critical when working in cloud environments:
Follow the principle of least privilege: Grant only the permissions necessary
Encrypt sensitive data: Use encryption for data at rest and in transit
Implement multi-factor authentication: Add an extra layer of security
Use private networks: Isolate your resources when possible
Regular security audits: Review permissions and access regularly
In practice, the cleanest approach for most data science work is to enable default encryption on the S3 bucket itself (AWS console → Bucket → Properties → Default encryption → SSE-S3 or SSE-KMS). Once that’s set, every object written to the bucket is encrypted at rest automatically, and your notebook code doesn’t need to pass keys at all:
import boto3from botocore.exceptions import ClientErrordef get_data(bucket, key):# Credentials come from the instance IAM role, not from code.# Encryption is handled by the bucket's default policy. s3 = boto3.client('s3')try: response = s3.get_object(Bucket=bucket, Key=key)return response['Body'].read()except ClientError as e:print(f"Error accessing data: {e}")returnNone
A few principles to notice:
No access keys in code. When running on an EC2 instance, SageMaker notebook, or Lambda, attach an IAM role with the minimum required permissions; boto3 picks up credentials automatically. For local development, use aws configure to store credentials in ~/.aws/credentials (which is already outside your git repository).
Let the bucket policy enforce encryption, not the client. This avoids the trap of one script forgetting to encrypt.
Avoid SSE-C (customer-provided keys) unless you have a specific compliance reason. With SSE-C, you are responsible for the encryption key, and losing it means losing the data. SSE-KMS is almost always the better choice.
Remember that security is a shared responsibility between you and the cloud provider. The provider secures the infrastructure, but you’re responsible for securing your data and applications.
5.1.9 Hands-On Exercise: Your First Cloud Analysis with Google Colab
Let’s walk through a complete example of using Google Colab for a data science task. This exercise demonstrates the practical workflow of cloud-based analysis.
Rename it by clicking on “Untitled0.ipynb” at the top
5.1.9.2 Step 2: Load and Explore Data
In the first cell, load a dataset directly from a URL:
Show code
import pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns# Load a sample dataset directly from the weburl ="https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"df = pd.read_csv(url)# Display basic informationprint(f"Dataset shape: {df.shape}")print(f"\nColumn types:\n{df.dtypes}")print(f"\nFirst few rows:")df.head()
5.1.9.3 Step 3: Perform Analysis
In subsequent cells, perform your analysis:
Show code
# Summary statisticsdf.describe()
Show code
# Create a visualisationplt.figure(figsize=(10, 6))sns.scatterplot(data=df, x='bill_length_mm', y='bill_depth_mm', hue='species', style='island', s=100)plt.title('Penguin Bill Dimensions by Species and Island')plt.xlabel('Bill Length (mm)')plt.ylabel('Bill Depth (mm)')plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')plt.tight_layout()plt.show()
For machine learning tasks, you can enable GPU acceleration:
Go to Runtime → Change runtime type
Select “T4 GPU” from the Hardware accelerator dropdown
Click Save
Then verify GPU availability:
Show code
import torchif torch.cuda.is_available():print(f"GPU available: {torch.cuda.get_device_name(0)}")print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory /1e9:.1f} GB")else:print("No GPU available - using CPU")
5.1.9.5 Step 5: Save Your Work
Colab notebooks are automatically saved to your Google Drive. You can also:
Download the notebook: File → Download → Download .ipynb
Save to GitHub: File → Save a copy in GitHub
Share with collaborators: Click the Share button in the top right
This exercise demonstrates the core workflow of cloud-based data science: loading data, performing analysis, creating visualisations, and optionally leveraging specialised hardware—all without installing anything on your local machine.
5.1.10 Connecting Cloud Storage to Your Analysis
When working with larger datasets, you’ll want to connect cloud storage to your notebooks. Here’s how to mount Google Drive in Colab:
Show code
from google.colab import drive# Mount Google Drivedrive.mount('/content/drive')# Now you can access files in your Driveimport pandas as pddf = pd.read_csv('/content/drive/MyDrive/data/my_dataset.csv')
For AWS S3 from a Colab notebook, use the boto3 library with credentials supplied through Colab’s secret manager rather than typing them into cells:
Show code
# In Colab: click the 🔑 key icon in the left sidebar and add two secrets# named AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Then:import osfrom google.colab import userdataimport boto3os.environ["AWS_ACCESS_KEY_ID"] = userdata.get("AWS_ACCESS_KEY_ID")os.environ["AWS_SECRET_ACCESS_KEY"] = userdata.get("AWS_SECRET_ACCESS_KEY")s3 = boto3.client("s3", region_name="us-east-1")s3.download_file("your-bucket", "data.csv", "data.csv")
Never paste access keys directly into notebook cells or commit them to a repository. Anything written in a cell gets saved to the .ipynb file and is trivially recoverable. Use Colab secrets, environment variables, or (better) IAM roles for service-to-service access.
If a key does leak, rotate it in the AWS console immediately. Don’t try to scrub the notebook history.
These patterns allow you to work with data stored in various cloud locations while leveraging the computational resources of your cloud notebook environment.
5.2 Conclusion
Cloud platforms provide powerful resources for data science, allowing you to scale beyond the limitations of your local machine. Whether you’re using free services like Google Colab or comprehensive platforms like AWS, GCP, or Azure, the cloud offers flexibility, scalability, and specialised tools that can significantly enhance your data science capabilities.
As you grow more comfortable with cloud services, you can explore more advanced features like automated machine learning pipelines, distributed computing, and real-time data processing. The cloud is continuously evolving, with new services and features being added regularly to support data science workflows.
In the upcoming chapters, we’ll explore how to deploy your data science projects to make them accessible to others (Deployment chapter) and how to use containerisation with Docker to ensure your environments are reproducible across local and cloud platforms (Containerisation chapter).