Data Engineering from Scratch Part 1: Essentials & Tools

So you’ve heard the term data engineering floating around, but what does it really mean? And how do you go from “I know some Python and SQL” to confidently building your own data pipeline? This beginner-friendly series will walk you through that journey – step by step. No cloud, no enterprise tools to start – just your computer, Python, and a curiosity to learn. The data we will use come from TV Maze: https://www.tvmaze.com/api

I am writing these series to help other people that are either gained interest in the field of data engineering recently, or have tried getting into data engineering and felt completely overwhelmed by the great number of tools out there. Data Engineering can be a complex field to get started in. In addition of having to know languages such as Python and SQL, there are so many different tools and frameworks out there. A few examples of this are Pandas, PyArrow, Airflow, Prefect, Spark, Kafka, dbt, Docker, Kubernetes, Terraform… and the list goes on.. And then there are databases, data lakes, data warehouses and cloud platforms!

Although there are some data engineering tutorials out there, I feel like most of these focus on intermediate level data engineering, and they will throw a bunch of tools at you within the first few minutes of reading. I want to avoid this! Let’s start at the absolute basics! Simple SQL and Python. The other tools will come when we need them.

Before moving on let’s discuss what a real life data engineer actually does.

What Does a Data Engineer Actually Do?

So what do data engineers actually do? Well, it is important to remember that the roles of a data engineer, and the tools the engineer uses, vary a lot between companies! But if we try to summarize it we could say that Data engineering is the practice of designing, building, and maintaining systems that collect, transform, and store data – so that others (analysts, data scientists, apps) can use it.

Task examples

Here are a few of the things you might work at on an average day:

Ingest data from files, APIs, or databases
Clean and transform raw data
Store it in a database or warehouse
Automate and schedule data flows
Make sure data is secure, fresh, and trustworthy

Data engineers are the builders who ensure the right data is available, in the right format, at the right time. Data engineers are very much in the backend, and don’t expect to get much focus by the leadership of your company. If you are not in the spotlight, you are probably doing a great job!

Common Misconceptions

Let’s clear up what data engineers aren’t:
– They aren’t data analysts writing reports
– They aren’t data scientists building machine learning models
– They don’t only write SQL or only manage infrastructure
– They probably don’t make any Power BI reports either

Instead, data engineers are empower everyone else by making data flow. Great data engineers makes the work of data analysts, data scientists and others easier!

Core responsibilities:

Building and maintaining pipelines
Data cleaning and transformation
Managing databases and storage
Working with tools like Python, SQL, dbt, Airflow, Docker and cloud platforms

Data Engineer vs. Data Analyst vs. ML Engineer

There is some confusion online about the similarities and differences between some of the major data roles at large companies. Let’s look at this briefly before moving on:

Role	Focus	Tools	Output
Data Engineer	Build pipelines & infrastructure	Python, SQL, Airflow, dbt	Clean, reliable data
Data Analyst	Explore and visualize data	SQL, Excel, BI tools	Dashboards, reports
ML Engineer	Train & deploy ML models	Python, TensorFlow, PyTorch	Predictions, APIs

So yes, data engineering might not be a very flashy job, but it is so important for a large part of the organisation – and if you do your job well you will be very well compensated!

What This Series Will Cover

Enough with this data engineering introduction – you want to code! I completely agree. But let’s shortly cover the project I have in mind (which will probably change along the way) for this series of articles.

This series is for Python and SQL beginners who want a practical path into data engineering. You’ll learn by building one small project that evolves over time.I will divide this series in a variety of parts, each with their own main subjects. The API/dataset I will use comes from TV Maze, found at: https://www.tvmaze.com/api.

We’ll start local and with basic Python scripts, and then slowly introduce the tools used in real-world jobs.

You’ll learn:

Python scripting for data cleaning
Pandas for data transformation
PostgreSQL for storage
How to schedule jobs
How to use Docker, Airflow, and dbt
How to model our data using Kimbal
How to deploy pipelines to the cloud

You don’t need to know any of these tools right now. We’ll cover them one by one.

Tools You’ll Need

We are going to start slow with data engineering! The only tools you will need for now is a PC (or VM) running Python 3. Secondly, you will need a code editor of some sorts. I highly recommend Visual Studio Code. And having Git is probably also a good idea if you to put your code in source control. It will also make it easier to fetch my code.

In addition, you shouldn’t be a complete stranger with basic programming and terminal commands. But even if you feel in a doubt on this point, I will try to explain the basics as I go! The main thing you need is a natural curiousity, coupled with a certain level of persistence. Don’t give up while following this project, but don’t push yourself to hard either.

If you don’t have any of these tools installed. Don’t worry, this is the main reason I start this series with a part 0 – to get everyone on the same level, ready to go.

Hands-On Setup – Your First Project

Let’s set up our local data engineering working environment. We will start by installing Python, as it allows us to build some simple data pipelines all by it self.

Step 1: Install Python

To get started with Python development, you’ll first need to install Python itself. The exact steps depend on your operating system, but one of the simplest and most reliable methods—especially on Windows—is to download it directly from the official website:

https://www.python.org/downloads/

When running the installer on Windows, make sure you check the box that says “Add Python to PATH” before clicking “Install Now.” This step is crucial—it allows you to run Python from any terminal or command prompt without needing to type the full path to the Python executable. Forgetting to do this is one of the most common beginner mistakes.

Once installed, you can verify that Python is working correctly by opening a terminal (Command Prompt, PowerShell, or Terminal on macOS/Linux) and typing:

python --version

python --version

or, in some environments:

python3 --version

python3 --version

You should see the version number printed out, something like Python 3.12.1. That means Python is installed and ready to go!

Tip: On macOS and many Linux distributions, Python may already be pre-installed. However, it’s often an older version. Installing the latest version from python.org (or using tools like Homebrew on macOS) ensures you’re using a modern, up-to-date version with all the newest features.

Step 2: Install a Code Editor

While there are plenty of viable code editors available—such as Sublime Text, Atom, PyCharm, and others—I fully recommend using Visual Studio Code (VS Code). It’s a lightweight, open-source editor developed by Microsoft that has become the go-to choice for many developers across different languages, especially Python. This is why I definately recommend it for data engineering projects.

VS Code is easy to install and runs smoothly even on lower-end machines. What makes it stand out is its extensive ecosystem of extensions, which you can use to tailor the editor exactly to your needs. For Python development, it offers excellent support through the Python extension, which adds features like:

Intelligent code completion (IntelliSense)
Code linting and formatting
Integrated debugging
Virtual environment support
Jupyter Notebook integration
Interactive Python terminal

You can also customize the interface with themes, keyboard shortcuts, and workspace settings. Whether you’re writing simple scripts or building complex applications, VS Code strikes the right balance between power and simplicity.

To install VS Code, go to https://code.visualstudio.com, download the installer for your operating system, and follow the setup instructions. After installation, be sure to install the Python extension from the Extensions Marketplace to unlock all the Python-specific features.

Step 3: Install Git

Git is an essential tool for modern software development, including data engineering. It’s a version control system that allows you to track changes in your code, collaborate with others, and manage different versions of your project efficiently. Even if you’re working solo, using Git is a great habit—it acts like a powerful undo button for your code.

To install Git, follow the instructions for your platform:

Windows:
Download the Git installer from https://git-scm.com/download/win.
During installation, you can accept the default settings, but make sure that:
“Git from the command line and also from 3rd-party software” is selected.
macOS:
You can install Git using Homebrew (if you have it installed) by running:
brew install git
Linux:
Git is typically available via your package manager. For example:
sudo apt install git

Once installed, you can check the version to confirm it’s working by running:

git --version

git --version

You should see something like git version 2.43.0.

💡 Tip: After installing Git, it’s a good idea to configure your identity, especially if you plan to use Git with platforms like GitHub:

git config --global user.name "Your Name"
git config --global user.email "you@example.com"

git config --global user.name "Your Name"
git config --global user.email "you@example.com"

This information will be attached to your commits, making collaboration and version tracking clearer.

Step 4: Create Your First Project Folder

I am going to assume you know the absolute basics of terminals. Otherwise you can find some great resources out there. In this article you only need two commands: mkdir (make directory) and cd (change directory).

Open your terminal (Command Prompt, PowerShell, or Terminal) and run:

mkdir data-engineering-fundamentals
cd data-engineering-fundamentals

mkdir data-engineering-fundamentals
cd data-engineering-fundamentals

Step 5: Creation of virtual environment

Before we start coding, we need to talk about virtual environments.

Imagine this: you’re working on a personal project that uses the requests package. You install it using pip install requests (we’ll explain this command later). That project specifically needs version 2.26 of the package.

Later, you start another project that requires requests version 2.32.4.

If you don’t use a virtual environment, pip installs packages globally—meaning the version installed affects every Python project on your system. So if you upgrade requests to 2.32.4 for the new project, your old project, which depends on 2.26, might break.

This problem isn’t limited to requests. Say one project needs Flask 2.0, and another needs Flask 1.1—having everything installed globally can quickly lead to version conflicts, broken dependencies, and a messy system.

That’s where Python virtual environments come in. They let you create isolated environments for each project, so you can manage dependencies cleanly and safely. A virtual environment is like a self-contained bubble for your Python project. It has its own Python interpreter and its own place to install packages — completely separate from the global Python installation on your computer.

How to Create and Activate a Virtual Environment

That’s nice and all. But how do we create a virtual environment?

Here’s the typical workflow:

1. Create a virtual environment folder

Open your terminal and run:

python -m venv venv

python -m venv venv

This creates a directory called venv containing a fresh Python setup.

2. Activate the virtual environment
Before you install packages or run Python scripts, activate your virtual environment like so on Linux/macOS:

source venv/bin/activate

source venv/bin/activate

On Linux you use the following:

.\venv\Scripts\activate

.\venv\Scripts\activate

When activated, your shell prompt usually changes to show (venv), indicating you’re working inside the environment.

Why Activate? What Does It Do?

Once you activate the virtual environment, all your Python commands and package installations happen inside this isolated bubble. That means:

Running python will use the Python interpreter from inside venv.
Running pip install <package> will install packages only inside venv — not globally.
You don’t need to add any special options or flags to pip to install packages in the right place.

Like I said earlier, this is powerful because it keeps your project’s dependencies cleanly separated from others.

6. Installing Packages

One of the reasons Python is so powerful is the vast ecosystem of packages created by others. This means you often don’t need to write everything from scratch—for example, if you want to make web requests, there’s already a package for that. There are so many packages out there useful to data engineering.

Thanks to the virtual environment being active, you simply do:

pip install requests

pip install requests

Now the requests package goes straight into your project’s venv. No confusion, no conflicts.

When used from within a virtual environment, common installation tools such as pip will install Python packages into a virtual environment without needing to be told to do so explicitly.

What You’ll Build in the Coming Articles

In the next part of my data engineering series, we’ll set up a clean folder structure for your first data pipeline and write the first lines of code to pull in some real data.

Here’s a sneak peek of what’s ahead:

Clean a CSV file using Python
Use Pandas to transform real data
Load that data into PostgreSQL
Replace the CSV with an API
Automate your pipeline with a scheduler
Use Docker to run everything consistently
Introduce Airflow to manage workflows
Use dbt to test and structure your data
Deploy it all to the cloud

Coming Next: Build Your First Python-to-PostgreSQL ETL

In part 1 of this data engineering series, we will create a basic directory strucutre, read a CSV, clean it, and load it into a local database — all from scratch.

👉 Up next: Build Your First Python-to-PostgreSQL ETL

Data Engineering from Scratch Part 1: Essentials & Tools

Table of Contents

What Does a Data Engineer Actually Do?

Task examples

Common Misconceptions

Core responsibilities:

Data Engineer vs. Data Analyst vs. ML Engineer

What This Series Will Cover

Tools You’ll Need

Hands-On Setup – Your First Project

Step 1: Install Python

Step 2: Install a Code Editor

Step 3: Install Git

Step 4: Create Your First Project Folder

Step 5: Creation of virtual environment

How to Create and Activate a Virtual Environment

Why Activate? What Does It Do?

6. Installing Packages

What You’ll Build in the Coming Articles

Coming Next: Build Your First Python-to-PostgreSQL ETL

Leave a ReplyCancel Reply

TryHackMe: TShark Challenge 1: Teamwork Walkthrough (SOC Level 1)

TryHackMe: Snort Challenge – The Basics Walkthrough (SOC Level 1)

TryHackMe: Metasploit Introduction – Walkthrough

TryHackMe: Diamond Model Walkthrough (SOC Level 1)

TryHackMe: Unified Kill Chain Walkthrough (SOC Level 1)

Table of Contents

What Does a Data Engineer Actually Do?

Task examples

Common Misconceptions

Core responsibilities:

Data Engineer vs. Data Analyst vs. ML Engineer

What This Series Will Cover

Tools You’ll Need

Hands-On Setup – Your First Project

Step 1: Install Python

Step 2: Install a Code Editor

Step 3: Install Git

Step 4: Create Your First Project Folder

Step 5: Creation of virtual environment

How to Create and Activate a Virtual Environment

Why Activate? What Does It Do?

6. Installing Packages

What You’ll Build in the Coming Articles

Coming Next: Build Your First Python-to-PostgreSQL ETL

Newsletter Updates

Related Posts

Leave a ReplyCancel Reply

Trending now