Data Engineering from Scratch Part 1: Essentials & Tools

So you’ve heard the term data engineering floating around, but what does it really mean? And how do you go from “I know some Python and SQL” to confidently building your own data pipeline? This beginner-friendly series will walk you through that journey – step by step. No cloud, no enterprise tools to start – just your computer, Python, and a curiosity to learn. The data we will use come from TV Maze: https://www.tvmaze.com/api

I am writing these series to help other people that are either gained interest in the field of data engineering recently, or have tried getting into data engineering and felt completely overwhelmed by the great number of tools out there. Data Engineering can be a complex field to get started in. In addition of having to know languages such as Python and SQL, there are so many different tools and frameworks out there. A few examples of this are Pandas, PyArrow, Airflow, Prefect, Spark, Kafka,  dbt, Docker, Kubernetes, Terraform… and the list goes on.. And then there are databases, data lakes, data warehouses and cloud platforms!

Although there are some data engineering tutorials out there, I feel like most of these focus on intermediate level data engineering, and they will throw a bunch of tools at you within the first few minutes of reading. I want to avoid this! Let’s start at the absolute basics! Simple SQL and Python. The other tools will come when we need them.

Before moving on let’s discuss what a real life data engineer actually does.



What Does a Data Engineer Actually Do?

So what do data engineers actually do? Well, it is important to remember that the roles of a data engineer, and the tools the engineer uses, vary a lot between companies! But if we try to summarize it we could say that Data engineering is the practice of designing, building, and maintaining systems that collect, transform, and store data – so that others (analysts, data scientists, apps) can use it.

Task examples

Here are a few of the things you might work at on an average day:

  • Ingest data from files, APIs, or databases
  • Clean and transform raw data
  • Store it in a database or warehouse
  • Automate and schedule data flows
  • Make sure data is secure, fresh, and trustworthy

Data engineers are the builders who ensure the right data is available, in the right format, at the right time. Data engineers are very much in the backend, and don’t expect to get much focus by the leadership of your company. If you are not in the spotlight, you are probably doing a great job!

Common Misconceptions

Let’s clear up what data engineers aren’t:
– They aren’t data analysts writing reports
– They aren’t data scientists building machine learning models
– They don’t only write SQL or only manage infrastructure
– They probably don’t make any Power BI reports either

Instead, data engineers are empower everyone else by making data flow. Great data engineers makes the work of data analysts, data scientists and others easier!

Core responsibilities:

  • Building and maintaining pipelines
  • Data cleaning and transformation
  • Managing databases and storage
  • Working with tools like Python, SQL, dbt, Airflow, Docker and cloud platforms

Data Engineer vs. Data Analyst vs. ML Engineer

There is some confusion online about the similarities and differences between some of the major data roles at large companies. Let’s look at this briefly before moving on:

RoleFocusToolsOutput
Data EngineerBuild pipelines & infrastructurePython, SQL, Airflow, dbtClean, reliable data
Data AnalystExplore and visualize dataSQL, Excel, BI toolsDashboards, reports
ML EngineerTrain & deploy ML modelsPython, TensorFlow, PyTorchPredictions, APIs

So yes, data engineering might not be a very flashy job, but it is so important for a large part of the organisation – and if you do your job well you will be very well compensated!


What This Series Will Cover

Enough with this data engineering introduction – you want to code! I completely agree. But let’s shortly cover the project I have in mind (which will probably change along the way) for this series of articles.

This series is for Python and SQL beginners who want a practical path into data engineering. You’ll learn by building one small project that evolves over time.I will divide this series in a variety of parts, each with their own main subjects. The API/dataset I will use comes from TV Maze, found at: https://www.tvmaze.com/api.

We’ll start local and with basic Python scripts, and then slowly introduce the tools used in real-world jobs.

You’ll learn:

  • Python scripting for data cleaning
  • Pandas for data transformation
  • PostgreSQL for storage
  • How to schedule jobs
  • How to use Docker, Airflow, and dbt
  • How to model our data using Kimbal
  • How to deploy pipelines to the cloud

You don’t need to know any of these tools right now. We’ll cover them one by one.


Tools You’ll Need

We are going to start slow with data engineering! The only tools you will need for now is a PC (or VM) running Python 3. Secondly, you will need a code editor of some sorts. I highly recommend Visual Studio Code. And having Git is probably also a good idea if you to put your code in source control. It will also make it easier to fetch my code.

In addition, you shouldn’t be a complete stranger with basic programming and terminal commands. But even if you feel in a doubt on this point, I will try to explain the basics as I go! The main thing you need is a natural curiousity, coupled with a certain level of persistence. Don’t give up while following this project, but don’t push yourself to hard either.

If you don’t have any of these tools installed. Don’t worry, this is the main reason I start this series with a part 0 – to get everyone on the same level, ready to go.


Hands-On Setup – Your First Project

Let’s set up our local data engineering working environment. We will start by installing Python, as it allows us to build some simple data pipelines all by it self.

Step 1: Install Python

To get started with Python development, you’ll first need to install Python itself. The exact steps depend on your operating system, but one of the simplest and most reliable methods—especially on Windows—is to download it directly from the official website:

https://www.python.org/downloads/

When running the installer on Windows, make sure you check the box that says “Add Python to PATH” before clicking “Install Now.” This step is crucial—it allows you to run Python from any terminal or command prompt without needing to type the full path to the Python executable. Forgetting to do this is one of the most common beginner mistakes.

Once installed, you can verify that Python is working correctly by opening a terminal (Command Prompt, PowerShell, or Terminal on macOS/Linux) and typing:

python --version

or, in some environments:

python3 --version

You should see the version number printed out, something like Python 3.12.1. That means Python is installed and ready to go!

Tip: On macOS and many Linux distributions, Python may already be pre-installed. However, it’s often an older version. Installing the latest version from python.org (or using tools like Homebrew on macOS) ensures you’re using a modern, up-to-date version with all the newest features.


Step 2: Install a Code Editor

While there are plenty of viable code editors available—such as Sublime Text, Atom, PyCharm, and others—I fully recommend using Visual Studio Code (VS Code). It’s a lightweight, open-source editor developed by Microsoft that has become the go-to choice for many developers across different languages, especially Python. This is why I definately recommend it for data engineering projects.

VS Code is easy to install and runs smoothly even on lower-end machines. What makes it stand out is its extensive ecosystem of extensions, which you can use to tailor the editor exactly to your needs. For Python development, it offers excellent support through the Python extension, which adds features like:

  • Intelligent code completion (IntelliSense)
  • Code linting and formatting
  • Integrated debugging
  • Virtual environment support
  • Jupyter Notebook integration
  • Interactive Python terminal

You can also customize the interface with themes, keyboard shortcuts, and workspace settings. Whether you’re writing simple scripts or building complex applications, VS Code strikes the right balance between power and simplicity.

To install VS Code, go to https://code.visualstudio.com, download the installer for your operating system, and follow the setup instructions. After installation, be sure to install the Python extension from the Extensions Marketplace to unlock all the Python-specific features.


Step 3: Install Git

Git is an essential tool for modern software development, including data engineering. It’s a version control system that allows you to track changes in your code, collaborate with others, and manage different versions of your project efficiently. Even if you’re working solo, using Git is a great habit—it acts like a powerful undo button for your code.

To install Git, follow the instructions for your platform:

  • Windows:
    Download the Git installer from https://git-scm.com/download/win.
    During installation, you can accept the default settings, but make sure that:
    “Git from the command line and also from 3rd-party software” is selected.
  • macOS:
    You can install Git using Homebrew (if you have it installed) by running:
    brew install git
  • Linux:
    Git is typically available via your package manager. For example:
    sudo apt install git

Once installed, you can check the version to confirm it’s working by running:

git --version

You should see something like git version 2.43.0.

💡 Tip: After installing Git, it’s a good idea to configure your identity, especially if you plan to use Git with platforms like GitHub:

git config --global user.name "Your Name"
git config --global user.email "you@example.com"

This information will be attached to your commits, making collaboration and version tracking clearer.


Step 4: Create Your First Project Folder

I am going to assume you know the absolute basics of terminals. Otherwise you can find some great resources out there. In this article you only need two commands: mkdir (make directory) and cd (change directory).

Open your terminal (Command Prompt, PowerShell, or Terminal) and run:

mkdir data-engineering-fundamentals
cd data-engineering-fundamentals

Step 5: Creation of virtual environment

Before we start coding, we need to talk about virtual environments.

Imagine this: you’re working on a personal project that uses the requests package. You install it using pip install requests (we’ll explain this command later). That project specifically needs version 2.26 of the package.

Later, you start another project that requires requests version 2.32.4.

If you don’t use a virtual environment, pip installs packages globally—meaning the version installed affects every Python project on your system. So if you upgrade requests to 2.32.4 for the new project, your old project, which depends on 2.26, might break.

This problem isn’t limited to requests. Say one project needs Flask 2.0, and another needs Flask 1.1—having everything installed globally can quickly lead to version conflicts, broken dependencies, and a messy system.

That’s where Python virtual environments come in. They let you create isolated environments for each project, so you can manage dependencies cleanly and safely. A virtual environment is like a self-contained bubble for your Python project. It has its own Python interpreter and its own place to install packages — completely separate from the global Python installation on your computer.


How to Create and Activate a Virtual Environment

That’s nice and all. But how do we create a virtual environment?

Here’s the typical workflow:

1. Create a virtual environment folder

Open your terminal and run:

python -m venv venv

This creates a directory called venv containing a fresh Python setup.

2. Activate the virtual environment
Before you install packages or run Python scripts, activate your virtual environment like so on Linux/macOS:

source venv/bin/activate 

On Linux you use the following:

.\venv\Scripts\activate

When activated, your shell prompt usually changes to show (venv), indicating you’re working inside the environment.


Why Activate? What Does It Do?

Once you activate the virtual environment, all your Python commands and package installations happen inside this isolated bubble. That means:

  • Running python will use the Python interpreter from inside venv.
  • Running pip install <package> will install packages only inside venv — not globally.
  • You don’t need to add any special options or flags to pip to install packages in the right place.

Like I said earlier, this is powerful because it keeps your project’s dependencies cleanly separated from others.


6. Installing Packages

One of the reasons Python is so powerful is the vast ecosystem of packages created by others. This means you often don’t need to write everything from scratch—for example, if you want to make web requests, there’s already a package for that. There are so many packages out there useful to data engineering.

Thanks to the virtual environment being active, you simply do:

pip install requests

Now the requests package goes straight into your project’s venv. No confusion, no conflicts.

When used from within a virtual environment, common installation tools such as pip will install Python packages into a virtual environment without needing to be told to do so explicitly.


What You’ll Build in the Coming Articles

In the next part of my data engineering series, we’ll set up a clean folder structure for your first data pipeline and write the first lines of code to pull in some real data.

Here’s a sneak peek of what’s ahead:

  • Clean a CSV file using Python
  • Use Pandas to transform real data
  • Load that data into PostgreSQL
  • Replace the CSV with an API
  • Automate your pipeline with a scheduler
  • Use Docker to run everything consistently
  • Introduce Airflow to manage workflows
  • Use dbt to test and structure your data
  • Deploy it all to the cloud

Coming Next: Build Your First Python-to-PostgreSQL ETL

In part 1 of this data engineering series, we will create a basic directory strucutre, read a CSV, clean it, and load it into a local database — all from scratch.

👉 Up next: Build Your First Python-to-PostgreSQL ETL

Newsletter Updates

Enter your email address below and subscribe to our newsletter

Leave a Reply

Your email address will not be published. Required fields are marked *