Jasper Alblas
Jasper Alblas
Mastering Data & Cybersec
When I first got into data, I was overwhelmed by all the terms, definition and tasks: data engineering, analytics, BI, data science, modeling, OLTP, OLAP… It felt like everyone else already knew what it all meant, and I was just nodding along. But what helped me a lot was taking a step back and reading up on data fundamentals.
Whether you dream of building dashboards, automating pipelines, or designing full-scale data warehouses, every data journey starts with understanding how information flows. And how does data provide value to a business, and how do we get data in the correct form to provide this value?
This series — Data Fundamentals — is your first step in a larger, full-stack learning path. Over the coming weeks, you’ll move from concepts like roles and data types, through hands-on modeling in Python and SQL, all the way to production-style pipelines that load and serve clean data to BI tools.
In this first part of the Data Fundamentals series, we’ll explore how data roles fit together and walk through the data lifecycle — from raw information to meaningful insights. By the end, you’ll understand how the data journey really works, and where you might fit in.
Data drives almost every business decision today, but not everyone in the data space has the same responsibilities. But what’s the difference between a data engineer, data analyst, and a data scientist? And where does a BI developer fit in?
Here’s how I’ve seen roles typically play out in real organizations:

At a high level:
| Role | Focus | Typical Output |
|---|---|---|
| Data Engineer | Build pipelines, clean and integrate data | Reliable, structured data |
| Data Analyst / BI Developer | Explore and visualize data | Dashboards, reports |
| Data Scientist | Apply statistical models or machine learning | Forecasts, predictions |
| Data Architect | Design overall data systems | Architecture diagrams, standards |
| Data Ops / Platform Engineer | Automate and monitor data systems | ETL pipelines, CI/CD, orchestration |
These roles overlap. Sometimes one person does all four, sometimes they’re split across whole teams. I have had weeks, especially during summer holidays, where I both did basic analyses, edited Power BI reports, fixed a pipeline, and even edited a machine learning pipeline.
Engineers often act as the bridge between raw systems and analytical tools. Analysts and scientists use the data engineers prepared to extract meaningful insights. Knowing the overlap helps avoid confusion and improves collaboration.
Here’s a quick side-by-side to highlight the differences:
| Role | Tools |
|---|---|
| Data Engineer | Python, SQL, Airflow, dbt |
| Data Analyst | SQL, Excel, BI tools |
| Data Scientist | SQL, Python, TensorFlow, PyTorch |
Notice SQL is used by every role? YES, you can’t go wrong with learning SQL. We will get there.
I’ve noticed that beginners often skip understanding where data comes from. Yet, grasping the lifecycle of data is so important to understanding proper data fundamentals. When I started working with production systems, I imagined data just appearing in dashboards. In reality, it moves through several messy but structured steps — and understanding these is what separates a good analyst from a great data engineer.
Let’s look at the lifecycle as it really happens in most organizations:
Archive & Monitor (Governance)
Historical data is archived for traceability, and data quality is continuously checked. Monitoring ensures pipelines don’t silently fail.
Extract (Source Systems)
Everything begins in operational systems — ERPs, CRMs, IoT devices, web apps. These are OLTP systems optimized for fast inserts and updates.
💡 Example: At my job, we pull daily transactions from a banking core system through an API feed — it’s the “raw truth.”
Stage (Landing Area)
Data lands in a staging layer — usually raw tables or flat files that mirror the source. Nothing is cleaned yet. This is your “safety copy.”
Clean & Transform (ETL / ELT)
This is where engineers fix formats, join related tables, and calculate metrics. The messy data becomes usable.
Example: removing duplicates, converting currencies, or enriching records with master data.
Model (Warehouse or Lakehouse)
Clean data is reshaped into fact and dimension tables — the backbone for analytics.
Facts capture measurable events (e.g., sales, logins).
Dimensions describe context (e.g., customer, product, region).
This layer lives in your data warehouse (like Snowflake, BigQuery, Synapse).
Serve (Semantic Layer / BI)
Analysts and BI tools like Power BI or Looker connect here. The data is aggregated, filtered, and visualized.
💡 From experience: This is where naming conventions and business logic consistency really matter.
Here’s a theoretical pipeline:
💡 Personal insight: When I first saw this pipeline in action, I underestimated how much thought goes into staging and validating data before it hits dashboards. This includes talking with business users so that we can prepare the data in the format the business actually need. Remember, you provide value to the company, you get to keep your job.
In real-world data work, you’re rarely starting with a clean, curated dataset. Instead, you’re pulling from messy, diverse sources — each with its own quirks. Here are the main categories:
These are the systems businesses use to run their day-to-day operations:
These systems generate transactional data — things like:
“Order #1234, placed on 2023-02-01, by Customer 12, amount €59.99.”
This data is usually stored in relational databases (SQL), and is highly structured.
Modern applications generate tons of event data:
This data is often semi-structured (JSON) and stored in data lakes or NoSQL systems. It’s great for behavioral analysis, but can be noisy and hard to model.
Still very common in many organizations:
These are often the starting point for ad-hoc analysis or prototyping.
Sometimes you enrich your internal data with external sources:
These can add valuable context — like population data to support sales analysis.
💡 Lesson from my work: Using extract and archive tables to snapshot raw data daily is key. It allows you to rebuild history or trace errors — especially when upstream systems change unexpectedly
It’s tempting to throw raw CSVs into Power BI and start building visuals. And yes, you’ll get something on the screen. But without a proper model, you’ll quickly hit walls:
Modeling is what turns spaghetti into structure. It’s the process of organizing your data so it’s consistent, scalable, and easy to analyse.
The magic of data work comes when technical systems and business needs align:
Knowing this bridge makes collaboration easier and prevents you from wasting time wondering “who should do this?”.
Download the Superstore dataset. If you are new to Kaggle, you will have to create a account. Trust me, it’s worth it. After you downloaded the file, go ahead and open it in Excel (or Google Sheets).
You should see something looking like this:

Take a few minutes to scroll through the rows and columns. Ask yourself:
This exercise will help you a lot when we learn about modelling. But it is a great exercise to get thinking about whether data fields (columns) are transactions (fact table), or part of descriptions (dimension tables).
Don’t worry if you’re unsure — we’ll revisit this when we start modeling in a later part. The main point to note though is that the different transactions and descriptive columns are mixed together in one big table! Well we can work with this in Power BI (and we will do so soon), working in Power BI gets way more effective when we have our data spread out in dimensional tables and fact tables.
This was the warm-up. You now know the landscape: roles, data flows, and key concepts.
In Part 2, we’ll explore the core concepts that power these pipelines:
🛠️ Mini-challenge: Pick one dataset you encounter daily. Ask yourself: Is this transactional or dimensional data? Understanding this distinction is the first step toward thinking like a data engineer..
You are welcome to comment on this post, or share my post with friends.I would be even more grateful if you support me by buying me a cup of coffee:

[…] Welcome back to part 2 of this data fundamentals series. You can find part 1 here. […]