Data Fundamentals Part 1: Understanding Data Roles and the Data Lifecycle

When I first got into data, I was overwhelmed by all the terms, definition and tasks: data engineering, analytics, BI, data science, modeling, OLTP, OLAP… It felt like everyone else already knew what it all meant, and I was just nodding along. But what helped me a lot was taking a step back and reading up on data fundamentals.

Whether you dream of building dashboards, automating pipelines, or designing full-scale data warehouses, every data journey starts with understanding how information flows. And how does data provide value to a business, and how do we get data in the correct form to provide this value?

This series — Data Fundamentals — is your first step in a larger, full-stack learning path. Over the coming weeks, you’ll move from concepts like roles and data types, through hands-on modeling in Python and SQL, all the way to production-style pipelines that load and serve clean data to BI tools.

In this first part of the Data Fundamentals series, we’ll explore how data roles fit together and walk through the data lifecycle — from raw information to meaningful insights. By the end, you’ll understand how the data journey really works, and where you might fit in.

1. Data Fundamentals: Roles Explained

Data drives almost every business decision today, but not everyone in the data space has the same responsibilities. But what’s the difference between a data engineer, data analyst, and a data scientist? And where does a BI developer fit in?

Here’s how I’ve seen roles typically play out in real organizations:

You should eat data fundamentals for breakfast

At a high level:

Role	Focus	Typical Output
Data Engineer	Build pipelines, clean and integrate data	Reliable, structured data
Data Analyst / BI Developer	Explore and visualize data	Dashboards, reports
Data Scientist	Apply statistical models or machine learning	Forecasts, predictions
Data Architect	Design overall data systems	Architecture diagrams, standards
Data Ops / Platform Engineer	Automate and monitor data systems	ETL pipelines, CI/CD, orchestration

These roles overlap. Sometimes one person does all four, sometimes they’re split across whole teams. I have had weeks, especially during summer holidays, where I both did basic analyses, edited Power BI reports, fixed a pipeline, and even edited a machine learning pipeline.

Engineers often act as the bridge between raw systems and analytical tools. Analysts and scientists use the data engineers prepared to extract meaningful insights. Knowing the overlap helps avoid confusion and improves collaboration.

Comparing the Tools

Here’s a quick side-by-side to highlight the differences:

Role	Tools
Data Engineer	Python, SQL, Airflow, dbt
Data Analyst	SQL, Excel, BI tools
Data Scientist	SQL, Python, TensorFlow, PyTorch

Notice SQL is used by every role? YES, you can’t go wrong with learning SQL. We will get there.

2. The Data Lifecycle: From Raw to Insight

I’ve noticed that beginners often skip understanding where data comes from. Yet, grasping the lifecycle of data is so important to understanding proper data fundamentals. When I started working with production systems, I imagined data just appearing in dashboards. In reality, it moves through several messy but structured steps — and understanding these is what separates a good analyst from a great data engineer.

Let’s look at the lifecycle as it really happens in most organizations:

Archive & Monitor (Governance)
Historical data is archived for traceability, and data quality is continuously checked. Monitoring ensures pipelines don’t silently fail.

Extract (Source Systems)
Everything begins in operational systems — ERPs, CRMs, IoT devices, web apps. These are OLTP systems optimized for fast inserts and updates.
💡 Example: At my job, we pull daily transactions from a banking core system through an API feed — it’s the “raw truth.”

Stage (Landing Area)
Data lands in a staging layer — usually raw tables or flat files that mirror the source. Nothing is cleaned yet. This is your “safety copy.”

Clean & Transform (ETL / ELT)
This is where engineers fix formats, join related tables, and calculate metrics. The messy data becomes usable.
Example: removing duplicates, converting currencies, or enriching records with master data.

Model (Warehouse or Lakehouse)
Clean data is reshaped into fact and dimension tables — the backbone for analytics.

Facts capture measurable events (e.g., sales, logins).

Dimensions describe context (e.g., customer, product, region).
This layer lives in your data warehouse (like Snowflake, BigQuery, Synapse).

Serve (Semantic Layer / BI)
Analysts and BI tools like Power BI or Looker connect here. The data is aggregated, filtered, and visualized.
💡 From experience: This is where naming conventions and business logic consistency really matter.

3. How Data Moves in Real Life

Here’s a theoretical pipeline:

Operational Systems (OLTP) – CRM, ERP, HR, and transactional databases.
Integration Layer (Staging) – Extracted data lands here, validated and cleaned before moving to the warehouse.
Data Warehouse (OLAP) – Fact and dimension tables reside here, optimized for analytics.
Semantic / Model Layer – Tools like Power BI or Tableau define relationships and business logic.
Analytics & Reporting – Analysts create dashboards, share insights, and answer business questions.
Feedback & Automation – Insights feed back into operational processes.

💡 Personal insight: When I first saw this pipeline in action, I underestimated how much thought goes into staging and validating data before it hits dashboards. This includes talking with business users so that we can prepare the data in the format the business actually need. Remember, you provide value to the company, you get to keep your job.

4. Where Data Comes From

In real-world data work, you’re rarely starting with a clean, curated dataset. Instead, you’re pulling from messy, diverse sources — each with its own quirks. Here are the main categories:

Operational Systems

These are the systems businesses use to run their day-to-day operations:

ERP systems (e.g. SAP, Dynamics): manage inventory, finance, HR.
CRM systems (e.g. Salesforce, HubSpot): track customer interactions.
E-commerce platforms (e.g. Shopify, Magento): handle orders, payments.

These systems generate transactional data — things like:

“Order #1234, placed on 2023-02-01, by Customer 12, amount €59.99.”

This data is usually stored in relational databases (SQL), and is highly structured.

Logs & Events

Modern applications generate tons of event data:

Clickstreams: what users click on, how they navigate.
System logs: server activity, errors, performance metrics.
Web APIs: endpoints that expose data from apps or services.

This data is often semi-structured (JSON) and stored in data lakes or NoSQL systems. It’s great for behavioral analysis, but can be noisy and hard to model.

Files & Flat Data

Still very common in many organizations:

Excel spreadsheets: budgets, reports, exports from other systems.
CSV files: simple tabular data, often used for sharing.
JSON & XML: structured but nested, often from APIs.
Parquet: columnar format used in big data environments.

These are often the starting point for ad-hoc analysis or prototyping.

APIs & External Feeds

Sometimes you enrich your internal data with external sources:

Open Data portals: government stats, weather, demographics.
Kaggle datasets: great for learning and experimentation.
Third-party providers: financial data, market research, etc.

These can add valuable context — like population data to support sales analysis.

Manual / Reference Data

Master tables, lookup lists, or configuration files.
Small, but critical for consistency.

💡 Lesson from my work: Using extract and archive tables to snapshot raw data daily is key. It allows you to rebuild history or trace errors — especially when upstream systems change unexpectedly

5. Why Modeling Matters

It’s tempting to throw raw CSVs into Power BI and start building visuals. And yes, you’ll get something on the screen. But without a proper model, you’ll quickly hit walls:

Totals that don’t add up.
Filters that behave strangely.
Reports that break when new data arrives.

Modeling is what turns spaghetti into structure. It’s the process of organizing your data so it’s consistent, scalable, and easy to analyse.

6. Why Understanding Roles and Flow Matters

The magic of data work comes when technical systems and business needs align:

Engineers prepare the data.
Analysts interpret it.
Leaders make decisions.

Knowing this bridge makes collaboration easier and prevents you from wasting time wondering “who should do this?”.

7. Exercise: Explore Your First Dataset

Download the Superstore dataset. If you are new to Kaggle, you will have to create a account. Trust me, it’s worth it. After you downloaded the file, go ahead and open it in Excel (or Google Sheets).

You should see something looking like this:

Take a few minutes to scroll through the rows and columns. Ask yourself:

Which columns look like transactions (things that happen: orders, dates, amounts)?
Which columns look like descriptions (things about customers or products)?

This exercise will help you a lot when we learn about modelling. But it is a great exercise to get thinking about whether data fields (columns) are transactions (fact table), or part of descriptions (dimension tables).

Don’t worry if you’re unsure — we’ll revisit this when we start modeling in a later part. The main point to note though is that the different transactions and descriptive columns are mixed together in one big table! Well we can work with this in Power BI (and we will do so soon), working in Power BI gets way more effective when we have our data spread out in dimensional tables and fact tables.

What’s next?

This was the warm-up. You now know the landscape: roles, data flows, and key concepts.

In Part 2, we’ll explore the core concepts that power these pipelines:

Structured vs unstructured data
OLTP vs OLAP
Fact and dimension tables
Keys, granularity, and data modeling basics

🛠️ Mini-challenge: Pick one dataset you encounter daily. Ask yourself: Is this transactional or dimensional data? Understanding this distinction is the first step toward thinking like a data engineer..

Like my articles?

You are welcome to comment on this post, or share my post with friends.I would be even more grateful if you support me by buying me a cup of coffee:

Data Fundamentals Part 1: Understanding Data Roles and the Data Lifecycle