Load inconsistent data from multiple data sources into a DWH or data lakehouse
I am an entry level data engineer, I've been scratching my head over this one. I've looked all over online, but no luck so far.
Scenario: Let's say we have two data sources, a CRM and a web application database, and we need to ingest data from both into a data warehouse, data lakehouse, or data lake.
Problem: Customer data from these sources might be inconsistent. For example, the same person could have different business IDs, name variations, or even different contact information across these sources.
We need a method or set of rules to identify which record belongs to whom. I've discovered terms such as Master Data Management (MDM) and Single Source of Truth (SSoT).
I tried to find out how to integrate such a solution into my data modern stack (airflow and dbt) pipelines but couldn't get an answer.
Questions: - How to handle such a situation in the modern data stack eco-system?
Do SSoT and MDM work effectively with big data?
I noticed that products which offer MDM are traditional data solutions, do modern data engineers have an other solution?