Breaking Down Data Silos With a Unified Data Warehouse: An Apache Doris-Based CDP

This post was originally published on DZone (IoT)

The data silos problem is like arthritis for online businesses because almost everyone gets it as they grow old. Businesses interact with customers via websites, mobile apps, H5 pages, and end devices. For one reason or another, it is tricky to integrate the data from all these sources. Data stays where it is and cannot be interrelated for further analysis. That’s how data silos come to form. The bigger your business grows, the more diversified customer data sources you will have, and the more likely you are trapped by data silos. 

This is exactly what happens to the insurance company I’m going to talk about in this post. By 2023, they have already served over 500 million customers and signed 57 billion insurance contracts. When they started to build a customer data platform (CDP) to accommodate such a data size, they used multiple components. 

Data Silos in CDP

Like most data platforms, their CDP 1.0 had a batch processing pipeline and a real-time streaming pipeline. Offline data was loaded, via Spark jobs, to Impala, where it was tagged and divided into groups. Meanwhile, Spark also sent it to NebulaGraph for OneID computation (elaborated later in this post). On the other hand,

Read the rest of this post, which was originally published on DZone (IoT).

Previous Post

Using My New Raspberry Pi To Run an Existing GitHub Action

Next Post

Biotech Special: Computational Drug Discovery with Patrick Finneran