November 21, 2024

Notion Builds In-House Data Lake to Power Analytics and Machine Learning

Facing massive data growth, productivity app Notion ditched its traditional data pipeline for a custom-built data lake leveraging Apache Spark and Hudi on AWS. This new architecture slashed data ingestion time and unlocked millions in savings, paving the way for advanced features like AI.
Notion, the popular all-in-one workspace app, has seen explosive user growth, doubling its user data every 6-12 months. This surge presented a challenge for the company’s data infrastructure. Their existing system, which relied on Postgres for both online traffic (user requests) and offline traffic (data analytics and machine learning jobs), began to strain.
To address this, Notion built a new data pipeline in 2021 using Fivetran and Snowflake. While effective for initial data warehousing, this approach came with limitations. Maintaining hundreds of Fivetran connectors proved burdensome, and the data warehouse wasn’t optimized for Notion’s update-heavy workloads, leading to slow and expensive data ingestion. Additionally, the standard SQL interface of off-the-shelf data warehouses made implementing complex data transformations difficult.
To overcome these hurdles, Notion opted for a custom data lake solution. Their new architecture utilizes Debezium for data capture from Postgres, Apache Spark for data processing and transformation, AWS S3 for data storage, and Apache Hudi for managing updates in S3. This shift offers several advantages. Firstly, it significantly reduces data ingestion time from days to mere hours or minutes. Secondly, it delivers substantial cost savings, helping Notion reinvest in product development. Finally, the new data lake empowers Notion to integrate advanced features like AI into the application.
Notably, Notion hasn’t entirely abandoned Snowflake and Fivetran. They acknowledge Snowflake’s value for insert-heavy workloads and Fivetran’s effectiveness for specific use cases. However, their custom data lake built with open-source technologies provides the flexibility and scalability needed to support Notion’s ever-growing data needs.