Phil Mizrahi is the Product Manager for the Data Map & Discovery team at Lyft, helping a company that grows fast in number of datasets and employees to make sense of its data. Previously, Phil worked as a product manager for a Fintech startup in Berlin, as well as an M&A Analyst for an Investment Bank in Paris and served as an Officer in the French Air Force.
Disrupting Data Discovery at Lyft with Amundsen
Before any analysis can begin, a data scientist needs to discover the right data sources to analyze, understand them, and determine whether they can trust them. Unfortunately, data discovery is very inefficient today. Countless hours are lost trying to find the right data to use.
Gaining trust in data requires running a bunch of queries (max timestamp, counts per day, count distincts, etc.) that waste time and leads to errors. There’s no clear way to know how to find folks to answer questions about the table. And worst of all, many times analysis is redone and models are rebuilt because previous work isn’t discoverable. This does not scale along with an increase in number of employees and data resources, which is the case at Lyft.
Lyft has solved this problem and reduced the time it takes to discover data by 10x by building its own data portal, Amundsen. Amundsen is built on three key pillars: an augmented data graph, an intuitive user experience, and centralized metadata.
1. An augmented data graph
Amundsen uses a graph database under the hood to store relationships between various data assets (tables, dashboards, protobuf events, etc.). What’s unique to Amundsen is that it treats people as a first-class data assets’ in other words, there’s a graph node for each person in the organization that connects to other nodes (like tables and dashboards).
2. An intuitive user experience
Amundsen runs PageRank using data from access logs to power search ranking, similar to how Google ranks web pages on the internet.
3. Centralized metadata
Amundsen gathers metadata from various different sources (Hive, Presto, Airflow, etc.) and exposes it in one central place. The right place to store all this metadata is a work in progress.
Phil shares ongoing efforts in this space, including RISELab’s Ground and WeWork’s Marquez projects. He gives a demo of Amundsen and its goals, leads a deep dive into Amundsen’s architecture, and explains how it achieves the three design pillars. He explains how this map of Lyft’s data is also leveraged for GDPR and CCPA compliance purposes.
Phil closes with a future roadmap of the project, what problems remain unsolved, and how we can work together as an Open Source community to solve them.