diff --git a/website/www/site/assets/scss/_case_study.scss b/website/www/site/assets/scss/_case_study.scss index 72c7ce3d94a7c..b1f42e5a35c6c 100644 --- a/website/www/site/assets/scss/_case_study.scss +++ b/website/www/site/assets/scss/_case_study.scss @@ -125,6 +125,7 @@ .case-study-card-img img { height: 50px; + object-fit: scale-down; @media (min-width: $mobile) and (max-width: $tablet) { object-fit: contain; } @@ -362,6 +363,9 @@ h2.case-study-h2 { } } } + .pb-0 { + padding-bottom: 30px; + } } .case-study-post { @@ -386,6 +390,15 @@ h2.case-study-h2 { } } + .post-scheme--centered { + margin-left: auto; + margin-right: auto; + + img { + width: 70%; + } + } + @media screen and (max-width: $mobile) { .case-study-content { flex-direction: column; diff --git a/website/www/site/content/en/case-studies/creditKarma.md b/website/www/site/content/en/case-studies/creditKarma.md new file mode 100644 index 0000000000000..d6fb4ddc1cb5e --- /dev/null +++ b/website/www/site/content/en/case-studies/creditKarma.md @@ -0,0 +1,256 @@ +--- +title: "Self-service Machine Learning Workflows and Scaling MLOps with Apache Beam" +name: "Credit Karma" +icon: "/images/logos/powered-by/credit-karma.png" +category: "study" +cardTitle: "Self-service Machine Learning Workflows and Scaling MLOps with Apache Beam" +cardDescription: "Apache Beam has future-proofed Credit Karma’s data and ML platform for scalability and efficiency, enabling MLOps with unified pipelines, processing 5-10 TB daily at 5K events per second, and managing 20K+ ML features." +authorName: "Avneesh Pratap" +coauthorName: "Raj Katakam" +authorPosition: "Senior Data Engineer II @ Credit Karma" +coauthorPosition: "Senior ML Engineer II @ Credit Karma" +authorImg: /images/case-study/credit_karma/avneesh_pratap.jpeg +coauthorImg: /images/case-study/credit_karma/raj_katakam.jpeg +publishDate: 2022-12-01T00:12:00+00:00 +--- + +
+++ “Apache Beam has been the ideal solution for us. Scaling, backfilling historical data, experimenting with new ML models and new use cases… it is all very easy to do with Beam.” +
+ +
+++ “Apache Beam enabled self-service ML for our data scientists. They can plug in pieces of code, and those transformations will be automatically attached to models without any engineering involvement. Within seconds, our data science team can move from experimentation to production.” +
+ +
++ +[With Apache Beam Dataflow runner](/documentation/runners/capability-matrix/), Credit Karma benefitted from [Google Cloud Dataflow](https://cloud.google.com/dataflow) managed service to ensure increased scalability and efficiency. The Apache Beam [built-in I/O connectors](/documentation/io/built-in/) provide native support for a variety of sinks and sources, which has allowed Credit Karma to seamlessly integrate Beam into their ecosystem with various Google Cloud tools and services, including [Pub/Sub](https://cloud.google.com/pubsub/docs/overview), [BigQuery](https://cloud.google.com/bigquery), and [Cloud Storage](https://cloud.google.com/storage). + +Credit Karma leveraged an Apache Beam kernel and [Jupyter Notebook](https://jupyter.org/) to create an exploratory environment in Vega and enable their data scientists to create new experimental data pipelines without engineering involvement. + +The data scientists at Credit Karma mostly use [SQL](https://en.wikipedia.org/wiki/SQL) and [Python](https://www.python.org/) to create new pipelines. Apache Beam provides powerful [user-defined functions](/documentation/dsls/sql/extensions/user-defined-functions/) with multi-language capabilities that allow for authoring scalar or aggregate functions in Java or Scala, and invoking them in SQL queries. To democratize Scala transformations for their data science team, Credit Karma’s engineers abstracted out the UDFs, [Tensorflow Transforms](https://www.tensorflow.org/), and other complex transformations with numerous components - reusable and shareable “building blocks” - to create Credit Karma’s data and ML platform. Apache Beam and custom abstractions allow data scientists to operate these components when creating experimental pipelines and transformations, which can be easily reproduced in staging and production environments. Credit Karma’s data science team commits their code changes to a common GitHub repository, the pipelines are then merged into a staging environment, and combined into a production application. + +The Apache Beam abstraction layer plays a crucial part in the operationalization of hypotheses and experiments into the production pipelines when it comes to working with financials and sensitive information. Apache Beam enables masking and filtering data right inside data pipelines before writing it to the data warehouse. Credit Karma uses [Apache Thrift](https://thrift.apache.org/) annotations to label the column metadata, Apache Beam pipelines filter specific elements from the data based on Thrift annotations before it reaches the data warehouse. Credit Karma’s data science team can use the available abstractions or write data transformations on top of them to calculate new metrics and validate the ML models without seeing the actual data. + ++ When we started exploring Apache Beam, we found this programming model very promising. At first, we migrated just one partner [to an Apache Beam pipeline]. We were very impressed with the results and migrated to other partner pipelines right away. +
+ +
++ +Currently, about 20 Apache Beam pipelines are running in production and over 100 experimental pipelines are on the way. Plenty of the upcoming experimental pipelines leverage Apache Beam stateful processing to compute user aggregates right inside the streaming pipelines, instead of computing them in the data warehouse. Credit Karma’s data science team is also planning to leverage [Beam SQL](/documentation/dsls/sql/overview/) to use SQL syntax directly within the stream processing pipeline and easily create aggregations. The Apache Beam abstraction of the execution engines and a variety of runners allow Credit Karma to test data pipeline performance with different engines on mock data, create benchmarks and compare the results of different data ecosystems to optimize performance depending on specific use cases. + +## Unified Stream & Batch Data Ingestion + +Apache Beam enabled Credit Karma to revamp one of their most significant use cases - the data ingestion pipeline. Numerous Credit Karma partners send data about their financial products and offerings via gateways to Pub/Sub for downstream processing. The streaming Apache Beam pipeline written in [Scio](https://spotify.github.io/scio/) consumes Pub/Sub topics in real-time and works with deeply nested JSON data, flattening it to the database row format. The pipeline also structures and partitions the data, then writes the outcome into the BigQuery data warehouse for ML model training. + +The Apache Beam unified programming model executes business logic for batch and streaming use cases, which allowed Credit Karma to develop one uniform pipeline. The data ingestion pipeline handles both real-time data and batched data ingestion to backfill historical data from partners into the data warehouse. Some of Credit Karma’s partners send historical data using object stores like GCS or S3, while some use Pub/Sub. Apache Beam unifies batch and stream processing by creating bounded and [unbounded PCollections in the same pipeline](/documentation/basics/) depending on the use case. Reading from a batch object store creates a bounded PCollection. Reading from a streaming and continuously-updating Pub/Sub creates an unbounded PCollection. In case of backfilling just new features for past dates, Credit Karma’s data engineering team configures the same streaming Apache Beam pipeline to process chunks of historical data sent by partners in a batch fashion: read the entire data set once and join historical data elements with the data for a particular date, in a job of finite length. + ++ Apache Beam helped us to ‘black-box’ the financial aspects and non-disclosable information so that teams can work with costs and financials without actually having access to all the data. +
+ +
++ +Currently, the data ingestion pipeline processes and transforms more than 100 million messages, along with regular backfills, which is equivalent to around 5-10 TB worth of data. + +## Self-Service Machine Learning + +At Credit Karma, the data scientists deal with modeling and analyzing the data, and it was crucial for the company to give them the power and flexibility to easily create, test and deploy new models. Apache Beam provided an abstraction that enabled the data scientists to write their own transformations on raw feature space for efficient ML engineering, while keeping the model serving layer independent of any custom code. + +Apache Beam helped to automate Credit Karma’s machine presenting workflows, chain and score models, and prepare data for ML model training. Apache Beam provides [Beam DataFrame API](/documentation/dsls/dataframes/overview/) to identify and implement the required [preprocessing](/documentation/ml/data-processing/) steps to iterate faster towards production. Apache Beam’s built-in I/O transforms allow for reading and writing [TensorFlow TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) files natively, and Credit Karma leverages this connectivity to preprocess data, score models, and use the model scores to recommend financial offers and content. + +Apache Beam enables Credit Karma to process large volumes of data, both for [preprocessing and model validation](/documentation/ml/overview/), and experiment with data during preprocessing. They use [TensorFlow Transforms](https://www.tensorflow.org/tfx/tutorials/transform/simple) for applying transformations on data in batch and real-time model inferences. The output of TensorFlow Transforms is exported as a TensorFlow graph and is attached to models, making prediction services independent of any transformations. Credit Karma was able to offload ad hoc changes on prediction services by performing on-the-fly transformations on raw data, rather than aggregated data that required the involvement of their data engineering team. Their data scientists can now write any type of transformation on the raw data in SQL and deploy new models without any changes to the infrastructure. + ++ With Apache Beam, you can easily add complex processing logic, for example, you can add configurable triggers on processing time. At the same time, Dataflow runner will manage execution for you, it uploads your executable code and dependencies automatically. And you have Dataflow auto-scaling working out of the box. You don’t have to worry about scaling horizontally. +
+ +
++ +Apache Beam-powered ML pipelines have proven to be incredibly reliable, processing more than 100 million events and updating ML models with fresh data daily. + +## Enabling Real-Time Data Availability + +Credit Karma leverages machine learning to analyze user behavior and recommend the most relevant offers and content. Before using Apache Beam, collecting user actions (logs) across multiple systems required a myriad of manual steps and multiple tools, which resulted in processing performance drawbacks and backs-and-forths between teams whenever any changes were needed. Apache Beam helped to automate this logging pipeline. The cross-system user session logs are recorded in Kafka topics and are stored in Google Cloud Storage. The batch Apache Beam pipeline written in Scio parses the user actions for a particular tracking ID, transforms and cleans the data, and writes it to BigQuery. + ++ Apache Beam enabled self-service ML for our data scientists. They can plug in pieces of code, and those transformations will be automatically attached to models without any engineering involvement. Within seconds, our data science team can move DAGs from experimentation to production by just changing the deploy path. +
+ +
++ ++ Now that we have migrated the logging pipeline to Apache Beam, we are very happy with its speed and performance, and we are planning to transform this batch pipeline into a streaming one. +
+ +