How to Simplify CDC With Delta Lake’s Change Data Feed

June 9, 2021, 8:00 am

≫ Next: How to Build a Scalable Wide and Deep Product Recommender

≪ Previous: Women in Product at Databricks

Try this notebook in Databricks

Change data capture (CDC) is a use case that we see many customers implement in Databricks – you can check out our previous deep dive on the topic here. Typically we see CDC used in an ingestion to analytics architecture called the medallion architecture. The medallion architecture that takes raw data landed from source systems and refines the data through bronze, silver and gold tables. CDC and the medallion architecture provide multiple benefits to users since only changed or added data needs to be processed. In addition, the different tables in the architecture allow different personas, such as Data Scientists and BI Analysts, to use the correct up-to-date data for their needs. We are happy to announce the exciting new Change Data Feed (CDF) feature in Delta Lake that makes this architecture simpler to implement and the MERGE operation and log versioning of Delta Lake possible!

Why is the CDF feature needed?

Many customers use Databricks to perform CDC, as it is simpler to implement with Delta Lake compared to other Big Data technologies. However, even with the right tools, CDC can still be challenging to execute. We designed CDF to make coding even simpler and address the biggest pain points around CDC, including:

Quality Control – Row level changes are hard to attain between versions.
Inefficiency – It can be inefficient to account for non-changing rows since the current version changes are at the file and not the row level.

Here is how Change Data Feed (CDF) implementation helps resolve the above issues:

Simplicity and convenience – Uses a common, easy-to-use pattern for identifying changes, making your code simple, convenient and easy to understand.
Efficiency – The ability to only have the rows that have changed between versions, makes downstream consumption of Merge, Update and Delete operations extremely efficient.

CDF captures changes only from a Delta table and is only forward-looking once enabled.

Change Data Feed in Action!

Let’s dive into an example of CDF for a common use case: financial predictions. The notebook referenced at the top of this blog ingests financial data. Estimated Earnings Per Share (EPS) is financial data from analysts predicting a company’s quarterly earnings per share. The raw data can come from many different sources and from multiple analysts for multiple stocks.

With the CDF feature, the data is simply inserted into the bronze table (raw ingestion), then filtered, cleaned and augmented in the silver table and, finally, aggregate values are computed in the gold table based on the changed data in the silver table.

While these transformations can get complex, thankfully, now the row-based CDF feature is simple and efficient. But how do you use it? Let’s dig in!

NOTE: The example here focuses on the SQL version of CDF and also on a specific way to use the operations, to evaluate variations, please see the documentation here

Enabling CDF on a Delta Lake Table

To have the CDF feature available on a table, you must first enable the feature on said table. Below is an example of enabling CDF for the bronze table at table creation. You can also enable CDF on a table as an update to the table. In addition, you can enable CDF on a cluster for all tables created by the cluster. For these variations, please see the documentation here.

How to Simplify CDC with Delta Lakes Change Data Feed blog img 2

Change Data Feed is a forward looking feature, it will capture changes once the table property is set up and not earlier.

Querying the change data

To query the change data, use the table_changes operation. The example below includes inserted rows and two rows that represent the pre- and post-image of an updated row, so that we can evaluate the differences in the changes if needed. There is also a delete Change Type that is returned for deleted rows.

This example accesses the changed records based on the starting version, but you can also cap the versions based on the ending version, as well as starting and ending timestamps if needed. This example focuses on SQL, but there are also ways to access this data in Python, Scala, Java and R. For these variations, please see the documentation here.

Using CDF row data in a MERGE statement

Aggregate MERGE statements, like the merge into the gold table, can be complex by nature, but the CDF feature makes the coding of these statements simpler and more efficient.

As seen in the above diagram, CDF makes it simple to derive which rows have changed, as it only performs the needed aggregation on the data that has changed or is new using table_changes operation. Below, you can see how to use the changed data to determine which dates and stock symbols have changed.

As shown below, you can use the changed data from the silver table to aggregate only the data on the rows that need to be updated or inserted into the gold table. To do this, use INNER JOIN on the table_changes(‘table_name’,’version’)

The end result is a clear and concise version of a gold table that can incrementally change over time!

Typical use cases

Here are some common use cases and benefits of the new CDF feature:

Silver & gold tables

Improve Delta performance by processing only changes following initial MERGE comparison to accelerate and simplify ETL/ELT operations.

Materialized views

Create up-to-date, aggregated views of information for use in BI and analytics without having to reprocess the full underlying tables, instead updating only where changes have come through.

Transmit changes

Send Change Data Feed to downstream systems such as Kafka or RDBMS that can use it to incrementally process in later stages of data pipelines.

Audit trail table

Capturing Change Data Feed outputs as a Delta table provides perpetual storage and efficient query capability to see all changes over time, including when deletes occur and what updates were made.

When to use Change Data Feed

Conclusion

At Databricks, we strive to make the impossible possible and the hard simple. CDC, Log versioning and MERGE implementation were virtually impossible at scale until Delta Lake was created. Now we are making it simpler and more efficient with the exciting Change Data Feed (CDF) feature!

Try this notebook in Databricks

Try Databricks for free. Get started today.

The post How to Simplify CDC With Delta Lake’s Change Data Feed appeared first on Databricks.

↧

How to Build a Scalable Wide and Deep Product Recommender

June 9, 2021, 9:00 am

≫ Next: Solution Accelerator: Toxicity Detection in Gaming

≪ Previous: How to Simplify CDC With Delta Lake’s Change Data Feed

Download the notebooks referenced throughout this article.

I have a favorite coffee shop I’ve been visiting for years. When I walk in, the barista knows me by name and asks if I’d like my usual drink. Most of the time, the answer is “yes”, but every now and then, I see they have some seasonal items and ask for a recommendation. Knowing I usually order a lightly sweetened latte with an extra shot of espresso, the barista might recommend the dark chocolate mocha — not the fruity concoction piled high with whipped cream and sprinkles. The barista’s knowledge of my explicit preferences and their ability to generalize based on my past choices provides me with a highly-personalized experience. And because I know the barista knows and understands me, I trust their recommendations.

How to Build a Scalable Wide & Deep Product Recommender

Much like the barista at my favorite coffee shop, the wide-and-deep learning for recommender systems has the ability to both memorize and generalize product recommendations based on user behavior and customer interactions. First introduced by Google for use in its Google Play app store, the wide-and-deep machine learning (ML) model has become popular in a variety of online scenarios for its ability to personalize user engagements, even in ‘cold start problem’ scenarios with sparse data inputs.

The goal with wide and deep recommenders is to provide the same level of customer intimacy that, for example, our favorite barista does. This model uses explicit and implicit feedback to expand the considerations set for customers. Wide and deep recommenders go beyond simple weighted averaging of customer feedback found in some collaborative filters to balance what is understood about the individual with what is known about similar customers. If done properly, the recommendations make the customer feel understood and this should translate into greater value for both the customer and the business.

Understanding the model design

To understand the concept of deep & wide recommendations, it’s best to think of it as two separate, but collaborating, engines. The wide model, often referred to in the literature as the linear model, memorizes users and their past product choices. Its inputs may consist simply of a user identifier and a product identifier, though other attributes relevant to the pattern (such as time of day) may also be incorporated.

Figure 1. A conceptual interpretation of the wide and deep model

The deep portion of the model, so named as it is a deep neural network, examines the generalizable attributes of a user and their product choices. From these, the model learns the broader characteristics that tend to favor users’ product selections.

Together, the wide and deep submodels are trained on historical product selections by individual users to predict future product selections. The end result is a single model capable of calculating the probability with which a user will purchase a given item, given both memorized past choices and generalizations about a user’s preferences. These probabilities form the basis for user-specific product rankings, which can be used for making recommendations.

Building the model

The intuitive logic of the wide-and-deep recommender belies the complexity of its actual construction. Inputs must be defined separately for each of the wide and deep portions of the model and each must be trained in a coordinated manner to arrive at a single output, but tuned using optimizers specific to the nature of each submodel. Thankfully, the Tensorflow DNNLinearCombinedClassifier estimator provides a pre-packaged architecture, greatly simplifying the assembly of the overall model.

Training

The challenge for most organizations is then training the recommender on the large number of user-product combinations found within their data. Using Petastorm, an open-source library for serving large datasets assembled in Apache Spark™ to Tensorflow (and other ML libraries), we can cache the data on high-speed, temporary storage and then read that data in manageable increments to the model during training. In doing so, we limit the memory overhead associated with the training exercise while preserving performance.

Tuning

Tuning the model becomes the next challenge. Various model parameters control its ability to arrive at an optimal solution. The most efficient way to work through the potential parameter combinations is simply to iterate through some number of training cycles, comparing the models’ evaluation metrics with each run to identify the ideal parameter combinations. By leveraging hyperopt with SparkTrails, we can parallelize this work across many compute nodes, allowing the optimizations to be performed in a timely manner.

Deploying

Finally, we need to deploy the model for integration with various retail applications. Leveraging MLflow allows us to both persist our model and package it for deployment across a wide variety of microservices layers, including Azure Machine Learning, AWS Sagemaker, Kubernetes and Databricks Model Serving.

While this seems like a large number of technologies to bring together just to build a single model, Databricks integrates all of these technologies within a single platform, providing data scientists, data engineers & MLOps Engineers a unified experience. The pre-integration of these technologies means various personas can work faster and leverage additional capabilities, such as the automated tracking of models, to enhance the transparency of the organization’s model building efforts.

To see an end-to-end example of how a wide and deep recommender model may be built on Databricks, please check out the following notebooks:

Try Databricks for free. Get started today.

The post How to Build a Scalable Wide and Deep Product Recommender appeared first on Databricks.

↧

Solution Accelerator: Toxicity Detection in Gaming

June 16, 2021, 10:39 am

≫ Next: Announcing Photon Public Preview: The Next Generation Query Engine on the Databricks Lakehouse Platform

≪ Previous: How to Build a Scalable Wide and Deep Product Recommender

Across massively multiplayer online video games (MMOs), multiplayer online battle arena games (MOBAs) and other forms of online gaming, players continuously interact in real time to either coordinate or compete as they move toward a common goal — winning. This interactivity is integral to game play dynamics, but at the same time, it’s a prime opening for toxic behavior — an issue pervasive throughout the online video gaming sphere.

Toxic behavior manifests in many forms, such as the varying degrees of griefing, cyberbullying and sexual harassment that are illustrated in the matrix below from Behaviour Interactive, which lists the types of interactions seen within the multiplayer game, Dead by Daylight.

Matrix of toxic interactions that player experience

Fig. 1: Matrix of toxic interactions that players experience

In addition to the personal toll that toxic behavior can have on gamers and the community — an issue that cannot be overstated — it is also damaging to the bottom line of many game studios. For example, a study from Michigan State University revealed that 80% of players recently experienced toxicity, and of those, 20% reported leaving the game due to these interactions. Similarly, a study from Tilburg University showed that having a disruptive or toxic encounter in the first session of the game led to players being over three times more likely to leave the game without returning. Given that player retention is a top priority for many studios, particularly as game delivery transitions from physical media releases to long-lived services, it’s clear that toxicity must be curbed.

Compounding this issue related to churn, some companies face challenges related to toxicity early in development, even before launch. For example, Amazon’s Crucible was released into testing without text or voice chat due in part to not having a system in place to monitor or manage toxic gamers and interactions. This illustrates that the scale of the gaming space has far surpassed most teams’ ability to manage such behavior through reports or by intervening in disruptive interactions. Given this, it’s essential for studios to integrate analytics into games early in the development lifecycle and then design for the ongoing management of toxic interactions.

Toxicity in gaming is clearly a multifaceted issue that has become a part of video game culture and cannot be addressed universally in a single way. That said, addressing toxicity within in-game chat can have a huge impact given the frequency of toxic behavior and the ability to automate detection of it using natural language processing (NLP).

Introducing the Toxicity Detection in Gaming Solution Accelerator from Databricks

Using toxic comment data from Jigsaw and Dota 2 game match data, this solution accelerator walks through the steps required to detect toxic comments in real time using NLP and your existing lakehouse. For NLP, this solution accelerator uses Spark NLP from John Snow Labs, an open-source, enterprise-grade solution built natively on Apache Spark ™.

The steps you will take in this solution accelerator are:

Load the Jigsaw and Dota 2 data into tables using Delta Lake
Classify toxic comments using multi-label classification (Spark NLP)
Track experiments and register models using MLflow
Apply inference on batch and streaming data
Examine the impact of toxicity on game match data

Detecting toxicity within in-game chat in production

With this solution accelerator, you can now more easily integrate toxicity detection into your own games. For example, the reference architecture below shows how to take chat and game data from a variety of sources, such as streams, files, voice or operational databases, and leverage Databricks to ingest, store and curate data into feature tables for machine learning (ML) pipelines, in-game ML, BI tables for analysis and even direct interaction with tools used for community moderation.

Databricks toxicity detection reference architecture

Fig. 2: Toxicity detection reference architecture

Having a real-time, scalable architecture to detect toxicity in the community allows for the opportunity to simplify workflows for community relationship managers and the ability to filter millions of interactions into manageable workloads. Similarly, the possibility of alerting on severely toxic events in real-time, or even automating a response such as muting players or alerting a CRM to the incident quickly, can have a direct impact on player retention. Likewise, having a platform capable of processing large datasets, from disparate sources, can be used to monitor brand perception through reports and dashboards.

Getting started

The goal of this solution accelerator is to help support the ongoing management of toxic interactions in online gaming by enabling real-time detection of toxic comments within in-game chat. Get started today by importing this solution accelerator directly into your Databricks workspace.

Once imported you will have notebooks with two pipelines ready to move to production.

ML Pipeline using Multi-Label Classification with training on real-world English datasets from Google Jigsaw. The model will classify and label the forms of toxicity in text.
Real-time streaming inference pipeline leveraging the toxicity model. The pipeline source can be easily modified to ingest chat data from all the common data sources.

With both of these pipelines, you can begin understanding and analyzing toxicity with minimal effort. This solution accelerator also provides a foundation to build, customize and improve the model with relevant data to game mechanics and communities.

DOWNLOAD THE NOTEBOOKS!

Try Databricks for free. Get started today.

The post Solution Accelerator: Toxicity Detection in Gaming appeared first on Databricks.

↧

Announcing Photon Public Preview: The Next Generation Query Engine on the Databricks Lakehouse Platform

June 17, 2021, 2:42 pm

≫ Next: The Modern Chief Data Officer: Transitioning From Defense to Offense

≪ Previous: Solution Accelerator: Toxicity Detection in Gaming

Today, we’re excited to announce the availability of Photon in public preview. Photon is a native vectorized engine developed in C++ to dramatically improve query performance. All you have to do to benefit from Photon is turn it on. Photon will seamlessly coordinate work and resources and transparently accelerate portions of your SQL and Spark queries. No tuning or user intervention required.

While the new engine is designed to ultimately accelerate all workloads, during preview, Photon is focused on running SQL workloads faster, while reducing your total cost per workload. There are two ways you can benefit from Photon:

As the default query engine on Databricks SQL
As part of a new high-performance runtime on Databricks clusters, which consumes DBUs at a different rate than the same instance type running the non-Photon runtime.

In this blog, we’ll discuss the motivation behind building Photon, explain how Photon works under the hood and how to monitor query execution in Photon from both Databricks SQL and traditional clusters on Databricks Data Science & Data Engineering as well.

Faster with Photon

One might be wondering, why build a new query engine? They say a bar chart is worth a thousand words, so let’s allow the data to tell the story.

Relative Speedup of Databricks Runtime with Photon compared to version 2.1 using TPC-DS 1TB

Image 1: Relative Speedup of Databricks Runtime compared to version 2.1 using TPC-DS 1TB

As you can see from this chart of Databricks Runtime performance using the Power Test from the TPC-DS benchmark (scale factor 1TB), performance steadily increased over the years. However, with the introduction of Photon, we see a huge leap forward in query performance — Photon is up to 2x faster than Databricks Runtime 8.0. That’s why we’re very excited about Photon’s potential, and we’re just getting started — the Photon roadmap contains plans for greater coverage and more optimizations.

Early private preview customers have observed 2-4x average speedups using Photon on SQL workloads such as:

SQL-based jobs – Accelerate large-scale production jobs on SQL and Spark DataFrames.
IoT use cases – Faster time-series analysis using Photon compared to Spark and traditional Databricks Runtime.
Data privacy and compliance – Query petabytes-scale datasets to identify and delete records without duplicating data with Delta Lake, production jobs and Photon.
Loading data into Delta and Parquet – Photon’s vectorized I/O speeds up data loads for Delta and Parquet tables, lowering overall runtime and costs of Data Engineering jobs.

How Photon works

While Photon is written in C++, it integrates directly in and with Databricks Runtime and Spark. This means that no code changes are required to use Photon. Let me walk you through a quick “lifecycle of a query” to help you understand where Photon plugs in.

Image 2: Lifecycle of a Photon query

When a client submits a given query or command to the Spark driver, it is parsed, and the Catalyst optimizer does the analysis, planning and optimization just as it would if there were no Photon involved. The one difference is that with Photon the runtime engine makes a pass over the physical plan and determines which parts can run in Photon. Minor modifications may be made to the plan for Photon, for example, changing a sort merge join to hash join, but the overall structure of the plan, including join order, will remain the same. Since Photon does not yet support all features that Spark does, a single query can run partially in Photon and partially in Spark. This hybrid execution model is completely transparent to the user.

The query plan is then broken up into atomic units of distributed execution called tasks that are run in threads on worker nodes, which operate on a specific partition of the data. It’s at this level that the Photon engine does its work. You can think of it as replacing Spark’s whole stage codegen with a native engine implementation. The Photon library is loaded into the JVM, and Spark and Photon communicate via JNI, passing data pointers to off-heap memory. Photon also integrates with Spark’s memory manager for coordinated spilling in mixed plans. Both Spark and Photon are configured to use off-heap memory and coordinate under memory pressure.

With the public preview release, Photon supports many – but not all – data types, operators and expressions. Refer to the Photon overview in the documentation for details.

Photon execution analysis

Given that not all workloads and operators are supported today, you might be wondering how to choose workloads that can benefit from Photon and how to detect the presence of Photon in the execution plan. In short, Photon execution is bottom up — it begins at the table scan operator and continues up the DAG (directed acyclic graph) until it hits an operation that is unsupported. At that point, the execution leaves Photon, and the rest of the operations will run without Photon.

Click the Query History icon on the sidebar.
Click the line containing the query you’d like to analyze.
On the Query Details pop-up, click Execution Details.
Look at the Task Time in Photon metric at the bottom.

In general, the larger the percentage of Task Time in Photon, the larger the performance benefit from Photon.

Image 3: Databricks SQL Query History Execution Details

If you are using Photon on Databricks clusters, you can view Photon action in the Spark UI. The following screenshot shows the query details DAG. There are two indications of Photon in the DAG. First, Photon operators start with Photon, such as PhotonGroupingAgg. Secondly, in the DAG Photon operators and stages are colored peach, whereas the non-Photon ones are blue.

Image 4: Spark UI Query Details DAG

Getting started with a working example on NYC taxi data

As discussed above, there are two ways you can use Photon:

Photon is on by default for all Databricks SQL endpoints. Just provision a SQL endpoint, and run your queries and use the method presented above to determine how much Photon impacts performance.
To run Photon on Databricks clusters (AWS only during public preview), select a Photon runtime when provisioning a new cluster. The new Photon instance type consumes DBUs at a different rate than the same instance type running the non-Photon runtime. For more details on the specifics of Photon instances and DBU consumption, refer to the Databricks pricing page for AWS.

Once you’ve created a Photon-enabled SQL endpoint or cluster, you can try running a few queries against the NYC Taxi dataset from Databricks SQL editor or a notebook. We have pre-loaded an excerpt and made it accessible as part of our Databricks datasets.

First, create a new table pointing to the existing data with the following SQL snippet:

 
CREATE DATABASE IF NOT EXISTS photon_demo;
CREATE TABLE photon_demo.nyctaxi_yellowcab_table
USING DELTA
OPTIONS (
  path "/databricks-datasets/nyctaxi/tables/nyctaxi_yellow/" 
);

Try this query and enjoy the speed of Photon!

 
SELECT vendor_id,
  SUM(trip_distance) as SumTripDistance,
  AVG(trip_distance) as AvgTripDistance
FROM photon_demo.nyctaxi_yellowcab_table
WHERE passenger_count IN (1, 2, 4)
GROUP BY vendor_id
ORDER BY vendor_id;

We measured the response time of the above query with Photon and a conventional Databricks Runtime on a warmed-up AWS cluster with 2 i3.2xlarge executors and a i3.2xlarge driver. Here are the results.

Image 5: Photon vs. Databricks Runtime on NYC taxi example query

If you’d like to learn more about Photon, you can also watch our Data and AI Summit session: Radical speed for SQL Queries Photon Under the Hood. Thank you for reading, we look forward to your feedback on this!

Try Databricks for free. Get started today.

The post Announcing Photon Public Preview: The Next Generation Query Engine on the Databricks Lakehouse Platform appeared first on Databricks.

↧

The Modern Chief Data Officer: Transitioning From Defense to Offense

June 18, 2021, 8:22 am

≫ Next: How Databricks Supports Digital Native Companies in Their Hyper-growth Journey

≪ Previous: Announcing Photon Public Preview: The Next Generation Query Engine on the Databricks Lakehouse Platform

The Chief Data Officer (CDO) is not a new position – Capital One reportedly had a CDO all the way back in 2002. But only recently has it become a mainstream, business-critical role for enterprises. In a recent study by NewVantage Partners, 65% of surveyed companies have a CDO position within their organization. And as data and AI continue to shape nearly every industry, the role of the CDO is shifting with it.

In this blog, we dive into the evolution of the CDO and, as data-driven leadership becomes the key to business success, the prescriptive steps required for building data and AI-centric organizations.

Evolution of responsibilities of the CDO

At a high level, the CDO is responsible for envisioning and executing the data strategy across all business functions. Over time, the CDO role has evolved with global economic and regulatory changes and continuous advancements in technology. According to Gartner, with the uptick in data and analytics (D&A) use cases, we have entered into the fourth stage of the CDO evolution, defined as:

Stage 1: Primarily focus on data management strategies and best practices. This can be either centralized or decentralized under distinct business units.
Stage 2: Utilize analytics along with data management, which helps with “the offense” since the CDO can now design holistic data management strategies that best enable the use of data analytics.
Stage 3: Drive digital transformation or business transformation.
Stage 4: Manage profit and loss (P&L) using D&A instead of merely focusing on creating the D&A products.

As CDOs across organizations transition to “stage four,” their approach and metrics of success ultimately come down to one question: are they optimizing for only a defensive or an offensive data strategy that is built on sound data management practices? Let’s walk through what each of these scenarios means.

In a defensive approach, one focuses on ETL, data management, infrastructure, regulatory and compliance issues. These are the foundations of a good data platform since they are concerned with protecting the data and controlling access to the data.

An offensive strategy, on the other hand, focuses on the products and uses D&A to drive decision-making for P&L. CDOs are integral to decision-making and work with the business units. These are revenue-generating strategies or strategies concerned with increasing customer satisfaction. In this scenario, it’s important to note that data management is still critical; the difference is that CDOs are now looking beyond pure data management in a defensive strategy to build on those practices and create value out of data.

According to Harvard Business Review, offensive strategies tend to be more real-time since they focus on sales and marketing, while defensive strategies focus on compliance and the legal aspects of data management. Ideally, one would want to balance the competing features of control (defensive) and flexibility (offensive) of data to utilize the data effectively. While it is easy to lock up the data in a silo, this is counterproductive since now no value can be derived from the data to bring in new customers and increase revenue. It is important to point out that a good defensive strategy is essential before one can transition to offense, (i.e. one needs good data management and governance policies in place before one can start to put the data to use with machine learning models and analytics).

Technology enabling the transition from defense to offense

According to a recent MIT Tech Review study, “Building a High-Performance Data and AI Organization,” which sought to understand the characteristics of high-performing data-driven organizations, most high-performing organizations are focused on implementing ML solutions with their data, whereas lower-performing organizations still struggle with implementing data management strategies. High-performing organizations prioritized the following in their quest to become a more data-driven organization:

Reduced data duplication
Fast and easy access to data
Improved data quality
Minimize the hurdles in cross-functional collaboration
Perform analytics ‘in-place’

Data management – the single source of truth – can evolve to derive versions of this, but the provenance has to be clear. First of all, data governance has to be strictly enforced to make sure that this is done consistently. Data management can be either centralized or decentralized under each individual business unit. The advantage of a centralized data management policy is stronger controls over your data, however, a decentralized approach results in more flexible access to data, making it easier to apply analytics and generate insights from the data. Stronger controls or data governance can result in reduced data duplication and better data quality while flexibility helps with faster and easier access to data.

A new data architecture is emerging, the lakehouse architecture, which brings the best of data warehouses and data lakes into a single unified platform for all data, analytics and AI. A lakehouse architecture both helps CDOs effectively check the box on data management and build am offensive data and AI strategy. Modern lakehouse architectures build on existing open lakes to seamlessly add comprehensive data management, supporting analytics and ML across all business units using all enterprise data. With lakehouse architectures, CDOs can reduce risk by enabling fine-grained access controls for data governance, functionality typically not possible with data lakes. Data can quickly and accurately be updated in the data lake to comply with regulations like GDPR and maintain better data governance through audit logging – even in multi-cloud environments. CDOs now shift focus on the exciting value add initiatives of creating compelling business insights and turning data into products.

The pace of investment in D&A products is increasing.

There is greater adoption of ML and deep learning solutions. However, there are two primary issues that plague a data science team. One is avoiding the inevitable staleness associated with moving data from the source to the model building platform. Ideally, a framework that allows models to be trained on data ‘in-place’ or where the data resides is needed. The second issue relates to handing off ML models from the data scientists to the engineering teams (i.e. there has to be a seamless way to productionize ML models). In order to minimize these issues, there must either be a close collaboration between the data scientists and the engineering team or a seamless process for the model-building team to deploy it to production without involving the engineering team.

Through this shift, it’s important to note that the importance of governance never goes away but rather the shift is additive, with a solid offensive strategy executed on a strong foundation of governance.

Challenges for the modern CDO

“Culture eats strategy for breakfast” by Peter Drucker. According to the NewVantage Partners executive survey on Big Data and AI, the same can be said for almost 92% of organizations with respect to being data-driven. Organizations, while aspiring to be data-driven, can be hesitant to make the necessary changes required to becoming a data-driven organization. CDOs need to be mindful of this while trying to work across functional units. As reported in the NewVantage survey, more executives believe that the CDO role should belong to someone who is an ‘insider’ as opposed to an external agent of change. This is possibly signaling the desire for someone who is in sync with the company culture, thereby minimizing the hurdles associated with bringing about change in the data culture.

To achieve this goal, a CDO needs the adoption of its products across the entire company, not just individual teams. This requires close cooperation with and endorsement from data-driven leaders in the C-suite, otherwise, these initiatives are bound to fail. According to data leader veteran Sol Rashidi of Estee Lauder, the best approach for increasing adoption is to start with a “prototype” and get buy-in from the business leaders before proceeding. This creates tangible alignment with the business interests as opposed to abstract goals that are difficult to align on. The key here, according to Rashidi, is to de-emphasize the details and focus on business outcomes and the value derived from D&A products rather than just the technical capabilities.

The path to success

The most critical step for CDOs to enable data and AI at scale is to develop a comprehensive strategy with buy-in from stakeholders across the organization. This strategy focuses on achieving maximum success by leveraging people, processes, data and technology to ultimately drive measurable business results against your corporate priorities. The strategy serves as a set of principles that every member of your organization can reference when making business decisions. It should cover the roles and responsibilities of teams within the organization for capturing, storing, curating and processing data — including the resources (labor and budget).

To help guide CDOs and other data leaders looking to transition to an offensive, business-focused data strategy, we’ve compiled a list of 10 key considerations. You can see the full list and details in the guide “Enable Data and AI to Transform your Organization”:

What are the overall goals, timeline and appetite for the initiative?
How do you identify, evaluate and prioritize use cases that actually provide a significant ROI?
How do you create high-performing teams and empower your business analyst, data scientist, machine learning and data engineering talent?
How can you future-proof your technology investment with a modern cloud-based data architecture?
How can you satisfy the GDPR, the CCCPA, and other emerging data compliance and governance regulations?
How do you guarantee data quality and enable secure data access and sharing of all your data across the organization?
How do you streamline the user experience (UX), improve collaboration, and simplify the complexity of your tooling?
How do you make informed build vs. buy decisions and ensure you are focusing your limited resources on the most important problems?
How do you establish the initial budgets, allocate and optimize costs based on SLAs and usage patterns?
What are the best practices for moving into production, and how do you measure progress, rate of adoption, and user satisfaction?

A strategy should clearly answer these 10 questions and more, and it should be captured in a living document, owned and governed by the CDO, and made available for everyone in the organization to review and provide feedback. The strategy will evolve based on the changing business and/or technology landscape — but it should serve as the North Star for how you will navigate the many decisions and trade-offs that you will need to make over the course of the transformation.

Next steps

To learn more about how CDO’s and data leaders are evolving their roles and their views on the future of their data strategies check out the Champions of Data + AI series and the new report from MIT Tech Review Insights: Building a High-Performance Data and AI Organization.

Try Databricks for free. Get started today.

The post The Modern Chief Data Officer: Transitioning From Defense to Offense appeared first on Databricks.

↧

How Databricks Supports Digital Native Companies in Their Hyper-growth Journey

June 18, 2021, 10:36 am

≫ Next: Get Your Free Copy of Delta Lake: The Definitive Guide (Early Release)

≪ Previous: The Modern Chief Data Officer: Transitioning From Defense to Offense

In a recent panel discussion, Richard Zananiri, Director of EMEA Mid-market at Databricks, was joined by four globally operating, high-growth cloud-native companies. Each organization is at the forefront of utilizing the power of data, ML and AI to solve business-critical issues within their respective sectors. The company representatives were asked to explain how they are harnessing the power of data to overcome their biggest challenges, and how they are implementing Databricks technologies to reach scale and drive hyper-growth within their organizations.

Across all four companies, it was clear that by introducing the Databricks Lakehouse platform, data teams are able to manage and process data in a faster, more efficient and scalable way. This in turn allows them to achieve more with fewer resources.

Dodo Brands
Dodo Brands, a Russian tech-driven food service company that has experienced 40% growth over the last year, implemented Databricks to enable artificial intelligence (AI) and machine learning (ML) for predictive supply chain analytics and forecasting, which is instrumental to sustaining their competitive advantage. Their first project involved building a new infrastructure and system for supply chain planning, starting with short-term forecasting. This was then scaled up across the whole supply chain, ensuring that data was accurate, reliable and accessible for all their partners and employees. With around 600 pizzerias and 10,000+ employees, data can now be harnessed to provide all stakeholders with relevant information on how to increase revenues and manage their business effectively. The company’s next step will be to look beyond forecasting to future advanced analytics use cases, such as optimizing prices with ML algorithms. The aim is to complete projects up to 20 times faster, aligned with the rapid pace of the Dodo Brands business.

Fraudio
Fraudio is another fast-growing, disruptive business benefiting from the Databricks platform. The financial services company offers payment, merchant fraud detection and anti-money laundering solutions using their patented centralized AI technologies, and works with some of the biggest financial institutions in the world. Databricks’ solutions enable Fraudio to carry out data engineering, data science and model training and to manage business intelligence (BI) efficiently. Fraudio processes large amounts of transactions every day from different customers and third parties. Customer data schemas are translated to its internal data schema in real time, and all schemas are put together in one centralized place for the AI to be leveraged. The AI continuously learns from all the data received, producing a networking effect involving billions of transactions, enabling Fraudio to contextualize the data and produce extremely accurate scores that serve customers of all sizes to meet their real-world needs more rapidly.

FollowAnalytics
With growth of more than 600% over the last two years, FollowAnalytics allows customers to build a mobile app, with the aim of growing mobile revenue streams and increasing marketing ROI. As its customer base has grown, FollowAnalytics has transitioned to Databricks to manage each customer experience individually with separate flows. Over the next 18 months, the company is moving entirely to the Databricks platform to increase flexibility and business impact for customers, as well as managing internal resources more efficiently. With the large volume of data from some retail and e-commerce customers, FollowAnalytics has developed specific analytics solutions customized to each customer, and is starting to implement AI and ML models that allow automatic segmentation. Modeling is completed on a per client basis, so that customer data is not mixed – ensuring compliance with data protection and privacy requirements. Another key advantage of Databricks for FollowAnalytics is stability. Different customers have subtle differences in their applications, so with A/B testing, they can measure which solution serves the customer best, increases revenue and maximizes ROI. With several million people using the applications daily, the FollowAnalytics team confirms that this would not be possible without Databricks.

S4M
S4M is a rapidly-growing business that delivers advertising designed to drive customers to stores, dealerships and restaurants. Its goal is to increase value from campaign spend, optimizing and reporting on business KPIs such as store visits and sales. More than 1,000 brands use the S4M Fusio platform to drive customers to physical locations. The company transferred to the Databricks open source Delta Lake platform, which has increased performance and provided full integration with open source services such as MLFlow. Its data teams have transitioned from batch jobs to streaming, which allows them to segregate and create a layered architecture. This means they now operate with more agility and flexibility when feeding ML jobs. Thanks to Databricks, S4M can solve an array of critical problems such as qualifying bids, how much should be paid for performing campaigns and geolocalisation fraud issues.

The S4M data team uses MLFlow daily on production jobs or using notebooks. MLFlow is employed to track every model, run and record. The data can be versioned by using just a few more lines of code, making operation much simpler, and iteration much faster. All actions are completed ‘under the hood’, saving time and effort: the tracking of the runs, the parameters and the metrics, and models can be automatically exported to S3. Furthermore, Databricks makes it easy for S4M teams to change cluster configurations, to bootstrap jobs and to go on production every couple of hours. It has also helped the teams embrace a layered architecture, which can be leveraged much more easily using Delta Lake atop their data lake. In summary, Databricks has made S4M’s data operations more cost-effective and efficient, allowing teams to accomplish more with reduced resources.

In conclusion, all the organizations on our panel – although operating in vastly different industries and customer sectors – share similar goals in terms of data management: driving efficiencies across their teams and moving forward with agility and speed.

We would like to thank our guests and moderator for their participation and insights. Watch the full discussion through the link below.

Watch now

Panelists:

Jose Carlos Joaquim – CTO, FollowAnalytics

Joao Moura – CEO, Fraudio

Michael Colson – VP Platform & Data, S4M

Clement Carreau – Data Engineer, S4M

Vladislav Mandryka – Global Supply Chain Director, Dodo Brands

Andrey Filipev – Chief Data Officer, Dodo Brands

Try Databricks for free. Get started today.

The post How Databricks Supports Digital Native Companies in Their Hyper-growth Journey appeared first on Databricks.

↧

Get Your Free Copy of Delta Lake: The Definitive Guide (Early Release)

June 22, 2021, 8:00 am

≫ Next: Need for Data-centric ML Platforms

≪ Previous: How Databricks Supports Digital Native Companies in Their Hyper-growth Journey

At the Data + AI Summit, we were thrilled to announce the early release of Delta Lake: The Definitive Guide, published by O’Reilly. The guide teaches how to build a modern lakehouse architecture that combines the performance, reliability and data integrity of a warehouse with the flexibility, scale and support for unstructured data available in a data lake. It also shows how to use Delta Lake as a key enabler of the lakehouse, providing ACID transactions, time travel, schema constraints and more on top of the open Parquet format. Delta Lake enhances Apache Spark and makes it easy to store and manage massive amounts of complex data by supporting data integrity, data quality, and performance.

What can you expect from reading this guide? Learn about all the buzz around bringing transactionality and reliability to data lakes using the Delta Lake. You will gain an understanding about the evolution of the big data technology landscape — from data warehousing to the data lakehouse.

Source: Evolution to the Data Lakehouse

There is no shortage of challenges associated with building data pipelines, and this guide walks through how to tackle them and make data pipelines robust and reliable so that downstream users both realize significant value and rely on their data to make critical data driven decisions.

While many organizations have standardized on Apache Spark™ as the big data processing engine, we need to add transactionality to our data lakes to ensure a high quality end-to-end data pipeline. This is where Delta Lake comes in. Delta Lake enhances Apache Spark and makes it easy to store and manage massive amounts of complex data by supporting data integrity, data quality and performance. And with the recent announcements by Michael Armbrust and Matei Zaharia, Databricks recently released Delta Lake 1.0 on Apache Spark 3.1, with added experimental support for Google Cloud Storage, Oracle Cloud Storage and IBM Cloud Object Storage. In relation to this release, we also introduced Delta Sharing, an open protocol for secure real-time exchange of large datasets, which enables organizations to share data in real-time regardless of which computing platforms they use. We will cover the step-by-step guide on all these releases in a future release of the book.

This guide is designed to walk data engineers, data scientists and data practitioners through how to build reliable data lakes and data pipelines at scale using Delta Lake. Additionally, you will:

Understand key data reliability challenges and how to tackle them
Learn how to use Delta Lake to realize data reliability improvements
Learn how to concurrently run streaming and batch jobs against a data lake
Explore how to execute update, delete and merge commands against a data lake
Dive into using time travel to roll back and examine previous versions of a data

Reviewing the transaction log structure

Learn best practices to build effective, high-quality end-to-end data pipelines for real-world use cases
Integrate with other data technologies like Presto, Athena, Redshift and other BI tools and programming languages
Learn about different use cases in which transaction log can be an absolute lifesaver, such as with data governance (GDPR/CCPA):

Simplified governance use case with time travel

Book reader personas

This guide doesn’t require any prior knowledge of the modern lakehouse architecture, however, some knowledge of big data, data formats, cloud architectures and Apache Spark is helpful. While we invite anyone with an interest in data architectures and machine learning to check our guide, it’s especially useful for:

Data engineers with Apache Spark or big data backgrounds
Machine learning engineers who are involved in day-to-day data engineering
Data scientists who are interested in learning behind-the-scenes data engineering for the curated data
DBAs (or other operational folks) who know SQL and DB concepts and want to apply their knowledge the new world of data lakes
University students who are learning all things possible in CS, Data and AI

The early release of the digital book is available now from Databricks and O’Reilly. You get to read the ebook in its earliest form—the authors’ raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. The final digital copy is expected to be released at the end of 2021, and the printed copies will be available in April 2022. Thanks to Gary O’Brien, Jess Haberman and Chris Faucher from O’reilly who have been helping us with the book publication.

Early Release of Delta Lake: The Definitive Guide

To provide you a sneak peek, here is an excerpt from Chapter 2 describing what Delta Lake is.

What is Delta Lake?

As previously noted, over time, there have been different storage solutions built to solve this problem of data quality – from databases to data lakes. The transition from databases to data lakes allows for the decoupling of business logic from storage as well as the ability to independently scale compute and storage. But lost in this transition was ensuring data reliability. Providing data reliability to data lakes led to the development of Delta Lake.
Built by the original creators of Apache Spark, Delta Lake was designed to combine the best of both worlds for online analytical workloads (i.e., OLAP style): the transactional reliability of databases with the horizontal scalability of data lakes.

Delta Lake is a file-based, open-source storage format that provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lakes and is compatible with Apache Spark and other processing engines. Specifically, it provides the following features:

ACID guarantees: Delta Lake ensures that all data changes written to storage are committed for durability and made visible to readers atomically. In other words, no more partial or corrupted files! We will discuss more on the acid guarantees as part of the transaction log later in this chapter.
Scalable data and metadata handling: Since Delta Lake is built on data lakes, all reads and writes using Spark or other distributed processing engines are inherently scalable to petabyte-scale. However, unlike most other storage formats and query engines, Delta Lake leverages Spark to scale out all the metadata processing, thus efficiently handling metadata of billions of files for petabyte-scale tables. We will discuss more on the transaction log later in this chapter.
Audit History and Time travel: The Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes. These data snapshots enable developers to access and revert to earlier versions of data for audits, rollbacks, or to reproduce experiments. We will dive further into this topic in Chapter 3: Time Travel with Delta.
Schema enforcement and schema evolution: Delta Lake automatically prevents the insertion of data with an incorrect schema, i.e. not matching the table schema. And when needed, it allows the table schema to be explicitly and safely evolved to accommodate ever-change data. We will dive further into this topic in Chapter 4 focusing on schema enforcement and evolution.
Support for deletes, updates, and merge: Most distributed processing frameworks do not support atomic data modification operations on data lakes. Delta Lake supports merge, update, and delete operations to enable complex use cases including but not limited to change-data-capture (CDC), slowly-changing-dimension (SCD) operations, and streaming upserts. We will dive further into this topic in Chapter 5: Data modifications in Delta.
Streaming and batch unification: A Delta Lake table has the ability to work both in batch and as a streaming source and sink. The ability to work across a wide variety of latencies ranging from streaming data ingestion to batch historic backfill to interactive queries all just work out of the box. We will dive further into this topic in Chapter 6: Streaming Applications with Delta.

Above figure (referenced from the VLDB20 paper) shows a data pipeline implemented using three storage systems (a message queue, object store and data warehouse), or using Delta Lake for both stream and table storage. The Delta Lake version removes the need to manage multiple copies of the data and uses only low-cost object storage. For more information, refer to the VLDB20 paper: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores.

Additionally, we are planning to cover the following topics in the final release of the book.

A critical part of building your data pipelines is building the right platform and architecture, so we will be focusing on how to build the Delta Lake Medallion architecture (Chapter 7) and Lakehouse architectures (Chapter 8) respectively.
As data reliability is crucial for all data engineering and data science systems, it is important that this capability is accessible to all systems. Thus in Integrations with Delta Lake (Chapter 9), we will focus on how Delta Lake integrates with other open-source and proprietary systems including but not limited to Presto, Athena, and more!
With Delta Lake in production for many years with more than 1 exabyte of data/day processed, there are a plethora of design tips and best practices that will be discussed in Design Patterns using Delta Lake (Chapter 10).
Just as important for production environments is the ability to build security and governance for your lake, this will be covered in Security and Governance (Chapter 11).
To round up this book, we will also cover important topics including Performance and Tuning (Chapter 12), Migration to Delta Lake (Chapter 13), and Delta Lake Case Studies (Chapter 14).

Please be sure to check out some related content from the Data+AI Summit 2021 platform – keynotes from Visionaries and thought leaders including Bill Inmon: the father of Data Warehouses, Malala Yousafzai: Nobel Peace Prize winner and education advocate, Dr. Moogega Cooper and Adam Steltzner: Trailblazing Engineers of the famed Mars Rover ‘Perseverance’ Mission at NASA-JPL, Sol Rashidi: CAO at Estee Lauder, DJ Patil who coined the title “Data scientists” at Linkedin, Michael Armbrust, distinguished software engineer at Databricks, Matei Zaharia: Databricks Co-founder & Chief Technologist, and original creator of Apache Spark and MLflow and Ali Ghodsi: Databricks CEO and co-founder among other features speakers. Level up your knowledge with highly technical content, presented by leading experts

Try Databricks for free. Get started today.

The post Get Your Free Copy of Delta Lake: The Definitive Guide (Early Release) appeared first on Databricks.

↧

Need for Data-centric ML Platforms

June 23, 2021, 8:00 am

≫ Next: Three Principles for Selecting Machine Learning Platforms

≪ Previous: Get Your Free Copy of Delta Lake: The Definitive Guide (Early Release)

This blog is the first in a series on MLOps and Model Governance. The next blog will be by Joseph Bradley and will discuss how to choose the right technologies for data science and machine learning based on his experience working with customers.

Introduction

Recently, I learned that the failure rate for machine learning projects is still astonishingly high. Studies suggest that between 85-96% of projects never make it to production. These numbers are even more remarkable given the growth of machine learning (ML) and data science in the past five years. What accounts for this failure rate?

For businesses to be successful with ML initiatives, they need a comprehensive understanding of the risks and how to address them. In this post, we attempt to shed light on how to achieve this by moving away from a model-centric view of ML systems towards a data-centric view. We’ll also dive into MLOps and model governance and the importance of leveraging data-centric ML platforms such as Databricks.

The data of ML applications

Of course, everyone knows that data is the most important component of ML. Nearly every data scientist has heard: “garbage in, garbage out” and “80% of a data scientist’s time is spent cleaning data”. These aphorisms remain as true today as they did five years ago, but both refer to data purely in the context of successful model training. If the input training data is garbage, then the model output will be garbage, so we spend 80% of our time ensuring that our data is clean and our model makes useful predictions. Yet model training is only one component of a production ML system.

In Rules of Machine Learning, research scientist Martin Zinkevich emphasizes implementing reliable data pipelines and infrastructure for all business metrics and telemetry before training your first model. He also advocates testing pipelines on a simple model or heuristic to ensure that data is flowing as expected prior to any production deployment. According to Zinkevich, successful ML application design considers the broader requirements of the system first, and does not overly focus on training and inference data.

Zinkevich isn’t the only one who sees the world this way. The Tensorflow Extended (TFX) team at Google has cited Zinkevich and echoes that building real world ML applications “necessitates some mental model shifts (or perhaps augmentations).”

Prominent AI researcher Andrew Ng has also recently spoken about the need to embrace a data-centric approach to machine learning systems, as opposed to the historically predominant model-centric approach. Ng talked about this in the context of improving models through better tining data, but I think he is touching upon something deeper. The message from both of these leaders is that deploying successful ML applications requires a shift in focus. Instead of asking, What data do I need to train a useful model?, the question should be: What data do I need to measure and maintain the success of my ML application?

To confidently measure and maintain success, a variety of data must be collected to satisfy business and engineering requirements. For example, how do we know if we’re hitting business KPIs for this project? Or, where is our model and its data documented? Who is accountable for the model, and how do we trace its lineage? Looking at the flow of data in a ML application can shed some light on where these data points are found.

The diagram below illustrates one possible flow of data in a fictional web app that uses ML to recommend plants to shoppers and the personas that own each stage.

In this diagram, source data flows from the web app to intermediate storage, and then to derived tables. These are used for monitoring, reporting, feature engineering and model training. Additional metadata about the model is extracted, and logs from testing and serving are collected for auditing and compliance. A project that neglects or is incapable of managing this data is at risk of underperforming or failing entirely, regardless of how well the ML model performs on its specific task.

ML engineering, MLOps & model governance

Much like DevOps and data governance have lowered risk and become disciplines in their own right, ML engineering has emerged as a discipline to handle the operations (aka MLOps) and governance of ML applications. There are basically two kinds of risk that need to be managed in this context: risk inherent to the ML application system and risk of non-compliance with external systems. If data pipeline infrastructure, KPIs, model monitoring and documentation are lacking, then the risk of your system becoming destabilized or ineffective increases. On the other hand, a well-designed app that fails to comply with corporate, regulatory and ethical requirements runs the risk of losing funding, receiving fines or reputational damage.

How can organizations manage this risk? MLOps and model governance are still in their early stages, and there are no official standards or definitions for them. Therefore, based on our experience working with customers, we propose useful definitions to help you think about it.

MLOps (machine learning operations) is the active management of a productionized model and its task, including its stability and effectiveness. In other words, MLOps is primarily concerned with maintaining the function of the ML application through better data, model and developer operations. Simply put, MLOps = ModelOps + DataOps + DevOps.

Model governance, on the other hand, is the control and regulation of a model, its task and its effect on surrounding systems. It is primarily concerned with the broader consequences of how an ML application functions in the real world.

To illustrate this distinction, imagine an extreme case in which someone builds a highly-functional ML application that is used to secretly mine Bitcoin on your devices. That would be very effective, but its lack of governance has negative consequences on society. At the same time, you could write 400 page compliance and auditing reports for a credit risk model to satisfy federal regulations, but if the application isn’t stable or effective, then it is lacking in the operational dimension.

So, to build a system that is functional and respects human values, we need both. At a minimum, operations are responsible for maintaining uptime and stability, and each organization assumes legal and financial responsibility for the ML applications they create. Today, this responsibility is relatively limited because the regulatory environment for AI is in its infancy. However, leading corporations and academic institutions in the space are working to shape its future. Much like GDPR caused major waves in the data management space, it seems that similar regulation is an inevitability for ML.

Essential Capabilities

Having distinguished between operations and governance, we are now in a position to ask: What specific capabilities are required to support them? The answers fall into roughly six categories:

Data processing and management

Since the bulk of innovation in ML happens in open source, support for structured and unstructured data types with open formats and APIs is a prerequisite. The system must also process and manage pipelines for KPIs, model training/inference, target drift, testing and logging. Note that not all pipelines process data in the same way or with the same SLA. Depending on the use case, a training pipeline may require GPUs, a monitoring pipeline may require streaming and an inference pipeline may require low latency online serving. Features must be kept consistent between training (offline) and serving (online) environments, leading many to look to feature stores as a solution. How easy is it for engineers to manage features, retry failed jobs, understand data lineage, and comply with regulatory mandates like GDPR? The choices made to deliver these capabilities can result in significant swings in ROI.

Secure collaboration

Real world ML engineering is a cross-functional effort – thorough project management and ongoing collaboration between the data team and business stakeholders are critical to success. Access controls play a large role here, allowing the right groups to work together in the same place on data, code and models while limiting the risk of human error or misconduct. This notion extends to separation of dev and prod environments too.

Testing

To ensure the system meets expectations for quality, tests should be run on code, data and models. This includes unit tests for pipeline code covering feature engineering, training, serving and metrics, as well as end-to-end integration testing. Models should be tested for baseline accuracy across demographic and geographic segments, feature importance, bias, input schema conflicts and computational efficiency. Data should be tested for the presence of sensitive PII or HIPAA data and training/serving skew, as well as validation thresholds for feature and target drift. Ideally automated, tests reduce the likelihood of human error and aid in compliance.

Monitoring

Regular surveillance over the system helps identify and respond to events that pose a risk to its stability and effectiveness. How soon can it be discovered when a key pipeline fails, a model becomes stale or a new release causes a memory leak in production? When was the last time all input feature tables were refreshed or someone tried to access restricted data? The answers to these questions may require a mix of live (streaming), periodic (batch) and event driven updates.

Reproducibility

This refers to the ability to validate the output of a model by recreating its definition (code), inputs (data) and system environment (dependencies). If a new model shows unexpectedly poor performance or contains bias towards a segment of the population, organizations need to be able to audit the code and data used for feature engineering and training, reproduce an alternate version, and re-deploy. Also, if a model in production is behaving strangely, how will we be able to debug it without reproducing it?

Documentation

Documenting a ML application scales operational knowledge, lowers the risk of technical debt and acts as a bulwark against compliance violations. This includes an accounting and visualization of the system architecture; the schemas, parameters and dependencies of features, models and metrics; and reports of every model in production and accompanying governance requirements.

The need for a data-centric machine learning platform

In a recent webinar, Matei Zaharia listed ease of adoption by data teams alongside integration with data infrastructure and collaboration functions as desirable features in a ML platform.

In this regard, data science tools that emerged from a model-centric approach are fundamentally limited. They offer advanced model management features in software that is separated from critical data pipelines and production environments. This disjointed architecture relies on other services to handle the most critical component of the infrastructure – data.

As a result, access control, testing and documentation for the entire flow of data are spread across multiple platforms. To separate these at this point seems arbitrary and, as has been established, unnecessarily increases the complexity and risk of failure for any ML application.

A data-centric ML platform brings models and features alongside data for business metrics, monitoring and compliance. It unifies them, and in doing so, is fundamentally simpler. Enter lakehouse architecture.

Lakehouses are by definition data-centric and combine the flexibility and scalability of data lakes with the performance and data management of a data warehouse. Their open source nature makes it easy to integrate ML with where the data lives. There’s no need to export data out of a proprietary system in order to use ML frameworks like Tensorflow, PyTorch or scikit-learn. This also makes them considerably easier to adopt.

Databricks Machine Learning is built upon a lakehouse architecture and supports critical MLOps and governance needs including secure collaboration, model management, testing and documentation.

Data processing and management

To manage and process the variety and volume of data sources required by a ML application, Databricks uses a high performance combination of Apache Spark and Delta Lake. These unify batch and streaming workloads, operate at petabyte scale and are used for monitoring, metrics, logging and training/inference pipelines that are built with or without GPUs. Delta Lake’s data management capabilities make it easy to maintain compliance with regulations. The Feature Store is tightly integrated with Delta, Spark and MLflow to make feature discovery and serving simple for training and inference jobs. Multi-step pipelines can be executed as scheduled jobs or invoked via API, with retries and email notifications. For low latency online serving, Databricks offers hosted MLflow model serving for testing, publishing features to an online store, and integrating with Kubernetes environments or managed cloud services like Azure ML and Sagemaker for production.

Secure collaboration

In addition to defining data access privileges at the table, cloud resource or user identity level, Databricks also supports access control of models, code, compute and credentials. These enable users to co-edit and co-view notebooks in the workspace in compliance with security policies. The administrative features that limit access to production environments and sensitive data are used by customers in financial services, health care, and government around the world.

Testing

Databricks Repos allow users to integrate their project with version control systems and automated build and test servers like Jenkins or Azure DevOps. These can be used for unit and integration tests whenever code is committed. Databricks also offers MLflow webhooks that can be triggered at key stages of a model’s lifecycle – for example promotion to staging or production. These events can force an evaluation of the model for baseline accuracy, feature importance, bias, and computational efficiency, rejecting candidates that fail to pass or inviting a code review and tagging models accordingly. The signature or input schema of a MLflow model can also be provided at logging time and tested for compatibility with the data contract of the production environment.

Monitoring

For ongoing surveillance, Structured Streaming and Delta Lake can be used in conjunction with Databricks SQL to visualize system telemetry, KPIs, and feature distributions to stakeholders in real time dashboards. Periodic, scheduled batch jobs keep static historical and audit log tables fresh for analysis. To stay abreast of important events teams can receive email or Slack notifications for job failures. To maintain the validity of input features, routine statistical testing of feature distributions should be performed and logged with MLflow. Comparing runs makes it easy to tell if the shape of feature and target distributions is changing. If a distribution or application latency metric breaches a threshold value, an alert from SQL Analytics can trigger a training job using webhooks to automatically redeploy a new version. Changes to the state of a model in the MLflow Model Registry can be monitored via the same webhooks mentioned for testing. These alerts are critical to maintaining the efficacy of a model in production.

Reproducibility

MLflow is a general framework to track and manage models from experimentation through deployment. The code, data source, library dependencies, infrastructure and model can be logged (or auto-logged) at training time alongside other arbitrary artifacts like SHAP explainers or pandas-profiling. This allows for reproducing a training run at the click of a button. This data is preserved when models are promoted to the centralized Model Registry, serving as an audit trail of their design, data lineage and authorship. Maintaining model versions in the registry makes it easy to quickly roll back breaking changes while engineers trace a model artifact back to its source for debugging and investigation.

Documentation

Following the notion that documentation should be easy to find, Databricks Notebooks are a natural fit for documenting pipelines that run on the platform and system architecture. In addition to notebooks, models can also be elucidated by conveniently logging relevant artifacts alongside them to the MLflow tracking server, as described above. The tracking server and registry also support annotation of a model and a description of its lifecycle stage transitions via the UI and API. These are important features that bring human judgement and feedback to an AI system.

Putting it all together

To illustrate what the experience of developing a ML application on a data-centric ML platform like Databricks looks like, consider the following scenario:

A team of three practitioners (data engineer, scientist, machine learning engineer) are tasked with building a recommender to improve sales for their online store – plantly.shop.

At first, the team meets with business stakeholders to identify KPI and metric requirements for the model, application, and corresponding data pipelines, identifying any data access and regulatory issues up front. The data engineer starts a project in version control, syncs their code to a Databricks Repo, then gets to work using Apache Spark to ingest sales and application log data into Delta Lake from an OLTP database and Apache Kafka. All pipelines are built with Spark Structured Streaming and TriggerOnce to provide turnkey streaming in the future. Data expectations are defined on the tables to ensure quality, and unit and integration tests are written with Spark in local mode in their IDE. Table definitions are documented with markdown in shared notebooks on Databricks and copied into an internal wiki.

The data scientist is granted access to the tables using SQL, and they use Databricks AutoML, koalas and notebooks to develop a simple baseline model predicting if a user will purchase plants shown to them. The system environment, code, model binary, data lineage and feature importance of this baseline are automatically logged to the MLflow tracking server, making auditing and reproducibility simple.

Eager to test in a production pipeline, the data scientist promotes the model to the MLflow Model Registry. This triggers a webhook, which in turn kicks off a series of validation tests written by the ML engineer. After passing checks for prediction accuracy, compatibility with the production environment, computational performance, and any compliance concerns with the training data or predictions (can’t recommend invasive species, can we!), the ML engineer approves the transition to production. MLflow model serving is used to expose the model to the application via REST API.

In the next release, the model is tested by sending a subset of production traffic to the API endpoint and the monitoring system comes to life! Logs are streaming into Delta Lake, parsed and served in SQL Analytics dashboards that visualize conversion rates, compute utilization, rolling prediction distributions and any outliers. These give the business stakeholders direct visibility into how their project is performing.

In the meantime the data scientist is busy working on version 2 of the model, a recommender using deep learning. They spin up a single node, GPU enabled instance with the ML Runtime and develop a solution with PyTorch that is automatically tracked by MLflow. This model performs far better than the baseline model, but uses features that are completely different. They save these to Delta Lake, documenting each feature, its source tables and the code used to generate it. After passing all tests, the model is registered as version 2 of the plant recommender.

The pandemic has certainly caused plant sales to spike and to cope with the higher than expected traffic the team uses the mlflow.pyfunc.spark_udf to generate predictions with the new model in near real-time with Spark Structured Streaming. In the next release, everyone is recommended a variegated ficus elastica, which immediately sells out. No surprise there! The team celebrates their success, but in a quiet moment, the data scientist can be heard muttering something about ‘overfitting’…

This simplified example of a real-life workflow helps animate MLOps and governance alongside traditional work on a data-centric ML platform.

Conclusion

In this blog, we endeavoured to understand why ML initiatives continue to fail. We discovered that a model-centric approach to ML applications can unintentionally be a tremendous source of risk. Switching to a data-centric approach clarifies the nature of that risk as belonging to the application function itself, or to compliance with external systems. MLOps and governance are emerging disciplines that seek to establish confidence in and derisk ML initiatives, which they accomplish through a set of essential capabilities. The Databricks Lakehouse is one proven data-centric ML platform that delivers these capabilities while remaining open and easy to adopt.

We may still be early in the days of machine learning, but it doesn’t feel like that will be the case for much longer. AI will continue to change every sector of the economy and our lives. Organizations that adopt a data-centric ML platform with strong MLOps and governance practices will play a role in that transformation.

Next steps

To see a live demonstration of many of these concepts, see the DAIS 2021 session Learn to Use Databricks for the Full ML Lifecycle.

In future posts, we hope to dive deeper into how Databricks realizes these capabilities for its customers. In the meantime, here are some resources to learn more:

Operationalizing Machine Learning at Scale featuring Matei Zaharia, J.B. Hunt, H&M (2021)
Tech Talk: MLOps on Azure Databricks with MLflow (2021)

1. Most Data Science Projects Fail, But Yours Doesn’t Have To, Datanami, Oct. 2020
2. Rules of Machine Learning: Best Practices for ML Engineering, Zinkevich, M. 2017
3. Towards ML Engineering: A Brief History of Tensorflow Extended (TFX), Katsiapis et al., page 3, 2020.
4. See Andrew Ng’s discussion on ML Ops
5. For a more comprehensive discussion on mitigating risk in ML applications, see ML Engineering in Action, Wilson, B., 2021
6. EU outlines ambitious AI regulations focused on risky uses, Associated Press 2021
7. See “Data Dependencies Cost More Than Code Dependencies”, Hidden Technical Debt in Machine Learning Systems, Scully, et al., 2015.
8. See “Data Dependencies Cost More Than Code Dependencies”, Hidden Technical Debt in Machine Learning Systems, Scully, et al., 2015.
9. See Chapter 3, “Before you model: Planning and Scoping”, ML Engineering in Action, Wilson, B., 2021
10. For an excellent treatment of testing, see The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction, Breck, et al., 2017.
11. Ibid.
12. Keynote: Operationalizing Machine Learning Systems at Scale, 2021
13. See Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics, Armbrust, et al., 2021
14. Ibid., especially the section ‘Efficient Access for Advanced Analytics’.
15. See Running Streaming Jobs Once a Day for 10x Cost Savings, Yavuz, B., Condie, T., 2017
16. https://databricks.com/customers

Try Databricks for free. Get started today.

The post Need for Data-centric ML Platforms appeared first on Databricks.

↧

Three Principles for Selecting Machine Learning Platforms

June 24, 2021, 8:00 am

≫ Next: Using Bayesian Hierarchical Models to Infer the Disease Parameters of COVID-19

≪ Previous: Need for Data-centric ML Platforms

This blog post is the second in a series on ML platforms, operations, and governance. For the first post, see Rafi Kurlansik’s post on the “Need for Data-centric ML Platforms.”

I recently spoke with a Sr. Director of Data Platforms at a cybersecurity company, who commented, “I don’t understand how you can be future-proof for machine learning since there’s such a mess of constantly changing tools out there.” This is a common sentiment. Machine learning (ML) has progressed more rapidly than almost any other recent technology; libraries are often fresh from the research lab, and there are countless vendors advertising tools and platforms (Databricks included). Yet, as we talked, the platform director came to understand they were in a perfect position to future-proof the company’s data science (DS) and ML initiatives. Their company needed a platform that could support ever-changing technology on top.

In my years at Databricks, I’ve seen many organizations build data platforms to support DS & ML teams for the long term. The initial challenges commonly faced by these organizations can be grouped into a few areas: separation between their data platforms and ML tools, poor communication and collaboration between engineering and DS & ML teams, and past tech choices inhibiting change and growth. In this blog post, I have collected my high-level recommendations which guided these organizations as they selected new technologies and improved their DS & ML platforms. These common mistakes — and their solutions — are organized into three principles.

Principle 1: Simplify data access for ML

DS and ML require easy access to data. Common barriers include proprietary data formats, data bandwidth constraints and governance misalignment.

One company I’ve worked with provides a representative example. This company had a data warehouse with clean data, maintained by data engineering. There were also data scientists working with business units, using modern tools like XGBoost and TensorFlow, but they could not easily get data from the warehouse into their DS & ML tools, delaying many projects. Moreover, the platform infrastructure team worried that data scientists had to copy data onto their laptops or workstations, opening up security risks. To address these frictions caused by their data warehouse-centric approach to ML, we broke down the challenges into three parts.

Open data formats for Python and R

In this example, the first problem was the use of a proprietary data store. Data warehouses use proprietary formats and require an expensive data egress process to extract data for DS & ML. On the other side, DS & ML tools are commonly based on Python and R — not SQL — and expect open formats: Parquet, JSON, CSV, etc. on disk and Pandas or Apache Spark DataFrames in memory. This challenge is exacerbated for unstructured data like images and audio, which do not fit naturally in data warehouses and require specialized libraries for processing.

Re-architecting data management around Data Lake storage (Azure ADLS, AWS S3, GCP GCS) allowed this company to consolidate data management for both data engineering and DS & ML, making it much easier for data scientists to access data. Data scientists could now use Python and R, loading data directly from primary storage to a DataFrame — allowing faster model development and iteration. They could also work with specialized formats like image and audio — unblocking new ML-powered product directions.

Data bandwidth and scale

Beyond DS & ML-friendly formats, this company faced data bandwidth and scale challenges. Feeding an ML algorithm with data from a data warehouse can work for small data. But application logs, images, text, IoT telemetry and other modern data sources can easily max out data warehouses, becoming very expensive to store and impossibly slow to extract for DS & ML algorithms.

By making data lake storage their primary data layer, this company was able to work with datasets 10x the size, while reducing costs for data storage and movement. More historical data boosted their models’ accuracies, especially in handling rare outlier events.

Unified data security and governance

Of the challenges this company faced from its previous data management system, the most complex and risky was in data security and governance. The teams managing data access were Database Admins, familiar with table-based access. But the data scientists needed to export datasets from these governed tables to get data into modern ML tools. The security concerns and ambiguity from this disconnect resulted in months of delays whenever data scientists needed access to new data sources.

These pain points led them towards selecting a more unified platform that allowed DS & ML tools to access data under the same governance model used by data engineers and database admins. Data scientists were able to load large datasets into Pandas and PySpark dataframes easily, and database admins could restrict data access based on user identity and prevent data exfiltration.

Success in simplifying data access

This customer made two key technical changes to simplify data access for DS & ML: (1) using data lake storage as their primary data store and (2) implementing a shared governance model over tables and files backed by data lake storage. These choices led them towards a lakehouse architecture, which took advantage of Delta Lake to provide data engineering with data pipeline reliability, data science with the open data formats they needed for ML and admins with the governance model they needed for security. With this modernized data architecture, the data scientists were able to show value on new use cases in less than half the time.

A few of my favorite customer success stories on simplifying data access include:

At Outreach, ML engineers used to waste time setting up pipelines to access data, but moving to a managed platform supporting both ETL and ML reduced this friction.

At Edmunds, data silos used to hamper data scientists’ productivity. Now, as Greg Rokita (Executive Director), said, “Databricks democratizes data, data engineering and machine learning, and allows us to instill data-driven principles within the organization.”
At Shell, Databricks democratized access to data and allowed advanced analytics on much larger data, including inventory simulations across all parts and facilities and recommendations for 1.5+ million customers.

Principle 2: Facilitate collaboration between data engineering and data science

A data platform must simplify collaboration between data engineering and DS & ML teams, beyond the mechanics of data access discussed in the previous section. Common barriers are caused by these two groups using disconnected platforms for compute and deployment, data processing and governance.

A second customer of mine had a mature data science team but recognized that they were too disconnected from their data engineering counterparts. Data science had a DS-centric platform they liked, complete with notebooks, on-demand (cloud) workstations and support for their ML libraries. They were able to build new, valuable models, and data engineering had a process for hooking the models into Apache Spark-based production systems for batch inference. Yet this process was painful. While the data science team was familiar with using Python and R from their workstations, they were unfamiliar with the Java environment and cluster computing used by data engineering. These gaps led to an awkward handoff process: rewriting Python and R models in Java, checking to ensure identical behavior, rewriting featurization logic and manually sharing models as files tracked in spreadsheets. These practices caused months of delays, introduced errors in production and did not allow management oversight.

Cross-team environment management

In the above example, the first challenge was environment management. ML models are not isolated objects; their behavior depends upon their environment, and model predictions can change across library versions. This customer’s teams were bending over backwards to replicate ML development environments in the data engineering production systems. The modern ML world requires Python (and sometimes R), so they needed tools for environment replication like virtualenv, conda and Docker containers.

Recognizing this requirement, they turned to MLflow, which uses these tools under the hood but shields data scientists from the complexity of environment management. With MLflow, their data scientists shaved over a month off of productionization delays and worried less about upgrading to the latest ML libraries.

Data preparation to featurization

For DS & ML, good data is everything, and the line between ETL/ELT (often owned by data engineers) and featurization (often owned by data scientists) is arbitrary. For this customer, when data scientists needed new or improved features in production, they would request data engineers to update pipelines. Long delays sometimes caused wasted work when business priorities changed during the wait.

When selecting a new platform, they looked for tools to support the handoff of data processing logic. In the end, they selected Databricks Jobs as the hand-off point: data scientists could wrap Python and R code into units (Jobs), and data engineering could deploy them, using their existing orchestrator (Apache AirFlow) and CI/CD system (Jenkins). The new process of updating featurization logic was almost fully automated.

Sharing machine learning models

ML models are essentially vast amounts of data and business goals distilled into concise business logic. As I worked with this customer, it felt ironic and frightening to me that such valuable assets were being stored and shared without proper governance. Operationally, the lack of governance led to laborious, manual processes for production (files and spreadsheets), as well as less oversight from team leads and directors.

It was game-changing for them to move to a managed MLflow service, which provided mechanisms for sharing ML models and moving to production, all secured under access controls in a single Model Registry. Software enforced and automated previously manual processes, and management could oversee models as they moved towards production.

Success in facilitating collaboration

This customer’s key technology choices for facilitating collaboration were around a unified platform that supports both data engineering and data science needs with shared governance and security models. With Databricks, some of the key technologies that enabled their use cases were the Databricks Runtime and cluster management for their compute and environment needs, jobs for defining units of work (AWS/Azure/GCP docs), open APIs for orchestration (AWS/Azure/GCP docs) and CI/CD integration (AWS/Azure/GCP docs), and managed MLflow for MLOps and governance.

Customer success stories specific to collaboration between data engineering and data science include:

Condé Nast benefited from breaking down walls between teams managing data pipelines and teams managing advanced analytics. As Paul Fryzel (Principal Engineer of AI Infrastructure) said, “Databricks has been an incredibly powerful end-to-end solution for us. It’s allowed a variety of different team members from different backgrounds to quickly get in and utilize large volumes of data to make actionable business decisions.”
At Iterable, disconnects between data engineering and data science teams prevented training and deploying ML models in a repeatable manner. By moving to a platform shared across teams that streamlined the ML lifecycle, their data teams simplified reproducibility for models and processes.
At Showtime, ML development and deployment were manual and error-prone until migrating to a managed MLflow-based platform. Databricks removed operational overhead from their workflows, reducing time-to-market for new models and features.

Principle 3: Plan for change

Organizations and technology will change. Data sizes will grow; team skill sets and goals will evolve; and technologies will develop and be replaced over time. An obvious, but common, strategic error is not planning for scale. Another common but more subtle error is selecting non-portable technologies for data, logic and models.

I’ll share a third customer story to illustrate this last principle. I worked with an early stage customer who hoped to create ML models for content classification. They chose Databricks but relied heavily on our professional services due to lack of expertise. A year later, having shown some initial value for their business, they were able to hire more expert data scientists and had meanwhile collected almost 50x more data. They needed to scale, to switch to distributed ML libraries, and to integrate more closely with other data teams.

Planning for scaling

As this customer found, data, models, and organizations will scale over time. Their data could originally have fit within a data warehouse, but it would have required migration to a different architecture as the data size and analytics needs grew. Their DS & ML teams could have worked on laptops initially, but a year later, they needed more powerful clusters. By planning ahead with a Lakehouse architecture and a platform supporting both single-machine and distributed ML, this organization prepared a smooth path for rapid growth.

Portability and the “build vs. buy” decision

Portability is a more subtle challenge. Tech strategy is sometimes oversimplified into a “build vs. buy” decision, such as “building an in-house platform using open source technologies can allow customization and avoid lock-in, whereas buying a ready-made, proprietary toolset can allow faster setup and progress.” This argument presents an unhappy choice: either make a huge up-front investment in a custom platform or get locked in to a proprietary technology.

However, that argument is misleading, for it does not distinguish between data platform and infrastructure, on the one hand, and project-level data technology, on the other. Data storage layers, orchestration tools and metadata services are common platform-level technology choices; data formats, languages and ML libraries are common project-level technology choices. These two types of choices should be handled differently when planning for change. It helps to think of the data platform and infrastructure as the generic containers and pipelines for a company’s specialized data, logic and models.

Planning for project-level technology changes

Project-level technologies should be simple to swap in and out. New data- and ML-powered products may have different requirements, requiring new data sources, ML libraries or service integrations. Flexibility in changing these project-level technology choices allows a business to adapt and be competitive.

The platform must allow this flexibility and, ideally, encourage teams to avoid proprietary tools and formats for data and models. For my customer, though they began with scikit-learn, they were able to switch to Spark ML and distributed TensorFlow without changing their platform or MLOps tools.

Planning for platform changes

Platforms should allow portability. For a platform to serve a company long-term, the platform must avoid lock-in: moving data, logic and models to and from the platform must be simple and inexpensive. When data platforms are not a company’s core mission and strength, it makes sense for the organization to buy a platform to move faster — as long as that platform allows the company to stay nimble and move its valuable assets elsewhere when needed.

For my customer, selecting a platform that allowed them to use open tools and APIs like scikit-learn, Spark ML and MLflow helped in two ways. First, it simplified the platform decision by giving them confidence that the decision was reversible. Second, they were able to integrate with other data teams by moving code and models to and from other platforms.

Type of change	Platform needs	Project-level technology examples
Scaling	Process both small and big data efficiently. Provide single-node and distributed compute.	Scale pandas → Apache Spark or Koalas. Scale scikit-learn → Spark ML. Scale Keras → Horovod.
New data types and application domains	Support arbitrary data types and open data formats. Support both batch and streaming. Integrate easily with other systems.	Use and combine Delta, Parquet, JSON, CSV, TXT, JPG, DICOM, MPEG, etc. Stream data from web app backends.
New personas and orgs	Support data scientists, data engineers, and business analysts. Provide scalable governance and access controls.	Visualize data in both (a) plotly in notebooks and (b) dashboards in pluggable BI tools. Run ML via both (a) custom code and (b) AutoML.
Change of platform	User owns their data and ML models; no egress tax. User owns their code; sync with git.	Use open code APIs such as Keras and Spark ML to keep project-level workloads independent of the platform.

Success in planning for change

This customer’s key technology choices that allowed them to adapt to change were a lakehouse architecture, a platform supporting both single-machine and distributed ML, and MLflow as a library-agnostic framework for MLOps. These choices simplified their path of scaling data by 50x, switching to more complex ML models, and scaling their team and its skill sets.

Some of my top picks for customer success stories on change planning and portability are:

At Edmunds, data teams needed infrastructure that supported data processing and ML requirements, such as the latest ML frameworks. Maintaining this infrastructure on their own required significant DevOps effort. The Databricks managed platform provided flexibility, while reducing the DevOps overhead.
As Quby experienced data growth to multiple petabytes and the number of ML models grew to 1+ million, legacy data infrastructure could not scale or run reliably. Migrating to Delta Lake and MLflow provided the needed scale, and migration was simplified since Databricks supported the variety of tools needed by the data engineering and data science teams.
Data teams at Shell range widely both in skills and in analytics projects (160 AI projects with more coming). With Databricks as one of the foundational components of the Shell.ai platform, Shell has the flexibility needed to handle current and future data needs.

Applying the principles

It’s easy to list out big principles and say, “go do it!” But implementing them requires candid assessments of your tech stack, organization and business, followed by planning and execution. Databricks offers a wealth of experience in building data platforms to support DS & ML.

The most successful organizations we work with follow a few best practices: They recognize that long-term architectural planning should happen concurrently with short-term demonstrations of impact and value. That value is communicated to executives by aligning data science teams with business units and their prioritized use cases. Cross-organization alignment helps to guide organizational improvements, from simplifying processes to creating Centers of Excellence (CoE).

This blog post is just scratching the surface of these topics. Some other great material includes:

Data + AI Summit 2021 keynotes: These announce the release of Databricks Machine Learning, a data-native and collaborative ML solution for the full ML lifecycle.
Building Machine Learning Platforms: Recorded webinar including Matei Zaharia (CTO and co-founder, Databricks), Ben Lorica (Chief Data Scientist, Databricks), and Clemens Mewald (Director, Product Management, Data Science and ML, Databricks)
MLOps Virtual Event: Operationalizing machine learning at scale: Recorded webinar including Matei Zaharia (CTO and co-founder, Databricks) and invited speakers from H&M, J.B. Hunt Transport, and Artis Consulting
Databricks pages for Data Science Solutions and Managed MLflow

The next post will be a deep dive into ML Ops—how to monitor and manage your models post-deployment and how to leverage the full Databricks platform to close the loop on a model’s lifecycle.

Try Databricks for free. Get started today.

The post Three Principles for Selecting Machine Learning Platforms appeared first on Databricks.

↧

Using Bayesian Hierarchical Models to Infer the Disease Parameters of COVID-19

June 29, 2021, 9:00 am

≫ Next: Databricks Solutions Showcase

≪ Previous: Three Principles for Selecting Machine Learning Platforms

In a previous post, we looked at how to use PyMC3 to model the disease dynamics of COVID-19. This post builds on this use case and explores how to use Bayesian hierarchical models to infer COVID-19 disease parameters and the benefits compared to a pooled or an unpooled model. We fit an SIR model to synthetic data, generated from the Ordinary Differential Equation (ODE), in order to estimate the disease parameters such as R₀. We then show how this framework can be applied to a real-life dataset (i.e. the number of infections per day for various countries). We conclude with the limitations of this model and outline the steps for improving the inference process.

I have also launched a series of courses on Coursera covering this topic of Bayesian modeling and inference, courses 2 and 3 are particularly relevant to this post. Check them out on the Coursera Databricks Computational Statistics course page.

The SIR model

The SIR model, as shown in our previous post to model COVID-19, includes the set of three Ordinary Differential Equations (ODEs). There are three compartments in this model: S, I and R.

Here ‘S’, ‘I’ and ‘R’ refer to the susceptible, infected and recovered portions of the population of size ‘N’ such that

S + I + R = N

The assumption here is that once you have recovered from the disease, lifetime immunity is conferred on an individual. This is not the case for a lot of diseases and, hence, may not be a valid model.

is the rate of infection and is the rate of recovery from the disease. The fraction of people who recover from the infection is given by ‘f’ but for the purpose of this work, ‘f’ is set to 1 here. We end up with an Initial Value Problem (IVP) for our set of ODEs, where I(0) is assumed to be known from the case counts at the beginning of the pandemic and S(0) can be estimated as N – I(0). Here we make the assumption that the entire population is susceptible. Our goal is to accomplish the following:

Use Bayesian Inference to make estimates about λ and μ
Use the above parameters to estimate I(t) for any time ‘t’
Compute R₀

Pooled, unpooled and hierarchical models

Suppose you have information regarding the number of infections from various states in the United States. One way to use this data to infer the disease parameters of COVID-19 (e.g. R₀) is to sum it all up to estimate a single parameter. This is called a pooled model. However, the problem with this approach is that fine-grained information that might be contained in these individual states or groups is lost. The other extreme would be to estimate an individual parameter R₀ per state. This approach results in an unpooled model. However, considering that we are trying to estimate the parameters corresponding to the same virus, there has to be a way to perform this collectively, which brings us to the hierarchical model. This is particularly useful when there isn’t sufficient information in certain states to create accurate estimates. Hierarchical models allow us to share the information from other states using a shared ‘hyperprior’. Let us look at this formulation in more detail using the example for λ :

For a pooled model, we can draw λ from a single distribution with fixed parameters λ_μ, λ_σ.

For an unpooled model, we can draw each λ with fixed parameters λ_μ, λ_ι.

For a hierarchical model, we have a prior that is parameterized by non-constant parameters drawn from other distributions. Here, we draw two λs for each state, however both are connected through a shared hyperprior distribution.

Check out course 3 Introduction to PyMC3 for Bayesian Modeling and Inference in the recently-launched Coursera specialization on hierarchical models.

Hierarchical models on synthetic data

To implement and illustrate the use of hierarchical models, we generate data using the set of ODEs that define the SIR model. These values are generated at preset timesteps; here the time interval is 0.25. We also select two groups for ease of illustration, however, one can have as many groups as needed. The values for λ and μ are set as [4.0, 3.0] and [1.0, 2.0] respectively for the two groups. The code to generate this along with the resulting time-series curves are shown below.

Generate synthetic data


def SIR(y, t, p):
    ds = -p[0] * y[0] * y[1]
    di = p[0] * y[0] * y[1] - p[1] * y[1]
    return [ds, di]

times = np.arange(0, 5, 0.25)
cases_obs = [0] * 2
lam, mu = 4.0, 1.0
y = odeint(SIR, t=times, y0=[0.99, 0.01], args=((lam, mu),), rtol=1e-8)
yobs = np.random.lognormal(mean=np.log(y[1::]), sigma=[0.1, 0.1])
cases_obs[0] = yobs[:,1]

plt.plot(times[1::], yobs, marker='o', linestyle='none')
plt.plot(times, y[:, 0], color='C0', alpha=0.5, label=f'$S(t)$')
plt.plot(times, y[:, 1], color='C1', alpha=0.5, label=f'$I(t)$')
plt.legend()
plt.show()

lam, mu = 3.0, 2.0
y = odeint(SIR, t=times, y0=[0.99, 0.01], args=((lam, mu),), rtol=1e-8)
yobs = np.random.lognormal(mean=np.log(y[1::]), sigma=[0.1, 0.1])
cases_obs[1] = yobs[:,1]

plt.plot(times[1::], yobs, marker='o', linestyle='none')
plt.plot(times, y[:, 0], color='C0', alpha=0.5, label=f'$S(t)$')
plt.plot(times, y[:, 1], color='C1', alpha=0.5, label=f'$I(t)$')
plt.legend()
plt.show()

Performing inference using a hierarchical model


def SIR_sunode(t, y, p):
        return {
            'S': -p.lam * y.S * y.I,
            'I': p.lam * y.S * y.I - p.mu * y.I,
        }

sample_period = covid_data.sample_period
cases_obs = covid_data.cases_obs
time_range = np.arange(0,len(covid_data.cases_obs[0])) * covid_data.sample_period  I0 = covid_data.data[0] # data is scaled
S0 = 1 - I0
S_init = S0 
I_init = I0 
cases_obs_scaled = covid_data.data

with pm.Model() as model4:
                       
            # ------------------- Setup the priors and hyperpriors ---------------#

            prior_lam = pm.Lognormal('prior_lam', 0.75, 2) 
            prior_mu = pm.Lognormal('prior_mu', 0.75, 2)
            prior_lam_std = pm.HalfNormal('prior_lam_std', 1.0)
            prior_mu_std = pm.HalfNormal('prior_mu_std', 1.0)
            
            lam = pm.Lognormal('lambda', prior_lam , prior_lam_std, shape=2) # 1.5, 1.5
            mu = pm.Lognormal('mu', prior_mu , prior_mu_std, shape=2)           # 1.5, 1.5
                       
            # -------------------- ODE model --------------- #
                        
            res, _, problem, solver, _, _ = sunode.wrappers.as_theano.solve_ivp(
            y0={
                'S': (S_init, (2,)),
                'I': (I_init, (2,)),},
            params={
                'lam': (lam, (2,)),
                'mu': (mu, (2,)),
                '_dummy': (np.array(1.), ())},
                 rhs=SIR_sunode,
            # The time points where we want to access the solution
            tvals=time_range[1:],
            t0=time_range[0],
            )
            
            # ------------------- Setup likelihoods for the observed data ---------------#
           
            I = pm.Normal('I', mu=res['I'], sigma=0.01, observed=cases_obs_scaled[1:])

            R0 = pm.Deterministic('R0',lam/mu)

           # ------------------- Sample from the distribution ---------------#

            # if you increase the variance and the distributions looks choppy, increase the tuning sample size to sample the space more effectively
            # also, increase the total number of samples
            trace = pm.sample(8000, tune=2000, chains=4, cores=4)
            data = az.from_pymc3(trace=trace)

az.plot_posterior(data, point_estimate='mode', round_to=2)
az.summary(trace)
traceplot(data)

Real-life COVID-19 data

The data used here is obtained from the Johns Hopkins CSSE Github page where case counts are regularly updated. Here we plot and use the case count of infections-per-day for two countries, the United States and Brazil. However, there is no limitation on either the choice or number of countries that can be used in a hierarchical model. The cases below are from Mar 1, 2020 to Jan 1, 2021. The graphs seem to follow a similar trajectory, even though the scales on the y-axis are different for these countries. Considering that these cases are from the same COVID-19 virus, this is reasonable. However, there are differences to account for, such as the different variants, different geographical structures and social distancing rules, healthcare infrastructure and so on.

Inference of parameters

The sampled posterior distributions are shown below, along with their 94% Highest Density Interval (HDI).

We can also inspect the traceplots for convergence, which shows good mixing in all the variables – a good sign that the sampler has explored the space well. There is good agreement between all the traces. This behavior can be confirmed with the fairly narrow HDI intervals in the plots above.

The table below summarizes the distributions of the various inferred variables and parameters, along with the sampler statistics. While estimates about the variables are essential, this table is particularly useful for informing us about the quality and efficiency of the sampler. For example, the Rhat is all equal to 1, indicating good agreement between all the chains. The effective sample size is another critical metric. If this is small compared to the total number of samples, that is a sure sign of trouble with the sampler. Even if the Rhat values look good, be sure to inspect the effective sample size!

Although this yielded satisfactory estimates for our parameters, often we run into the issue of the sampler not performing effectively. In the next post of this series, we will look at a few ways to diagnose the issues and improve the modeling process. These are listed, in increasing order of difficulty, below:

Increase the tuning size and the number of samples drawn.
Decrease the target_accept parameter for the sampler so as to reduce the autocorrelation among the samples. Use the autocorrelation plot to confirm this.
Add more samples to the observed data, i.e. increase the sample frequency.
Use better priors and hyperpriors for the parameters.
Use an alternative parameterization of the model.
Incorporate changes such as social-distancing measures into the model.

You can learn more about these topics at my Coursera specialization that consists of the following courses:

SEE THE COURSE LISTINGS

The post Using Bayesian Hierarchical Models to Infer the Disease Parameters of COVID-19 appeared first on Databricks.

↧

Databricks Solutions Showcase

June 30, 2021, 12:49 pm

≫ Next: Applying Natural Language Processing to Healthcare Text at Scale

≪ Previous: Using Bayesian Hierarchical Models to Infer the Disease Parameters of COVID-19

Inspiration doesn’t always come from our peers. Some of the best ideas come from innovators in other industries.This is especially true when it comes to data science and AI/ML innovations being applied to business processes and all aspects of operational decision-making.

Consider fraud: Who knows how to analyze millions of data points in real-time better than a credit card company? Now consider the customer experience; online retailers have mastered using machine learning (ML) to personalize recommendations, and adtech companies wrote the book on how to use internal and external data to create a 360° view of the customer. Every company, regardless of industry, can learn from these innovators.

Making it simpler and more accessible for our customers to learn from each other is what inspired us to create the Databricks Solutions Showcase. This virtual event takes place on July 22nd and features data leaders from some of most innovative data-driven enterprises. Now, you can explore how top global brands like Disney, Walmart, ExxonMobil, HSBC, Takeda Pharmaceuticals, Rolls Royce Holdings, John Deere and more are leveraging a lakehouse architecture to transform their business, focusing on six hot topic areas that cut across industries:

Supply chain
Customer 360
IoT
Real-time analytics
Advertising optimization

Security & fraud detection

These business leaders and innovators in AI strategy will share, in their own words, the big data challenges they faced and how they are using advanced data engineering and analytics to overcome them and drive topline results. In the meantime, you can explore the Databricks Solution Accelerators, fully functional pre-built code to tackle the most common and high-impact use cases that our customers are facing. Get a hands-on demonstration of their capabilities and business value for a variety of the most common business use cases, such as customer contextual fraud detection, churn prediction, demand forecasting and advertising attribution + optimization.

The post Databricks Solutions Showcase appeared first on Databricks.

↧

Applying Natural Language Processing to Healthcare Text at Scale

July 1, 2021, 1:23 pm

≫ Next: A Shared Vision for Data Teams: Why Cubonacci Joined Databricks

≪ Previous: Databricks Solutions Showcase

This is a co-authored post written in collaboration with John Snow Labs. We thank Moritz Steller, senior cloud solution architect, at John Snow Labs for his contributions.

In 2015, HIMSS estimated that the healthcare industry in the U.S. produced 1.2 billion clinical documents. That’s a tremendous amount of unstructured text data. Since that time, the digitization of healthcare has only increased the amount of clinical text data generated annually. Digital forms, online portals, pdf reports, emails, text messages and chatbots all provide the backbone for modern healthcare communications. The amount of text generated across these channels is too vast to measure and too comprehensive for a human to consume. And because these datasets are unstructured, they are not readily analyzable and often remain siloed.

This poses a risk for all healthcare organizations. Locked within these lab reports, provider notes and chat logs is valuable information. When combined with a patient’s electronic health record (EHR), these data points provide a more complete view of a patient’s health. At a population level, these datasets can inform drug discovery, treatment pathways, and real-world safety assessments.

Uncovering novel health insights with natural language processing

There’s good news. Advancements in natural language processing (NLP) – a branch of artificial intelligence that enables computers to understand written, spoken or image text – make it possible to extract insights from text. Using NLP methods, unstructured clinical text can be extracted, codified and stored in a structured format for downstream analysis and fed directly into machine learning (ML) models. These techniques are driving significant innovations in research and care.

In one use case, Kaiser Permanente, one of the largest nonprofit health plans and healthcare providers in the US, used NLP to process millions of emergency room triage notes to predict the demand for hospital beds, nurses and clinicians, and ultimately improve patient flow. Another study used NLP to analyze non-standard text messages from mobile support groups for HIV-positive adolescents. The analysis found a strong correlation between engagement with the group, improved medication adherence and feelings of social support.

What’s getting in the way of healthcare NLP?

With all this incredible innovation, it begs the question—why aren’t more healthcare organizations making use of their clinical text data? In our experience, working with some of the largest payers, providers and pharmaceutical companies, we see three key challenges:

NLP systems are typically not designed for Healthcare. Clinical text is its own language. The data is inconsistent due to the wide variety of source systems (e.g. EHR, clinical notes, PDF reports) and, on top of that, the language varies greatly across clinical specialties. Traditional NLP technology is not built to understand the unique vocabularies, grammars and intents of medical text. For example, in the text string below, the NLP model needs to understand that azithromycin is a drug, 500 mg is dosage, and that SOB is a clinical abbreviation for “shortness of breath” related to the patient condition pneumonia. It’s also important to infer that the patient is not short of breath, and that they haven’t taken the medication yet since it’s just being prescribed.

Most NLP tools cannot properly codify Healthcare text. Spark NLP for Healthcare is purpose built with algorithms designed to understand domain-specific language

Inflexible legacy healthcare data architectures. Text data contain troves of information but only provide one lens into patient health. The real value comes from combining text data with other health data to create a comprehensive view of the patient. Unfortunately, legacy data architectures built on data warehouses lack support for unstructured data—such as scanned reports, biomedical images, genomic sequences and medical device streams — making it impossible to harmonize patient data. Additionally, these architectures are costly and complex to scale. A simple ad hoc analysis on a large corpus of health data can take hours or days to run. That is too long to wait when adjusting for patient needs in real-time.

Lack of advanced analytics capabilities. Most healthcare organizations have built their analytics on data warehouses and BI platforms. These are great for descriptive analytics, like calculating the number of hospital beds used last week, but lack the AI/ML capabilities to predict hospital bed use in the future. Organizations that have invested in AI typically treat these systems as siloed, bolt-on solutions. This approach requires data to be replicated across different systems resulting in inconsistent analytics and slow time-to-insight.

Unlocking the power of healthcare NLP with Databricks and John Snow Labs

Databricks and John Snow Labs – the creator of the open-source Spark NLP library, Spark NLP for Healthcare and Spark OCR – are excited to announce our new suite of solutions focused on helping healthcare and life sciences organizations transform their large volumes of text data into novel patient insights. Our joint solutions combine best-of-breed Healthcare NLP tools with a scalable platform for all your data, analytics, and AI.

Unlocking the power of healthcare NLP with Databricks Lakehouse Platform and John Snow Labs.

Serving as the foundation is the Databricks Lakehouse platform, a modern data architecture that combines the best elements of a data warehouse with the low cost, flexibility and scale of a cloud data lake. This simplified, scalable architecture enables healthcare systems to bring together all their data—structured, semi-structured and unstructured—into a single, high-performance platform for traditional analytics and data science.

At the core of the Databricks Lakehouse platform are Apache SparkTM and Delta Lake, an open-source storage layer that brings performance, reliability and governance to your data lake. Healthcare organizations can land all of their data, including raw provider notes and PDF lab reports, into a bronze ingestion layer of Delta Lake. This preserves the source of truth before applying any data transformations. By contrast, with a traditional data warehouse, transformations occur prior to loading the data, which means that all structured variables extracted from unstructured text are disconnected from the native text.

Building on this foundation is John Snow Labs’ Spark NLP for Healthcare, the most widely-used NLP library for healthcare and life science industries. The software seamlessly extracts, classifies and structures clinical and biomedical text data with state-of-the-art accuracy. This is done using production-grade, scalable and trainable implementations of recent healthcare-specific deep learning and transfer learning techniques, together with 200+ pre-trained and regularly updated models.

Notable capabilities of John Snow Labs’ software libraries include:

Out-of-the-box named entity recognition of over 100 clinical and biomedical entities – from symptoms & drugs to anatomy, social determinants, labs, imaging and genes
Resolving entities to the semantically nearest code of terminologies including SNOMED-CT, ICD-10-CM, ICD-10-PCS, RxNorm, LOICS, UMLS, MeSH, and HPO.
Pre-trained relation extraction models to detect 30+ relation types: between medical events, treatments and drugs, genes and phenotypes, and others.
Customizable detection, de-identification, and obfuscation of sensitive information from free text, PDF documents, scanned reports, and DICOM images.
Healthcare-specific word, chunk and sentence embeddings that are not available elsewhere and are regularly updated with new terminologies and content.

ohn Snow Labs’ Spark NLP for Healthcare library provides one of the most robust set of capabilities and models for natural language processing in the industry.

John Snow Labs’ Spark NLP for Healthcare library provides one of the most robust set of capabilities and models for natural language processing in the industry.

Our joint solutions bring together the power of Spark NLP for Healthcare with the collaborative analytics and AI capabilities of Databricks. Informatics teams can ingest raw data directly into Databricks, process that data at scale with Spark NLP for Healthcare, and make it available for downstream SQL Analytics and ML, all in one platform. Both training and inference processes run directly within Databricks; beyond the benefits of speed and scale, this also means that data is never sent to a third party, a critical privacy and compliance requirement when processing sensitive medical data. Best of all, Databricks is built on Apache SparkTM, making it the best place to run Spark applications like Spark NLP for Healthcare.

An end-to-end workflow for processing, analyzing and modeling all of your data including clinical text with Databricks and John Snow Labs

Getting started with healthcare natural language processing at scale

Join us for our upcoming virtual workshop, Extract Real-World Data with NLP, on July 15 to learn how to generate novel patient insights with natural language processing solutions from John Snow Labs and Databricks.

Try Databricks for free. Get started today.

The post Applying Natural Language Processing to Healthcare Text at Scale appeared first on Databricks.

↧

A Shared Vision for Data Teams: Why Cubonacci Joined Databricks

July 2, 2021, 2:00 am

≫ Next: Democratizing Data and AI in Finserv: DAIS 2021 Takeaways

≪ Previous: Applying Natural Language Processing to Healthcare Text at Scale

Today, we are excited to announce that our company, Cubonacci, has joined the Databricks family. We founded Cubonacci in Amsterdam to enable businesses to build scalable and future-proof data science solutions. Our goal was to bring machine learning to the masses and we are excited to continue this mission through Databricks’ vision and strategy. More specifically: Jan van der Vegt, Sr. Product Manager, will be supporting the foundational layer of machine learning: scalable and secure data storage. Borre Mosch, Sr. Software Developer, will be focusing on jobs and workflows.

One question we keep hearing is: Why leave the entrepreneurial world for Databricks? In reality, it is our entrepreneurial drive that led us here. The company is growing at a nearly-unheard speed and helping pave the future of data and AI. This move seems even more obvious when we look at how Databricks is making AI/ML more accessible and transforming how enterprises interact with and leverage their data.

In other words, there is no one reason why we joined Databricks. There are countless, but we broke down a few of the reasons we think it’s a perfect fit:

Culture of innovation (and ability to execute)

Every enterprise, every industry will need to adopt a data and AI strategy to stay ahead among an ever-growing amount of data. Data teams are the force behind the innovations that are changing how we live and solving some of the world’s toughest problems. The Databricks Lakehouse Platform transforms how data teams use all of their data for all of their data, analytics and AI workloads without the need to rely on clunky, multi-vendor solutions.

At Cubonacci, we shared this vision of empowering data teams (more specifically for us, data scientists) to shape the future of enterprise with solutions built to do just that. Operating in a competitive and risk-averse market made it difficult to grow as fast as we needed to. Joining Databricks– which has a strong history of introducing cutting-edge solutions to the market – allows us to tap into our strengths and bring innovations to life. We’re very excited to bring our own experience from the startup world into the exceptional product and engineering teams that make this happen.

Customer-first mindset and the adoption to back it up

One of Databricks’ core values is “be customer-obsessed,” which in essence means the company makes decisions based on customer and partner needs above all else. This aligns with our own philosophy as founders since we know every customer matters. Being able to bring this customer obsession and apply this at the scale of Databricks is very exciting. With many fast-growing companies, it can be hard to distinguish innovative technology from just great marketing. But to date, over 5,000 companies across the globe rely on Databricks to unlock value from their data, with many driving novel AI/ML and analytics use cases to tackle tough challenges.

Open source is an essential component of the technology world and something that we highly value as tech founders. Something that sets Databricks apart is its commitment to the community as a whole. The company is rooted in open source and has built 5 major open source projects thus far – Apache Spark^TM, Delta Lake, MLflow and Koalas. Delta Sharing, the fifth project, was just announced last month and is a total game-changer. Delta Sharing is the industry’s first open protocol for secure data sharing, making it simple to share data with other organizations regardless of which computing platform they use.

World-class talent across the map

Databricks is in a unique position in that its very foundations are in academia This “researcher” mindset has transpired into a corporate culture that is data-driven, analytical and unafraid to question the status quo. And with development happening so fast, there’s room for us to bring our entrepreneurial spirit and make a huge impact on the product roadmap.

The opportunity to learn from the global team is one of the most appealing aspects of joining Databricks. We’re excited to work out of the Amsterdam HQ, which is the second-largest engineering office at Databricks. This team is rapidly growing and operates under the leadership of former innovators from Google, Facebook, Amazon and more.

This is a new chapter for us. Building a company from scratch taught us a lot, and we’re excited to apply these lessons to Databricks. We’re also growing our team in Amsterdam. Interested in joining us? Check out our current openings!

Try Databricks for free. Get started today.

The post A Shared Vision for Data Teams: Why Cubonacci Joined Databricks appeared first on Databricks.

↧

Democratizing Data and AI in Finserv: DAIS 2021 Takeaways

July 6, 2021, 11:22 am

≫ Next: Four E-commerce Challenges That Can Be Addressed With Data + AI

≪ Previous: A Shared Vision for Data Teams: Why Cubonacci Joined Databricks

For financial services providers, driving business forward with data is a longstanding practice—but as machine learning (ML) and artificial intelligence (AI) technologies improve, understanding the potential for data-based roadblocks is key.

This year’s Data and AI Summit (DAIS) hosted over global attendees across more than 200 sessions, and highlighted leaders from ABN AMRO, Intuit, Capital One, S&P Global and Northwestern Mutual as they discussed how to unlock the full capabilities of data, ML and AI across all of finance and financial tech services.

Speakers shared their perspectives on the biggest issues facing financial institutions today, including data sharing and openness, cloud migration and Big Data management, as well as how critical data’s role in shaping strategic decision-making. Most of our sessions will be available on-demand, but to help you navigate the content, here’s a rundown of what’s top-of-mind for Financial Services data teams and leaders.

Keynote & Experts Panel: The Biggest Data Challenges in FinServ

Kicking off the Financial Services sessions with an inspiring keynote, Northwestern Mutual’s Chief Data Officer Don Vu dug into historical data hurdles like fragmented information, governance and blind spots in the user experience. He detailed Northwestern Insurance’s digital transformation and how it used Databricks to modernize their data infrastructure, unify all of their data across multiple system inputs, and gain a complete, 360-degree view of their customers. This transformation allows them to align their data extraction and analysis strategies with the overarching goals of the business, leading to the development of innovative new features and greatly enhancing their integrated digital experience. Today, data is the engine powering every business decision.

Next on the virtual DAIS stage was a panel of industry experts moderated by Junta Nakai, Financial Services Industry Leader & Regional Vice President at Databricks. Panelists discussed their own journeys with data and AI, including common challenges they’ve faced and advice on implementing platforms and tools to support data ingestion and analysis at scale. Here’s what they covered:

The importance of openness

Nakai from Databricks opened the discussion with the open banking initiative and how, with the help of Databricks’ Lakehouse Platform, companies in the financial sector can “take all the transactional data, third party data, alternate data sources, such as social media and different types of data that come in multiple forms, and land them in one place so companies can do data science and machine learning from a single source of truth.”

ABN AMRO’s Head of Data Engineering Marcel Kramer touched on the importance of openness in data systems and how beneficial collaboration between data scientists and engineers can be to create seamless data handovers in the same environment.

With the Databricks Lakehouse Platform, they’re able to deliver new solutions 10x faster than before and with pinpoint accuracy — something their legacy infrastructure was unable to accomplish A lakehouse architecture enables faster, more reliable data pipelines to feed complete, accurate data into their ML models, improving downstream financial analytics and overall collaboration.

“Openness in data systems helps data scientists and engineers collaborate better with seamless handovers in the same environment,” said Kramer. Additionally, based on the transactions and patterns, ABN AMRO could identify patterns of human trafficking being perpetrated based on just the sheer combination of transactions — showing how data can be used for human rights and socialgood.

Innovate with the customer in mind

Intuit Principal Engineer Bharath Ramarathinam spoke about the necessity of openness and open source software, detailing how they accelerated Intuit’s data transformation by allowing them to leverage multiple partners and their community to achieve business goals. He also mentioned how his team “designs to delight customers, and data is critical to understanding what users want from your business”.

One common thread we often hear from customers is the growing importance of driving social and environmental impact. M&G’s Head of ESG & Research Transformation Priyank Patwa discussed the environmental impact of investments and the importance of quantifying qualitative environmental, social and governance (ESG) information to develop insights and how using enhanced AI and ML makes it possible. Nakai exclaimed how companies can “literally help to curb emissions through investments and data” by quantifying their ESG efforts.

Databricks partner KX Data is the world’s fastest streaming analytics platform and integrates seamlessly with the Databricks Lakehouse Platform for high-volume, continuous, real-time stock market intelligence. Their Managing Director, Conor Twomey, discussed how fast, real-time ML-powered insights are shaping the way people invest and working with “continuous intelligence is the smartest way to maintain value for customers”.

Main takeaways and moving forward

In addition to sharing anecdotal learnings, speakers offered a few core, overarching insights regarding the need for companies to use data and AI —both to minimize risk and create more engaging customer experiences, as well as to drive future growth.

Openness in data is mission critical. Streamlining collaboration across different teams and entities accelerates transformation and is vital to the accuracy and depth of your data.

Data transformation depends on cross-functional engagement. Getting stakeholder approval and engagement across business units is important to data transformation; those involved are more invested when benefits are clear and outcomes align with individual goals.

We haven’t seen anything yet. More and more companies are leveraging data, ML and AI for a variety of tangible use cases in financial tech services. These technologies deliver exponential value in both information and profits, and experts believe we’re only just beginning to tap into the potential of data platforms.

Click below to watch the DAIS Financial Services sessions in full, and explore technical sessions and demos on-demand.

WATCH FS FORUM FROM SUMMIT NOW

Try Databricks for free. Get started today.

The post Democratizing Data and AI in Finserv: DAIS 2021 Takeaways appeared first on Databricks.

↧

Four E-commerce Challenges That Can Be Addressed With Data + AI

July 8, 2021, 8:00 am

≫ Next: Down to the Individual Grain: How John Deere Uses Industrial AI to Increase Crop Yields Through Precision Agriculture

≪ Previous: Democratizing Data and AI in Finserv: DAIS 2021 Takeaways

The global health crisis accelerated the adoption of omnichannel shopping and fulfillment. Consumers spent $861.12 billion online with US merchants in 2020, up an incredible 44% compared to the previous year, which marks the highest annual growth in U.S. e-commerce in at least two decades. To keep up pace with this shift and more effectively sell, businesses have substantially moved investments to online infrastructures, such as e-commerce platforms, inventory management, product recommendations and chatbots and delivery.

Infographic exploring the four customer challenges driving e-commerce profitability.

On one hand, setting up e-commerce sites and/or optimizing online stores means increased sales and market penetration; on the other, these benefits are potentially outweighed by the increased costs as retailers essentially shift a part of their business to logistics and fulfillment. As businesses make the transition to online retailers, they will have to focus on these four key customer areas to ensure profitability: fraud, delivery theft, returns and customer service. Strategically approaching these focus areas with data and artificial intelligence (AI) brings visibility, accuracy and automation, which helps brands better serve customers, provide a competitive advantage and drive loyalty — key success drivers for e-commerce businesses.

The customer challenges retailers face have a significant impact on their bottom line. Here’s how data and AI can address them.

Fraud: Fraud has become too commonplace and on average costs companies 3.36x in chargeback, replacement and operational costs. Losses associated with fraud soared to $56 billion in 2020 and accompanied a huge dip in customer confidence in the brand. Data and AI can help retailers get ahead of fraud and avoid financial and reputational damages, especially when it comes to proactive approaches.

At Databricks, we have a suite of Solution Accelerators that use rules, machine learning and geospatial data to detect and prevent fraud. Learn how to quickly get started with the solutions here. Additionally, the Databricks Lakehouse Platform, which delivers the data management and performance typically found in data warehouses with the low-cost, flexible object stores offered by data lakes. effectively enables anomaly detection at a massive scale to protect losses caused by fraud in real time.

Delivery package theft: Exacerbated by the rise of online shopping from the pandemic, package theft is a huge operational burden on retailers. Some estimates report that, in 2020, 1.7 million packages were stolen or lost daily. The shipping industry is also using AI to enhance security measures, both within and outside of business grounds. Shipping carriers use drones to patrol the grounds around their warehouses to collect real-time information and data. Big data analytics can help logistics providers identify common sites of traffic accidents or package thefts and design their services around those. Locker services for apartment lobbies, alternative pickup spots, specific time deliveries are some examples. The Databricks Lakehouse Platform enables analytics and AI use cases to identify such hotspots and frame apt responses.

Returns & reverse logistics: Companies have also incorporated predictive analytics using data and AI in their returns and reverse-logistics operations, leading to an improvement in service levels with fewer queries and reported issues. They’ve also achieved freight savings between 5-10% by reducing last-minute load requests. The Databricks Lakehouse Platform allows companies to tap into vast amounts of data and unlock actionable insights to determine high-return items or customer behaviors to help retailers prepare strategies to minimize returns

Customer service and cost to serve: Customer satisfaction, customer retention and cost to serve are three factors that can define the long-term profitability for retailers. The drivers of these KPIs are strongly interlinked. Although customer issues can be wide-ranging, many issues will be common amongst customers. Natural Language Processing (NLP) tools can analyze call notes to identify the straightforward and most common issues. These can be tackled by blending digital and call center channels and driving self-service usage for common queries. According to IBM, businesses can reduce customer service costs by up to 30% by implementing such solutions. Additionally, a database of customer context insights can be connected to care flow through IVR to intercept complex calls and route them to the agents who are equipped to resolve the identified issues. This avoids long wait times, multiple transfers and repeat calls — all of which can impact customer satisfaction, retention and cost to serve. Finally, businesses can arm their care agents with an at-a-glance view of customer health, context, history and next-best-step suggestions while the customer is on the call. This is especially useful while dealing with an at-risk customer identified by ML models who has been routed to the retention specialist.

Check out our new e-commerce infographic for a look at the quantifiable impact of these challenges affecting every retail business. The issues outlined in this infographic have a severe impact on customer experience and can prevent them from coming back again. Learn how Databricks can help take on these challenges with our ebook The Retail Lakehouse: Build Resiliency and Agility in the Age of Disruption.

Open your world with Data & AI to uncover the customer journey and solve for these challenges. Data and AI can help you achieve customer-centricity, win win loyal customers and generate sustainable revenue over time.

Try Databricks for free. Get started today.

The post Four E-commerce Challenges That Can Be Addressed With Data + AI appeared first on Databricks.

↧

Down to the Individual Grain: How John Deere Uses Industrial AI to Increase Crop Yields Through Precision Agriculture

July 9, 2021, 12:06 pm

≫ Next: Now in Databricks: Orchestrate Multiple Tasks With Databricks Jobs

≪ Previous: Four E-commerce Challenges That Can Be Addressed With Data + AI

Recently, The Verge spoke with Jahmy Hindman, CTO at John Deere, about the transformation of the company’s farm equipment over the last three decades from purely mechanical to, as Jahmy calls them, “mobile sensor suites that have computational capability.” This is in service to John Deere’s “smart industrial” strategy. More than just selling a piece of equipment, smart industrial is about providing the whole system (equipment, data, analysis and automation) that farmers need in order to provide individualized care (exact amount of water, nutrients and pesticides) at scale to each of the tens of thousands of plants per acre (multiplied by thousands of acres per farm), leading to greater yields and lower waste.

At this year’s Data + AI Summit (DAIS), Gregory Finch (Senior Principal Software Engineer, Intelligent Solutions Group) and Jake Sankey (Technical Product Manager, Enterprise Data & Analytics Platforms) from John Deere went in-depth about the data platform that makes this possible during their manufacturing keynote. As the amount of data generated by equipment doubles or triples per year, Deere needed a data platform that could handle this scale of data now and in the future, easily integrate new data sources (e.g., weather) and then unify it so that different downstream teams — like sales, service, or engineering — could improve customer results.

As Jake explained, “our technology stack is really vast…It consists of onboard and offboard components. On the onboard side, we have sensors, tons of them. We have vision systems, guidance systems and wireless connectivity. Offboard we have cloud infrastructure and storage and scalable services that allow us to receive and process and analyze all that data. This stack is what enables us to help our customers be more productive and more successful.”

As an example, he points to the X9 Combine (the machine that harvest grain crops) where, “cameras continually monitor images of the grains down to individual kernels as they’re taken up the combine’s elevator and dumped into the tank. We use machine learning to analyze grain quality and automatically adjust the operating parameters of the machine if any damage is detected to the grains.”

These sorts of advancements don’t just help the farmer, but have broader social benefits as well. Through precision agriculture, farmers can reduce chemical use by 70%, reducing environmental impacts of pesticide overuse.

Throughout this keynote, Jake and Greg talk about how a 184-year-old enterprise is leading the transformation of the industry as data and artificial intelligence (AI) become more prominent tools of the trade—from execution on the shop floor to how things work in the customer’s hand.

Try Databricks for free. Get started today.

The post Down to the Individual Grain: How John Deere Uses Industrial AI to Increase Crop Yields Through Precision Agriculture appeared first on Databricks.

↧

Now in Databricks: Orchestrate Multiple Tasks With Databricks Jobs

July 13, 2021, 7:00 am

≫ Next: Using Your Data to Stop Credit Card Fraud: Capital One and Other Best Practices

≪ Previous: Down to the Individual Grain: How John Deere Uses Industrial AI to Increase Crop Yields Through Precision Agriculture

READ DOCUMENTATION

As companies undertake more business intelligence (BI) and artificial intelligence (AI) initiatives, the need for simple, clear and reliable orchestration of data processing tasks has increased. Previously, Databricks customers had to choose whether to run these tasks all in one notebook or use another workflow tool and add to the overall complexity of their environment.

Today, we are pleased to announce that Databricks Jobs, available in public preview, now supports task orchestration — the ability to run multiple tasks as a directed acyclic graph (DAG). A job is a non-interactive way to run an application in a Databricks cluster, for example, an ETL job or data analysis task you want to run immediately or on a scheduled basis. The ability to orchestrate multiple tasks in a job significantly simplifies creation, management and monitoring of your data and machine learning workflows at no additional cost. Benefits of this new capability include:

Simple task orchestration
Now, anyone can easily orchestrate tasks in a DAG using the Databricks UI and API. This eases the burden on data teams by enabling data scientists and analysts to build and monitor their own jobs, making key AI and ML initiatives more accessible. The following example shows a job that runs seven notebooks to train a recommender machine learning model.

Orchestrate anything, anywhere
Jobs orchestration is fully integrated in Databricks and requires no additional infrastructure or DevOps resources. Customers can use the Jobs API or UI to create and manage jobs and features, such as email alerts for monitoring. Your data team does not have to learn new skills to benefit from this feature. This feature also enables you to orchestrate anything that has an API outside of Databricks and across all clouds, e.g. pull data from CRMs.

Next steps
Task Orchestration will begin rolling out to all Databricks workspaces as a Public Preview starting July 13th. Over the course of the following months, we will also enable you to reuse a cluster across tasks in a job and restart the DAG such that it only runs the tasks again that had previously failed.

Read more about task orchestration and multiple tasks in a Databricks Job, then go to the admin console of your workspace to enable the capability for free.

Try Databricks for free. Get started today.

The post Now in Databricks: Orchestrate Multiple Tasks With Databricks Jobs appeared first on Databricks.

↧

Using Your Data to Stop Credit Card Fraud: Capital One and Other Best Practices

July 13, 2021, 9:51 am

≫ Next: Driving Transformation at Northwestern Mutual (Insights Platform) by Moving Towards a Scalable, Open Lakehouse Architecture

≪ Previous: Now in Databricks: Orchestrate Multiple Tasks With Databricks Jobs

Fraud is a costly and growing problem – research estimates that $1 of fraud costs companies 3.36x in chargeback, replacement and operational cost. Adding to the pain, according to experts, there are not enough regulations to protect small businesses from chargebacks and losses from fraud. Despite significant advancements in credit card fraud, risk management techniques have adapted, and fraudsters are still able to find loopholes and exploit the system. For credit card companies, the threat of fraudulent card usage is a constant, which results in the need for accurate credit card fraud detection systems. All organizations are at risk of fraud and fraudulent activities, but that risk is especially burdensome to those in financial services. “Threats can originate from internal or external sources, but the effects can be devastating – including loss of consumer confidence, incarceration for those involved, and even the downfall of corporations,” says Badrish Davay, a Data Engineering and Machine Learning leader at Capital One. CNBC reports that the US is the most credit card fraud prone country in the world.

Fraud detection using machine learning
It’s not all bad news, though. With modern advancements, businesses are able to stay ahead of threats by leveraging data and machine learning. As part of a tech talk at the recent Data + AI Summit, we were able to get a glimpse into how Capital One is using data and artificial intelligence (AI) to address fraud. Badrish Davay from Capital One shared how we can utilize state-of-the-art ML algorithms to stay ahead of the attackers and, at the same time ,constantly learn new ways a system is being exploited. “In order to more dynamically detect fraudulent transactions, one can train ML models on a dataset, including credit card transaction data, as well as card and demographic information of the cardholder. Capital One uses Databricks to achieve this goal,” noted Davay.

Capital One analyzes all the fraudulent activities to understand what to look for in credit card fraud. Davay presented the 6 “W” questions they ask – what, who, when, where, why and what if? – used to uncover trends in fraudulent activities. Davay highlighted various scenarios in which card information may be compromised and how data can help with anomaly detection and identifying fraud. For example, he shared how geospatial data can detect stolen card information when it is being used away from its actual location, along with temporal data to determine fraud.

As Davay also explained, when a customer physically loses a card, but doesn’t notify the organization, contextual information ( e.g. work hours, spending habits, etc.) can help determine if transactions are routine or anomalous. A key takeaway from Davay is that we should be able to combine a multiple of independent signals together to get a wider context around traction and demographics data. With the availability of data and advancements in ML, fraud prevention is a key area in which ML is changing both workflows and outcomes, allowing organizations to stay ahead of increasingly technologically advanced criminals.

Today’s businesses are facing an increasingly sophisticated enemy that attacks, responds and changes tactics extremely quickly. Due to dynamics of fraud, organizations need AI to constantly adapt to changing behaviours and patterns. AI brings agility that rules do not. With data analytics and ML, companies can get ahead of threats. Below are some key reasons why ML is apt for taking on fraud:

Fraud hides under massive amounts of data: The most effective way to detect fraud is to look at the overall behaviors of end users. Looking at transactions or orders is not enough — we need to follow the events leading up to and after the transaction. This culminates into a lot of structured and unstructured data, and the best way to detect fraud in such huge volumes is with ML and AI.
Fraud happens quickly: When an ML system updates in real time, that knowledge can be used within milliseconds to update fraud detection models and prevent an attack.
Fraud is always changing: Fraudsters constantly adapt their tactics, making them difficult for humans to detect – and impossible for static rules-based systems, which don’t learn. ML, however, can adapt to changing behavior.
Fraud looks fine on the surface: To the human eye, fraudulent and normal transactions don’t appear any differently from each other. ML has a deeper and more nuanced way of viewing data, which helps avoid false positives.

Davay discussed how MLuses statistical models, such as classifiers, and logistic regression to look at past outcomes and anomalies to predict future outcomes. An ML system can learn, predict and make decisions as data comes in real time. In his presentation, Davay outlined what a good fraud prevention model needs to have:

one-stop shop for users to train the model and orchestrate execution
Real-time detection
Deep analytics and modeling by leveraging powerful ML tools, such as deep learning and neural networks, for what if data analysis and testing new hypotheses
Adherence to company security policy and compliance requirements
Notification service to inform cardholders immediately of suspicious activity
Seamless integration with enterprise systems

MLflow in fraud prevention
Davay highlighted the value of Databricks and MLflow in their fraud prevention efforts. He talked about the platform and how different data and fraud teams collaboratively develop and run experiments with the team using Databricks. “Even though they share experiments and data collaboratively within the team, we can implement stringent security measures in order to respect data privacy, and each experiment can have its own compute environments and requirements,” said Davay. He referred to Databricks as “a one-stop shop for all of [their] data science and models, making it perfect for data science projects.” When the team has identified features for predicting whether a transaction is fraudulent or not, they pass these data points to Databricks’ hosted environment, where they can then perform feature engineering, data pre-processing and split the data into test and training sets. They then use a variety of supervised or unsupervised ML algorithms, such as SVM, decision tree and random forest, to train a model. They identify the best performing model and use the Databricks Lakehouse Platform to solve for fraud directly from within the platform. The lakehouse is a conducive environment for fraud detection and you can learn more from our solution accelerators here.

Davay mentioned how “MLflow within the Databricks ecosystem is a great feature that we can use because it has numerous advantages in developing the ML workflow pipeline seamlessly.” MLflow allows Capital One to track their ML experiments from end-to-end throughout the ML model lifecycle. During the talk, Davay mentioned they can run experiments directly from GitHub without the need to go through the code and can directly deploy and train models by serializing them while utilizing packages such as Python’s pickle module, Apache Spark, and MLflow. They then deploy the serialized model and serve it as an API by harnessing MLflow.

MLflow and microservices
Davay also touched on microservices and why they are useful in MLflow. A microservice is a gateway to a specific functional aspect of an application. It helps teams like Capital One’s develop applications in a standardized consistent manner over time. Microservices allow Capital One to deploy functionality of applications independent of each other. It helps abstract the functionality while enabling the team to build in a reusable and uniform way of interacting with an application. Furthermore, it lets teams compose complex behavior by combining a variety of other microservices together. Essentially, it empowers companies to use any tech stack in the backend while maintaining compatibility on the front end.

With Capital One’s raw data stored in Amazon S3, they quickly integrate interactions between S3 and their framework through Databricks seamlessly and can massively scale ML model training, validation and deployment pipelines through MLflow. Their team trains and validates models on custom clusters in AWS and deploys them through SageMaker directly by using MLflow APIs. MLflow is not only limited to AI, but can embed any piece of business logic (as mentioned in Databricks Rules + AI accelerator) and, as such, benefits from the E2E governance and delivery principles as microservices.

Putting it all together
Davay shared how Databricks allows Capital one to query and deploy models and manage and clean up the deployment while using MLflow APIs within the AWS ecosystem. In addition, they can ensure safe security and conditional access via AWS SSO.

Based on observations from Capital One and many other customers, there are several benefits of using data and AI for fraud prevention, including:

Reduced need for manual review. ML automates processes in which behaviors can be learned at the individual level and detect anomalies.
The ability to prevent fraud cases without impeding the user experience. AI brings automation to the process seamlessly and prevents fraud in advance without burdening users.
Lower operational costs than other approaches. With less manual work and automation, data and AI require fewer resources and preempts losses associated with fraud.
Frees up teams’ time to focus on more strategic tasks. Most companies are not in the business of fraud detection, and an ML fraud prevention process can help them focus on core activities.
Adapts quickly. Coupled with human talent and experience, data and AI work together to constantly learn and adjust to new user behaviors and trends.

When it comes to operationalizing data and AI to build customer relationships and drive higher returns on equity, fraud should be considered a top priority. Curbing fraudulent or malicious behavior – from fraudulent card transactions – is key to mitigating negative revenue impact. To more dynamically detect fraudulent transactions, Capital One uses ML and credit card transaction information, as well as card and demographic information, to get a comprehensive view to identify anomalies. Data-driven innovators such as Capital One’s are paving the way in fraud detection and provide a successful model to follow to protect customers and business.

Get started with fraud prevention in Databricks

Get a jump start with our our prebuilt code and guides in our Fraud Solution Accelerators

See all our Financial Services solutions

Try Databricks for free. Get started today.

The post Using Your Data to Stop Credit Card Fraud: Capital One and Other Best Practices appeared first on Databricks.

↧

Driving Transformation at Northwestern Mutual (Insights Platform) by Moving Towards a Scalable, Open Lakehouse Architecture

July 15, 2021, 11:10 am

≫ Next: Feature Engineering at Scale

≪ Previous: Using Your Data to Stop Credit Card Fraud: Capital One and Other Best Practices

This is a guest authored post by Manhu Kotian, Vice President of Engineering ( Investment Products Data, CRM, Apps and Reporting) at Northwestern Mutual.

Digital Transformation has been front and center in most contemporary big data corporate initiatives, especially in companies with a heavy legacy footprint. One of the underpinning components in digital transformation is data and its related data store. For 160+ years, Northwestern Mutual has been helping families and businesses achieve financial security. With over $31 Billion in revenue, 4.6M+ clients and 9,300+ financial professionals, there are not too many companies that have this volume of data across a variety of sources.

Data ingestion is a challenge in this day and age when organizations deal with millions of data points coming in different formats, time frames and from different directions at an unprecedented volume. We want to make data ready for analysis to make sense of it. Today, I am excited to share our novel approach to transforming and modernizing our data ingestion process, scheduling process, and journey with data stores. One thing we learned is that an effective approach is multifaceted, which is why in addition to the technical arrangements I’ll walk through our plan to onboard our team.

Challenges faced

Before we embarked on our transformation, we worked with our business partners to really understand our technical constraints and help us shape the problem statement for our business case.

The business pain point we identified was a lack of integrated data, with customer and business data coming from different internal and external teams and data sources. We realized the value of real-time data but had limited access to production/real-time data that could enable us to make business decisions in a timely manner. We also learned that data stores built by the business team resulted in data silos, which in turn caused data latency issues, increased cost of data management and unwarranted security constraints.

Furthermore, there were technical challenges with respect to our current state. With increased demand and additional data needed, we experienced constraints with infrastructure scalability, data latency, cost of managing data silos, data size and volume limitations and data security issues. With these challenges mounting, we knew we had a lot to take on and needed to find the right partners to help us in our transformation journey.

Solution analysis

We needed to become data-driven to be competitive and serve our customers better and optimize internal processes. We explored various options and performed several POCs to select a final recommendation. The following were the must-haves for our go-forward strategy –

An all-inclusive solution for our data ingestion, data management and analytical needs
A modern data platform that can effectively support our developers and business analysts to perform their analysis using SQL
A data engine that can support ACID transactions on top of S3 and enable role-based security
A system that can effectively secure our PII/PHI information
A platform that can automatically scale based on the data processing and analytical demand

Our legacy infrastructure was based on MSBI Stack. We used SSIS for ingestion, SQL Server for our datastore, Azure Analysis Service for Tabular model and PowerBI for Dashboarding and reporting. Although the platform met the needs of the business initially, we had challenges around scaling with increased data volume and data processing demand, and constrained our data analytical expectations. With additional data needs, our data latency issues from load delays and a data store for specific business needs caused data silos and data sprawl.

Security became a challenge due to the spread of data across multiple data stores. We had approximately 300 ETL jobs that took more than 7 hours from our daily jobs. The time to market for any change or new development was roughly 4 to 6 weeks (depending on the complexity).

Northwestern Mutual’s legacy data analytics stack prior to its data modernization initiative.

Figure 1: Legacy Architecture

After evaluating multiple solutions in the marketplace, we decided to move forward with Databricks to help us deliver one integrated data management solution on an open lakehouse architecture.

Databricks being developed on top of Apache Spark™ enabled us to use Python to build our custom framework for data ingestion and metadata management. It provided us the flexibility to perform ad-hoc analysis and other data discoveries using the notebook. Databricks Delta Lake (the storage layer built on top of our data lake) provided us the flexibility to implement various database management functions (ACID transactions, Metadata Governance, Time travel, etc.) including the implementation of required security controls. Databricks took the headache out from managing/scaling the cluster and reacted effectively to the pent-up demand from our engineers and business users.

Figure 2: Architecture with Databricks

Migration approach and onboarding resources

We started with a small group of engineers and assigned them to a virtual team from our existing scrum team. Their goal was to execute different POC, build on the recommended solution, develop best practices and transition back to their respective team to help with the onboarding. Leveraging existing team members favored us better because they had existing legacy system knowledge, understood the current ingestion flow/business rules, and were well versed with at least one programming knowledge (data engineering + software engineering knowledge). This team first trained themselves in Python, understood intricate details of Spark and Delta, and closely partnered with the Databricks team to validate the solution/approach. While the team was working on forming the future state, the rest of our developers worked on delivering the business priorities.

Since most of the developers were MSBI Stack engineers, our plan of action was to deliver a data platform that would be frictionless for our developers, business users, and our field advisors.

We built an ingestion framework that covered all our data load and transformation needs. It had in-built security controls, which maintained all the metadata and secrets of our source systems. The ingestion process accepted a JSON file that included the source, target and required transformation. It allowed for both simple and complex transformation.
For scheduling, we ended up using Airflow but given the complexity of the DAG, we built our own custom framework on top of Airflow, which accepted a YAML file that included job information and its related interdependencies.
For managing schema-level changes using Delta, we built our own custom framework which automated different database type operations (DDL) without requiring developers to have break-glass access to the data store. This also helped us to implement different audit controls on the data store.

In parallel, the team also worked with our Security team to make sure we understood and met all the criteria for data security (Encryption in Transit, Encryption at Rest and column level encryption to protect PII information).

Once these frameworks were set up, the cohort team deployed an end-to-end flow (Source to target with all transformation) and generated a new set of reports/dashboards on PowerBI pointing to Delta Lake. The goal was to test the performance of our end-to-end process, validate the data and obtain any feedback from our field users. We incrementally improved the product based on the feedback and outcomes of our performance/validation test.

Simultaneously, we built training guides and how-tos to onboard our developers. Soon after, we decided to move the cohort team members to their respective teams while retaining a few to continue to support the platform infrastructure (DevOps). Each scrum team was responsible for managing and delivering their respective set of capabilities/features to the business. Once the team members moved back to their respective teams, they embarked on the task to adjust the velocity of the team to include the backlog for migration effort. The team leads were giving specific guidance and appropriate goals to meet the migration goals for different Sprint/Program Increments. The team members who were in the cohort group were now the resident experts and they helped their team onboard to the new platform. They were available for any ad-hoc questions or assistance.

As we incrementally built our new platform, we retained the old platform for validation and verification.

The beginning of success

The overall transformation took us roughly a year and a half, which is quite a feat given that we had to build all the frameworks, manage business priorities, manage security expectations, retool our team and migrate the platform. Overall load time came down remarkably from 7 hours to just 2 hours. Our time to market was roughly about 1 to 2 weeks, down significantly from 4-6 weeks. This was a major improvement in which I know will extend itself to our business in several ways.

Our journey is not over. As we continue to enhance our platform, our next mission will be to expand on the lakehouse pattern. We are working on migrating our platform to E2 and deploying Databricks SQL. We are working on our strategy to provide a self-service platform to our business users to perform their ad-hoc analysis and also enable them to bring their own data with an ability to perform analysis with our integrated data. What we learned is that we benefited greatly by using a platform that was open, unified and scalable. As our needs and capabilities grow, we know we have a robust partner in Databricks.

About Madhu Kotian

Madhu Kotian is the Vice President of Engineering ( Investment Products Data, CRM, Apps and Reporting) at Northwestern Mutual. He has over 25+ years of experience in the field of Information Technology with experience and expertise in data engineering, people management, program management, architecture, design, development and maintenance using Agile practices. He is also an expert in Data Warehouse Methodologies and Implementation of Data Integration and Analytics.

Try Databricks for free. Get started today.

The post Driving Transformation at Northwestern Mutual (Insights Platform) by Moving Towards a Scalable, Open Lakehouse Architecture appeared first on Databricks.

↧

Feature Engineering at Scale

July 16, 2021, 8:06 am

≫ Next: AML Solutions at Scale Using Databricks Lakehouse Platform

≪ Previous: Driving Transformation at Northwestern Mutual (Insights Platform) by Moving Towards a Scalable, Open Lakehouse Architecture

Feature engineering is one of the most important and time-consuming steps of the machine learning process. Data scientists and analysts often find themselves spending a lot of time experimenting with different combinations of features to improve their models and to generate BI reports that drive business insights. The larger, more complex datasets with which data scientists find themselves wrangling exacerbate ongoing challenges, such as how to:

Define features in a simple and consistent way
Find and reuse existing features
Build upon existing features
Maintain and track versions of features and models
Manage the lifecycle of feature definitions
Maintain efficiency across feature calculations and storage
Calculate and persist wide tables (>1000 columns) efficiently
Recreate features that created a model that resulted in a decision that must be later defended (i.e. audit / interpretability)

In this blog, we present design patterns for generating large-scale features. A reference implementation of the design patterns is provided in the attached notebooks to demonstrate how first-class design patterns can simplify the feature engineering process and facilitate efficiency across the silos in your organization. The approach can be integrated with the recently-launched Databricks Feature Store, the first of its kind co-designed with an MLOps and data platform, and can leverage the storage and MLOps capabilities of Delta Lake and MLFlow.

In our example, we use the TPC-DS training dataset to demonstrate the benefits of a first-class feature-engineering workflow with Apache Spark™ at scale. We twist and transform base metrics, such as sales and transactions, across dimensions like customer and time to create model-ready features. These complex transformations are self-documenting, efficient and extensible. A first-class feature-engineering framework is not industry specific, yet must be easily extended to facilitate the nuance of specific organizational goals. This extensibility is demonstrated in this blog through the use of adapted, higher-order functions applied within the framework.

Our approach to feature engineering is also designed to address some of the biggest challenges around scale. For nearly every business, data growth is exploding, and more data leads to more features, which exponentially compounds the challenge of feature creation and management – regardless of the industry. The framework discussed in this blog has been explored and implemented across several industries, some of which are highlighted below.

Architectural overview

The design patterns in this blog are based upon the work of Feature Factory. The diagram below shows a typical workflow. First of all, base features are defined from the raw data and are the building blocks of more features. For example, a total_sales feature can be defined as a base feature, which sums up the sales_value grouped by customers. Derived features can be used as inputs to create more complex manipulations from the base. A multitude of features can be rapidly generated, documented, tested, validated and persisted in a few lines of code.

Feature definitions are applied to the raw data to generate features as dataframes and can be saved to the Feature Registry using Feature Store APIs. Delta Lake provides multiple optimizations that the feature generation engine leverages. Furthermore, feature definitions are version controlled, enabling traceability, reproduction, temporal interpretability and audit as needed.

The code example below shows how feature definitions can be materialized and registered to the Feature Store.

  def compute_customer_features(data):
  features = StoreSales()
  fv_months = features.total_sales.multiply("i_category", ["Music", "Home", "Shoes"]).multiply("month_id", [200012, 200011, 200010])
  df = append_features(src_df, [features.collector], fv_months)
  return df
customer_features_df = compute_customer_features(src_df)

Example feature definitions materialized and registered to the Feature Store.

Result from simple feature multiplication

fs = feature_store.FeatureStoreClient()
fs.create_feature_table(
    name="db.customer_features",
    keys=["customer_id"],
    features_df=customer_features_df,
    description="customer feature table",

Features created in Feature Store

Example: The result shows that total_sales_Music_200012 for customer_id 46952 was 1752.68 which, in plain english, means that this customer purchased $1,752.68 worth of music as defined by total_sales in December of the year 2000.

Dataset

The reference implementation is based on, but not limited to, the TPC-DS, which has three sales channels: Web, Store, and Catalog. The code examples in this blog show features created from the StoreSales table joined by date_dim and item tables, defined as:

Store_Sales: Transactional revenue on product generated from within a brick and mortar
Date_Dim: A calendar-type table representing a date dimension
Item: A sku that can be sold

Base feature definition

The Spark APIs provide powerful functions for data engineering that can be harnessed for feature engineering with a wrapper and some contextual definitions that abstract complexity and promote reuse. The Feature class provides a unified interface to define features with these components:

_base_col column or other feature, both of which are simply columnar expressions.
list of conditions or, more specifically, true/false columnar expressions. If the expression is True, the logic defined in _base_col will be taken as the feature, otherwise the feature will be calculated using the _negative_value
_negative_value expression to be evaluated if the _filter returns False
_agg_func defines the Spark SQL functions to be used to aggregate the base column. If _agg_func is not defined, the feature is not an aggregate expression (i.e. “feature”).

class Feature:
    def __init__(self,
                 _name: str,
                 _base_col: Union[Column, Feature],
                 _filter=[],
                 _negative_value=None,
                 _agg_func=None):

This example shows how to define an aggregate feature to sum up sales over first half of year 2019:

total_sales = Feature(_name="total_sales",
			_base_col="sales_value",
			_filter=[col("month_id").between(201901, 201906)],
			_agg_func=sum)

It is equivalent to:

sum(when(col("month_id").between(201901, 201906), col("sales_value")).otherwise(None))

Feature modularization

A common issue with feature engineering is that data science teams are defining their own features, but the feature definitions are not documented, visible or easily shared with other teams. This commonly results in duplicated efforts, code, and worst of all, features with the same intent but different logic / results. Bug fixes, improvements and documentation are rarely accessible across teams. Modularizing feature definitions can alleviate these common challenges.

Sharing across the broader organization and silos requires another layer of abstraction, as different areas of the organization often calculate similar concepts in different ways. For example, the calculation of net_sales is necessary for all lines of business, store, web and catalog, but the inputs, and even the calculation, is likely different across these lines of business. Enabling net_sales to be derived differently across the broader organization requires that net_sales and its commonalities be promoted to a common_module (e.g. sales_common) into which users can inject their business rules. Many features do not sufficiently overlap with other lines of business and as such, are never promoted to common. However, this does not mean that such a feature is not valuable for another line of business. Combining features across conceptual boundaries is possible, but when no common super exists, usage must follow the rules of the source concept (e.g. channel). For example, machine learning (ML) models that predict store sales often gain value from catalog features. Assuming catalog_sales is a leading indicator of store sales, features may be combined across these conceptual boundaries; it simply requires that the user understand the defining rules according to the construct of the foreign module (e.g. namespace). Further discussion on this abstraction layer is outside the scope of this blog post.

In the reference implementation, a module is implemented as a Feature Family (a collection of features). A read-only property is defined for each feature to provide easy access. A feature family extends an ImmutableDictBase class, which is generic and can serve as base class for collections of features, filters and other objects. In the code example below, filter definitions are extracted from features and form a separate Filters class. The common features shared by multiple families are also extracted into a separate Common Features class for reuse. Both filters and common features are inherited by the StoreSales family class, which defines a new set of features based upon the common definitions.

In the code example, there is only one channel; multiple channels share the same CommonFeatures. Retrieving a feature definition from a specific channel is as simple as accessing a property of that family class, e.g. store_channel.total_sales

class CommonFeatures(ImmutableDictBase):
   def __init__(self):
       self._dct["CUSTOMER_NUMBER"] = Feature(_name="CUSTOMER_NUMBER", _base_col=f.col("CUSTOMER_ID").cast("long"))
       self._dct["trans_id"] = Feature(_name="trans_id", _base_col=f.concat("ss_ticket_number","d_date"))

   @property
   def customer(self):
       return self._dct["CUSTOMER_NUMBER"]

   @property
   def trans_id(self):
       return self._dct["trans_id"]


class Filters(ImmutableDictBase):
   def __init__(self):
       self._dct["valid_sales"] = f.col("sales_value") > 0

   @property
   def valid_sales(self):
       return self._dct["valid_sales"]


class StoreSales(CommonFeatures, Filters):
   def __init__(self):
       self._dct = dict()
       CommonFeatures.__init__(self)
       Filters.__init__(self)

       self._dct["total_trans"] = Feature(_name="total_trans",
                                          _base_col=self.trans_id,
                                          _filter=[],
                                          _negative_value=None,
                                          _agg_func=f.countDistinct)

       self._dct["total_sales"] = Feature(_name="total_sales",
                                          _base_col=f.col("sales_value"),
                                          _filter=self.valid_sales,
                                          _negative_value=0,
                                          _agg_func=f.sum)

   @property
   def total_sales(self):
       return self._dct["total_sales"]

   @property
   def total_trans(self):
       return self._dct["total_trans"]

Feature operations

There are often common patterns found in feature generation; a feature can be extended to include higher-order functions that reduce verbosity, simplify reuse, improve legibility and its definition. Examples may include:

An analyst often likes to measure and compare various product trends over various time periods, such as the last month, last quarter, last year, etc..
A data scientist analyzing customer purchasing patterns across product categories and market segments when developing recommender systems for ad placements.

Many very different use cases implement a very similar series of operations (e.g. filters) atop a very similar (or equal) set of base features to create deeper, more powerful, more specific features.

In the reference implementation, a feature is defined as a Feature class. The operations are implemented as methods of the Feature class. To generate more features, base features can be multiplied using multipliers, such as a list of distinct time ranges, values or a data column (i.e. Spark Sql Expression). For example, a total sales feature can be multiplied by a range of months to generate a feature vector of total sales by months.

total_sales * [1M, 3M, 6M] => [total_sales_1M, total_sales_3M, total_sales_6M]

The multiplication can be applied to categorical values as well. The following example shows how a total_sales feature can derive sales by categorical features:

total_sales * [home, auto] => [total_sales_home, total_sales_auto]

Note that these processes can be combinatorial such that the resulting features of various multipliers can be combined to further transform features.

total_sales_1M * [home, auto] => [total_sales_1M_home, total_sales_1M_home, total_sales_1M_home]

Higher-order lambdas can be applied to allow list comprehension over list of features multiplied by list of features and so on, depending on the need. To clarify, the output variable, total_sales_1M_home below is the derived total store sales for home goods in the past 1 month. Data scientists often spend days wrangling this data through hundreds of lines of inefficient code that only they can read; this framework greatly reduces that cumbersome challenge.

total_sales_by_time = total_sales * [1M, 3M, 6M]
categorical_total_sales_by_time = total_sales_by_time * [home, auto] => 
[
total_sales_1M_home, total_sales_1M_home, total_sales_1M_home, total_sales_1M_auto, total_sales_1M_auto, total_sales_1M_auto,
total_sales_3M_home, total_sales_3M_home, total_sales_3M_home, total_sales_3M_auto, total_sales_3M_auto, total_sales_3M_auto,
total_sales_6M_home, total_sales_6M_home, total_sales_6M_home, total_sales_6M_auto, total_sales_6M_auto, total_sales_6M_auto
]

Feature vectors

The feature operations can be further simplified by storing the features with the same operations in a vector. A feature vector can be created from a Feature Dictionary by listing feature names.

features = Features()
fv = FeatureVector.create_by_names(features, ["total_sales", "total_trans"])

A feature vector can create another feature vector by simple multiplication or division, or even via stats functions on an existing vector. A feature vector implements methods such as multiplication, division and statistical analysis to simplify the process of generating features from a list of existing base features. Similarly, Spark’s feature transformers can be wrapped to perform common featurization tactics such as scalers, binarizes, etc. Here’s an example of OHE:

fv2d = fv.multiply_categories("category", ["GROCERY", "MEAT", "DAIRY"])

As a result, new features will be created for total_sales and total_trans by each category (grocery, meat, dairy). To make it more dynamic, the categorical values can be read from a column of a dimensional table instead of hard-coded. Note that the output of the multiplication is a 2d vector.

	GROCERY	MEAT	DAIRY
total_sales	total_sales_grocery	total_sales_meat	total_sales_dairy
total_trans	total_trans_grocery	total_trans_meat	total_trans_dairy

Below shows how to implement FeatureVector.

class FeatureVector:

def __init__(self, features: List[Feature] = None):
if not features:
self._features = []
else:
self._features = features

def __add__(self, other):
"""
Overrides default add so that two feature vectors can be added to form a new feature vector.
e.g. fv1 = fv2 + fv3 in which fv1 contains all features from both fv2 and fv3
:param other:
:return:
"""
return FeatureVector(self._features + other._features)

@classmethod
def create_by_names(cls, feature_collection, feature_names: List[str]):
feat_list = [feature_collection[fn] for fn in feature_names]
return FeatureVector(feat_list)

def multiply(self, multiplier_col: str, multiplier_values: List[str]):
feats = FeatureVector()
for feature in self._features:
fv = feature.multiply(multiplier_col, multiplier_values)
feats += fv
return feats

def create_stats(self, base_name: str, stats=["min", "max", "avg", "stdev"]):
cols = [f.col(feat.name) for feat in self._features]
fl = []
for stat in stats:
if stat == "min":
fl.append(Feature(_name=base_name + "_min", _base_col=f.array_min(f.array(cols))))
elif stat == "max":
fl.append(Feature(_name=base_name + "_max", _base_col=f.array_max(f.array(cols))))
elif stat == "avg":
fl.append(Feature(_name=base_name + "_avg", _base_col=avg_func(f.array(cols))))
elif stat == "stdev":
fl.append(Feature(_name=base_name + "_stdev", _base_col=stdev_func(f.array(cols))))
return FeatureVector(fl)

def to_cols(self):
return [f.col(feat.name) for feat in self._features]

def to_list(self):
return self._features[:]

One-hot encoding

One-hot encoding is straightforward with feature multiplication and is walked through in the code below. The encoding feature defines base_col as 1 and negative_value as 0 so that when the feature is multiplied by a categorical column, the matched column will be set to 1 and all others 0.

src_df = spark.createDataFrame([(1, "iphone"), (2, "samsung"), (3, "htc"), (4, "vivo")], ["user_id", "device_type"])
encode = Feature(_name="device_type_encode", _base_col=f.lit(1), _negative_value=0)
onehot_encoder = encode.multiply("device_type", ["iphone", "samsung", "htc", "vivo"])
df = append_features(src_df, ["device_type"], onehot_encoder)

Multiplication with estimation

The dynamic approach can generate a large number of features with only a few lines of code. For example, multiplying a feature vector of 10 features with a category column of 100 distinct values will generate 1000 columns, 3000 if subsequently multiplied by 3 time periods. This is a very common scenario for time-based statistical observations through time, (e.g. max, min, average, trending of transactions over a year).

With the number of features increasing, the Spark job takes longer to finish. Recalculation happens frequently for the features with overlapping ranges. For example, to calculate transactions over a month, a quarter, half year and a year, the same set of data is counted multiple times. How can we enjoy the easy process of feature generation and still keep the performance under control?

One method to improve model performance is to pre-aggregate the source table and calculate the features from the pre-aggregated data. Total_sales can be pre-aggregated per category and per month, such that subsequent roll-ups will be based on this smaller set of data. This method is straightforward to implement on narrow transformations like sum and count but what about wide transformations like median and distinct count? Below, two methods are introduced that employ estimation to maintain performance while preserving low margins of error for large datasets.

HyperLogLog

HyperLogLog (HLL) is a machine learning algorithm used to approximate a distinct count of values in a dataset. It provides a fast way to count distinct values and, more importantly, produces a binary sketch, a fixed sized set with a small memory footprint. HLL can compute the union of multiple sketches efficiently, and the distinct count is approximated as the cardinality of a sketch. With HLL, total transactions per month can be pre-aggregated as sketches, and total transactions of a quarter will simply become the union of the three pre-computed monthly sketches

total_trans_quarter = cardinality(total_trans_1m_sketch U total_trans_2m_sketch U total_trans_3m_sketch)

The calculation is shown in the picture below.

MinHash

Another approach to approximate the result of multiplication is to intersect the two sketches and compute the cardinality of the intersection.

For example, total_trans_grocery_12m = cardinality(total_trans_12m total_trans_grocery)

The intersection can be computed with inclusion-exclusion principle:

|AB| = |A| + |B|-|AB|

However the intersection computed this way will have the compounding error from estimation.

MinHash is a fast algorithm to estimate Jaccard Similarity between two sets. The tradeoff of accuracy and computing/storage resources can be tuned by using different hash functions and the number of permutation functions. With MinHash, the distinct count of two joined sets can be calculated in linear time with small and fixed memory footage.

The details of multiplication with MinHash can be found in the notebooks below.

Feature definition governance

Databricks notebooks manage and track the versions of feature definitions and can be linked and backed in Github repos. An automated job will run the functional and integration tests against the new features. Additional stress tests can be integrated to determine performance impact in different scenarios as well. If the performance is out of some bounds, the feature gets a performance warning flag. Once all tests passed and code approved, the new feature definitions are promoted to production.

MLflow tracking tracks and logs the code version and source data when building models from the features .mlflow.spark.autolog() enables and configures logging of Spark datasource path, version and format. The model can be linked with the training data and feature definitions in the code repository.

To reproduce an experiment, we also need to use consistent datasets. Delta Lake’s Time Travel capability enables queries from a specific snapshot of data in the past. **Disclaimer, Time Travel is not meant to be used as long-term, persistent historical versioning; standard archival processes are still required for longer-term historical persistence even when using Delta.

Feature exploration

As the number of features increases, it becomes more difficult to browse and find specific feature definitions.

The abstracted feature class enables us to add a description attribute to each feature, and text mining algorithms can be applied to the feature descriptions to cluster features into different categories automatically.

At Databricks, we’ve experimented with an automatic feature clustering approach and seen promising results. This approach assumes that proper description of features is provided as input. Descriptions are transformed into a TF-IDF feature space, and then Birch clustering is applied to gather similar descriptions into the same group. The topics of each group are the high-rank terms in the group of features.

The feature clustering can serve multiple purposes. One is to group similar features together so that a developer can explore more easily. Another use case could be to validate the feature description. If a feature is not properly documented, the feature will not be clustered in the same group as expected.

Summary

In this blog, design patterns for feature creation are presented to showcase how features can be defined and managed at scale. With this automated feature engineering, new features can be generated dynamically using feature multiplication as well as efficiently stored and manipulated using Feature Vectors. The calculation of derived features can be improved by union/intersect pre-computed features via estimation-based methods. Features can be further extended to simplistically and efficiently implement complex operations on otherwise extremely costly functions and simplify it for users from several backgrounds.

Hopefully this blog enables you to bootstrap your own implementation of a feature factory that streamlines workflows, enables documentation, minimizes duplicity, and guarantees consistency among your feature sets.

Preview the Notebook

Download the notebook

Try Databricks for free. Get started today.

The post Feature Engineering at Scale appeared first on Databricks.

↧