Automated Background Removal in E-commerce Fashion Image Processing Using PyTorch on Databricks

April 29, 2021, 10:00 am

≫ Next: Custom DNS With AWS Privatelink for Databricks Workspaces

≪ Previous: Databricks Named Data Science & Analytics Launch Partner for New AWS for Media & Entertainment Initiative

This is a guest blog from Simona Stolnicu, a data scientist and machine learning engineer at Wehkamp, an e-commerce company, where her team builds data pipelines for machine learning tasks at scale.

Wehkamp is one of the biggest e-commerce companies in the Netherlands, with more than 500,000 daily visitors on their website. A wide variety of products offered on the Wehkamp site aims to meet its customers’ many needs.

An important aspect of any customer visit on an e-commerce website is a qualitative and accurate visual experience of the products. At a large scale, this is no easy task, with thousands of product photos processed in a local photo studio.

One aspect of creating a great customer experience is consistency. Since these images’ backgrounds are highly varied, before an image goes on the website, the background is removed to create a uniform look on the web pages. If done manually, this is a very tedious and time-consuming job. When it comes to millions of images, the time and resources needed to manually perform background removal are too high to sustain the dynamic flow of the newly arrived products.

In this blog, we describe our automated end-to-end pipeline, which uses machine learning (ML) to reduce image processing time and increase image quality. For that, we employ PyTorch for image processing and Horovod on Databricks clusters for distributed training.

Image processing pipeline overview

In the following diagram, you can observe all the principal components of our pipeline, starting from data acquisition to storing the models which have been trained and evaluated on the processed data. Additionally, you can see the services and libraries that were used at each step in the image processing pipeline. As an outline, we used Amazon S3 buckets to load and save both the raw and processed image data. For the model training and evaluation, we used MLflow experiments to store parameters and results. Also, the models are versioned in the MLflow Model Registry from where they can go to the production environment.

Fashion image dataset processing

In order for the machine learning model to learn the distinction between an image’s background and foreground, the model needs to process a pair of the original image and a binary mask showing which pixels belong to the background or foreground.

The dataset used for this project has around 30,000 pair images. The predicted output of the model will be a binary mask, and the resulting final image is obtained by removing the background area marked in the binary mask from the original image.

Below you can see some examples from the testing dataset, more specifically the original image and label (which in the case of the training dataset will represent the input for the network), followed by the predicted mask and the final image with the background removed.

Building a general step-by-step pipeline to process images is difficult unless you know precisely what to expect from the data to be processed. Unforeseen variance in data makes it difficult to anticipate which operations are needed to be performed. In our case, we experimented with some training trials to determine the exact areas of the images that needed cleaning or more analysis.

The dataset cleaning consisted of removing the mismatched pair images and resizing the images at the beginning of the process. This was done to improve the process’s effectiveness and match our network architecture’s input size.

Another step suitable for our case was to split our image data into 6 clusters of product types, namely:

Long pants
Shorts
Short-sleeved tops/dresses
Long-sleeved tops/dresses
Beachwear/sportswear/accessories
Light-colored products.

This splitting was needed because of the unbalanced number of product types existing in the data, making our model prone to perform worse for the product types underrepresented in the training data. A cluster consisting of light-colored products was created because a considerable amount of the images in our dataset had a light-colored background, and detecting products when background and product colors are similar proved to be difficult. Because these cases were in small amounts, the model didn’t have many examples for treating similar cases.

The clustering process was based primarily on the k-means algorithm, which was applied to the original images’ computed features. The result of this process was not fully accurate, so some manual work was also needed.

Even though this work is based on a large amount of data, this didn’t cover all the use cases this model would be used for. For this reason, the data needed some augmentation techniques from which the most important were:

– Cropping

– Background color change

– Combination of images

Below you can see how these transformations look on some of the images presented above.

One thing to keep in mind when working with images is that training data is difficult to obtain. The manual process of creating the training data takes time and a lot of work to ensure accuracy.

Model architecture for image processing

Separating the foreground from the background is considered to be a saliency detection problem. It consists of detecting the most obvious region in images and predicting the probability of a pixel belonging to either background or foreground.

The most recent state of the art in this area is the U2-Net paper (U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection), which introduces a deep neural network architecture for salient object detection and feature extraction. This architecture has a two-level nested U network shape and can run with low memory and reduced computational cost.

Its main strength is detecting features by using deep layers of many scales in the architecture. In this way, it can understand the contextual information without any computational time added.

Our model is built on an architecture inspired by the paper mentioned above, and it showed significant improvements in the results compared to other convolutional neural network (CNN) architectures, such as Mask-RCNN. This is because Mask-RCNN’s backbone is also specific for image classification tasks, making the network more complicated in this particular use case.

Distributed model training and evaluation

Training with large amounts of data is a time-consuming process of trial and error. Our network was trained in a Databricks environment using workers with graphical processing units (GPUs). Horovod helped us set up a distributed training process using PyTorch.

During each epoch, distinct batches are trained on multiple workers, and the results are merged by averaging the parameters between workers. At each epoch, the dataset is split across the number of workers. Each worker then divides its data into the batch_size number that was set in the parameters for the training. This means the training is actually running on a larger batch size, which is batch_size multiplied by the number of workers.

Horovod uses a metric that can average the workers’ parameters after each epoch to a root node. Then it redistributes the new values to all workers, and the training can continue with the following epochs. This is very easy to set up, especially in a PyTorch environment.

Say there is a need to add different or additional metrics during training or set up validation steps after each epoch. What then? Because the default implementation of Horovod functions cannot take into account the additional metrics, you have to explicitly set up how workers’ parameters are to be merged. The solution is to use the same function that Horovod uses as an average metric. It was a bit hard to find this in the Horovod documentation, but it made things work.

Some of the training and evaluation characteristics used and worth mentioning are:

Training parameters settings (Adam optimizer, dynamic learning rate, batch_size=10*nr_gpus, epochs=~30)
Training and validation time: 9 hours using 20 workers of single GPU instances cluster
Evaluation methods: Intersection over Union (IoU) metric
TensorBoard metrics: losses, learning rate, IoU metric/epoch for validation dataset

Intersection over Union (IoU) is an evaluation metric used to check the predicted mask’s accuracy by comparing it to the actual truth label mask. Below you can see the values of this metric on two example images. The first has a value of almost 100%, meaning the predicted label is almost identical to the truth label. The second one shows some minor problems, which result in a lower score.

PyTorch model performance and evaluation metrics

Our best model average performance is 99.435%. In terms of the number of images achieving certain scores, you can look at them from two perspectives.

First, clustered by accuracy. Below are some numbers:

Over 99.3% of images with 95% accuracy
Over 98.7% of images with 97% accuracy
Over 93.1% of images with 99% accuracy

Another way to check the model results is to see its performance on specific image type clusters. This way of looking at the numbers makes clear our model’s strengths and weaknesses. For example, when clustering by image type, the numbers are:

Long pants: 99.428% accuracy
Shorts: 99.617% accuracy
Short-sleeved tops/dresses: 99.502% accuracy
Long-sleeved tops/dresses: 99.588% accuracy
Beachwear/sportswear/accessories: 98.815% accuracy
Light-colored products: 98.868% accuracy

Lessons learned

As with any training or experimentation, you are bound to encounter some errors. Some of the errors we ran into during the development were related to either out of memory errors (OOM) or, in some cases, failures caused when workers’ nodes were detached. For the first case, we tried different values for the batch size until we found the largest size that didn’t raise any errors. For the second, the solution is always to check whether the cluster is set up to use only on-demand instances instead of spot instances. This is easily selected when creating a Databricks machine learning runtime cluster.

What’s Next

In summation, when using ML in such a dynamic environment as e-commerce fashion, where item styles are constantly changing, it is essential to make sure the models’ performance keeps up with these changes. The project aims to move towards building pipelines for retraining using the results that the current models are outputting in production.

For the quality check of the output images, an established threshold will determine whether an image will be transferred to an interface, where it can be corrected, or whether it can directly go to data storage for the retraining pipeline. Once we have accomplished this, we can move towards new challenges that can continuously improve the quality of the work and the customers’ delivered experience.

Interested to learn more? Register for the Data + AI Summit 2021 and attend the related session Automated Background Removal Using PyTorch.

Try Databricks for free. Get started today.

The post Automated Background Removal in E-commerce Fashion Image Processing Using PyTorch on Databricks appeared first on Databricks.

↧

Custom DNS With AWS Privatelink for Databricks Workspaces

April 30, 2021, 10:14 am

≫ Next: Guide to Manufacturing & Distribution Sessions at Data + AI Summit 2021

≪ Previous: Automated Background Removal in E-commerce Fashion Image Processing Using PyTorch on Databricks

This post was written in collaboration with Amazon Web Services (AWS). We thank co-authors Ranjit Kalidasan, senior solutions architect, and Pratik Mankad, partner solutions architect, of AWS for their contributions.

Last week, we were excited to announce the release of AWS PrivateLink for Databricks Workspaces, now in public preview, which enables new patterns and functionalities to meet the governance and security requirements of modern cloud workloads. One pattern we’ve often been asked about is the ability to leverage custom DNS servers for Customer-managed VPC for a Databricks workspace. To provide this functionality in AWS PrivateLink-enabled Databricks workspaces, we partnered with AWS to create a scalable, repeatable architecture. In this blog, we’ll discuss how we implemented Amazon Route 53 Resolvers to enable this use case, and how you can recreate the same architecture for your own Databricks workspace.

Motivation

Many enterprises configure their cloud VPCs to use their own DNS servers. They may do this because they want to limit the use of externally controlled DNS servers, and/or because they have on-prem, private domains that need to be resolved by cloud applications. In general, this is not an issue when using Databricks because our standard deployments, even with Secure Cluster Connectivity (i.e. private subnets), use domains that are resolvable by AWS.

AWS PrivateLink for Databricks interfaces, however, requires private DNS resolution in order to make connectivity to back-end and front-end interface work. If a customer configures their own DNS servers for their workspace VPC, they will not be able to resolve these VPC endpoints on their own, so connectivity between the Databricks Data and Control planes will be broken. In order to deploy Databricks with AWS PrivateLink and Custom DNS, Route 53 can be used to resolve these private DNS names in the Data Plane.

What is Amazon Route 53?

Amazon Route 53 is a highly-available and scalable cloud Domain Name System (DNS) web service. It is designed to give developers and businesses an extremely reliable and cost-effective way to route end users to Internet applications by translating names like www.example.com into the numeric IP addresses like 192.0.2.1 that computers use to connect to each other. Route53 consists of different components, such as hosted zones, policies and domains. In this blog, we focus on Route 53 Resolver Endpoints (specifically, Outbound Endpoints) and the applied Endpoint Rules.

High-level architecture

At a high level, the architecture to create Private DNS names for an interface Amazon virtual private cloud (VPC) endpoint on the service consumer side is shown below:

Route53 in this case provides an outbound resolver endpoint. This essentially provides a way of resolving local, private domains with Route 53, and using the custom DNS for any remaining, unresolved domains. Technically, this architecture consists of Route 53 outbound resolver endpoints deployed in the DNS Server VPC, and Route 53 Resolver Rules that tell the service how and where to resolve domains. For more information on how Route 53 Private Hosted Zone entries are resolved by AWS, please see the documentation and user guide. For more information, refer to Private DNS for Interface Endpoints and Working with Private Hosted Zones. Note that this works similarly in the case where a DNS server is hosted on-prem. In this case, the VPC in which Outbound Resolvers are deployed should be the same VPC that is hosting the Direct Connect endpoint to your on-prem data center.

Step-by-step instructions

Below, we walk through the steps for setting up a Route53 Outbound Resolver with the appropriate rules. We assume that a AWS PrivateLink-enabled Databricks workspace is already deployed and running.

Ensure that the workspace is deployed properly according to our PrivateLink documentation. If you cannot spin up clusters due to the Custom DNS already in place, try enabling AWS DNS resolution to make sure that cluster creation is unblocked and there are no additional issues.
Gather the following information:
- The VPC ID used for the Databricks Data Plane (and, if applicable, the User-to-Workspace VPC endpoint)
- The VPC ID of the VPC containing the custom DNS server
- The subnets into which Route53 endpoints will be deployed. These must be in the same VPC as the custom DNS server (at least 2 subnets are required, and they should be in separate AZs)
- The IP of the custom DNS server
- The Security Group ID that will be applied to the Route 53 endpoints. This should allow inbound connections on UDP port 53 from the Data Plane VPC (10.175.0.0/16 in the above diagram), and should use the default outbound rule (i.e., allow 0.0.0.0/0)
Start by creating a new Route53 Outbound Resolver (Services > Route53 > Outbound Endpoint > Create Outbound Endpoint). Create this endpoint on the DNS VPC with VPC ID obtained from step 2b, and on the subnets from step 2c. Select the security group created from step2e. Unless you have a compelling reason to do otherwise, select “Use an IP address that is selected automatically” when selecting the IP addresses.
Create a new resolver rule (Services > Route53 > Rules > Create Rule). This rule will forward DNS queries to the custom DNS server for all domains except for Private DNS names for Databricks VPC endpoints (these endpoints will use Private Hosted Zone for resolution). In “Domain Name”, enter a dot (“.” without quotes), which is translated to all domains. For the VPC, select your Data Plane VPC from Step 2a. The outbound endpoint should be the endpoint created in Step 3. In “Target IP”, use the IP of the custom DNS server. NOTE: if you use a User-to-Workspace PrivateLink endpoint in a separate VPC from the SCC/REST endpoints, also attach the rule to that VPC.
If AWS endpoints are being used for the Data Plane, (i.e., Kinesis, S3 and STS endpoints), add another rule to forward these domain resolution requests to the Route 53 default resolver. This rule should have a domain of “amazonaws.com” (no quotes). The VPC and endpoint settings should be the same as those in Step 4. For the target IP address, use the AWS VPC resolver, which is the second IP of the VPC CIDR range; i.e., for CIDR 10.0.0.0/16, use 10.0.0.2. This should be the VPC from Step 2b; in this example the IP would be 10.100.100.2.
Your Route53 resolver is now set up. Make sure that the DNS and Data Plane VPCs have routing configured correctly; no additional routing is required for Route53 endpoints once they are associated with the appropriate VPCs. No explicit routing is required for the Databricks VPC endpoints (since they are resolved by Route53), but other endpoints, such as Amazon S3 or other services, may have explicit routes.
Open your workspace and try launching a cluster. To validate that the resolution is working, you can run the following command in a notebook:

%sh dig .privatelink.cloud.databricks.com

Where region will change depending on the region you are in. For us-east-1, this will be nvirginia. This command should return something similar to the following:

; <<>> DiG 9.11.3-1ubuntu1.13-Ubuntu <<>> nvirginia.privatelink.cloud.databricks.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34414
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;nvirginia.privatelink.cloud.databricks.com. IN A

;; ANSWER SECTION:
nvirginia.privatelink.cloud.databricks.com. 60 IN A 10.175.4.9

;; Query time: 2 msec
;; SERVER: 10.175.0.2#53(10.175.0.2)
;; WHEN: Thu Feb 25 14:55:57 UTC 2021
;; MSG SIZE  rcvd: 87

If this succeeds, you have successfully set up your DNS routing with Route53!

Try Databricks for free. Get started today.

The post Custom DNS With AWS Privatelink for Databricks Workspaces appeared first on Databricks.

↧

Guide to Manufacturing & Distribution Sessions at Data + AI Summit 2021

April 30, 2021, 11:14 am

≫ Next: Databricks on Google Cloud Now Generally Available

≪ Previous: Custom DNS With AWS Privatelink for Databricks Workspaces

Data + AI Summit is the global event for the data community, where 100,000 practitioners, leaders and visionaries come together to engage in thought-provoking discussions and dive into the latest innovations in data and AI. For those in manufacturing and distribution, this will be an opportunity to learn from industry leaders about how they are using AI to create resilient, yet efficient, supply chains; meet sustainability and yield goals through a smart factory; and anticipate and meet customer demand during volatile times.

At this year’s Data + AI Summit, we’re excited to announce a full agenda of sessions for data teams in the manufacturing & distribution industry. Leading innovators from across the industry – including Rolls-Royce, John Deere, Henkel, Thyssen Kruppe and ExxonMobil – are joining us to share how they are using data to transform their businesses into Equipment-as-a-Service, leveraging IoT and unstructured real-time data for predictive maintenance and develop the shop floors of tomorrow.

Manufacturing & Distribution Industry Forum

Join us on May 26 for our capstone Manufacturing & Distribution industry event as leaders in the industry engage in a panel discussion on the future of manufacturing and how organizations can thrive during even the most turbulent times with data and AI. John Deere will kick off with a keynote address around how they are leading the charge as a smart industrial company through precision agriculture, helping their customers increase yields applying AI to the millions of data points generated every day from their machines around the world.

Manufacturing & distribution tech talks

Here’s an overview of some of our most highly anticipated sessions at this year’s summit:

Manipulating Geospatial Data at Massive Scale (John Deere): In this talk, John Deere will describe some of their data engineering methods for efficiently ingesting and processing petabytes of agriculture data from customers’ farms across the globe to enable their data scientists to perform geospatial analyses.
Analytics-enabled Experiences: The New Secret Weapon (Steelcase): Hear how Steelcase, the world’s largest office furniture manufacturer, is applying product application analytics to make offices safer in the midst of COVID-19 pandemic.
NLP-focused Applied ML at Scale for Global Fleet Analytics (ExxonMobile): The data team from ExxonMobil will discuss how they leverage Databricks to perform machine learning to ingest structured and unstructured data from legacy systems, and then sift through millions of records to extract insights using NLP. The insights enable outlier identification, capacity planning, prioritization of cost reduction opportunities and the discovery process for cross-functional teams.
Delivering Insights from 20M+ Smart Homes with 500M+ Devices (Plume Design, Inc): This is a story of how Plume Design, a SaaS-based company that delivers smart home experience management, scaled their data processing and boosted team productivity to meet demand for insights from 20M+ Smart Homes and 500M+ devices across the globe, from numerous internal business teams and our 150+ CSP partners.

Check out the full list of manufacturing & distribution talks at Summit.

For practitioners: hands-on demos and expert discussions

Join us for live demos of the hottest data analysis use cases in the manufacturing and distribution industry, including demand forecasting and safety stock analysis. You’ll be able to ask questions to our expert data scientists from the industry. Link coming soon.

Sign-up for the manufacturing & distribution experience at Summit!

Make sure to register for the Data + AI Summit to take advantage of all the amazing manufacturing & distribution sessions, demos and talks scheduled to take place. Registration is free!

Try Databricks for free. Get started today.

The post Guide to Manufacturing & Distribution Sessions at Data + AI Summit 2021 appeared first on Databricks.

↧

Databricks on Google Cloud Now Generally Available

May 4, 2021, 10:00 am

≫ Next: Data-driven Software: Towards the Future of Programming in Data Science

≪ Previous: Guide to Manufacturing & Distribution Sessions at Data + AI Summit 2021

Today, we announced the general availability of Databricks on Google Cloud, a jointly developed service that combines an open Lakehouse platform with an open cloud. Since the announcement of Databricks on Google Cloud, we have seen tremendous momentum for this partnership as customers pursue a multi-cloud approach to their analytics and DS/ML workloads. Customers are demanding a simple, unified platform as they move workloads to Databricks on Google Cloud built on open standards with technologies like Delta Lake, MLflow and Google Kubernetes Engine. This GA release is now available in multiple regions in the US and Europe with additional regions coming soon.

What’s new in GA

Repo and Project support to sync your work with remote Git repository
Table ACLs that lets you programmatically grant and revoke access to data from Python and SQL
DB Connect to connect to Databricks from your favorite IDE
Cluster Tags for DBU Usage tracking
Notebook scoped libraries to create and share custom Python environments that are specific to a notebook
Local SSD Support for caching and improved performance
Tableau connector to Databricks on Google Cloud
Terraform provider to easily provision and manage Databricks along with associated cloud infrastructure.

Reckitt was among the first to use Databricks on Google Cloud. Databricks delivers tight integrations with Google Cloud’s compute, storage, analytics and management products. This includes the first Google Kubernetes Engine (GKE) based, fully containerized Databricks runtime on any cloud, pre-built connectors to seamlessly and quickly integrate Databricks with BigQuery, Google Cloud Storage, Looker and Pub/Sub. In addition, customers can deploy Databricks from the Google Cloud Marketplace for simplified procurement and user provisioning, Single Sign-On and unified billing. With Databricks on Google Cloud for data and AI workloads, Reckitt unlocks competitive advantages such as cost savings, agility, increased innovation and business continuity planning. Let’s take a closer look.

Reckitt: AI-focused customer analytics platform

Reckitt, a multinational consumer goods company that serves millions of retail customers worldwide, is on a mission to improve their analytics workflows with AI-driven decisions. Reckitt’s struggle was similar to many other companies – they were dealing with tons of data and disjointed pipelines, and each time the team implemented a data science project, they found themselves reinventing the wheel. This led them to make AI a priority at an enterprise-level with a “ubiquitous AI” vision:

To infuse trusted, AI-driven decisions into daily workflows and liberate our people’s limitless potential for innovation

One of the first projects was a Customer Analytics platform aimed at improving marketing ROI across 50 brand-market units in 13 countries with metrics like audience activation and media effectiveness.

Reckitt chose Databricks on Google Cloud to enable their customer analytics efforts. By unifying media data from hundreds of sources for consumer identification, Reckitt is building a highly-modular data platform that can support key use cases such as measuring the performance of a propensity model to drive sales uplift or the impact of first-party data on their conversion funnel.

Databricks on Google Cloud Marketing ROI solution architecture deployed by Reckitt Marketing ROI solution architecture:

The above diagram shows Reckitt’s Databricks on Google Cloud Marketing ROI solution architecture. The main features of the diagram include:

Data Collection: Read structured and unstructured data into BigQuery from 114 unique media datastreams from sources such as Facebook, YouTube and Pinterest and Google Analytics; unstructured data from IoT devices and SaaS applications such as Salesforce is stored in GCS.
Transformation: Apply business rules and aggregation in Delta Lake and calculate KPIs. Delta Lake allows Reckitt to reuse existing data pipelines from other public clouds since it stores the data in an open-source parquet format that can be easily stored in GCS.
Analyze: Use SQL Analytics, Cloud Natural Language and MLflow for further analysis.
Visualize: Data scientists, business analysts and executives use Power BI and Data Studio for visualization. Insights are leveraged downstream across ad platforms, email, CRM and other systems.

DataOps and MLOps are critical drivers of Reckitt’s multi-cloud architecture. Databricks on Google Cloud makes it possible to reuse PySpark scripts and existing data pipelines on Delta Lake, greatly simplifying data engineering and data science at scale. The result is a hyper-targeted set of audiences that are engaged across multiple media channels, measured by ROI uplift across those channels. Audience activations and marketing ROI boost has yielded 44% efficiency in cost-per-view, 11% reduced cost-per-1000 impressions and 10% improved view-thru-rate. Learn more about Reckitt’s data analytics journey here.

Broad partner support

Databricks on Google Cloud is supported by our broad ecosystem of partners who share their commitment to open standards, integrations and solution expertise for Databricks on Google Cloud. These partners bring deep experience in the Databricks Lakehouse architecture for building the AI and ML foundation across targeted industry solutions. We are pleased to have these partners invest in working with us above and beyond to support the GA launch of Databricks on Google Cloud:

BI Partners: Tableau, Qlik, Looker
Ingest Partners: Fivetran, Fishtown Analytics, Talend, Qlik, Infoworks, Trifacta, and Informatica
Catalog Partners: Collibra
Governance: Immuta, Privacera
Data Sources: Confluent, MongoDB
Consulting Partners: Accenture, Cognizant, Deloitte, Insight, SoftServe, Slalom, TCS

Talking with customers through the public preview, it is clear that multi-cloud is a growing strategy for cloud data and analytics workloads. The general availability of Databricks on Google Cloud further advances the potential of multi-cloud with the open, simple Databricks Lakehouse platform that brings analytics workloads to multiple clouds.

The post Databricks on Google Cloud Now Generally Available appeared first on Databricks.

↧

Data-driven Software: Towards the Future of Programming in Data Science

May 4, 2021, 1:00 pm

≫ Next: Guide to Media & Entertainment Sessions at Data + AI Summit 2021

≪ Previous: Databricks on Google Cloud Now Generally Available

This is a guest authored post by Tim Hunter, data scientist, and Rocío Ventura Abreu, data scientist, of ABN AMRO Bank N.V.

Data science is now placed at the center of business decision making thanks to the tremendous success of data-driven analytics. However, more stringent expectations around data quality control, reproducibility, auditability and ease of integration from existing systems have come with this position. New insights and updates are expected to be quickly rolled out in a collaborative process without impacting existing production pipelines.

Essentially, data science is confronting issues that software development teams have worked on for decades. Software engineering built effective best practices such as versioning code, dependency management, feature branches and more. However, data science tools do not integrate well with these practices, which forces data scientists to carefully understand the cascading effects of any change in their data science pipeline. Common consequences of this include downstream dependencies using stale data by mistake and needing to rerun an entire pipeline end-to-end for safety. When data scientists collaborate, they should be able to use the intermediate results from their colleagues instead of computing everything from scratch, just like software engineers reuse libraries of code written by others.

This blog shows how to treat data like code through the concept of Data-Driven Software (DDS). This methodology, implemented as a lightweight and easy-to-use open-source Python package, solves all the issues mentioned above for single user and collaborative data pipelines written in Python, and it fully integrates with a Lakehouse architecture such as Databricks. In effect, it allows data engineers and data scientists to YOLO their data: you only load once — and never recalculate.

Data-driven software: a first example

To get a deeper understanding of DDS, let’s walk through a common operation in sample data science code: downloading a dataset from the internet. In this case, a sample of the Uber New York trips dataset.


data_url = "https://github.com/fivethirtyeight/uber-tlc-foil-response/raw/master/uber-trip-data/uber-raw-data-apr14.csv"

def fetch_data():
    raw_content = requests.get(url=data_url, verify=False).content 
    return pandas.read_csv(io.StringIO(raw_content.decode('utf8')))

taxi_dataframe = fetch_data()

This simple function illustrates recurring challenges for a data scientist:

every time the function is called, it slows down the execution by downloading the same dataset.
adding manual logic to write the content is error-prone. What happens when we want to update the URL to use another month, for example?

DDS consists of two parts: routines that analyze Python code and a data store that caches Python objects or datasets on persistent storage (hard drive or cloud storage). DDS builds the dependency graph of all data transformations done by Python functions. For each function call, it calculates a unique cryptographic signature that depends on all the inputs, dependencies, calls to subroutines and the signatures of these subroutines. DDS uses the signatures to check if the output of a function is already in its store and if it has changed. If the code is the same, so are the signatures of the function call and the output. Here is how we would modify the above Uber example with a simple function decorator:


import dds

@dds.data_function("/taxi_dataset")
def fetch_data():
    raw_content = requests.get(url=data_url, verify=False).content 
    return pandas.read_csv(io.StringIO(raw_content.decode('utf8')))

taxi_dataframe = fetch_data()

Here is the representation inside DDS of the same function. DDS omits most of the details of what the code does and focuses on what this code depends on (the `data_url` variable, the function `read_csv` from pandas and the python modules `io` and `requests`). For our code, the output of `fetch_data()` is associated with a unique signature (fbd5c23cb9). This signature will change if either the URL or the body of the function is updated.


Fun <__main__/fetch_data> /taxi_dataset                          signature:fbd5c23cb9
  |- Dependency data_url -> <__main__/data_url>              signature:9a3f6b9131
  |- ExternalDependency io -> 
  |- ExternalDependency pandas -> 
  `- ExternalDependency requests ->

When calling this function for the first time, DDS sees that the signature fbd5c23cb9 is not present in its store and has not been calculated yet. It calls the `fetch_data()` function and stores the output dataframe under the key fbd5c23cb9 in its persistent store. When calling this function a second time, DDS sees that the signature fbd5c23cb9 is present in its store. It does not need to call the function and simply returns the retrieved CSV file. This check is completely transparent and takes milliseconds, which is much faster than calling retrieved data from the internet! Furthermore, because the store is persistent, the signature is preserved across multiple executions of the code. When the code gets updated, for example when `data_url` changes, then (and only then) will this retrigger calculations.

This code shows a few features of DDS:

Tracking only the business logic: DDS makes the choice by default of just analyzing the user code and not all the “system” dependencies such as `pandas` or `requests`.
Storing all the evaluated outcomes in a shared store: This ensures that all functions called by one user are cached and immediately available to colleagues, even if they are working on different versions of the codebase.
Building a high-level view of the data pipeline: There is no need to use different tools to represent the data pipeline. The full graph of dependencies between datasets is extracted by parsing the code. A full example of this feature will be shown in the use case.

Most importantly, users of this function do not have to worry if it depends on complex data processing or I/O operations. They simply call this function as if it was a “well-behaved” function that just instantly returns the dataset they need. Updating datasets is not required, as it is all automatically handled when code changes. This is how DDS breaks down the barrier between code and data.

Data-driven software

The idea of tracking changes of data through software is not new. Even the venerable GNU Make program, invented in 1976, is still used to update data pipelines. A couple of tools have similar automation objectives, with different use cases:

Data Build Tool (DBT) for the SQL language
Data Version Control (DVC) framework aims at tracking experimentation and exploration
MLflow focuses on accelerating the lifecycle of Machine Learning
Prefect is most similar to DDS but requires explicit definitions of tasks and flows

DDS can accommodate Python objects of any shape and size: Pandas or Apache Spark DataFrames, arbitrary python objects, scikit-learn models, images and more. Its persistent storage natively supports a wide variety of storage systems – a local file system and its variants (NFS, Google drive, SharePoint), Databricks File System (DBFS), and Azure Data Lake (ADLS Gen 2) – and can easily be extended to other storage systems.

Use case: how DDS helps a major European bank

DDS has been evaluated on multiple data pipelines within a major European bank. We present here an application in the realm of crime detection.

Challenge

The bank has the legal and social duty to detect clients and transactions that might be associated with financial crime. For a specific form of financial crime, the bank has decided to build a new machine learning (ML) model from scratch that scans clients and transactions to flag potential criminal activities.

The raw data for this project (banking transactions over multiple years in Delta Lake tables) was significant (600+ GB). This presents several challenges during the development of a new model:

Data scientists work in teams and must be careful not to use old or stale data/
During the exploration phase, data scientists use a combination of different notebooks and scripts, making it difficult to keep track of which code generated which table.
This project is highly iterative in nature, with significant changes in the business logic at different steps of the data pipeline on a daily basis. A data scientist can simply not afford to wait for the entire pipeline to run all the way from the beginning because they have made an update in the previous to last step.

Solution

This project combines all the standard frameworks (Apache Spark, GraphFrames, pandas and scikit-learn) with all code structured in functions that look similar to the following skeleton. The actual codebase generates several dozens of ML features coded through thousands of lines of Python code.


@dds.dds_function("/table_A")
def get_table_A() -> DataFrame:
  dataA = spark.read.parquet("raw_data/dataA")
  # … transformation steps
  return dataA

@dds.dds_function("/table_B")
def get_table_B():
  dataB = spark.read.parquet("raw_data/dataB")
  # … transformation steps
  return dataB

@dds.dds_function("/feature1")
def get_feature1():
    tA = get_table_A()
    tB = get_table_B()
 
    t = tA.join(tB, "Key_AB")
    df = t.filter(F.col("Status") == 1).groupBy("ClientID").count().withColumnRenamed("count", "Feature1")
    check_no_null_and_no_missing_customers(df, "ClientID")
    return df

If something in the code changed for table_A or table_B, this table and feature1 would be re-evaluated. In any other circumstances, DDS will recognize nothing has changed and move on. Here is a comparison in running times for the previous example:

Code change in table_B: 28.3 min
Code change in get_feature1: 19.4 min
No change (DDS loading the cached Spark dataframe): 2.7 sec

Compared to running from scratch, that is a reduction of 99.8% in computational time!

Visualizing what is new

DDS includes a built-in visualization tool that shows which intermediate tables will be rerun based on the changes in the code. Here, highlighted in green, we see that because the code that generates table B has changed, both feature1 and feature2 will need to be rerun.

This feature only relies on inspecting the Python code and does not require running the pipeline itself. It was found so useful that every code change (pull request) displays this graph in our CI/CD pipeline. Here is an example of visualization (the actual names have been changed). In this case, one feature is being updated (“feature4”), which is also triggering the update of dependent features (“group2_profiles” and “feature9_group3”):

As one data scientist put it, “we would not have dared to have so many data dependencies without a tool like DDS.”

DDS also facilitates constructing pipelines with PySpark and can directly take advantage of a Lakehouse architecture:

It is a natural solution to checkpoint intermediate tables
It can make use of the ACID properties of the underlying storage and can leverage a Delta Lake

Conclusion

DDS reduces the problem of data coherency to the problem of tracking the code that was used to generate it, which has been thoroughly investigated. As seen in the examples, DDS can dramatically simplify the construction of data pipelines and increase collaboration inside data teams of engineers, data scientists and analysts. In practice, current DDS users found that their expectations around collaboration have significantly increased since adopting DDS; they now take for granted that accessing any piece of data (ML models, Spark DataFrames) is instantaneous, and that running any notebook always takes seconds to complete. All the usual collaborative operations of forking or merging can be performed without fear of breaking production data. Rolling out or updating to the latest version of the data is often as fast as a Git checkout.

We believe it is time to break down the barrier between code and data, making any piece of data instantly accessible as if it was a normal function call. DDS was implemented for Python and SQL users in mind. We see it as a stepping stone towards a more general integration of data, engineering and AI for any platform and any programming language.

For a deeper dive into this topic, check out the Tech Talk: Towards Software 2.0 with data-driven programming.

How to get started

To get started using DDS, simply run `pip install dds_py`. We always welcome contributions and feedback, and look forward to seeing where DDS takes you!

As with any software product, the journey is never finished. The package itself should be considered a “stable beta”: the APIs are stable, but the underlying mechanisms for calculating signatures can still evolve (triggering recalculations for the same code) to account for obscure corner cases of the Python language. Contributions and feedback are particularly welcome in this area.

Acknowledgments

The authors are grateful to Brooke Wenig, Hossein Falaki, Jules Damji and Mikaila Garfinkel for their comments on the blog.

Try Databricks for free. Get started today.

The post Data-driven Software: Towards the Future of Programming in Data Science appeared first on Databricks.

↧

Guide to Media & Entertainment Sessions at Data + AI Summit 2021

May 5, 2021, 10:00 am

≫ Next: Rise of the Lakehouse

≪ Previous: Data-driven Software: Towards the Future of Programming in Data Science

Data + AI Summit is the global event for the data community, where 100,000 practitioners, leaders and visionaries come together to engage in thought provoking dialogue and share the latest innovations in data and AI.

At this year’s Data + AI Summit, we’re excited to announce a full agenda of sessions for data teams in the Media & Entertainment industry. Leading innovators from across the industry – including Disney, Comcast, Conde Nast, and CBC – are joining us to share how they are using data & AI to acquire and retain subscribers in D2C, maximize advertising engagement, and use real-time data to optimize the audience experience.

Media & Entertainment Tech Talks

Here’s an overview of some of our most highly anticipated sessions at this year’s summit:

M&E Industry Forum: Join us for our capstone M&E event as leaders in the industry engage in a panel discussion on the future of content and audience monetization
Disney: Customer Experience at Disney+ Through Data Perspective
Comcast: SQL Analytics Powering Telemetry Analysis
Conde Nast: Modeling customer lifetime value for Subscription Business

Check out the full list of Media & Entertainment talks at Summit.

For Practitioners: Check out the Use Case Solutions Theater for hands-on demos and Ask an Expert

Join us for live demos on the hottest data analysis use cases in the M&E industry, including generating personalized content recommendations, filtering out toxicity in live gaming, and predicting subscriber churn. You’ll be able to ask your questions to our expert data scientists for your industry. Link coming soon.

Sign-up for the M&E Experience at Summit!

Make sure to register for the Data + AI Summit to take advantage of all the amazing Media & Entertainment sessions, demos and talks scheduled to take place. Registration is free!

Try Databricks for free. Get started today.

The post Guide to Media & Entertainment Sessions at Data + AI Summit 2021 appeared first on Databricks.

↧

Rise of the Lakehouse

May 6, 2021, 9:00 am

≫ Next: Building Forward-Looking Intelligence With External Data

≪ Previous: Guide to Media & Entertainment Sessions at Data + AI Summit 2021

With the fast-moving evolution of the data lake, Billy Bosworth and Ali Ghodsi share their mutual thoughts on the top 5 common questions they get asked about data warehouses, data lakes and lakehouses. Coming from different backgrounds, they each provide unique and valuable insights into this market. Ali has spent more than 10 years on the forefront of research into distributed data management systems; is an adjunct professor at UC Berkeley; and is the co-founder and now CEO of Databricks. Billy has spent 30 years in the world of data as a developer, database administrator and author; has served as CEO and senior executive at software companies specializing in databases; has served on public company boards, and is currently the CEO of Dremio.

What went wrong with Data Lakes?

Ali Ghodsi
Let’s start with one good thing before we get to the problems. They enabled enterprises to capture all their data – video/audio/logs – not just the relational data, and they did so in a cheap and open way. Today, thanks to this, the vast majority of the data, especially in the cloud, is in data lakes. Because they’re based on open formats and standards (e.g. Parquet and ORC), there is also a vast ecosystem of tools, often open sourced (e.g. Tensorflow, Pytorch), which can directly operate on these data lakes. But at some point, just collecting data for the sake of collecting it is not useful, and nobody cares about how many petabytes you’ve collected, but what have you done for the business? What business value did you provide?

It turned out it was hard to provide business value because the data lakes often became data swamps. This was primarily due to three factors. First, it was hard to guarantee that the quality of the data was good because data was just dumped into it. Second, it was hard to govern because it’s a file store, and reasoning about data security is hard if the only thing you see are files. Third, it was hard to get performance because the data layout might not be organized for performance, e.g. millions of tiny comma-separated-files (CSVs).

Billy Bosworth
All technologies evolve, so rather than think about “what went wrong” I think it’s more useful to understand what the first iterations were like. First, there was a high correlation between the words “data lake” and “Hadoop.” This was an understandable association, but the capabilities now available in data lake architectures are much more advanced and easier than anything we saw in the on-prem Hadoop ecosystem. The second is that data lakes became more like swamps where data just sat and accumulated without delivering real insight to the business. I think this happened due to overly complex on-premises ecosystems without the right technology to seamlessly and quickly allow the data consumers to get the insights they needed directly from the data in the lake. Finally, like any new technology, it lacked some of the mature aspects of databases such as robust governance and security. A lot has changed, especially in the past couple of years, but those seem to be some of the common early issues.

What do you see as the biggest changes in the last several years to overcome some of those challenges?

Billy
A defacto upstream architecture decision is what really got the ball rolling. In the past few years, application developers simply took the easiest path to storing their large datasets, which was to dump them in cloud storage. Cheap, infinitely scalable and extremely easy to use, cloud storage became the default choice for people to land their cloud-scale data coming out of web and IoT applications. That massive accumulation of data pushed the innovation that was necessary to access the data directly where it lived versus trying to keep up with copies to traditional databases. Today, we have a rich set of capabilities that deliver things previously only possible in relational data warehouses.

Ali
The big technological breakthrough came around 2017 when three projects simultaneously enabled building warehousing-like capabilities directly on the data lake: Delta Lake, Hudi and Iceberg. They brought structure, reliability, and performance to these massive datasets sitting in data lakes. It started with enabling ACID transactions, but soon went beyond that with performance, indexing, security, etc. This breakthrough was so profound that it was published in the top academic conferences (VLDB, CIDR etc).

Why use another new term, “Lakehouse” to describe data lakes?

Ali
Because they’re so radically different from data lakes that it warrants a different term. Data lakes tend to become data swamps for the three reasons I mentioned earlier, so we don’t want to encourage more of that, as it’s not good for enterprises. The new term also gives us the opportunity to guide these enterprises to land a data strategy that can provide much more business value rather than repeating the mistakes of the past.

Billy
If you look at something like Werner Vogel’s blog post from Jan 2020 highlighting the tremendous advantages and capabilities of an open data lake architecture, you see a giant evolution from how data lakes were perceived even just a few years ago. Mostly this is true for data analytics use cases that were only thought to be possible in a data warehouse. Therefore, the term “Lakehouse” brings a new connotation to the current world of open data architectures, allowing for fresh association with rich data analytics capabilities. When underlying technologies evolve dramatically, new names are often created to represent new capabilities. That is what I think we see happening with the term “Lakehouse.”

Why consider Lakehouses at all? Why not just continue to use data warehouses?

Billy
The data problems of today are just not a little different from the past, they are radically, categorically different. Among the many issues with data warehouses is time. Not the time it takes them to run a query, but the time it takes data teams to get the data into and out of the data warehouse using a labyrinth of ETL jobs. This highly complex chain of data movement and copying introduces onerous change management (a “simple” change to a dashboard is anything but simple), adds data governance risks and ultimately decreases the scope of data available for analytics because subsets tend to get created with each copy.

Often I hear people talk about the “simplicity” of a data warehouse. Zoom out just a tiny bit and you will always find a dizzying web of interconnected data copy and movement jobs. That is not simple. So the question is, why go through all that copying and moving if you don’t have to? In a Lakehouse, the design principle is that once the data hits data lake storage, that’s where it stays. And the data is already hitting data lake storage, even before the analytics team has anything to say about it. Why? Because as I said earlier, developers now use it as the de facto destination for their data exhaust. So once it’s there, why move it anywhere else? With a Lakehouse, you don’t have to.

Ali
The most important reason has to do with machine learning and AI, which is very strategic for most enterprises. Data warehouses don’t have support for sparse data sets that ML/AI uses, such as video, audio and arbitrary text. Furthermore, the only way to communicate with them is through SQL, which is amazing for many purposes, but not so much for ML/AI. Today, a vast open ecosystem of software is built on Python, for which SQL is not adequate. Finally, the vast majority of the data today is stored in data lakes, so migrating all of that into a data warehouse is nearly impossible and cost-prohibitive.

Other than eliminating data copies, what do you personally consider to be the biggest advantages of a Lakehouse?

Ali
The direct support for ML/AI. This is where the puck is going. Google would not be around today if it wasn’t for AI or ML. The same is true for Facebook, Twitter, Uber, etc. Software is eating the world, but AI will eat all software. Lakehouses can support these workloads natively. If I can mention more than one advantage, I would say that there are already massive datasets in data lakes, and the Lakehouse paradigm enables making use of that data. In short, it lets you clean up your data swamp.

Billy
I’ve spent my entire career working with databases, and almost all of it on the operational side. As I recently moved more into the world of data analytics, frankly, I felt like I was in a time machine when I saw the data warehouse model still being used. On the operational side of the world, architectures have long since moved from big and monolithic to services-based. The adoption of these services-based architectures is so complete that it hardly bears mentioning. And yet, when you look at a data warehouse-centric architecture, it’s like looking at an application architecture from 2000. All the advantages of services-based architectures apply to the analytics world just as much as they do to the operational world. A Lakehouse is designed to make your data accessible to any number of services you wish, all in open formats. That is really key for today and the future. Modular, best-of-breed, services-based architectures have proven to be superior for operational workloads. Lakehouse architectures allow the analytics world to quickly catch up.

Does implementing a Lakehouse mean “ripping and replacing” the data warehouse?

Billy
Perhaps the best thing about implementing a Lakehouse architecture is that your application teams have already likely started the journey. Companies have datasets already available that make it easy to get started implementing a Lakehouse architecture. Unwinding things from the data warehouse is not necessary. The most successful customer implementations we see are ones that start with a single use case, successfully implement it, then ask “what other use cases should we implement directly on the Lakehouse instead of copying data in the data warehouse?”

Ali
No it does not. We haven’t seen anyone do it that way. Rather, the Data Warehouse becomes a downstream application of the Lakehouse, just like many other things. Your raw data lands in the data lake. The Lakehouse enables you to curate it into refined datasets with schema and governance. Subsets of that can then be moved into data warehouses. This is how everyone starts, but as the use cases on the Lakehouse get more successful, almost all enterprises we have worked with end up moving more and more workloads directly to the Lakehouse.

Try Databricks for free. Get started today.

The post Rise of the Lakehouse appeared first on Databricks.

↧

Building Forward-Looking Intelligence With External Data

May 6, 2021, 9:15 am

≫ Next: MIT Tech Review Study: Building a High-performance Data and AI Organization — The Data Architecture Matters.

≪ Previous: Rise of the Lakehouse

This post was written in collaboration with the Foursquare data team. We thank co-author Javier Soliz, sales engineer specializing in data engineering and geospatial analysis at Foursquare, for his contribution.

“In an interlocked global economy, triggering events can quickly set off a chain reaction,” wrote Boston Consulting Group in early 2020 as the world grappled with the COVID pandemic. Already in the first few months of 2021, we have experienced wildfires in western Australia, winter storms causing millions to lose power for days in Texas, a powerful earthquake off the coast of Japan, flooding and evacuations in both eastern Australia and Hawaii, political unrest surrounding the U.S. presidential election and a single ship shutting down a major global shipping route between Europe and Asia – all while the world struggles to recover from a global recession triggered by the pandemic. With no shortage of triggering events, organizations are now investing heavily in resilience.

A common notion of resilience is a return to normalcy following a disruptive event. But as the COVID pandemic illustrates, what was normal before may not be normal after. We’ve seen a remarkable shift in patterns of consumer mobility and spending. Once the initial panic over shortages of staples, such as toilet paper subsided, oat milk and sweatpants became the new must-have items. Businesses that could fulfill this demand through online purchasing, home delivery and curbside pickup saw significant growth, while others saw their share of the market decline. Emerging from the pandemic, even more shifts in consumer spending patterns are expected.

The bottom line for businesses is that the uncertainty that affects their internal operations also affects the consumers they serve. Organizations seeking resilience need not only an internal focus on performance management but an external focus on the markets within which they operate.

Building forward-looking intelligence

The Texas-based grocery chain, HEB, provides an excellent example of how organizations may balance an inward focus on performance management with an outward focus on risk detection. Leveraging methodologies that examine potential future scenarios to understand an organization’s particular vulnerabilities, HEB was able to identify key risks to its organization well ahead of the pandemic. As the COVID crisis emerged, the grocer knew to be on the lookout for potential disruptions in regions critical to its supply chain and began the process of stocking up on essential items likely to be affected.

While a pandemic was not a specific threat identified by HEB, its assessment of its organization’s vulnerabilities informed it where to look for emerging threats. The signals needed to identify those threats would not be found in its internal data until the threat was already upon the organization, so it looked to outside information sources to provide it the early warning it needed to put its planned response in motion. HEB’s ability to successfully navigate the early days of the COVID pandemic is multifaceted, but looking outside the organization for forward-looking signals was a key part of it. For its early, effective and on-going efforts in managing the pandemic, HEB was recognized as the 2020 Grocer of the Year by GroceryDive, a leading trade journal.

Leveraging exterdnal data

The growing awareness of the need for organizations to look beyond their own four-walls is driving a surge in interest in external data sources. A recent survey by Forrester indicates 70% of organizations acquired or were in the process of acquiring new external data assets and another 17% reporting intending to do so within the coming year. In response, there are a growing number of data providers, aggregators and marketplaces making all types of information, such as weather data, more accessible. (See also alternative data.)

Figure 1. Commonly used external data from a report by McKinsey & Company

Effective use of such information requires careful consideration. Here are a few best practices:

Before acquiring external data, carefully consider the insights your organization wishes to obtain from it. A careful review of the terms and conditions associated with the data, as well as a consideration of how the data is sourced and how customers might respond to your company using it, should help you steer clear of potential problems.

If cleared for use, it is important to understand how the data is collected and prepared for distribution, how far back the data is available, and how fit it is for your organization’s intended uses. Many data providers make both documentation and samples available for just this purpose.

Weigh the technical challenges of leveraging the external data sources. The volume of historical data and periodic updates, the frequency with which it is updated and the mechanisms by which data is made available are key considerations. Also determine how data assembled outside the organization may be reconciled with internally generated data. Differences in temporal and spatial levels of granularity, as well as different ways of expressing overlapping dimensions, may require the data to undergo significant processing to be made available for analysis. For many organizations, the physical and logical challenges of integrating external data necessitate the adoption of new, more flexible and more cost-effective data management approaches over classic data warehousing approaches developed for the analysis of operational information.

Ensure value is derived from the data on an ongoing basis. Careful documentation, education and evangelism, and ongoing utilization monitoring can help ensure the data earns its keep. Many larger data providers assist their customers with this and may be able to provide guidance and best practices. These suggestions and many others for the effective use of external data can be found in published guidance from both McKinsey and Forrester.

Examining foot traffic with Foursquare data

To further explore how external data may be employed, we partnered with Foursquare, a leading provider of location technology and data, to examine the impact of COVID on taco shops in the US.

Why taco shops? Like most quick service restaurants, these establishments are highly dependent on foot traffic, a key aspect of consumer engagement disrupted during the pandemic. These establishments also tend to be smaller, independent businesses and as such, as has been noted in some regional reporting, are more capable of adapting their business models in response to the pandemic.. Finally, while this analysis can be applied to any number of businesses represented in the Foursquare dataset, two of our authors are from Texas, where tacos are a much loved regional staple.

With foot traffic data collected through Foursquare’s Pilgrim SDK and made available through its Places and Visits databases, we examined the visitation rates of customers to taco shops in various regions of the country. Leveraging population estimates from the US Census Bureau, we were able to see a clear picture of the regional importance of these establishments.

Figure 2. Visits to taquerias relative to population size, logarithmically scaled, for the years 2017 through 2020

To align the point locations of individual businesses with the county-level metrics provided by the US Census Bureau, we leveraged the Uber H3 grid system, which maps geographic locations to hexagonal grids of varying resolutions. This system made it easier for us to overlay additional datasets, such as county-level COVID case counts.

Our analysis shows that while the number of taco shops has been increasing over the last few years, customer visits per restaurant had declined prior to the COVID pandemic. While the vast majority of restaurants are independent, the bulk of the traffic to taco shops was consumed by chain establishments.

Figure 4. Per location customer visits for independent vs. chain taquerias

With the emergence of COVID in early 2020, a strong initial dip in visitations led to a return of customers to stores in May at about 75% the levels seen across prior years.

Figure 5. Impact of COVID on store visitations

Examining year-over-year numbers, the independent restaurants appear to have recovered better than chains following this initial dip. As reported in other venues, the agility of smaller, independent establishments may account for some of their better rebound. Shop local efforts may also have contributed to the pattern with customers favoring neighborhood establishments over larger chains. But independent restaurants have also seen better year-over-year visitation numbers relative to chains just prior to the pandemic, indicating that forces favoring them in 2020 predate the pandemic.

Figure 6. Year-over-year changes in shop visits for independent vs. chain restaurants

This is a positive bit of news for these small businesses which have been losing ground to chain restaurants. Looking ahead, we forecast continued overall improvements in visitation numbers which should provide good news for independents and chains alike. That said, these projections depend on reliable forecasts of COVID numbers, something that has alluded public health experts to date. In our analysis we made what we felt was a reasonable projection for a limited period of time, but in the end we found that forecasts were only reliable for a 2-3 month horizon. All of this is to say that there are still many unknowns, and while we are hopeful for a recovery, this is a scenario that will need to be frequently revisited as new information is available. Based on our experience with other QSRs and retailers, we believe this same caveat applies broadly across the industry.

Figure 7. Historical and forecasted store visits for subset of regions for which forecasts could be made

To examine our analysis in more detail, including the data preparation work required to spatially align our datasets, please explore the following notebooks:

Databricks and Foursquare would like to extend our best wishes to all the local restaurateurs and their employees who have and continue to navigate the uncertainty of the pandemic. Please remember to support your local restaurants.

Try Databricks for free. Get started today.

The post Building Forward-Looking Intelligence With External Data appeared first on Databricks.

↧

MIT Tech Review Study: Building a High-performance Data and AI Organization — The Data Architecture Matters.

May 6, 2021, 10:00 am

≫ Next: Improved Tableau Databricks Connector With Azure AD Authentication Support

≪ Previous: Building Forward-Looking Intelligence With External Data

Only 13% of organizations — super achievers — are succeeding at their data and AI strategy yet the successful application of data and AI has never been a greater necessity for survival than now¹. In order to remain forward-thinking in today’s landscape, data leaders are looking for the ability to eliminate silos that traditionally separate analytics, data science, and machine learning through lakehouse platforms — unifying their data, analytics, and AI under a simple, open, and collaborative data architecture. The old adage of “this is how things have always been” is a recipe for failure and the successful use of data and AI by a group of innovative organizations is transforming every industry by force. This blog will provide insight into what the super achievers attribute their success to, as well as what data and technology leaders cite as a critical enabler to building data cultures, their challenges with ML, priority investment areas over the next two years, and what they would focus on if given a redo button.

Growing importance and struggles with data and AI

Research confirms the obsession with data + AI is extending beyond the practitioner and into the board room. Leaders are also shifting their mindset to no longer just think about what data they have, but rather how that data is being used to fuel innovation and growth. In fact, in the 2021 Big Data and AI Executive Survey, NewVantage Partners found 92% of executives report that the pace of Big Data/AI investment in their organization is accelerating — up 40% from the previous year², and McKinsey & Co. estimates that analytics and AI will create over $15 trillion in new business value by 2030³. Yet despite this growing priority, very few organizations actually successfully implement their strategy – only 13%¹. One angle we rarely examine is what the so-called “super-achievers” — are doing to drive their success?

Spoiler alert, the data architecture matters a lot more than you would think

Based on interviews with 9 data and analytics leaders from brands like McDonald’s, CVS Health, L’Oreal, and Northwestern Mutual, in addition to a survey of 350 CIOs, CDOs, CTOs, and other leaders, MIT Tech Review, in collaboration with Databricks, found in its latest report, “Building a high-performance data and AI organization,” the challenge starts with the data architecture. Organizations need to build four different stacks to handle all of their data workloads: business analytics, data engineering, streaming, and ML. All four of these stacks require very different technologies and, unfortunately, they sometimes don’t work well together. The technology ecosystem across data warehouses and data lakes further complicates the architecture. It ends up being expensive and resource-intensive to manage. That complexity impacts data teams. Data and organizational silos can accidentally slow communication, hinder innovation and create different goals amongst the teams. The result is multiple copies of data, no consistent security/ governance model, closed systems, and less productive data teams.

Meanwhile, ML remains an elusive goal. With the emergence of lakehouse architecture, organizations are no longer bound by the confines and complexity of legacy architectures. By combining the performance, reliability, and governance of data warehouses with the scalability, low cost, and workload flexibility of the data lake, lakehouse architecture provides flexible, high-performance analytics, data science, and ML.

At Databricks we bring the lakehouse architecture to life through the Databricks Lakehouse Platform which excels in three ways:

It’s simple: Data only needs to exist once to support all workloads on one common platform.
It’s open: Based on open source and open standards, it’s easy to work with existing tools and avoid proprietary formats.
It’s collaborative: Data engineers, analysts, and data scientists can work together and more efficiently.

The cost savings, efficiencies, and productivity gains offered by the Databricks Lakehouse Platform are already making a bottom-line impact on enterprises in every industry and geography. Freed from overly complex architecture, Databricks provides one common cloud-based data foundation for all data and workloads across all major cloud providers. Data and analytics leaders can foster a data-driven culture that focuses on adding value by relieving the daily grind of planning and all its complexities, with predictive maintenance.

Additional findings from the study

In addition to an effective and efficient data architecture being the prime reason for success, the study also found:

Open standards are the top requirements of future data architecture strategies. If respondents could build a new data architecture for their business, the most critical advantage over the existing architecture would be a greater embrace of open source standards and open data formats.
Technology-enabled collaboration is creating a working data culture. The CDOs interviewed for the study ascribe great importance to democratizing analytics and ML capabilities. Pushing these to the edge with advanced data technologies will help end-users to make more informed business decisions — the hallmarks of a strong data culture.
ML’s business impact is limited by difficulties managing its end-to-end lifecycle. Scaling ML use cases is exceedingly complex for many organizations. According to 55% of respondents, the most significant challenge is the lack of a central place to store and discover ML models.
Enterprises seek cloud-native platforms that support data management, analytics, and machine learning. Organizations’ top data priorities over the next two years fall into three areas, all supported by broader adoption of cloud platforms: improving data management, enhancing data analytics and ML, and expanding the use of all types of enterprise data, including streaming and unstructured data.

From video streaming analytics to customer lifetime value, and from disease prevention to finding life on Mars, data is part of the solution, to succeed with data and AI, organizations need better tooling to handle the data management fundamentals across the enterprise. Download your copy of the report to dive into the analysis and better understand the interviewees’ viewpoints.

Attribution

¹MIT Tech Review – Building a high-performance data and AI organization
²NewVantage Partners – Big Data and AI Executive Survey
³McKinsey & Company – The executive’s AI playbook

Try Databricks for free. Get started today.

The post <strong>MIT Tech Review Study:</strong> Building a High-performance Data and AI Organization — The Data Architecture Matters. appeared first on Databricks.

↧

Improved Tableau Databricks Connector With Azure AD Authentication Support

May 7, 2021, 8:59 am

≫ Next: Your Guide to Retail & Consumer Goods Sessions at Data + AI Summit 2021

≪ Previous: MIT Tech Review Study: Building a High-performance Data and AI Organization — The Data Architecture Matters.

With the release of Tableau 2021.1, we have added new functionality to the Tableau Databricks Connector that simplifies security administration and streamlines the connection experience for end users. The updated connector lets Tableau users connect to Azure Databricks with a couple of clicks, using Azure Active Directory (Azure AD) credentials and SSO for Tableau Online users. This integration enables organizations to scale the management of users for Tableau and Databricks by making it seamless with your existing tools and processes.

For more information on our partnership with Tableau, visit www.databricks.com/tableau.

The native Tableau Databricks Connector in combination with the recently launched SQL Analytics service provides Tableau and Databricks customers with a first-class experience for performing business intelligence (BI) workloads directly on their Delta Lake. SQL Analytics allows customers to operate a lakehouse architecture on multiple clouds that provides data warehousing performance at data lake economics.

The Tableau Databricks Connector comes with the following improvements:

Support for Azure AD and SSO when connecting to Azure Databricks

Users can use their Azure AD credentials to connect from Tableau to Azure Databricks. Tableau Online users can access shared reports using SSO, using their own Azure AD credentials when accessing data in Databricks directly. Administrators no longer need to generate Personal Access Tokens for users for authentication.

Simple Connection Configuration to all clouds

The updated Tableau Databricks Connector allows the connection to be configured with a couple of clicks. Users select Databricks from the Tableau Connect menu, enter the Databricks-specific connection details and authentication method (Azure AD for Azure Databricks, Personal Access Tokens or Username / Password). After signing in, users are ready to query their data!

Faster results via Databricks ODBC on all clouds

The Databricks ODBC driver embedded in Tableau Online has been optimized with reduced query latency, increased result transfer speed based on Apache ArrowTM serialization and improved metadata retrieval performance. For Tableau Desktop and Server, the driver needs to be manually updated (download the latest version here). Advanced ODBC configurations can be set directly in the Advanced tab of the connection dialog.

Update Now!

If you’re using an older version of Tableau, update now to take advantage of these improvements. Tableau Desktop and Server 2021.1 and the latest version of Tableau Online provide a much better experience on Databricks.

Get the updated Databricks ODBC Driver.

Try Databricks for free. Get started today.

The post Improved Tableau Databricks Connector With Azure AD Authentication Support appeared first on Databricks.

↧

Your Guide to Retail & Consumer Goods Sessions at Data + AI Summit 2021

May 7, 2021, 11:57 am

≫ Next: Improving Customer Experience With Transaction Enrichment

≪ Previous: Improved Tableau Databricks Connector With Azure AD Authentication Support

Data + AI Summit is the global event for the data community, where global100,000 practitioners, leaders and visionaries come together to engage in thought-provoking dialogue and share the latest innovations in data and AI.

At this year’s Data + AI Summit, we’re excited to announce a full agenda of sessions for data teams in the Retail & Consumer Goods (CPG) industry. Leading innovators from across the industry – including Apple, H&M, Albertsons, Mars, Reckitt, Walmart Labs, Anheuser-Busch and Stitch Fix – are joining us to share how they are using data to innovate the shopping and supply-chain experience.

Retail & Consumer Goods Keynotes

On our main stage, we have Sol Rashidi, Chief Analytics Officer at Estée Lauder. With 7 patents and voted “50 Most Powerful Women in Tech” and” Top 100 Innovators in Data & Analytics,”, Sol has been in the data and analytics space before it became “‘cool’.” Her ability to articulate and communicate complex disciplines into non-complex concepts for the business has always invited her to sit with the business, having her teams bridge the gap and codify the partnership between IT & Business.

Also on the main stage keep a lookout for Patrick Bagniski, a data science and machine learning leader from McDonald’s.

Sol Rashidi, Chief Analytics Officer at Estée Lauder.

Retail & CPG Industry Forum

Join us on Thursday, May 27 at 11 am PT for our capstone Retail & CPG event at Data + AI Summit, which will feature thought leaders from some of the biggest global brands. Hear firsthand how they are unlocking the power of data + AI in novel and different ways. With a keynote led by Marcin Kaluzny from Reckitt and a panel of data and AI leaders from across the industry, you will walk away with new ideas and insights to act on!


Errol Koolmeister, Head of AI Foundation	Colleen Qiu, VP, Head of Data Science	Deepak Jose, Head of Business Strategy and Analytics


Alberto Rossi, Global Head of Retail Data & Analytics	Robert Barham Director of Data	Marcin Kaluzny Director of Data Analytics

Retail & CPG Tech Talks

Here’s an overview of some of our most highly anticipated Retail & CPG sessions at this year’s summit:

Structured Streaming Use-Cases at Apple

Kristine Guo & Liang-Chi Hsieh, Apple

In response to tremendous streaming requirements, Apple has actively worked on developing structured streaming in Apache Spark in the past few months. In this talk, Kristine Guo and Liang-Chi Hsieh will detail some of the issues that arose when applying structured streaming and what was done to address them.

Weekday Demand Sensing at Walmart

John Bowman, Walmart Labs

Walmart Labs will discuss its innovative, cloud-agnostic, scalable platform built m to improve Walmart’s ability to predict customer demand while optimizing in-stocks and reducing food waste.

Modularized ETL Writing with Apache Spark

Neelesh Salian, Stitch Fix

The talk will focus on ETL writing in Stitch Fix and how these modules that help their Data Scientists on a daily basis. Having these modules at the time of writing data allows cleaning, validation and testing of data prior to entering the Data Warehouse, relieving Stitch Fix, programmatically, of most of its data problems.

Building A Product Assortment Recommendation Engine

Ethan Dubois & Justin Morse, Anheuser Busch (AB)

The ability of retailers and brewers to provide optimal product assortments for their consumers has become a key business goal. Regional heterogeneities and massive product portfolios combine to scale the complexity of assortment selection. This talk will discuss how AB InBev approaches this problem with collaborative filtering and robust optimization techniques to recommend a set of products that enhance retailer revenue and product market share.

Check out the full list of Retail & CPG talks at Summit.

Demos on Popular Data + AI Use Case in Retail & CPG

Join us for live demos on the hottest data analysis use cases in the Retail & CPG industry:

Fine-Grained Time Series Forecasting at Scale	Segmentation in the Age of Personalization	Personalizing CX with Recommendations
Learn how retailers & manufacturers are cost-effectively generating millions of item- and& location-specific forecasts on a daily basis.	Explore a structured approach to building and analyzing segments that enable the organization to effectively engage their customers.	Explore how recommenders can be used in a variety of ways to deliver personalized customer experiences.

Don’t miss the Retail & CPG Experience at Summit!

Make sure to register for the Data + AI Summit to take advantage of all the amazing Retail & CPG sessions, demos and talks scheduled to take place. Registration is free!

Try Databricks for free. Get started today.

The post Your Guide to Retail & Consumer Goods Sessions at Data + AI Summit 2021 appeared first on Databricks.

↧

Improving Customer Experience With Transaction Enrichment

May 10, 2021, 1:15 pm

≫ Next: AWS Guide to Data + AI Summit Featuring Disney+, Comcast, Capital One and McDonald’s

≪ Previous: Your Guide to Retail & Consumer Goods Sessions at Data + AI Summit 2021

The retail banking landscape has dramatically changed over the past five years with the accessibility of open banking applications, mainstream adoption of Neobanks and the recent introduction of tech giants into the financial services industry. According to a recent Forbes article, millennials now represent 75% of the global workforce, and “71% claim they’d “rather go to the dentist than to take advice from their banks”. The competition has shifted from a 9 to 5 pm brick-and-mortar branch network to winning over digitally savvy consumers who are becoming more and more obsessed with the notions of simplicity, efficiency, and transparency. Newer generations are no longer interested to hear generic financial advice from a branch manager but want to be back in control of their finances with personalized insights, in real time, through the comfort of their mobile banking applications. To remain competitive, banks have to offer an engaging mobile banking experience that goes beyond traditional banking via personalized insights, recommendations, setting financial goals and reporting capabilities – all powered by advanced analytics like geospatial or natural language processing (NLP).

These capabilities can be especially profound given the sheer amount of data banks have at their fingertips. According to 2020 research from the Nilson Report, roughly 1 billion card transactions occur every day around the world (100 million transactions in the US alone). That is 1 billion data points that can be exploited every day to benefit the end consumers, rewarding them for their loyalty (and for the use of their data) with more personalized insights. On the flip side, that is 1 billion data points that must be acquired, curated, processed, categorized and contextualized, requiring an analytic environment that supports both data and AI and facilitates collaboration between engineers, scientists and business analysts. SQL does not improve customer experience. AI does.

In this new solution accelerator (publicly accessible notebooks are reported at the end of this blog), we demonstrate how the lakehouse architecture enables banks, open banking aggregators and payment processors to address the core challenge of retail banking: merchant classification. Through the use of notebooks and industry best practices, we empower our customers with the ability to enrich transactions with contextual information (brand, category) that can be leveraged for downstream use cases such as customer segmentation or fraud prevention.

Understanding card transactions

The dynamics of a card transaction are complex. Each action involves a point of sales terminal, a merchant, a payment processor gateway, an acquiring bank, a card processor network, an issuer bank and a consumer account. With many entities involved in the authorization and settlement of a card transaction, the contextual information carried forward from a merchant to a retail bank is complicated, sometimes misleading and oftentimes counter-intuitive for end consumers and requires the use of advanced analytics techniques to extract clear Brand and Merchant information. For starters, any merchant needs to agree on a merchant category code (MCC), a 4 digit number used to classify a business by the types of goods or services it provides (see list). MCC by itself is usually not good enough to understand the real nature of any business (e.g. large retailers selling different goods) as it is often too broad or too specific.

Merchant Category Codes (Source: https://instabill.com/merchant-category-code-mcc-basics/)

In addition to a complex taxonomy, the MCC is sometimes different from one point of sales terminal to another, even given the same merchant. Relying only on MCC code is not sufficient enough to drive a superior customer experience and must be combined with additional context, such as transaction narrative and merchant description to fully understand the brand, location, and nature of goods purchased. But here is the conundrum. The transaction narrative and merchant description is a free form text filled in by a merchant without common guidelines or industry standards, hence requiring a data science approach to this data inconsistency problem. In this solution accelerator, we demonstrate how text classification techniques such as fasttext can help organizations better understand the brand hidden in any transaction narrative given a reference data set of merchants. How close is the transaction description “STARBUCKS LONDON 1233-242-43 2021” to the company “Starbucks”?

An important aspect to understand is how much data do we have at our disposal to learn text patterns from. When it comes to transactional data, it is very common to come across a large disparity in available data for different merchants. This is perfectly normal and it is driven by the shopping patterns of the customer base. For example, it is to be expected that we will have easier access to Amazon transactions than to corner shop transactions simply due to the frequency of transactions happening at these respective merchants. Naturally, transaction data will follow a power law distribution (as represented below) in which a large portion of the data comes from a few merchants.

Our approach to fuzzy string matching

The challenge of approaching this problem from fuzzy string matching is that simple, larger parts of the description and merchant strings do not match. Any string-type distance would be very high and, in effect, any similarity very low. What if we changed our angle? Is there a better way to model this problem? We believe that the problem outlined above would better be modeled by document (free text) classification rather than string similarity. In this solution accelerator, we demonstrate how fasttext helps us efficiently solve the description-to-merchant translation and unlock advanced analytics use cases.

A popular approach in recent times is to represent text data as numerical vectors, making two prominent concepts appear: word2vec and doc2vec (see blog). Fasttext comes with its own built-in logic that converts this text into vector representations based on two approaches, cbow and skipgrams (see documentation), and depending on the nature of your data, one representation would perform better than the other. Our focus is not on dissecting the internals of the logic used for vectorization of text, but rather on the practical usage of the model to solve text classification problems when we are faced with thousands of classes (merchants) that text can be classified into.

Generalizing approach to card transactions

To maximize the benefits of the model, data sanitization and stratification are key! Machine learning (ML) simply scales and performs better with cleaner data. With that in mind, we will ensure our data is stratified with respect to merchants. We want to ensure we can provide a similar amount of data per merchant for the model to learn from. This will avoid the situation in which the model would bias towards certain merchants just because of the frequency at which shoppers are spending with them. For this purpose we are using the following line of code:

result = data.sampleBy(self.target_column, sample_rates)

Stratification is ensured by Spark sampleBy method, which requires a column over whose values stratification will occur, as well as a dictionary of strata label to sample size mappings. In our solution, we have ensured that any merchant with more than 100 rows of available labeled data is kept in the training corpus. We have also ensured that zero class (unrecognized merchant) is over-represented in the 10:1 ratio due to higher in-text perplexity in the space of transactions that our model cannot learn from. We are keeping zero class as a valid classification option to avoid inflation of false positives. Another equally valid approach is to calibrate each class with a threshold probability of the class at which we no longer trust the model-produced label and default to the “Unknown Merchant” label. This is a more involved process, therefore, we opted for a simpler approach. You should only introduced complexity in ML and AI if it brings obvious value.

From the cleaning perspective, we want to ensure our model is not stifled by time spent learning from insignificant data. One such example is dates and amounts that may be included in the transaction narrative. We can’t extract merchant-level information based on the date that transaction happened. If we add to this consideration that merchants do not follow the same standard of representation when it comes to dates, we immediately conclude that dates can safely be removed from the descriptions and that this action will help the model learn more efficiently. For this purpose, we have based our cleaning strategy on the information presented in the Kaggle blog. As a data cleaning reference, we present the full logical diagram of how we have cleaned and standardized our data. This being a logical pipeline the end-user of this solution can easily modify and/or extend the behavior of any one of these steps and achieve a bespoke experience.

After getting the data into the right representation, we have leveraged the power of MLflow, Hyperopt and Apache Spark™ to train fasttext models with different parameters. MLflow enabled us to track many different model runs and compare them. Critical functionality of MLflow is its rich UI, which makes it possible to compare hundreds of different ML model runs across many parameters and metrics:

For a reference on how to parameterize and optimize a fasttext model, please refer to the documentation. In our solution, we have used the train_unsupervised training method. Given the volumes of merchants we had at our disposal (1000+), we’ve realized that we cannot properly compare the models based on one metric value. Generating a confusion matrix with 1000+ classes might not bring desired simplicity of interpretation of performance. We have opted for an accuracy per percentile approach. We have compared our models based on performance on median accuracy, worst 25th percentile and worst 5th percentile. This gave us an understanding of how our model’s performance is distributed across our merchant space.

As a part of our solution we have implemented integration of fasttext model with MLflow and are able to load model via MLflow APIs and apply the best model at scale via prepackaged Spark udfs as in code below:

logged_model = f'runs:/{run_id}/model'
loaded_model = mlflow.pyfunc.load_model(logged_model)
loaded_model_udf = mlflow.pyfunc.spark_udf(
   spark, model_uri=logged_model, result_type="string"
)

spark_results = (
   validation_data 
      .withColumn('predictions', loaded_model_udf("clean_description"))
)

This level of simplicity in applying a solution is critical. One can rescore historical transactional data with several lines of code once the model has been trained and calibrated. These few lines of code unlock customer data analytics like never before. Analysts can finally focus on delivering complex advanced data analytics use cases in both streaming or batch, such as customer lifetime value, pricing, customer segmentation, customer retention and many other analytics solutions.

Performance, performance, performance!

The reason behind all this effort is simple: obtain a system that can automate the task of transaction enrichment. And for a solution to be trusted in automated running mode, performance has to be on a high level per merchant. We have trained several hundred different configurations and compared these models with a focus on low performer merchants. Our 5th lowest percentile accuracy achieved was at around 93% accuracy; our median accuracy achieved was at 99%. These results give us the confidence to propose automated merchant categorization with minimal human supervision.

These results are great, but a question comes to mind. Have we overfitted? Overfitting is only a problem when we expect a lot of generalization from our model, meaning when our training data is only representing a very small sample of reality and new arriving data wildly differs from the training data. In our case, we have very short documents with grammars of each merchant that are reasonably simple. On the other hand, fasttext generates ngrams and skipgrams, and in transaction descriptions, this approach can extract all useful knowledge. These two considerations combined indicate that even if we overfit these vectors, which are by nature excluding some tokens from knowledge representation, we will generalize nevertheless. Simply put, the model is robust enough against overfitting given the context of our application. It is worth mentioning that all the metrics produced for model evaluation are computed over a set of 400,000 transactions, and this dataset is disjoint from the training data.

Is this useful if we don’t have a labeled dataset

This is a difficult question to answer with yes or no. However, as a part of our experimentation, we have formulated a point of view. With our framework in place, the answer is yes. We have performed several ML model training campaigns with different amounts of labeled rows per merchant. We have leveraged MLflow, Hyperopt and Spark to both train different models with different parameters and train different models with different parameters over different data sizes and cross-reference them and compare them over a common set of metrics.

This approach has enabled us to answer the question: What is the smallest number of labeled rows per merchant that I need to train the proposed model and score my historical transactional data? The answer is: as low as 50, yes, five-zero!

With only 50 records per merchant, we have maintained 99% median accuracy and the 5th lowest percentile has decreased performance by only a few percentage points to 85%. On the other hand, the results obtained for 100 records per merchant dataset were 91% accuracy for the lowest 5th percentile. This only indicates that certain brands do have a more perplexed syntax of descriptions and might need a bit more data. The bottom line is that the system is operational at great median performance and reasonable performance in edge cases with as few as 50 rows per merchant. This makes the entry barrier to merchant classification very low.

Transaction enrichment to drive superior engagement

While retail banking is in the midst of transformation based on heightened consumer expectations around personalization and user experience, banks and financial institutions can learn a significant amount from other industries that have moved from wholesale to retail in their consumer engagement strategies. In the media industry, companies like Netflix, Amazon and Google have set the table for both new entrants and legacy players around having a frictionless, personalized experience across all channels at all times. The industry has fully moved from “content is king” to experiences that are specialized based on user preference and granular segment information. Building a personalized experience where a consumer gets value builds trust and ensures that you remain a platform of choice in a market where consumers have endless amounts of vendors and choices.

Learning from the vanguards of the media industry, retail banking companies that focus on banking experience rather than transactional data would not only be able to attract the hearts and minds of a younger generation but would create a mobile banking experience people like and want to get back to. In this model centered on the individual customer, any new card transaction would generate additional data points that can be further exploited to benefit the end consumer, drive more personalization, more customer engagement, more transactions, etc. — all while reducing churn and dissatisfaction.

Although the merchant classification technique discussed here does not address the full picture of personalized finance, we believe that the technical capabilities outlined in this blog are paramount to achieving that goal. A simple UI providing customers with contextual information (like the one in the picture above) rather than a simple “SQL dump” on a mobile device would be the catalyst towards that transformation.

In a future solution accelerator, we plan to take advantage of this capability to drive further personalization and actionable insights, such as customer segmentation, spending goals, and behavioral spending patterns (detecting life events), learning more from our end-consumers as they become more and more engaged and ensuring the value-added from these new insights benefit them.

In this accelerator, we demonstrated the need for retail banks to dramatically shift their approach to transaction data, from an OLTP pattern on a data warehouse to an OLAP approach on a data lake, and the need for a lakehouse architecture to apply ML at an industry scale. We have also addressed the very important considerations of the entry barrier to implementation of this solution concerning training data volumes. With our approach, the entry barrier has never been lower (50 transactions by a merchant).

Try the below notebooks on Databricks to accelerate your digital banking strategy today and contact us to learn more about how we assist customers with similar use cases.

Try Databricks for free. Get started today.

The post Improving Customer Experience With Transaction Enrichment appeared first on Databricks.

↧

AWS Guide to Data + AI Summit Featuring Disney+, Comcast, Capital One and McDonald’s

May 10, 2021, 1:52 pm

≫ Next: Azure Databricks Training and Key Sessions at Data + AI Summit 2021

≪ Previous: Improving Customer Experience With Transaction Enrichment

This is a guest co-authored post. We thank Igor Alekseev, partner solution architect at AWS, for his contributions.

Data + AI Summit: Register now to join this free virtual event May 24-28 and learn from the global data community.

Amazon Web Services (AWS) is a Platinum Sponsor of Data + AI Summit 2021, one of the largest events in the industry. Join this event and learn from joint Databricks and AWS customers like Disney+, Capital One, Takeda and Comcast that have successfully leveraged the Databricks Lakehouse Platform for their business, bringing together data, AI and analytics on one common platform.

At Data + AI Summit, Databricks and AWS are center stage in a number of keynote talks. Attendees will have the opportunity to hear a candid discussion from Databricks CEO Ali Ghodsi and AWS Senior Vice President Matt Garman. Core AWS enterprise customers will aso take the keynote stage, including date leaders from Atlassian on Day 1 and McDonald’s on Day 2.

The sessions below are a guide for everyone interested in Databricks on AWS and span a range of topics — from building recommendation engines to fraud detection to tracking patient interactions. If you have questions about Databricks on AWS or service integrations, visit the AWS booth at Data + AI Summit. In the meantime, you can learn more about how Databricks operates on AWS here.

Creating a Lakehouse on AWS

Dream of getting the low cost of a data lake but the performance of a data warehouse? Welcome to the Lakehouse. In this session, learn how to build a Lakehouse on your AWS cloud platform using Amazon S3 and Delta Lake. You’ll also explore how companies have created an affordable and high performance Lakehouse to drive all their analytics efforts.

Disney +: Customer Experience at Disney+ Through Data Perspective

Introduced in November 2019, Disney+ has grown to over 100 million users, andthe analytics platform behind that growth is Databricks on AWS. Discover how Disney+ rapidly scaled to provide a personalized and seamless experience to its customers. This experience is powered by a robust data platform that ingests, processes and surfaces billions of events per hour using Delta Lake, Databricks and AWS technologies.

Capital One: Credit Card Fraud Detection using ML in Databricks

Illegitimate credit card usage is a serious problem that can significantly impact all organizations – especially financial services – and results in a need to accurately detect fraudulent transactions vs non-fraudulent transactions. Despite regular fraud prevention measures, these are constantly being put to the test by malicious actors in an attempt to beat the system. In order to more dynamically detect fraudulent transactions, one can train ML models on a set of datasets, including credit card transaction information as well as card and demographic information of the owner of the account. Learn how Capital One is building this use case by leveraging Databricks.

Comcast: SQL Analytics Powering Telemetry Analysis at Comcast

See firsthand how Comcast RDK is providing the backbone of telemetry to the industry. The RDK team at Comcast analyzes petabytes of data, collected every 15 minutes from 70 million devices (video and broadband and IoT devices) installed in customer homes. SQL Analytics on the Databricks platform allows customers to operate a lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance for SQL workloads than traditional cloud data warehouses.

Discover the results of the “Test and Learn” initiative with SQL Analytics and the Delta Engine in partnership with the Databricks team. A quick demo will introduce the SQL native interface and the challenges with migration, the results of the execution and the journey of productionizing this at scale.

Northwestern Mutual: Northwestern Mutual Journey – Transform BI Space to Cloud

In this session, explore howNorthwestern Mutual leverages data driven decision making to improve both efficiency and effectiveness in its business. . As a financial company, data security is as important as the ingestion of data. In addition to fast ingestion and compute, Northwestern Mutual needed a solution to support column level encryption as well as role-based access to their data lake from many diverse teams. Learn how the data team moved hundreds of ELT jobs from an MSBI (Microsoft Business Intelligence) stack to Databricks and built a Lakehouse, resulting in massive time savings.

Asurion: Large Scale Lake House Implementation Using Structured Streaming

Business leads, executives, analysts and data scientists rely on up-to-date information to make business decisions, adjust to the market, meet the needs of their customers and run effective supply chain operations. In this session, learn how Asurion used Databricks on AWS, including Delta Lake, Structured Streaming, AutoLoader and SQL Analytics, to improve production data latency from day-minus-one to near real time. Asurion’s technical team will share battle tested tips and tricks you only get with a certain scale. The Asurion data lake executes 4,000+ streaming jobs and hosts over 4,000 tables in a production data lake on AWS.

Takeda: Empowering Real-time Patient Care Through Spark Streaming

Takeda’s Plasma Derived Therapies (PDT) business unit recently embarked on a project to use Spark Streaming on Databricks to empower how they deliver value to their Plasma Donation centers. As patients come in and interface without clinics, Takeda stores and tracks all patient interactions in real time and delivers outputs and results based on their interactions. The entire process is integrated with AWS Glue as the metadata provider. Using Spark Streaming will enable Takeda to replace their existing ETL processes based on Lambdas, step functions and triggered jobs, with a purely stream driven architecture.

Western Governors University: 10 Things Learned Releasing Databricks Enterprise Wide

Western Governors University (WGU) embarked on rewriting all of their ETL pipelines in Scala/Python, as well as migrating their enterprise data warehouse into Delta Lake – all on the Databricks platform. Starting with 4 users and rapidly growing to over 120 users across 8 business units, their Databricks environment turned into an entire unified platform used by individuals of all skill levels, data requirements and internal security requirements.
This session will dive into user management from both an AWS and Databricks perspective, understanding and managing costs, creating custom pipelines for efficient code management and utilizing new Apache Spark snippets that drove massive savings.

The AWS Booth

Visit the AWS booth to see demos and take part in discussions regarding running Databricks on AWS. There will be three lighting talks in the AWS booth:

Quickstart 5/26 12:30 PM PDT
Managed Catalog Strategy 5/27, 2:00 PM PDT
PrivateLink, Public Preview 5/28 11:00 AM PDT

Come take part in these discussions to learn best practices on running Databricks on AWS.

Register now to join this free virtual event and join the data community. Learn how companies are successfully building their lakehouse architecture with Databricks on AWS to create a simple, open and collaborative data platform.

Try Databricks for free. Get started today.

The post AWS Guide to Data + AI Summit Featuring Disney+, Comcast, Capital One and McDonald’s appeared first on Databricks.

↧

Azure Databricks Training and Key Sessions at Data + AI Summit 2021

May 11, 2021, 1:00 pm

≫ Next: Your Guide to All Things Financial Services at Data + AI Summit 2021

≪ Previous: AWS Guide to Data + AI Summit Featuring Disney+, Comcast, Capital One and McDonald’s

Diamond sponsor Microsoft and Azure Databricks customers to present keynotes and breakout sessions at Data + AI Summit 2021. Register for free.

Data + AI Summit 2021 is the global data community event, where practitioners, leaders and visionaries come together to shape the future of data and AI. Data teams will participate from all over the world to level up their knowledge on highly-technical topics presented by leading experts from the industry, research and academia. We are excited to have Microsoft as a Diamond sponsor, bringing Microsoft and Azure Databricks customers together for a lineup of great keynotes and sessions.

Rohan Kumar, Corporate Vice President of Azure Data, returns for the fourth consecutive year as a keynote speaker alongside Azure Databricks customers, including Humana, T-Mobile, Anheuser-Busch InBev, Estée Lauder and EFSA. Below are some of the top sessions to add to your agenda

KEYNOTE
Keynote with Rohan Kumar
Microsoft During the THURSDAY MORNING KEYNOTE, 8:30 AM – 10:30 AM (PDT)
Rohan Kumar, Corporate Vice President of Azure Data, will join Databricks CEO Ali Ghodsi for a fireside chat to highlight how Azure customers are leveraging open source and open standards using Azure Databricks and other Azure Data services to accelerate data and AI innovation.

KEYNOTE
Keynote with Sol Rashidi
Estée Lauder During the WEDNESDAY AFTERNOON KEYNOTE, 1:00 PM – 2:30 PM (PDT)
Sol Rashidi, Chief Analytics Officer at Estée Lauder, will be joining us to share insights on how practitioners in the Data + AI community should adopt a product-centric mindset. Prior to Estée Lauder, Sol held executive roles on data strategy at Merck, Sony, Royal Caribbean, EY and IBM.

DevOps for Databricks
Advancing Analytics WEDNESDAY, 12:05 PM – 12:35 PM (PDT)
Applying DevOps to Databricks can be a daunting task. This sessions will break down common DevOps topics including CI/CD, Infrastructure as Code and Build Agents. Explore how to apply DevOps to Databricks (in Azure), primarily using Azure DevOps tooling.

CI/CD in MLOps – Implementing a Framework for Self-Service Everything
J.B. Hunt and Artis Consulting WEDNESDAY, 3:15 PM – 3:45 PM (PDT)
How can companies create predictable, repeatable, secure self-service workflows for their data science teams? Discover how J. B. Hunt, in collaboration with Artis Consulting, created an MLOps framework using automated conventions and well-defined environment segmentation. Attendees will learn how to achieve predictable testing, repeatable deployment and secure self-service Databricks resource management throughout the local/dev/test/prod promotion lifecycle.

Predicting Optimal Parallelism for Data Analytics
Microsoft WEDNESDAY, 3:50 PM – 4:20 PM (PDT)
A key benefit of serverless computing is that resources can be allocated on demand, but the number of resources to request, and allocate, a job can profoundly impact its running time and cost. For a job that has not yet run, how can we provide users with an estimate of how the job’s performance changes with provisioned resources so they can make an informed choice upfront about cost-performance tradeoffs?

Accelerate Analytics On Databricks
Microsoft and WANdisco WEDNESDAY, 4:25 PM – 4:55 PM (PDT)
Enterprises are investing in data modernization initiatives to reduce cost, improve performance, and enable faster time to insight and innovation. These initiatives are driving the need to move petabytes of data to the cloud without interruption to existing business operations. This session will share some “been there done that” stories of successful Hadoop migrations/replications using a LiveData strategy in partnership with Microsoft Azure and Databricks.

ML/AI in the Cloud: Reinventing Data Science at Humana
Humana WEDNESDAY, 4:25 PM – 4:55 PM (PDT)
Humana strives to help the communities it serves achieve the best health – no small task in the past year! The data team at Humana had the opportunity to rethink existing operations and reimagine what a collaborative ML platform for hundreds of data scientists might look like. The primary goal of our ML Platform is to automate and accelerate the delivery lifecycle of data science solutions at scale. In this presentation,walk through an end-to-end example of how to build a model at scale on FlorenceAI and deploy it to production. Tools highlighted include Azure Databricks, MLFlow, AppInsights, and Azure Data Factory.

Advanced Model Comparison and Automated Deployment Using ML
T-Mobile WEDNESDAY, 5:00 PM – 5:30 PM (PDT)
At T-Mobile, when a new account is opened, there are fraud checks that occur both pre- and post-activation. Fraud that is missed has a tendency of falling into first payment default, looking like a delinquent new account. In this session, walk through how the team at T-Mobile leveraged ML in an initiative to investigate newly-created accounts headed towards delinquency and find additional fraud.

Wizard Driven AI Anomaly Detection with Databricks in Azure
Kavi Global WEDNESDAY, 5:00 PM – 5:30 PM (PDT)
Fraud is prevalent in every industry and growing at an increasing rate, as the volume of transactions increases with automation. The National Healthcare Anti-Fraud Association estimates $350B of fraudulent spending. Forbes estimates $25B spending by US banks on anti-money laundering compliance. At the same time, as fraud and anomaly detection use cases are booming, the skills gap of expert data scientists available to perform fraud detection is widening. The Kavi Global team will present a cloud native, wizard-driven AI anomaly detection solution and two client success stories across the pharmaceutical and transportation industries.

Accelerating Data Ingestion with Databricks Autoloader
Advancing Analytics THURSDAY, 11:35 AM – 12:05 PM (PDT)
Tracking which incoming files have been processed has always required thought and design when implementing an ETL framework. The Autoloader feature of Databricks looks to simplify this process, removing the pain of file watching and queue management. However, there can also be a lot of nuance and complexity in setting up Autoloader and managing the process of ingesting data using it. After implementing an automated data loading process in a major US CPMG, Simon Whiteley has some lessons to share from the experience.

Building A Product Assortment Recommendation Engine
Anheuser-Busch InBev THURSDAY, 11:35 AM – 12:05 PM (PDT)
Amid the increasingly competitive brewing industry, the ability of retailers and brewers to provide optimal product assortments for their consumers has become a key goal for business stakeholders. Consumer trends, regional heterogeneities and massive product portfolios combine to scale the complexity of assortment selection. At AB InBev, the data team approaches this selection problem through a two-step method rooted in statistical learning techniques.

With the ultimate goal of scaling this approach to over 100k brick-and-mortar retailers and online platforms, the team implemented its algorithms in custom-built Python libraries using Apache Spark. Learn more in this expert-led session.

Video Analytics At Scale: DL, CV, ML On Databricks Platform
Blueprint Technologies THURSDAY, 3:15 PM – 3:45 PM (PDT)
Don’t miss this live demo and reflection on lessons learned from building and publishing an advanced video analytics solution in the Azure Marketplace. This is a deep technical dive into the engineering and data science employed throughout, with all challenges encountered by combining Deep Learning and Computer Vision for object detection and tracking, the operational management and tool building efforts for scaling the video processing and insights extraction to large GPU/CPU Databricks clusters and the machine learning required to detect behavioral patterns, anomalies and scene similarities across processed video tracks.

The entire solution was built using open source scala, python, spark 3.0, mxnet, PyTorch, scikit-learn as well as Databricks Connect.

Raven: End-to-end Optimization of ML Prediction Queries
Microsoft FRIDAY, 10:30 AM – 11:00 AM (PDT)
ML models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, team members from Microsoft identified significant and unexplored opportunities for optimization. They will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.

Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
EFSA FRIDAY, 10:30 AM – 11:00 AM (PDT)
EFSA is the European agency providing independent scientific advice on existing and emerging risks across the entire food chain. Earlier this year, a new EU regulation (EU 2019/1381) was enacted, requiring EFSA to significantly increase the transparency of its risk assessment processes towards all citizens. To comply with this new regulation, delaware BeLux is helping EFSA in its digital transformation. The team at delaware has been designing and rolling out a modern data platform running on Azure and powered by Databricks that acts as a central control tower brokering data between a variety of applications. It is built around modularity principles, making it adaptable and versatile while keeping the overall ecosystem aligned w.r.t. changing processes and data models. Watch this session to learn how they did it.

Building a Data Science as a Service platform in Azure
Advancing Analytics FRIDAY, 11:05 AM – 11:35 AM (PDT)
ML in the enterprise is rarely delivered by a single team. In order to enable ML across an organization, you need to target a variety of different skills, processes, technologies and maturities. Doing this is incredibly hard and requires a composite of different techniques to deliver a single platform that empowers all users to build and deploy ML models. This session is delivered in collaboration with Ageas Insurance UK and Advancing Analytics. In this session, explore how Databricks enabled a Data Science-as-a-Service platform for Ageas insurance UK that empowers users of all skill levels to build and deploy models and realize ROI earlier.

Build Real-Time Applications with Databricks Streaming
Insight Digital Innovation FRIDAY, 11:40 AM – 12:10 PM (PDT)
In this presentation, study a use case the team at Insight Digital Innovation recently implemented for a large, metropolitan fire department. The company has already created a complete analytics architecture for the department based upon Azure Data Factory, Databricks, Delta Lake, Azure SQL and Azure SQL Server Analytics Services (SSAS). While this architecture works very well for the department, they would like to add a real-time channel to their reporting infrastructure. In this presentation, see how they leverage Databricks, Spark Structured Streaming, Delta Lake and the Azure platform to create this real-time delivery channel.

Sign up today!

Register today for Data + AI Summit 2021! Discover new best practices, learn new technologies, connect with your peers. If you have questions about Azure Databricks or Azure service integrations, meet us in the Microsoft Azure portal at Data + AI Summit.

For more information about Azure Databricks, visit databricks.com/azure

Try Databricks for free. Get started today.

The post Azure Databricks Training and Key Sessions at Data + AI Summit 2021 appeared first on Databricks.

↧

Your Guide to All Things Financial Services at Data + AI Summit 2021

May 12, 2021, 10:18 am

≫ Next: How to Secure Industrial IoT (And Why You Should Assume You Can’t Prevent a Data Breach)

≪ Previous: Azure Databricks Training and Key Sessions at Data + AI Summit 2021

I joined Databricks earlier this year and continue to be amazed by the value the platform brings to customers and organizations. From bringing alternative and social data (think:reddit) into real-time financial decision-making, and fast-tracking vaccine distribution, Databricks is enabling data-driven innovation for organizations in an unprecedented way.

As I learn about each possibility, I grow more excited about our upcoming Data + AI Summit in May. Data +AI Summit is the global event for the data community, where 100,000 practitioners, leaders and visionaries come together to engage in thought provoking dialogue and share the latest innovations in data and AI. At this year’s Data + AI Summit, we’re excited to announce a full agenda of sessions for data teams in the Financial Services industry. Leading innovators from across the industry – including Capital One, Intuit, Northwestern Mutual, JP Morgan and S&P Global – are joining us to share how they are using data to minimize risk, tap into value and innovate engaging and prosperous services for their customers.

Financial services keynotes and thought leadership panel

On our mainstage, we have Dr. Manuela M Veloso from JP Morgan who will be speaking about the intersection of machine learning and finance. Dr. Veloso is the Head of J.P. Morgan AI Research and a professor of computer science at Carnegie Mellon University (CMU), where she led the Machine Learning Department. She also co-founded the RoboCup international robotics competition. Her talk will be a stellar thought leadership session!

Our keynote speaker for Financial Services at Summit is Don Vu who led Data & Analytics at Major League Baseball (MLB) for 13 years and is now revamping how data is leveraged at Northwestern Mutual through organizational transformation, a modern data stack and aligning business objectives to data and AI use-cases. Join us on Thursday, May 27 at 9 am ET for our capstone Financial Services event at Data + AI Summit.

Following our keynote, you’ll have the opportunity to join a panel discussion with data analytics and AI leaders in Financial Services. Intuit, ABM AMRO and KX systems share their data transformation stories, and how they are accelerating change in their teams.

DAIS 2021 panel discussion with data analytics and AI leaders in Financial Services

Financial Services Tech Talks

Summit also has a great series of tech talks tracks. Our most highly anticipated Financial Services sessions at this year’s summit:

Commercializing Alternative Data
Jay Bhankharia, S&P Global | Srinivasa Podugu, S&P Global
End to end walkthrough of how S&P Global ingests, structures, and links data to make it more usable and then build out Sandbox workspaces using the unified analytics platform for clients.

Credit Card Fraud Detection Using ML
Badrish Davay, Capital One
Hear from Capital One on how they are dynamically detecting fraudulent transactions with machine learning on Databricks.

SCOR’s NonLife Risk Modelling on Databricks
In this talk, SCOR shares how they scaled their Non-Life Risk Modelling Application (NORMA) to run a countless number of scenarios for each and every piece of their P&C portfolio in less than half a working day.
Check out the full list of Financial Services talks at Summit.
Alos, join us for live demos on the hottest data analysis use cases in the Financial Services industry, my favorite one being:

Understanding YOLO, STONKS, and DIAMOND HANDS
Alternative data always offer a competitive advantage to those brave enough to mine it. Investment strategies are widely discussed on public forums and could be used by institutional investors as a leading indicator to near-term events.

Don’t miss the financial services experiences at Summit!

Register for the Data + AI Summit to take advantage of all the amazing Financial Services sessions, demos and talks scheduled to take place. Registration fees have been waived this year!

Try Databricks for free. Get started today.

The post Your Guide to All Things Financial Services at Data + AI Summit 2021 appeared first on Databricks.

↧

How to Secure Industrial IoT (And Why You Should Assume You Can’t Prevent a Data Breach)

May 13, 2021, 10:32 am

≫ Next: How Outreach Productionizes PyTorch-based Hugging Face Transformers for NLP

≪ Previous: Your Guide to All Things Financial Services at Data + AI Summit 2021

The Industrial Internet of Things (IIoT) is already driving massive productivity gains in the Manufacturing and Energy & Utilities industries through decreased waste, automated quality control, predictive maintenance (increasing overall equipment effectiveness) and optimized energy consumption…just to name a few.

the security team at Databricks takes a look at the challenges industrial businesses face and how its AI-driven solutions can help them mitigate the risks and disruptions.

At the same time, internet-enabled equipment and IoT devices present a cybersecurity vulnerability, particularly to ransomware, that can cause a business to go offline for days (or even weeks). The cyberattacks on Colonial Pipeline, which shut down the largest system of gas pipelines to the East Coast and is still causing major delays and other repercussions, is just the latest example. Even before this event, ransomware accounted for the largest share of cybersecurity attacks against enterprises across all industries.

Unlike traditional malware hacks in which perpetrators try to go unnoticed in order to siphon valuable information, such as financial accounts or trade secrets, ransomware organizations want their work to be noticed — they are seeking to disrupt core operations in the most disruptive way possible, forcing victims to pay or else be unable to conduct all business. This has made IIoT a particularly significant target to ransomware organizations.

Whereas in the analog age “equipment downtime” was primarily the result of maintenance and mechanical failures, IIoT creates new security challenges such as software bugs or malicious attacks shutting down the assembly line. These disruptions have large financial repercussions, as these enterprises continue incurring fixed labor and plant costs while losing revenues and possibly missing contractual SLAs, affecting key customer and vendor relationships.

While prevention is still the main line of cybersecurity defense, security solution architects recognize that it is near impossible to build a complex system that is both 100% secure from outside threats and provides the flexibility to take advantage of the latest technology in the sector. As a result, the strategy for connected devices is shifting from prevention to harm reduction as security professionals work to build redundant and resilient systems to minimize disruptions to overall production.

Their main question has become: assuming a data breach, how do you minimize data exfiltration and, just as importantly, how do you get back to business as quickly as possible?

Data engineering and AI have become key tools for security teams creating resilient systems. For example, engineers will stand up a digital twin recreating their cloud-to-edge environment for security analysts to wargame different attack scenarios. This enables them to proactively identify security issues and vulnerabilities (e.g., a patch or update which hasn’t been installed in a specific system), as well as flag bottlenecks in complex processes as candidates for creating redundancies.

Assuming a hacker organization gets into the network, however, how do you prevent them from causing harm once they’re in? At Databricks, a modern, effective approach we’ve seen from our customers is automating key portions of security analysis around the functions carried out in their network. Some of our customers, for example, conduct their ETL on Databricks for specific kinds of commands to their infrastructure, automatically appending qualities like command provenance or previous alerts. Security analysts then have actionable information on the security risk and whether to flag the command for further inspection or allow it to execute. This data enrichment process for security has saved one of our customers close to an hour for each of these events, which occur dozens of times during the day, so that they can continue operating their business efficiently, but securely.

Additionally, the solution for future systems is to isolate and distribute the data and software running their factory floors. That is, organizations will conduct the bulk of data storage and processing in the cloud to take advantage of security best practices around partitions and redundancy. Let’s see what this looks like in action:

Complex machine learning (ML) model development (say, for example a computer vision ML model identifying poor output coming out of the production line) happens in the cloud, which is then deployed in a pickle file at the edge where processing latency matters most. This way, if individual pieces of smart equipment, or even the entire factory floor, are blocked by a ransomware attack, it becomes easier to reboot the entire system and redeploy the machine learning models to the edge devices.

As 5G and IoT continue to revolutionize the factory floor, we will continue to see new attack vectors for malicious actors seeking to disrupt production. But by designing their systems with the assumption of failure, manufacturers, energy companies and utilities (or any enterprise dependent on network-enabled equipment) can use data engineering and AI to limit disruption to their production.

You can see our listing of security-related sessions at Data + AI Summit here.

Try Databricks for free. Get started today.

The post How to Secure Industrial IoT (And Why You Should Assume You Can’t Prevent a Data Breach) appeared first on Databricks.

↧

How Outreach Productionizes PyTorch-based Hugging Face Transformers for NLP

May 14, 2021, 11:16 am

≫ Next: Machine Learning, Alternative Data, Delta Lake and More: My Picks for Data + AI Summit 2021

≪ Previous: How to Secure Industrial IoT (And Why You Should Assume You Can’t Prevent a Data Breach)

This is a guest blog from the data team at Outreach.io. We thank co-authors Andrew Brooks, staff data scientist (NLP), Yong-Gang Cao, machine learning engineer, and Yong Liu, principal data scientist, of Outreach.io for their contributions.

At Outreach, a leading sales engagement platform, our data science team is a driving force behind our innovative product portfolio largely driven by deep learning and AI. We recently announced enhancements to the Outreach Insights feature, which is powered by the proprietary Buyer Sentiment deep learning model developed by the Outreach Data Science team. This model allows sales teams to deepen their understanding of customer sentiment through the analysis of email reply content, moving from just counting the reply rate to classification of the replier’s intent.

We use four primary classifications for email reply content: positive, objection, unsubscribe and referral, as well as finer sub-classifications. For example, for replies classified as an objection, we can break down how many replies are due to budget constraints vs. procurement timing issues. This is a gamechanger for the sales team, as it provides actionable insights for sales managers to coach their Sales Representatives to improve their strategies and performance.

This blog describes the technical details on how we develop the Buyer Sentiment deep learning model, which is a multiclass classifier for sales engagement email messages. In particular, we will explain an offline model development/experimentation, productionization and deployment steps.

Overview of an ML model lifecycle: development and production

As discussed in many recent articles, the development of a machine learning (ML) model requires three major artifacts: data, model and code. To successfully develop and ship a ML model in production, we need to embrace the full lifecycle development for ML projects. Figure 1 is a schematic view of Outreach’s full lifecycle development and production path, starting from data annotation to offline model development/experimentation, model productionization (model-preproduction), model deployment (staging and production) and, finally, online model monitoring and feedback loops. Databricks is used in model dev/pre-prod and CI/CD pipelines as execution servers (e.g., using GPU clusters in Databricks for model training).

A schematic view of Outreach’s full lifecycle ML development and production path
Figure 1: Full Lifecycle View of Model Development and Production at Outreach

During the offline model development/experimentation stage (i.e., Model Dev step as labeled in Figure 1), we tried different types of ML models, such as SVM, FastText and Pytorch-based Hugging Face transformers. Based on our requirements (classification f1 scores initially for the English language, with multiple languages planned for the longer term), we settled on using Pytorch-based Hugging Face transformer (bert-uncased-small) for its high-performance classification results ¹).

However, productizing a prototype is still one of the most painful experiences faced by ML practitioners. You can trade speed for discipline by enforcing production-grade standards from the very beginning. However, this is often premature optimization, as ML models require many iterations and nonlinear processes, and many fail or substantially pivot before they ever ship. You can also trade engineering discipline for maximum flexibility from day one. However, this makes the journey from prototype to production more painful once complexity reaches a tipping point where marginal costs exceed marginal gains from each new enhancement.

The trade-off between discipline and flexibility is somewhere in the middle. For us, that means we don’t directly ship our prototype code and experiments, but we enforce the minimal amount of structure needed to 1) register results from each prototype experiment, so we don’t need to repeat them, especially unsuccessful experiments; 2) link prototype experiment results to source code, so we know what logic produced them and ensure reproducibility; and 3) enable historical prototype models to be loaded for offline analysis.

Experiment, test, and deploy with MLflow Projects

Based on our full lifecycle analysis, we use MLflow Projects as the common thread between model development and deployment to achieve this trade-off. MLflow Projects is a reasonably lightweight layer that centralizes and standardizes entry points and environment definitions with a self-documenting framework.

Why we use MLflow Project:

MLflow projects add virtually no weight to your project, especially if you’re already using MLflow Tracking and MLflow Models, for which there are built-in integrations.

Smooth execution of code developed in IDE of choice.
→ Support for running Databricks notebooks is first-class, but it can be cumbersome to run scripts. MLflow Project provides a smooth CLI for running .py and .sh files without unnecessary overhead like creating Apache Spark™ or Databricks jobs.
Strong provenance tracing from source code to model results.
→ Ability to run a script from a GitHub commit without pulling down code or losing provenance on local uncommitted code.
Flexibility to prototype locally and then scale to remote clusters
→ The MLflow Project API enables users to toggle from local to remote execution with the –backend argument, which points to a Databricks or kubernetes JSON cluster config created for a single-use operation. Dependencies are handled in code (Conda) rather than state (manually configured cluster), ensuring reproducibility.

Model development mirrors CI/CD pattern
→ While we refactor experiment code before deploying, the CI/CD pipeline invokes the train, test, and deploy pipeline following the same pattern from model development, so minimal “extra” effort must go from prototype experiment to production. The ML model artifacts (binaries, results, etc.) and deployment status are centralized into one system, which eases debugging by smoothing provenance tracking back from production traffic and incidents.

How to use:

Run local code locally (no provenance) mlflow run ./ train
Run remote code locally (provenance, but bound by local compute)mlflow run https://github.com/your-GH/your-repo train --version 56a5aas
Run remote code on a cluster (provenance + compute at scale)mlflow run https://github.com/your-GH/your-repo train --config gpu_cluster_type.json --version 56a5aas

Three progressively wrapped model artifacts

One of our key considerations when developing a productionizable model is not just the model type (a fine-tuned Pytorch-based Huggingface transformer model), but also the pre/post-process steps and the internally developed Python libraries that are used by the pre/post-process steps. We took a rigorous approach to treat the entire model pipeline as a single serializable artifact in the MLflow artifact store without external dependencies on accessing a GitHub repo at deployment time. We use the scikit-learn Pipeline API for the model pipeline implementation, which is the most widely-used Python library for ML pipeline building. This opens doors to integrate other pre/post-processing steps that are also scikit-learn pipeline API compliant. Additional advantages of using this pipeline approach include preventing data leaking and maintaining reproducibility.

Taking this approach resulted in three progressively wrapped model artifacts: a fine-tuned PyTorch transformer model that implements scikit-learn baseEstimator and ClassifierMixin APIs, a scikit-learn Pipeline API compatible model pipeline that includes additional pre/post-processing steps (which we called pre-score filter and post-score filter) and a model pipeline that uses only locally bundled Python libraries without accessing any Github repos (Figure 2). Note that in the pre-score filter, we could add extra steps such as caching (for the same email message, we can serve the same prediction) and filtering out certain types of messages (e.g., bypassing those out-of-office messages). Similarly, in the post-score filter step, we can return the prediction and additional provenance tracking information about model versions and detailed score probabilities for the endpoint consumer app to use.

Outreach.io’s approach to propensity model pipeline approach utilizes three progressively wrapped model artifacts
Figure 2: Three progressively wrapped models for deployment

Embrace the automation, CI/CD and monitoring

Like any other software system, the most boring, painful and error-prone part is the repetitive maintenance work in a machine learning system. Continuous Integration and Continuous Deployment/Delivery (CI/CD) is designed to bring automation and guard rails into a workflow – from building to deployments. We designed two flows (Figure 3); one serves a quick sanity check round for each pushed commit, which takes under 30 minutes. The other prepares, checks and deploys the entire model, which takes a few hours. (Note: You can watch this video for more details on utilizing MLflow and Databricks):

Outreach.io designed two CI/CD flows. One serves as a quick sanity check round for each pushed commit; the other prepares, checks, and deploys the entire model.
Figure 3: CI/CD flows

Integration with tools

As a SaaS company, Outreach has a wide choice of SaaS and open source tools to leverage. While most of our internal services use Ruby or Go languages, the data science team opted to use Python, Databricks and MLflow for at-scale job runs. Thus, there was a need to create piping and integrations for all those tools almost from scratch. We dynamically generate conda and Databricks cluster config files for MLflow runs and put effort into synchronizing each step to construct the flows. We even weaved CircleCI and Concourse together to let them trigger each other (the same CircleCI flow is reused in CD for entire model building with different behaviors).

To do those, we exploited most capabilities of APIs from service providers – thanks to the excellent documentation from providers, open-source code from Databricks and support from both internal and external teams. There were several pitfalls, including version issues over time. Still, the caveat here is that all the tools we chose were originally not designed or tested to work together. It was up to us to overcome those initial drawbacks and provide feedback to the providers to allow them to work together.

Version controls

No matter which programming languages you use, one big headache is dependency complexity. Any version change in the deep dependency graph can be a danger for the production system. In our CI/CD, we scan and freeze all versions and bundle all dependency binaries and models into the Docker image we use for deployments so that nothing changes in the production environment. Once the image is deployed, it’s no longer affected by external dependencies. We versioned our data in Amazon S3 and built models at different stages via Model Registry provided by MLflow.

Guarded and staged model releases

As you can see from our CI/CD flows, we added several guard-rail steps along with staged environments for deployments. Besides regular flake8-based style checks, unit tests and liveness checks, we automated and verified the entire lifecycle, from training to the endpoint test after the deployment of images for each commit (in less than half an hour). For the full model check (except for the stage environment), we created a pre-prod environment with identical resources with prod for canary test and staged release purposes (beta launch of new models). Beyond that, we also added a regression test step for a thorough examination against large datasets (around one hour load and quality tests) to ensure all quality and throughput variances would be captured before we proceed for beta or production release.

As a final defense, we also added human checkpoints against the regression test results or pre-production results to confirm the promotion of changes beyond automated threshold checks. To assist the understanding of the real-world impact of changes, besides producing overall metrics and utilizing MLflow for side-by-side comparisons, we made a polished visualization of the confusion matrix (Figure 4) from a logistic regression test, as MLflow hosted image artifacts to assist the comparison and judgments with details (true positive, false positive, false negative numbers and rates on each label and axis with colors for emphasis) since overall metrics don’t tell all the dangers to individual categories but do tell the error types. The human check could be lifted once we accumulated enough experiences from multiple upgrades/iterations, and those data points could be used for later automation.

To impart the real-world impact of changes, Outreach.io made a polished visualization of the confusion matrix, as overall metrics don’t fully communicate the dangers to individual categories or reveal the error types.
Figure 4: Polished Visualization of Confusion Matrix for Predictions

Optimizing and monitoring of services

Having CI/CD to produce healthy service is just a start, but optimal running behavior and continuous health monitoring are must-haves. To optimize for obvious cases and repetitive cases, we added shortcuts and a cache layer on API to speed up serving. We initially used SageMaker for hosting our endpoints, but we found metrics related to model performance and results are minimal, so we switched to using Kubernetes and integration with Datadog for more in-depth monitoring. This brought us many advantages, including closer alignment with other internal teams, security, control and cost savings. Below are our Datadog dashboards that monitor all types of prediction trends over time, as well as latency percentiles at will. It also makes it easy to make an online prediction comparison between a new model and an old model by just one screen (e.g., when we split traffic to 50% on each, they are supposed to be statistically identical if models are the same). As you can see from the example dashboard (Figure 5), the built-in caching ability does play a positive role (the service latency could become nearly zero-second from time to time because of caching).

Figure 5: Datadog Dashboard Monitoring of the Model Endpoint Service

What’s next

This blog has focused on the end-to-end release of ML lifecycle as part of our product release using Databricks, MLflow, CircleCI, Concourse and other tools, including Datadog and Kubernetes. However, the iterative training and offline experimental flow can also benefit from additional automation. For example, standardizing how new training data is added and initiating training runs based on dynamic triggers like newly annotated data or user-provided feedback could improve overall system efficiency and shorten time-to-market for the top-performing model. More complete quality monitoring built-in to each stage with pre-set thresholds for gating releases could further improve efficiency.

Our deploy gate is still manual because, while we have target thresholds on critical metrics for releasing models, we haven’t codified every constraint and edge case that might give us pause before releasing a new model. Sometimes small offline error analyses are performed to provide the human understanding and confidence needed before releasing.

Another aspect that we have not covered in much detail is the annotation and feedback flow. While annotation provides the labeled data required to train and evaluate the model before releasing, the front-end of the released model can capture feedback directly from the users of the applications. We’ve integrated this feedback mechanism into the user experience such that user-corrected predictions produce data assets that can be incorporated into the training process. These labels are particularly impactful to model improvements as they push the model to change its behavior rather than duplicate simple patterns it already has learned and predicts correctly.

Finally, given our current flexibility to deploy to either Amazon SageMaker or local Kubernetes clusters for hosting services, we are also open to extending to other types of model hosting services such as TorchServ in the future.

For a more detailed look, check out the Summit Session on the topic given by the blog’s authors.

References:
¹ Liu, Y, Dmitriev, P, Huang, Y, et al. Transfer learning meets sales engagement email classification: Evaluation, analysis, and strategies. Concurrency Computat Pract Exper. 2020;e5759. https://doi.org/10.1002/cpe.5759

Try Databricks for free. Get started today.

The post How Outreach Productionizes PyTorch-based Hugging Face Transformers for NLP appeared first on Databricks.

↧

Machine Learning, Alternative Data, Delta Lake and More: My Picks for Data + AI Summit 2021

May 14, 2021, 2:38 pm

≫ Next: Guide to Healthcare & Life Sciences Sessions at Data + AI Summit 2021

≪ Previous: How Outreach Productionizes PyTorch-based Hugging Face Transformers for NLP

The Data + AI Summit has become an essential conference for analysts, data scientists, developers, data engineers and data teams across the globe. Once again, I’ve had the pleasure of collaborating with Jules Damji and Jen Aman to put together the agenda for the conference. Built for the data community, Data + AI Summit offers keynotes from leading technologists, hands-on training, 200+ technical deep dives and AMA sessions. Here are just a few of the sessions that I’m looking forward to attending:

Commercializing Alternative Data: Jay Bhankharia (Head of Marketplace Platforms) and Srinivasa Podugu (Head of Marketplace Technology Platforms) of S&P Global explain the end-to-end lifecycle to productize and commercialize alternative datasets at S&P Global Market Intelligence.

Massive Data Processing in Adobe using Delta Lake: Yeshwanth Vijayakumar (Sr. Engineering Manager/Architect at Adobe Experience Platform) describes how the data team built a cost effective and scalable data pipeline using Apache Spark and Delta Lake to manage petabytes of data.

Object Detection with Transformers: Liam Li, who recently completed a PhD in Machine Learning from Carnegie Mellon,dives into cutting-edge methods that use transformers to drastically simplify object detection pipelines in computer vision, while maintaining predictive performance.

Model Monitoring at Scale with Apache Spark and Verta: Manasi Vartak, Founder & CEO at Verta, explains why model monitoring is fundamentally different from application performance monitoring or data monitoring. Attendees will get a deeper understanding of what model monitoring must achieve for batch and real-time model serving use cases.

Real-world Strategies for Debugging Machine Learning Systems: Patrick Hall, Principal Scientist at bnh.ai, introduces model debugging, an emergent discipline focused on finding and fixing errors in the internal mechanisms and outputs of ML models.

FrugalML: Using ML APIs more accurately and cheaply: Lingjiao Chen, PhD Researcher at Stanford University, introduces a principled framework that jointly learns the strength and weakness of each API on different data, and performs an efficient optimization to automatically identify the best sequential strategy to adaptively use the available APIs within a budget constraint.

The Rise of Vector data: Edo Liberty, Founder & CEO of Pinecone, discusses the need for infrastructure for managing high-dimensional vectors. Edo walks through the algorithmic and engineering challenges in working with vector data at scale, and explores open problems we still have no adequate solutions for.

Observability for Data Pipelines with OpenLineage: Julien Le Dem, Co-Founder & CEO of Datakin, discusses Marquez, an open source project that instruments data pipelines to collect lineage and metadata and enable those use cases. Marquez implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.

Becoming a Data-driven Organization with Modern Lakehouse: A year after we formally introduced the lakehouse, we are seeing more companies adopt this exciting data management paradigm. Vini Jaiswal, Customer Success Engineer, explains how you can leverage the Lakehouse platform to make data a part of each business function.

Building an ML Platform with Ray and MLflow: Amog Kamsetty (Software Engineer) and Archit Kulkarni Software Engineer) of Anyscale describe how two open source projects, Ray and MLflow, work together to make it easy for ML platform developers to add scaling and experiment management to their platform.

Scaling Online ML Predictions At DoorDash: Hien Luu, Sr. Engineering Manager at Doordash, describes his journey of building and scaling a Machine Learning platform and, particularly, the prediction service, various optimizations experimented, lessons learned, technical decisions and tradeoffs.

These technical sessions are just a glimpse at what will be covered at Data + AI Summit 2021. Throughout the week, industry leaders will dive into all things AI, MLOps, open source, data use cases and so much more. I’m also incredibly excited about the keynotes we have lined up this year, including:

Bill Inmon
Malala Yoursafzai
Michael Lewis and Charity Dean
Manuela Veloso
Shafi Goldwasser
DJ Patil

Try Databricks for free. Get started today.

The post Machine Learning, Alternative Data, Delta Lake and More: My Picks for Data + AI Summit 2021 appeared first on Databricks.

↧

Guide to Healthcare & Life Sciences Sessions at Data + AI Summit 2021

May 17, 2021, 9:00 am

≫ Next: Guide to Public Sector Talks at Data + AI Summit 2021

≪ Previous: Machine Learning, Alternative Data, Delta Lake and More: My Picks for Data + AI Summit 2021

Download our guide to Healthcare and Life Sciences at Data + AI Summit to help plan your Summit experience.

Every year, data leaders, practitioners and visionaries from across the globe and industries join Data + AI Summit to discuss the latest trends in big data. For data teams in the Healthcare and Life Sciences industry, we’re excited to announce a full agenda of Healthcare and Life Science sessions. Leaders from Humana, Providence St. Joseph, Takeda, CMS and other industry organizations will share how they are using data to improve health equity, power real-time patient insights, accelerate drug discovery and innovate with real-world data.

Healthcare and Life Sciences Industry Forum

Join us on Thursday, May 27 at 11am PT for our Healthcare and Life Sciences Forum at Data + AI Summit. During our capstone event, you’ll have the opportunity to join keynotes and panel discussions with data analytics and AI leaders on the most pressing topics in the industry. Here’s a rundown of what attendees will get to explore:

Data + AI Summit 2021 Healthcare Keynote with CEO Carolyn Magill

Keynote
In this keynote, Carolyn Magill, CEO of Aetion, will share her perspectives on how real-world evidence is changing the way the world thinks about and delivers healthcare.

Panel Discussion
Join our esteemed panel of data and AI leaders from some of the biggest names in healthcare, insurance and pharma as they discuss how data is being used to improve patient engagement, access to care and discovery of new treatments.

Healthcare and Life Sciences Tech Talks

Here’s an overview of some of our most highly-anticipated Healthcare and Life Sciences sessions at this year’s summit:

RWE and Patient Analytics
In this talk, analytics leaders from Sanofi will discuss real-world data (RWD) – specifically how it is generated, what value it drives for life sciences and what kind of analytics are performed – and how they’re unlocking insights buried within RWD with the Databricks Lakehouse platform.
Learn more

Entity Resolution using Patient Records
Learn how the Centers for Medicare & Medicaid Services (CMS), a large Federal agency and the nation’s biggest healthcare payer, uses natural language processing to clean-up their claims data and power advanced analytics use cases on Databricks.
Learn more

Empowering Real Time Patient Care through Spark Streaming
Takeda’s Plasma Derived Therapies (PDT) business is on a journey to provide real-time patient insights to their clinics around the nation. Come hear how they are building a reliable streaming analytics environment with Databricks, AWS and Delta Lake.
Learn more

FlorenceAI: Reinventing data science in the cloud
How does one of the largest health insurers use machine learning to improve member benefits? Join this session to learn how Humana is unleashing the power of AI with Azure Databricks and MLFlow.
Learn more

From Vaccine Management to ICU Planning: Unlocking the Power of Data
Data has been a critical asset in the fight against COVID-19. The Chesapeake Regional Information System for our Patients (CRISP), a nonprofit healthcare information exchange (HIE) whose customers include states like Maryland and providers such as Johns Hopkins, knows this better than anyone. Learn how they built a health Lakehouse on Databricks to support a wide range of critical use cases during the pandemic.
Learn more

Check out the full list of Healthcare and Life Sciences talks at Summit.

Demos on Popular Data + AI Use Case in Healthcare and Life Sciences

In addition to all these great sessions, make sure catch these live demos on the hottest data analytics and AI use cases in the industry:

Building a health lakehouse to improve patient insights
Processing EHR records in real-time with Smolder
Solving the n+1 problem in Genomics with Glow and Delta Lake
Improving drug discovery with QSAR models at scale
Predicting opioid misuse with SQL Analytics
Analyzing healthcare claims data

Sign-up for the Healthcare and Life Sciences Experience at Summit!

Make sure to register for the Data + AI Summit to take advantage of all the amazing Healthcare and Life Sciences sessions, demos and talks scheduled to take place. Registration is free!
Download our Guide to Healthcare and Life Sciences Sessions at Data + AI Summit 2021

Try Databricks for free. Get started today.

The post Guide to Healthcare & Life Sciences Sessions at Data + AI Summit 2021 appeared first on Databricks.

↧

Guide to Public Sector Talks at Data + AI Summit 2021

May 18, 2021, 10:42 am

≫ Next: Announcing the 2021 Databricks Data Team Award Finalists

≪ Previous: Guide to Healthcare & Life Sciences Sessions at Data + AI Summit 2021

Download our guide to Public Sector at Data + AI Summit to help plan your Summit experience.

The world is being transformed by data and today’s federal government realizes that they have fallen far behind the private sector. As a result, the President’s Management Office (PMO) has recognized the need to modernize their existing infrastructure, federate data for easier access and management, and approach to data and analytics by establishing mandates around modernization, data openness, and the progression of AI innovations.

At this year’s Data + AI Summit, we’re excited to announce a full agenda of sessions for data teams in the Public Sector industry. Leading innovators from across the industry – including Veteran’s Affairs, FBI, DoD, CMS, Booz Allen Hamilton and Hennepin County – are joining us to share how they use data to deliver on their mission objectives and better serve citizens.

Government Industry Forum

Building a smarter and more innovative government starts by unlocking the power of data analytics and machine learning. Join us on Wednesday, May 26 at 11:00 AM – 1:00 PM PST for our capstone Public Sector event at Data + AI Summit. Attendees will l have the opportunity to join keynotes and panel discussions with data analytics and AI leaders across the federal and local governments. Here’s a sneak peek at what we’ll cover:

Industry keynotes
In the Public Sector, it’s well-known that driving a successful big data initiative is more complex than it should be. Learn how the data team at Booz Allen tackled this problem by developing an innovative approach to big data and codified it into a reference architecture. You’ll also hear from data leaders at DoD and FBI about how they successfully implemented a big data strategy with Booz Allen on the Databricks Lakehouse Platform with outstanding results.

Through this platform, both the DoD and the FBI are able to more easily access all their data to feed analytics and ML use cases. More specifically, at the DoD, they are using advanced analytics to improve the financial health and compliance of their entire organization. In addition, Databricks has empowered the DoD to transform data for the purpose of decision analytics to impact business, operational and mission performance. You don’t want to miss this keynote!

Data + AI Public Sector Keynote

Panel discussion
In addition to our star-studded keynote session, we are pleased to announce an industry expert panel featuring data leaders from Hennepin County, Department of Veterans Affairs (VA) and the Centers for Medicare & Medicaid Services (CMS). Join this discussion as they share insights into their data journey and how Databricks has been core to modernizing their data infrastructure and unlocking new innovations with analytics and AI.

Data + AI Summit 2021 Public Sector panel discussion

Public Sector Tech Talks

Here’s an overview of some of our most highly anticipated Public Sector sessions at this year’s summit:

Creating Reusable Geospatial Pipelines
Pacific Northwest National Lab
The Pacific Northwest National Lab is on the mission to expand the beneficial use of nuclear materials across the country. With massive volumes of geospatial data to process, they have developed data solutions on the Databricks platform to run traditional geospatial hotspot analysis. This talk will go over the pros and cons of various data and ML solutions and will show an actionable workflow implementation that any geospatial analyst can leverage.

Improving Power Grid Reliability Using IoT Analytics
Neudesic and DTE Energy
Electrical grid failures have impact and consequences that can range from daily inconveniences to catastrophic events. Ensuring grid reliability means that data is fully-leveraged to understand and forecast demand, predict and mitigate unplanned interruptions to power supply and efficiently restore power when needed. In this session, Neudesic, a Systems Integrator, and DTE Energy, a large electric and natural gas utility serving 2.2 million customers in southeast Michigan, share how they use the Databricks Lakehouse Platform to ingest large IoT datasets and predict sources and causes of reliability issues across DTE’s power distribution network. Because of this and other efforts, DTE has improved reliability by 25% year over year.

Consolidating MLOps at One of Europe’s Biggest Airports
The Royal Schiphol Group
At the Schiphol Airport, the opportunities to leverage data and AI are boundless — from predicting passenger flow to computer vision models that analyze what is happening around the aircraft. Join this talk as the data team at Schiphol Airport discusses how they rely on the Databricks Lakehouse Platform and MLflow to quickly iterate on models and monitor them actively to see if they still fit the current state of affairs. As a result, they are now able to release multiple versions of a model per week in a controlled fashion.

From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data During a Pandemic
Chesapeake Regional Information System for our Patients (CRISP)
When the pandemic started, the Maryland Department of Health reached out to the Chesapeake Regional Information System for our Patients (CRISP), a nonprofit healthcare information exchange (HIE), with a request: get us the demographic data we need to track COVID-19 and proactively support our communities. As a result, CRISP employees spent long hours attempting to handle multiple data sources with complex data enrichment processes. To automate these requests, CRISP partnered with Slalom to build a data platform powered by Databricks and Delta Lake. This session focuses on how the power of the Databricks Lakehouse platform and the flexibility of Delta Lake has helped CRISP process billions of records from hundreds of data sources in an effort to combat the pandemic.

Entity Resolution Using Patient Records at CMMI
NewWave
The Center for Medicare & Medicaid Innovation (CMMI) builds innovation models that test healthcare delivery and payment systems, integrating and parsing huge datasets with multiple provenance and quality. This instructional-style presentation will give into the need for and deployment of a Databricks-enabled Entity Resolution Capability at the Center for Medicare & Medicaid Innovation (CMMI) within the Centers for Medicare & Medicaid Services (CMS), the federal government agency that is also the nation’s largest healthcare payer. They’ll explore the specific entity resolution use cases, the ML necessary for this data and the unique uses of Databricks for the federal government and CMS in providing this capability

Check out the full list of Public Sector talks at Summit.

Demos on Popular Data + AI Use Case in Public Sector

Join us for live demos on the hottest data analytics and AI use cases in the public sector:

Predicting Opioid Misuse with Databricks and SQL Analytics
Every year, prescription opioid misuse results in unnecessary loss of life and places a massive financial burden on the healthcare system. Advanced analytics can be used to identify and flag anomalous opioid distribution patterns. Join this demo to learn how multiple data personas can collaborate using Databricks and SQL Analytics to ingest large volumes of pharma transaction data, identify statistical outliers and build dashboards to distinguish and classify suspicious cases of potential opioid misuse.

Detecting cyber criminals using ML, threat intel and DNS data
Learn how Databricks technologies can be used to augment and help scale Security Operations. In this no-jargon demo for security practitioners you will learn how to detect a remote access trojan – from data ingest to alerting – and the capabilities in the Databricks Lakehouse platform that can help security teams be more effective.

Healthcare Claims Reporting: Healthcare Claims Analytics for Health and Human Services
As more government entities move to deliver value-based healthcare outcomes, analyzing the cost and complexion of healthcare services has never been more timely. This demo takes a look at integrating disparate types of healthcare encounter claims and transforming them into a patient-centric model, which can then be analyzed along different dimensions.

Student Success: Understanding and Predicting Student Success
In today’s learning environment, more students than ever are learning through both in-person and digital means. This demo takes a look at the kinds of data often available to academic institutions and proposes a method for determining students who are at-risk of not matriculating, so that stakeholders can direct interventions and services to them to ensure the best educational outcomes.

Sign-up for the Public Sector Experience at Summit!

Make sure to register for the Data + AI Summit to take advantage of all the amazing Public Sector sessions, demos and talks scheduled to take place. Registration is free!

Try Databricks for free. Get started today.

The post Guide to Public Sector Talks at Data + AI Summit 2021 appeared first on Databricks.

↧