Why Cloud Centric Data Lake is the future of EDW

November 3, 2020, 9:56 am

≫ Next: Learn the Comcast Architecture for Enterprise Metadata and Security

≪ Previous: Quickly Deploy, Test, and Manage ML Models as REST Endpoints with MLflow Model Serving on Databricks

In this first of two blogs, we want to talk about WHY an organization might want to look at a lakehouse architecture (based on Delta Lake) for their data analytics pipelines instead of the standard patterns of lifting and shifting their Enterprise Data Warehouse (EDW) from on-prem or in the cloud. We will follow this case with a second detailed blog on HOW to make just such a transition shortly.

Enterprise Data Warehouse

Enterprise Data Warehouses have stood the test of time in organizations. They deliver massive business value. Businesses need to be data driven and want to glean insights from their data and EDWs are proven work horses to do just that.

However, as time has gone along, there have been several issues identified in the EDW architecture. Broadly speaking these issues can be attributed to the four characteristics of big data, commonly known as the “4 Vs” or volume, velocity, variety, and veracity that are problematic for legacy architectures. The following reasons further illustrate limitations of an EDW based architecture:

Legacy, non-scalable architecture: Over time, architects and engineers have gotten crafty with performant database technologies, and have turned to data warehousing as a complete data strategy. Pressured to get creative with existing tooling, the database has been used to solve more complex solutions that it was not originally intended for. This has caused anti-patterns to proliferate from on-prem to the cloud. Oftentimes, when these workloads are migrated, cloud costs skyrocket – from infrastructure, to resources required for the management and implementation to time it takes to derive value. This leaves everyone questioning this “cloud” strategy.
Variety of Data: The good old days of managing just structured data are over. Today, data management strategies have to account for semi structured text, JSON, and XML documents, as well as unstructured data like, audio, video and binary files. When deciding on technology for the cloud, you have to consider a platform that is able to handle data of all types, not just the structured data that feeds monthly reports.
Velocity of Data: Data Warehousing provided a paradigm shift where we would have the ETL processing happen overnight, where our business aggregates would compute, and business partners would have their fresh data first thing in the morning. The requirements and needs of businesses are rapid and increasing. In such a scenario to conform to daily loads becomes an execution risk. It is imperative to have an always updated data store for analytics, AI and decision making.
Data Science: At first a 2nd class citizen in the siloed data ecosystem of yesterday, organizations are now finding that they need to pave the way for data scientists to do what they do best. Data Scientists need access to as much data as possible. A big part of training a model is selecting the most predictive fields from the raw data, which aren’t always present in the Data Warehouse. A data scientist would not be able to identify which data to include in the warehouse without first analyzing it.
Proliferation of Data: Given the rate of change of business operations today, we require more changes to our data models, and changing the data warehouse can become costly. Alternatively, the use of data marts, extract tables, and desktop databases have fractured the data ecosystem in modern enterprises, causing fractured views of the business. Further still, this model requires capacity planning that looks out 6 months or longer. In the cloud, this design principle translates to significant cost.
Cost Per Column: In the traditional EDW world, the coordination and the planning required to yield a new column in the schema is substantial. This impacts two things – the cost and the lost time to value (i.e. decisions being made when the column is unavailable). An organization ought to look at the flexibility of a cloud data lake that reduces this cost (and time) significantly leading to desired outcomes faster.
ETL vs ELT: In an on-prem world you either pay to have ETL servers idle for most of the day; or you have to be careful in scheduling your ELT jobs against BI workloads in a Data Warehouse. In the cloud you have a choice – follow the same trend (i.e. perform ELT and BI in a Data warehouse) or switch to ETL. With ETL in the cloud you only pay for the infrastructure when your transformations run. And, you should pay a much lower price to execute those transformations. This segmentation of workloads also allows for efficient computation to support high throughput and streaming data. Overall, ETL in cloud can offer an organization tremendous cost and performance benefits.

Figure 1: Typical flow in the EDW world and its limitations

So, really, as a result of these challenges, the requirements became clearer for the EDW community:

Must ingest ALL data, i.e. structured, semi structured and unstructured
Must ingest ALL data at all velocities, i.e. monthly, weekly, daily, hourly, even every second (i.e. streaming) while evolving schema and preventing expensive, time consuming modifications
Must ingest ALL data and by this we mean the entire volume of data
Must ingest ALL of this data reliably – failed jobs up stream should not corrupt data downstream
It is NOT enough to simply have this data available for BI. The organization wants to leverage all of this data to have the edge against competition and be predictive and ask not just “what happened” but also “what will happen”
Must segment computation intelligently to achieve results with optimal costs instead of over provisioning for the “just in case” situations.
Do all of this while eliminating copies of data, replication issues, versioning problems and, possibly, governance issues

Simply put – the architecture must support all velocity, variety and volume of data, enable business intelligence and production grade data science at optimal cost.

Now, if we start talking about a Cloud Data Lake architecture the one major thing that this brings to the table for the organization is extremely cheap storage. When you think about Azure Blob or Azure Data Lake Gen2 as well as AWS S3, you can store TB scale data for a few dollars. This frees the organization from being beholden to analytics apparatus where the disk storage costs are many multiples of that. BUT, this only happens, if the organization takes advantage of separating compute from storage. By this we mean that the data must persist separately from your compute infrastructure. In nominal terms, on AWS, your data would reside on S3 (or ADLS Gen2/Blob on Azure) while your compute would spin up as and when required.

With that in mind, let us take a look at the architecture for a modern cloud data lake architecture

All of your data sources can land in one of these cheap object stores on your preferred cloud whether it is structured, unstructured or semi structured
Now, you build a curated data lake on top of this raw data just landing on the storage tier
And, on top of this curated data lake, you build out exploratory data science, production ML as well as SQL/BI

For this curated data lake, we want to focus on things that an organization has to think about in building this layer to avoid the pitfalls of the data lakes of yesteryear. In those there was a strong notion of “garbage in garbage out”. One of the key reasons for that property in the data lakes of the past was because of the reliability of data. Data could land with the wrong schema, could be corrupted etc. and it would just get ingested into the data lake. Only later, when that data is queried do the problems really arise. So reliability is a major requirement to think about.

Another one that matters, of course, is performance. We could play a lot of tricks to make the data reliable, but it is no good if a simple query takes forever to return.

Yet another one that matters is that, as an organization, you might start to think about data in levels of curation. You might have a raw tier, a refined tier and a BI tier. Generally, raw tier is your incoming data, the refined tier is imposing requirements of schema enforcement and reliability checks and the BI tier has clean data with aggregations ready to build out dashboards for executives. We also need to think about a process to move between these tiers in a simplistic way.

Also we want to keep compute and storage separate – and the reason we want to do this is because in the cloud compute costs can weigh heavily on the organization. You want to store it on the object store giving you a cheap persistent layer. Bring your compute to the data for only as long as you need it and then turn it off. As an example, bring up a very large cluster to perform ETL against your data for a few minutes and shut it off after the process is done. On the query side, you can keep ALL of your data going back decades on S3 and bring up a small cluster in the case you only need to query the last few years. This flexibility is of paramount importance. What this really implies is that the reliability and performance we are talking about have to be inherent properties of how the data is stored.

Figure 3: A Cloud Curated Data Lake architecture

So, say we have a data format for this curated Data Lake layer that gives us inherent reliability and performance properties coupled with the fact that the data stays completely under the organization’s control, you now need a query engine that allows you to access this format. We think the choice here, at least for now, is Apache Spark. Apache Spark is battle tested, supports ETL, streaming, SQL and ML workloads.

So, this data format, from a Databricks perspective, is Delta Lake. Delta Lake in an open source format that is maintained by the Linux Foundation. There are others you will hear about as well – Apache Hudi and Iceberg. They are trying to solve for the reliability property required on the data lake. The big difference, however, is that at this point, Delta Lake processes 2.5 exabytes per month. It is a battle tested data format for the cloud data lake amongst Fortune 500 companies and being leveraged across all verticals from financial services, to ad tech to automotive and public sector.

Delta Lake coupled with Spark gives you the capability to move easily between the data lake curation stages. In fact, you could incrementally ingest incoming data in raw tier and be assured to see it move through the transformation stages all the way through to the BI tier with ACID guarantees.

We at Databricks realize that this is the vision a lot of the organizations are looking to implement. So, when you look at Databricks as a Unified Data Analytics Platform, what you see is:

An open, unified data service – we are the original creators of several open source projects, including Apache Spark, MLflow, and Delta Lake. Their capabilities are deeply integrated into our product.
We cater to the data scientist via a collaborative workspace environment.
We enable and, in fact, accelerate the process of productionizing ML via an end-to-end ML workflow to train, deploy, and manage models.
And on the SQL/BI side, we provide a native SQL interface enabling the data analyst to directly query the data lake using a familiar interface. We have also optimized data lake connectivity with popular no-code BI tools like Tableau.

Figure 5: A Databricks centric Curated cloud Data Lake solution

What’s Next

We will follow this blog on WHY you should consider a Data Lake as you look to modernize in the cloud with a HOW blog. We will focus on specific aspects to think of and know about as you orient yourself from a traditional Data warehouse to a Data Lake.

Try Databricks for free. Get started today.

The post Why Cloud Centric Data Lake is the future of EDW appeared first on Databricks.

↧

Learn the Comcast Architecture for Enterprise Metadata and Security

November 5, 2020, 7:57 am

≫ Next: A Guide to MLflow Talks at Data + AI Summit Europe 2020

≪ Previous: Why Cloud Centric Data Lake is the future of EDW

Comcast will present a live session on their architecture for metadata and security at our upcoming Databricks AWS Cloud Data Lake DevDay. The event includes a hands-on lab with Databricks notebooks that integrate with Amazon Web Services (AWS) Services like AWS Glue and Amazon Redshift. Our partner Privacera will also show how their solution integrates with Databricks to help provide provide our customer Barbara Eckman, Senior Principal Software Architect, Comcast with a consistent security architecture across their AWS cloud and on-premises data lakes.

Building a Cloud Data Lake

Organizations want to leverage the wealth of data accumulated in their data lake for deep analytics insights. However, most organizations struggle with preparing data for analytics and automating data pipelines to leverage new data as data lakes are constantly updated. Making the shift to automated data pipelines can be challenging, but it’s become more urgent as the COVID-19 pandemic accelerates the move to a completely virtual workforce and collaborative problem solving.

Learn how to move from manual management of data pipelines to seamless automation in this collaborative workshop with experienced partners and customers to pave the way. Join us Wednesday, November 11th, at 9:00 AM PST to experience a deep dive into the technology that makes up a modern cloud-based big data and analytics platform. The session provides a valuable live chat opportunity with our system architects to answer all your questions, as well as a set of Notebooks to recreate the entire journey.

Speakers

Barbara Eckman, Senior Principal Software Architect, Comcast

Srikanth Venkat, VP, Product Management, Privacera

Denis Dubeau, AWS Partner Solution Architect Manager, Databricks

An overview of what you’ll learn:

Learn how to build highly scalable and reliable data pipelines for analytics
See how Comcast is using Privacera, Apache Atlas, and AWS Glue to provide an enterprise-wide metadata and security infrastructure
Learn how you can make your existing Amazon S3 data lake analytics-ready with open-source Delta Lake technology
Evaluate options to migrate current on premise data lakes (Hadoop, etc) to AWS with Databricks
Integrate that data with AWS services such as Amazon SageMaker, Amazon Redshift, AWS Glue, and Amazon Athena, as well as leveraging your AWS security and roles without moving your data out of your account
Understand open source technologies like Delta Lake and Apache SparkTM that are portable and powerful at any organization and for any data analytics use case
Get a set of Notebooks that guide you through the entire session
Network virtually and learn from your data professional peers

Get ready

Register: If you have not registered for the event, you can do so here.
Training: If you are new to Databricks and want to learn more, check out our free online training course here.
Learn more about Databricks on AWS at www.databricks.com/aws

Try Databricks for free. Get started today.

The post Learn the Comcast Architecture for Enterprise Metadata and Security appeared first on Databricks.

↧

A Guide to MLflow Talks at Data + AI Summit Europe 2020

November 5, 2020, 2:09 pm

≫ Next: Improving the Spark Exclusion Mechanism in Databricks

≪ Previous: Learn the Comcast Architecture for Enterprise Metadata and Security

In the last two years since its release, MLflow has seen a rapid adoption among enterprises and the data science community. With over 2M downloads, 260 contributors, and 100+ organizations contributing, the momentum seems to grow each year.

We have put together a short list of picks–keynotes, tutorials, and sessions–for MLflow: how the community and organizations manage their models at scale using MLflow and MLOps best practices.

Keynotes

Join Matei Zaharia on Thursday, November 19th for his keynote on Taking Machine Learning to Production with New Features in MLflow to learn more about some of the most recent and new MLflow features. Specifically, Matei will present some of the latest functionality added for productionizing machine learning in MLflow, the popular open source machine learning platform started by Databricks in 2018. These include built-in support for model management and review using the Model Registry, APIs for automatic Continuous Integration and Delivery (CI/CD), model schemas to catch differences in a model’s expected data format, and integration with model explainability tools.

Lin Qiao, engineering director, PyTorch, Facebook, will talk about recent developments in PyTorch and its extended integration with MLflow.

Talks

We have a fantastic lineup of speakers and sessions throughout the conference on MLflow. Join experts from H&M, Facebook, Yotpo, Seldon, Avast, Criteo, and Databricks and more for real-life examples, use cases, and deep dives on MLflow:

Apply MLOps at Scale: MLOps has taken centre stage as more and more machine learning models are put into production. This session will demonstrate how to scale and manage them. Keven Wang, Data + AI lead at H & M, will share a reference architecture for productionising ML models at scale.
Reproducible AI Using PyTorch and MLflow: Geeta Chauhan, Facebook’s AI Partnership Engineering lead, will show how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams.
Productionising Real-time Serving With MLflow: Ron Barabash, lead software engineer at Yotpo, will explain what they did at Yotpo in order to make MLflow model serving production-grade!
MLflow at Company Scale: Jean-Denis Lesage, lead software engineer at Criteo, will discuss how to manage 50k runs, millions of metrics, parameters or tags, some bursts at 20k QPS.
MATS stack (MLflow, Airflow, Tensorflow, Spark) for Cross-system Orchestration of Machine Learning Pipelines: Avast completes over 17 million phishing detections a day, providing crucial online protection for this type of attacks. In this talk, Joao Da Silva and Yury Kasimov will present the MATS stack for productionisation of machine learning and their journey into integrating model tracking, storage, cross-system orchestration and model deployments for a complete and modern machine learning pipeline at Avast.
Model Experiments Tracking and Registration using MLflow on Databricks: Dash Desai of StreamSets, will cover how to standardize and automate tracking of model experiments and development with MLflow tracking.
Seamless MLOps with Seldon and MLflow: Seldon’s Adrian González Martín, a machine learning engineer, will discuss through an end-to-end hands-on example showing how you can go from research to production without much complexity by leveraging the Seldon Core and MLflow frameworks.
Building an ML Tool to predict Article Quality Scores using Delta & MLflow: Ivana Pejeva, data scientist at element 61, will cover how Roularta – a news & media publishing company – has built an AI-driven article quality scoring solution using Spark for parallelized compute, Delta for efficient data lake use, BERT for NLP and MLflow for model management.
Building a MLOps Platform Around MLflow to Enable Model Productionalization in Just a Few Minutes: Milan Berka, ML architect at DataSentics a.s., will present their MLOps platform, powered by MLflow.
Deterministic Machine Learning with MLflow and mlf-core: Lukas Heumos, Research Software Engineer, will present mlf-core and its demeternistic project templates based on MLflow for PyTorch, Tensorflow, and XGBoost.
Introducing MLflow for End-to-End Machine Learning on Databricks: Sean Owen, a principal solution architect at Databricks, will walk you through an end-to-end machine learning lifecycle example on development to deployment using MLflow using Apache Spark ML.

Next Steps

You can browse through our sessions from the Data + AI Summit schedule, too.

To get started with open source MLflow, follow the instructions at mlflow.org or check out the MLflow release code on Github. We are excited to hear your feedback!

If you’re an existing Databricks user, you can start using managed MLflow on Databricks by importing the Quick Start Notebook for Azure Databricks or AWS. If you’re not yet a Databricks user, visit databricks.com/mlflow to learn more and start a free trial of Databricks and managed MLflow.

Try Databricks for free. Get started today.

The post A Guide to MLflow Talks at Data + AI Summit Europe 2020 appeared first on Databricks.

↧

Improving the Spark Exclusion Mechanism in Databricks

November 6, 2020, 10:23 am

≫ Next: Healthcare and Life Sciences Agenda for Data + AI Summit Europe 2020

≪ Previous: A Guide to MLflow Talks at Data + AI Summit Europe 2020

Ed Note: This article contains references to the term blacklist, a term that the Spark community is actively working to remove from Spark. The feature name will be changed in the upcoming Spark 3.1 release to be more inclusive, and we look forward to this new release.

Why Exclusion?

The exclusion mechanism was introduced for task scheduling in Apache Spark 2.2.0 (as “blacklisting”). The motivation for having exclusion is to enhance fault tolerance in Spark, especially against the following problematic scenario:

In a cluster with hundreds or thousands of nodes, there is a decent probability that executor failures (eg. I/O on a bad disk) happen on one of the nodes during a long-running Spark application — and this can lead to task failure.
When a task failure happens, there is a high probability that the scheduler will reschedule the task to the same node and same executor because of locality considerations. Now, the task will fail again.
After failing spark.task.maxFailures number of times on the same task, the Spark job would be aborted.

Figure 1. A task fails multiple times on a bad node

The exclusion mechanism solves the problem by doing the following. When an executor/node fails a task for a specified number of times (determined by the config spark.blacklist.task.maxTaskAttemptsPerExecutor and spark.blacklist.task.maxTaskAttemptsPerNode), the executor/node would be blocked for the task, and would not receive the same task again. We also count the number of failures of executors/nodes on a stage/application level, and block them for the entire stage/application when the number exceeds the threshold (determined by the config spark.blacklist.application.maxFailedTasksPerExecutor and spark.blacklist.application.maxFailedExecutorsPerNode). Executors and nodes in an application-level exclusion will be released out of the exclusion after a timeout period (determined by spark.blacklist.timeout).

Figure 2. A task succeed after excluding the bad node

Deficiencies in the exclusion mechanism

The exclusion mechanism introduced in Spark 2.2.0 has the following shortcomings that forbid it from being enabled for the user by default.

Exclusion mechanisms never actively decommission nodes. The excluded nodes would just sit idle, and contribute nothing to task completion.
When there are transient and frequent task failures, many nodes will be added to the exclusion list, thus a cluster quickly gets into a scenario where no worker node can be used.
Exclusion mechanism can not be enabled only for shuffle-fetch failures.

New features introduced in DBR 7.3

In DBR 7.3, we improved the Spark exclusion mechanism by implementing the following features.

Enable Node Decommission for Exclusion

In a scenario where some nodes are having permanent failures, all the old exclusion mechanism can do is to put them into application level exclusion list, take them out for another try after a timeout period, and then put them back in. The nodes would be sitting idle, contributing to nothing, and they stay bad.

We address this problem by adding a configuration called spark.databricks.blacklist.decommissionNode.enabled. If spark.databricks.blacklist.decommissionNode.enabled is set to true, when a node is excluded on the application level, it will be decommissioned, and a new node would be launched to keep the cluster to its desired size.

Figure 3. Decommission the bad node, add create new healthy node

Exclusion Threshold

In DBR 7.3, we introduced the feature of thresholding to the exclusion mechanism. By tuning spark.blacklist.application.blacklistedNodeThreshold (default to INT_MAX), users can limit the maximum number of nodes excluded at the same time for a Spark application.

Figure 4. Decommission the bad node until the exclusion threshold is reached

Thresholding is very useful when the failures in a cluster are transient and frequent. In such a scenario, an exclusion mechanism without thresholding is at risk of sending all executors and nodes into application level exclusion, and leaving the user with no resources to use until new healthy nodes are launched.

As shown in Figure 4, we only decommission bad nodes until the exclusion threshold is reached. Thus, with the threshold properly configured, a cluster will not enter the situation that it can only utilize far less worker nodes than expected. Also, as the old bad nodes are replaced by the new healthy nodes, we can still replace the remaining bad nodes in the cluster gradually.

Independent enabling for FetchFailed errors

FetchFailed errors occur when a node fails to fetch a shuffle block from another node. In this case, it is possible that the node being fetched from is having a long-lasting failure, and many tasks would be affected, and fail due to fetch failures. Therefore, we see exclusion for FetchFailed errors as a special case in the exclusion mechanism due to its large impact.

In Spark 2.2.0, exclusion for FetchFailed errors could only be used when general exclusion is enabled (e.g. spark.blacklist.enabled is set to true). In DBR 7.3, we made enablement for FetchFailed errors independent.

Now, the user can set spark.blacklist.application.fetchFailure.enabled alone to enable exclusion for FetchFailed errors.

Conclusion

The improved exclusion mechanism is better integrated with the control plane, it gradually decommissions the excluded nodes. As a result, the bad nodes are recycled and replaced by new healthy nodes, thus reduces the task failures caused by bad nodes, and also saves the cost spent on bad nodes. Get started today and try out the improved exclusion mechanism in Databricks Runtime 7.3.

Try Databricks for free. Get started today.

The post Improving the Spark Exclusion Mechanism in Databricks appeared first on Databricks.

↧

Healthcare and Life Sciences Agenda for Data + AI Summit Europe 2020

November 9, 2020, 9:00 am

≫ Next: Retail and Consumer Goods Agenda for Data + AI Summit Europe 2020

≪ Previous: Improving the Spark Exclusion Mechanism in Databricks

Looking for the best Healthcare and Life Sciences events and sessions at Data + AI Summit Europe 2020 (Nov 17-19)? Below are some highlights. You can also find all Healthcare-related sessions, including customer case studies and extensive how-tos, within the event homepage by selecting “Healthcare & Life Sciences” from the “Industry” dropdown menu. You can still register for this free, virtual event here.

Learn more about the Healthcare and Life Sciences talks, training and events featured at the Data + AI 2020 Virtual Summit.

For Business Leaders

Panel Discussion: Improving Health Outcomes with Data + AI
To drive better outcomes and reduce the cost-of-care, healthcare and life sciences organizations need to deliver the right interventions to the right patients through the right vehicle at the right time. To achieve this, health organizations need to blend and analyze diverse sets of data across large populations, including electronic health records, healthcare claims, SoDH/demographics data, and precision medicine technologies like genomic sequences. Integrating these diverse data sources under a common and reproducible framework is a key challenge healthcare and life sciences companies face in their journey towards powering data driven outcomes. In this session, we explore the opportunities for optimization across the whole healthcare value chain through the unification of data and AI. Attendees will learn best practices for building data driven organizations and hear real-world stories for how advanced analytics is improving patient outcomes.

Panelists

Joe Roemer, Sr. Director, Global Commercial IT Insight and Analytics, AstraZeneca
Iyibo Jack, Director Engineering, Milliman MedInsight
Arek Kaczmarek, Exec Director, Data Engineering, Providence St. Joseph Health

Keynote: AI & Predictive Analytics in Healthcare with Dr. Kira Radinsky

Dr. Kira Radinsky is the chairperson and CTO of Diagnostic Robotics, where the most advanced technologies in the field of artificial intelligence are harnessed to make healthcare better, cheaper, and more widely available. In her keynote, she will discuss some of the latest data + AI advancements in Healthcare.

Patient-centric AI App to reduce public health costs (Swisscom Digital)

Healthcare costs are exploding year by year. Thanks to Artificial Intelligence it is possible to address patient needs in a cost-efficient manner. In this session, speakers from Swisscom Digital demonstrate how as part of a telemedicine service they implemented, they were able to reduce the triage cost of patients by leveraging AI. The app they developed not only reduced costs but also significantly improved the patient experience.

Transforming GE Healthcare with Data Platform Strategy (GE Healthcare)

Data and analytics is foundational to the success of GE Healthcare’s digital transformation and market competitiveness. This talk focuses on their journey to transforming the data analytics platform that underpins GE Healthcare. They will discuss their efforts over the last year to move from an on-prem legacy data platforming strategy to a cloud native and completely services oriented strategy. This was a huge effort for an 18Bn company and executed in the middle of the pandemic. This transformation has enabled GE Healthcare to make huge improvements in their enterprise data analytics strategy.

For Practitioners

Unifying Multi-omics Data Together on the Databricks Platform followed by AMA

Healthcare, life sciences, and agricultural companies are generating petabytes of data, whether through genome sequencing, electronic health records, imaging systems, or the Internet of Medical Things. The value of these datasets grows when we are able to blend them together, such as integrating genomics and EHR-derived phenotypes for target discovery, or blending IoMT data with medical images to predict patient disease severity. In this session, we will look at the challenges customers face when blending these data types together. We will then present an architecture that uses the Databricks Unified Data Analytics Platform to unify these data types into a single data lake, and discuss the use cases this architecture can empower. We will then dive into a workload that uses the whole genome regression method from Project Glow to accelerate the joint analysis of genotypes and phenotypes data.

Afterwards, Amir Kermany, Sr. Solution Architect for Healthcare and Life Sciences at Databricks, will be available to answer questions about this solution or any other use case questions you may have across healthcare, the life sciences, or agriculture

Using NLP to Explore Entity Relationships in COVID-19 Literature (Wisecube AI)

In this talk, speakers from Wisecube AI will cover how to extract entities from text using both rule-based and deep learning techniques. They will also cover how to use rule-based entity extraction to bootstrap a named entity recognition model. Additionally, they will also cover how to infer relationships between entities, and combine them with explicit relationships found in the source data sets. Although this talk is focused on the COVID-19 data set, the techniques covered are applicable to a wide variety of domains. This talk is for those who want to learn how to use NLP to explore relationships in text. What you will learn:

How to extract named entities without a model
How to bootstrap an NLP model from rule-based techniques
How to identify relationships between entities in text.

Looking forward to seeing you at the Data + AI Summit 2020.

Try Databricks for free. Get started today.

The post Healthcare and Life Sciences Agenda for Data + AI Summit Europe 2020 appeared first on Databricks.

↧

Retail and Consumer Goods Agenda for Data + AI Summit Europe 2020

November 9, 2020, 10:00 am

≫ Next: Financial Services Agenda for Data + AI Summit Europe 2020

≪ Previous: Healthcare and Life Sciences Agenda for Data + AI Summit Europe 2020

Looking for the best Retail & CPG events and sessions at Data + AI Summit Europe 2020 (Nov 17-19)? Below are some highlights. You can also find all Retail-related sessions, including customer case studies and extensive how-tos, within the event homepage by selecting “Retail & Consumer Goods” from the “Industry” dropdown menu. You can still register for this free, virtual event here.

For Business Leaders

Charting a path to growth in Retail and Consumer Goods with data + AI
The recent pandemic has accelerated the adoption of digital services and e-commerce by 10 years in 10 weeks. Companies have seen a surge in market share from new customers in digital channels. As we navigate through COVID and economic recovery, companies that are able to engage their customers in a relevant, meaningful way will realize stronger growth rates.
To power real-time hyper-personalized experiences, organizations need to be armed with a unified approach to data and analytics to rethink their ways of understanding and acting on the consumer. Personalized engagement, when done properly, can drive higher revenues, marketing efficiency and customer retention. Through the use of big data, machine learning and AI, companies can refocus their efforts on areas that will rapidly deliver value and drive growth into the future. Join us for a discussion on best practices and real-world uses for data analytics and machine learning in the Retail and Consumer Goods industry.
Presenter

Rob Saker, Retail & CPG GTM Lead, Databricks

Panel

Patrick Baginski, Global Head, Data & Analytics, McDonalds
Tom Mulder, Lead Data Scientist, Wehkamp
Josh Osmon, VP, Product Management, Everseen

(H&M) Apply MLOps at Scale

In this session you will learn about how H&M evolves reference architecture covering the entire MLOps stack addressing a few common challenges in AI and Machine Learning product, like development efficiency, end-to-end traceability, speed to production, etc.

This architecture has been adapted by multiple product teams managing 100s of models across the entire H&M value chain and enables data scientists to develop a model in a highly interactive environment, enabling engineers to manage large scale model training and model serving pipeline with full traceability.

The team presenting is currently responsible for ensuring that best practices and reference architecture are implemented on all product teams to accelerate H&M groups’s data driven business decision making journey.

(Henkel) End-to-End Supply Chain Control Tower

When you look at traditional ERP or management systems, they are usually used to manage the supply chain originating from either the point of origin or point of destination which are all primarily physical locations. And for these, you have several processes like order to cash, source to pay, physical distribution, production etc.

Our supply chain control tower is not tied up to a single location nor confined to a single part in the supply network hierarchy. Our control tower focuses on gathering and storing real-time data, and offers a single point of information related to all data points. We are able to aggregate data from different inventory, warehouse, production, planning, etc. to guide improvements and mitigate exceptions keeping in mind an efficient supply network operations in our end to end value chain.

This allows us to do cross-functional data-based applications, such as digital sales and operations planning. Which is a very powerful tool to align operations execution with our financial goals.

All this is possible by using a future-proof big data architecture and strong partnership with respective suppliers such as Microsoft and Databricks.

For Practitioners

Understanding the Who & Why of Customer Churn, Followed by AMA

Customer retention is essential to the long-term success of any business. Understanding who is about to leave is essential for preventing customer churn and understanding why they might leave is essential for correcting systemic issues which lead customers down this path. While the two questions are closely related, they require the use of very different analytic techniques. In this session, we will leverage real-world data from a subscription music service to explore these techniques.

Afterwards, Retail & CPG Technical Director Bryan Smith will be available to answer questions about this solution or any other retail analytics use case questions you may have.

Context-aware fast food recommendation with Ray on Apache Spark at Burger King

For fast food recommendation use cases, user behavior sequences and context features (such as time, weather, and location) are both important factors to be taken into consideration. At Burger King, we have developed a new state-of-the-art recommendation model called Transformer Cross Transformer (TxT). It applies Transformer encoders to capture both user behavior sequences and complicated context features and combines both transformers through the latent cross for joint context-aware fast food recommendations. Online A/B testings show not only the superiority of TxT compared to existing methods results, but also TxT can be successfully applied to other fast food recommendation use cases outside of Burger King.

In addition, we have built an end-to-end recommendation system leveraging Ray, Apache Spark and Apache MXNet, which integrates data processing (with Spark) and distributed training (with MXNet and Ray) into a unified data analytics and AI pipeline, running on the same cluster where our big data is stored and processed. Such a unified system has been proven to be efficient, scalable, and easy to maintain in the production environment.

In this session, we will elaborate on our model topology and the architecture of our end-to-end recommendation system in detail. We are also going to share our practical experience in successfully building such a recommendation system on big data platforms.

(Guosto) Building a Real-Time Supply Chain View: How Guosto Merges Incoming Streams of Inventory Data at Scale to Track Ingredients Throughout its Supply Chain

Gousto is the leading recipe box company in the UK. Every day we have to keep track of a huge amount of ingredients flowing through our warehouse until they are shipped to customers. In this talk, Gousto’s Data Engineers will describe the challenges faced and the solutions found to merge the incoming stream of inventory events into Delta Tables. Come to hear about the bumps along the way, and to discover the tweaks implemented to improve merge performance. Today Gousto has real-time insight into the flow of ingredients through its supply chain, enabling a smarter, more optimised measure of its inventory performance.

Looking forward to seeing you at the Data + AI Summit 2020.

Try Databricks for free. Get started today.

The post Retail and Consumer Goods Agenda for Data + AI Summit Europe 2020 appeared first on Databricks.

↧

Financial Services Agenda for Data + AI Summit Europe 2020

November 10, 2020, 9:00 am

≫ Next: Media and Entertainment Agenda for Data + AI Summit Europe 2020

≪ Previous: Retail and Consumer Goods Agenda for Data + AI Summit Europe 2020

Looking for the best Financial Services events and sessions at Data + AI Summit Europe 2020 (Nov 17-19)? Below are some highlights. You can also find all Financial-related sessions, including customer case studies and extensive how-tos, within the event homepage by selecting “Financial Services” from the “Industry” dropdown menu. You can still register for this free, virtual event here.

For Business Leaders

Panel Discussion: The Future of Financial Services with Data + AI
In today’s economy, financial services firms are forced to contend with heightened regulatory environments and a variety of market, economic and regulatory uncertainties. Coupled with increasing demand from customers for more personalized experiences and a focus on sustainability/ESG, incumbent banks, insurers and asset managers are reaching the limits of where their current technology can take them with their digital transformation initiatives. It’s more critical than ever for institutions to turn towards big data and AI to meet these demands, and make smarter, faster decisions that reduce risk and protect against fraud. Business and analytics leaders and teams from the Financial Services sector are invited to join this industry briefing to learn new ideas and strategies for driving growth and reducing risk with data analytics and AI.

Presenter

Junta Nakai, FSI GTM Lead, Databricks

Panelists

Jacques Oelofse, VP Data Engineering and ML, HSBC
Mark Avallone, VP, Architecture, S&P Global
Douglas Hamilton, Chief Data Scientist, Nasdaq

Stories from the Financial Service AI Trenches: Lessons learned from building AI models at EY (Ernst & Young)

EY helps clients establish their data- and AI-driven transformation strategies, operationalise their AI governance frameworks, as well as build and monitor AI solutions. In this presentation, speakers from EY discuss how they have approached the nuances of building AI solutions in financial services, and how a highly-regulated industry meets innovation with experiment-driven emerging technologies.

The adoption of AI as a critical component to the future of financial services has been widely recognised. AI does enable the creation of innovative financial products and personalised services. It also derives value from improving processes and services through intelligent automation. AI has made great strides, particularly in machine learning. However, these advanced methods require vast amounts of good quality data for models to learn from. This is a great challenge in the financial sector due to a multitude of factors.

The talk covers EY’s experience in building models where data is scarce or highly restricted, their learnings from deploying models in multiple geographies and jurisdictions, and how they monitor models where data can drift because of changes in customer behaviour, degrading data quality, or new legislation. Good quality data is a big problem in many sectors, but it becomes more prevalent in the financial sector due to incomplete data sources, biases and imbalances, among others. With pressure from regulators, privacy concerns and restrictions, this often leads to very small samples of usable data. EY tackles the above challenges with various approaches, such as synthetic data generation, data anonymization, missing data prediction, and transfer learning, among others. They will also discuss how they have embedded automated and human-in-the-loop guardrails to capture domain knowledge and ensure trust in the AI solutions they build for clients.

Struggles along the way for the holy grail of personalization: Customer 360 (Ceska Sporitelna)

Ceska Sporitelna is one of the largest banks in Central Europe and one of its main goals is to improve the customer experience by weaving together the digital and traditional banking approach. This talk will focus on the real world (both technical and enterprise) challenges for implementing the vision, from powerpoint slides into production:

Implementing Spark and Databricks-centric analytics platform in the Azure cloud combined with a on-prem data lake in the EU-regulated financial environment
Forming a new team focused on solving use cases on top of C360 in the 10,000+ employee enterprise
Demonstrating this effort on real use cases such as client risk scoring using both offline and online data
Spark and its MLlib as an enabler for employing hundreds of millions of client interactions through personalized omni-channel CRM campaigns.

For Practitioners

Data Driven ESG Solution Demo followed by AMA

The future of finance goes hand in hand with social responsibility, environmental stewardship and corporate ethics. In order to stay competitive, businesses are increasingly disclosing more information about their environmental, social and governance (ESG) performance. In this demo, we’ll demonstrate ways to use machine learning to extract the key ESG initiatives as communicated in yearly PDF reports and compare these with the actual media coverage from news analytics data

Afterwards, Antoine Amend, Financial Services Technical Director at Databricks, will be available to answer questions about this solution or any other financial services analytics use case questions you may have.

SHAP & Game Theory For Recommendation Systems (First Digital Bank)

In this talk, First Digital Bank will introduce a game-theoretic approach to the study of recommendation systems with strategic content providers. Such systems should be fair and stable. Showing that traditional approaches fail to satisfy these requirements, they will propose the Shapley mediator and show how the Shapley mediator fulfills the fairness and stability requirements, runs in linear time, and is the only economically efficient mechanism satisfying these properties

Looking forward to seeing you at the Data + AI Summit 2020.

Try Databricks for free. Get started today.

The post Financial Services Agenda for Data + AI Summit Europe 2020 appeared first on Databricks.

↧

Media and Entertainment Agenda for Data + AI Summit Europe 2020

November 10, 2020, 10:00 am

≫ Next: Leveraging ESG Data to Operationalize Sustainability

≪ Previous: Financial Services Agenda for Data + AI Summit Europe 2020

Looking for the best Media and Entertainment (M&E) events and sessions at Data + AI Summit Europe 2020 (Nov 17-19) ? Below are some highlights. You can also find all M&E-related sessions, including customer case studies and extensive how-tos, within the event homepage by selecting “Media and Entertainment” from the “Industry” dropdown menu. You can still register for this free, virtual event here.

For Business Leaders

Winning the Battle for Consumer Attention with Data + AI
Media, broadcasting and gaming companies are in a fierce battle for audience attention for their direct to consumer businesses and the advertising ecosystem is under more pressure than ever before to drive performance based outcomes. The need to personalise the consumer experience is paramount to keeping audiences engaged and driving effective ad targeting solutions. Predictive analytics and real-time data science use cases can help media and entertainment companies increase engagement, reduce churn and maximise customer lifetime value. Join us as we discuss best practices and real-world machine learning use cases in the publishing, streaming video and gaming space as industry leaders move aggressively to personalize, monetize and drive agility around the consumer and advertiser experience.
Presenter

Presenter

Steve Sobel, M&E GTM Lead, Databricks

Panel

Steve Layland, Director, Engineering, Tubi
Arthur Gola de Paula, Manager, Data Science, Wildlife Studios
Krish Kuruppath, SVP, Global Head of AI Platform, Publicis Media-COSMOS

(Kaizen Gaming) Personalization Journey: From single node to Cloud Streaming

In the online gaming industry we receive a vast amount of transactions that need to be handled in real time. Our customers get to choose from hundreds or even thousand options, and providing a seamless experience is crucial in our industry. Recommendation systems can be the answer in such cases but require handling loads of data and need to utilize large amounts of processing power. Towards this goal, in the last two years we have taken down the road of machine learning and AI in order to transform our customer’s daily experience and upgrade our internal services.

In this long journey we have used the Databricks on Azure Cloud to distribute our workloads and get the processing power flexibility that is needed along with the stack that empowered us to move forward. By using MLflow we are able to track experiments and model deployment, by using Spark Streaming and Kafka we moved from batch processing to Streaming and finally by using Delta Lake we were able to bring reliability in our Data Lake and assure data quality. In our talk we will share our transformation steps, the significant challenges we faced and insights gained from this process.
Click here to see all M&E-related customer stories.

(Wildlife Studios) Using Machine Learning at Scale: A Gaming Industry Experience!

Games earn more money than movies and music combined. That means a lot of data is generated as well. One of the development considerations for ML Pipeline is that it must be easy to use, maintain, and integrate. However, it doesn’t necessarily have to be developed from scratch. By using well-known libraries/frameworks and choice of efficient tools whenever possible, we can avoid “reinventing the wheel”, making it flexible and extensible.

Moreover, a fully automated ML pipeline must be reproducible at any point in time for any model which allows for faster development and easy ways to debug/test each step of the model. This session walks through how to develop a fully automated and scalable Machine Learning pipeline by the example from an innovative gaming company whose games are played by millions of people every day, meaning data growth within terabytes that can be used to produce great products and generate insights on improving the product.

Wildlife leverages data to drive product development lifecycle and deploys data science to drive core product decisions and features, which helps the company by keeping ahead of the market. We will also cover one of the use cases which is improving user acquisition through improved LTV models and the use of Apache Spark. Spark’s distributed computing enabled Data Scientists to run more models in parallel and they can innovate faster by onboarding more Machine Learning use cases. For example, using Spark allowed the company to have around 30 models for different kinds of tasks in production.

For Practitioners

Understanding advertising effectiveness with advanced sales forecasting & attribution, followed by AMA.

How do you connect the effectiveness of your ad spend towards driving sales? Introducing the Sales Forecasting and Advertising Attribution Solution Accelerator. Whether you’re an ad agency or in-house marketing analytics team, this solution accelerator allows you to easily incorporate campaign data from a variety of historical and current sources — whether streaming digital or batch TV, OOH, print, and direct mail — to see how these drive sales at a local level and forecast future performance. Normally attribution can be a fairly expensive process, particularly when running attribution against constantly updating datasets. This session will demonstrate how Databricks facilitates the multi-stage Delta Lake transformation, machine learning, and visualization of campaign data to provide actionable insights on a daily basis.

Afterwards, M&E specialist SA Layla Yang will be available to answer questions about this solution or any other media, ad tech, or marketing analytics questions you may have.

(MIQ Digital India Pvt Ltd.) Building Identity Graph at scale for Programmatic Media Buying using Spark and Delta Lake

The proliferation of digital channels has made it mandatory for marketers to understand an individual across multiple touchpoints. In order to develop market effectiveness, marketers need have a pretty good sense of its consumer’s identity so that it can reach him via mobile device, desktop or a big TV screen on living room. Examples of such identity tokens include cookies, app IDs etc.A consumer can use multiple devices at the same time and so the same consumer should not be treated as different people in the advertising space. The idea of identity resolution comes with this mission and goal to have an omnichannel view of a consumer.

Identity Spine is MIQ’s proprietary identity graph, using identity signals across our ecosystem to create a unified source of reference to be consumed by product, business analysis and solutions teams for insights and activation. We have been able to build a strong data pipeline using Spark and Delta Lake, thereby strengthening our connected media products offerings for cross channel insights and activation.

This talk mostly highlights :

The journey of building a scalable data pipeline that handles 10TB+ of data daily
How we were able to save our processing cost by 50%
Optimization strategies implemented to onboard new dataset to enrich the graph

(Roularta) Building an ML Tool to predict Article Quality Scores using Delta & MLFlow

For Roularta, a news & media publishing company, it is of a great importance to understand reader behavior and what content attracts, engages and converts readers. At Roularta, we have built an AI-driven article quality scoring solution on using Spark for parallelized compute, Delta for efficient data lake use, BERT for NLP and MLflow for model management. The article quality score solution is an NLP-based ML model which gives for every article published – a calculated and forecasted article quality score based on 3 dimensions (conversion, traffic and engagement).

The score helps editorial and data teams to make data-driven article-decisions such as launching another social post, posting an article behind the paywall and/or top-listing the article on the homepage.
The article quality score gives editorial a quantitative base for writing more impactful articles and running a better news desk. In this talk, we will cover how this article quality score tool works including:

The role of Delta to accelerate the data ingestion and feature engineering pipelines
The use of the NLP BERT language model (Dutch based) for extracting features from the articles text in a Spark environment
The use of MLflow for experiments tracking and model management
The use of MLflow to serve model as REST endpoint within Databricks in order to score newly published articles

Looking forward to seeing you at the Data + AI Summit 2020.

Try Databricks for free. Get started today.

The post Media and Entertainment Agenda for Data + AI Summit Europe 2020 appeared first on Databricks.

↧

Leveraging ESG Data to Operationalize Sustainability

November 11, 2020, 8:00 am

≫ Next: Analytics on the Data Lake With Tableau and the Lakehouse Architecture

≪ Previous: Media and Entertainment Agenda for Data + AI Summit Europe 2020

The benefits of Environmental, Social and Governance (ESG) are well understood across the financial services industry. In our previous blog post, we demonstrated how asset managers can leverage data and AI to better optimize their portfolios and identify organizations that not only look good from an ESG perspective, but also do good — companies that operate in an environmentally friendly, socially acceptable and sustainable manner. But the benefits of ESG go beyond sustainable investments. Recent experience has taught us that the key to thriving during the COVID-19 pandemic starts with establishing a high bar for social responsibility and sustainable governance. For example, large retailers that already use ESG to monitor their supply chain performance have been able to leverage this information to better navigate the challenges of global lockdowns, ensuring a constant flow of goods and products to communities. In fact, the benefits of operationalizing ESG have been widespread during COVID-19. In Q1 2020, 94% of ESG funds outperformed their benchmarks. High ESG companies have higher resilience because they take a closer look at how they treat their workers, their sourcing, and how vulnerable they are to external shocks. What we are seeing among our ESG-focused customers is consistent with the findings of this article from the Harvard Law School Forum on Corporate Governance, “Companies that invest in [ESG] benefit from competitive advantages, faster recovery from disruptions.”

“High-quality businesses that adhere to sound ESG practices will outperform those that do not.”

Tidjane Thiam
Chair of the audit committee of the Kering Group
Former CEO of Credit Suisse

In this blog post, we’ll demonstrate a novel approach to supply chain analytics by combining geospatial techniques and predictive data analytics for logistics companies not only to reduce their carbon footprint, improve working conditions and enhance regulatory compliance but also to use that information to adapt to emerging threats, in real time. Using a maritime transport company as an example, we’ll uncover the following topics as key enablers toward a sustainable transformation through data and AI:

Leverage geospatial analytics to group IoT sensor data into actionable signals
Optimize logistics using Markov chain models
Predict vessel destination and optimize fuel consumption

The veracity of ESG

The burst of the web 2.0 bubble in the early 2000s has led to a variety of information being collected (e.g., IoT, sensor data) and the adoption of Hadoop-based technologies in 2010 has allowed organizations to efficiently store and process massive amounts of data (e.g., volume, velocity). However, organizations were not able to fully unlock the true potential of data until recently. Today, cloud computing coupled with open source AI technologies have democratized data, allowing businesses to address the biggest hurdle in taking a data-driven approach to ESG: information veracity. Alternative data, sensor data, ratings, and ESG disclosures come at different scales, different qualities, different formats, are often incomplete or unreliable and dramatically change over time, requiring all scientific personas (i.e., data scientists and computer scientists) to work collaboratively and iteratively to convert raw information into actionable signals. By unifying data and analytics, Databricks not only allows the ingestion and processing of massive amounts of data at minimal costs, but also helps enterprises iterate faster through the use of AI, establish new strategies to adapt to changing conditions, inform better decision-making, and transform their operating models to be more data-driven, agile and resilient.

In this demo, we’ll use maritime traffic information from the Automatic Identification System (AIS), an automatic tracking system that captures the exact location of every operating vessel at regular time intervals. This publicly available data set can be accessed either via NOAA archives or live feeds.

As reported in the above workflow, Delta Lake will be used to provide both reliability and performance of AIS data (1), and Apache Spark^TM will be used to group billions of isolated geographical information (2) into well-defined routes (commonly known as “sessionizing” IoT data), leveraging geospatial libraries such as H3 Uber’s Hexagonal Hierarchical Spatial Index (3). Using Markov chains, we will demonstrate how logistics companies can better understand the efficiency of their own fleet (4), and also leverage contextual information from others in order to predict traffic, detect anomalies (5) and minimize associated risks and disruption to their businesses (6).

Acquiring AIS data

The Automated Identification System is a navigation standard all vessels must theoretically comply with. As a result, its structure is relatively simple and contains vessel attributes, including Maritime Mobile Service Identity (MMSI) and call sign (unique to a ship) alongside dynamic characteristics such as the exact location, speed over ground, heading and timestamp. We can read incoming CSV files and append that information as-is onto a Delta bronze table, as represented in the table below.

Given the volume of data at play (2 billion data points collected for the U.S. alone in 2018), we leverage H3, a hierarchical structure that encodes latitude/longitude points as a series of overlapping polygons as represented below. Such a powerful grid structure will help us group points at different resolutions spanning from a million of km2 down to a few cm2.

Using Uber’s third-party library, we wrap this encoding logic as a User-Defined Function (UDF):

import com.uber.h3core.H3Core
import org.apache.spark.sql.functions._

// given a point to encode (latitude and longitude) and a resolution
// we return the hexadecimal representation of the corresponding H3
val toH3 = udf((lat: Double, lon: Double, res: Int) => {
     val h3 = H3Core.newInstance()
     val h3Long = h3.geoToH3(lat, lon, res)
     f"${h3Long}%X"
})

In order to appreciate the complexity of the task at hand, we render all points (grouped by a 10 km large polygon) using KeplerGL visualization. Such a visualization demonstrates the aggressive nature of AIS data and the necessity to address that problem as a data science challenge instead of a simple engineering pipeline or ETL workflow.

Since AIS data only contains latitudes and longitudes, we also acquire the location of 142 commercial ports in the United States using a simple web scraper and BeautifulSoup python library (the process is reported in the associated notebooks).

Transforming raw information into actionable signals

In order to convert raw information into actionable signals, we first need to sessionize points into trips separated by points where a vessel is no longer under command (e.g., when a vessel is anchored). This apparently simple problem comes with a series of computer science challenges: First, a trip is theoretically unbound (in terms of distance or time), using a typical SQL window function would result in each location for a given vessel to be held in memory. Second, some vessels may exhibit half a million data points, and sorting such a large list would lead to major inefficiencies. Lastly, our data set is highly unbalanced. Some vessels account for more traffic than others, a strategy for a given vessel may be suboptimal for others. None of these challenges can be addressed using standard SQL and relational database techniques.

Secondary sorting

We overcome these challenges by leveraging a well-known big data pattern, secondary sorting. This apparent legacy pattern (famous in the MapReduce era) is still incredibly useful with massive data sets and a must-have in the modern data science toolbox. The idea is to leverage Spark Shuffle by creating a custom partitioner on a composite key. The first half of the key is used to define the partition number (i.e., the MMSI) while its second half is used to sort each key within a given partition (i.e., timestamp), resulting in a vessel’s data to be grouped together in a sorted collection.

Although the full code is detailed in the attached notebook, we report below the use of a Partitioner (to tell the Spark framework what executor will be processing which vessel) and a composite key in order to exploit, at its best, the elasticity offered by cloud computing (and, therefore, minimize its cost).

import org.apache.spark.Partitioner

class VesselKeyPartitioner(n: Int) extends Partitioner {
     override def numPartitions: Int = n
     override def getPartition(key: Any): Int = {
     val k = key.asInstanceOf[VesselKey]
     math.abs(k.mmsi % n)
     }
}

object VesselKey {
     implicit def orderingByVesselPosition[A <: VesselKey] : Ordering[A] = {
     Ordering.by(k => (k.mmsi, k.rank))
     }
}

Equipped with a composite key, a partitioner, and a business logic to split sorted data points into sequences (separated by a vessel status), we can now safely address this challenge using Spark repartitionAndSortWithinPartitions framework (another name for secondary sorting). On a relatively small cluster, the process took 3mn to split our entire data set into 30,000 sessions, storing each trip on a silver Delta table that can be further enriched.

Geospatial enrichment

With the complexity of our data dramatically reduced, the next challenge is to further refine this information by filtering out incomplete trips (missing data) using the information of U.S. ports and locations we scraped from the internet. Instead of a complex geospatial query such as finding a “point in polygon” or a brute force approach to find the minimum distance to any known U.S. ports, we leverage the semantic properties of H3 to define a catchment area around the exact location of each port (as per picture below).

Any vessel caught in these areas (i.e., matching a simple INNER JOIN condition) at either end of their journeys will be considered as originating from/at the destination to these specific ports. Through this approach, we successfully reduced a massive data set of 2 billion raw records down to 15,000 actionable trips that can now be used to improve the operational resilience of our shipment company.

Flags of convenience and safety concerns

Maritime transport is the backbone of international trade. The global economy has around 80% of global trade by volume, and over 70% of global trade by value are carried by sea. With over 50,000 merchant ships registered in 150 countries, “Regulatory frameworks such as Basel Convention, OECD and ILO guidelines are looking at better governing shipbreaking activities. However, many boats sail under flags of convenience, including Panama, Liberia, and the Marshall Islands, making it possible to escape the rules laid down by international organizations and governments.”

”Globalization has helped to fuel this rush to the bottom. In a competitive shipping market, FOCs lower fees and minimize regulation, as ship owners look for the cheapest way to run their vessels.”
International Transport Workers’ Federation

There are a variety of factors for sailing under flags of convenience, but the least disciplined ship owners tend to register vessels in countries that impose fewer regulations. Consequently, ships bearing a flag of convenience can be ESG red-flags: Often characterized by poor conditions, inadequately trained crews, and frequent collisions that cause serious environmental and safety concerns that can only be detected and quantified using a data-driven approach. With all of our data points properly classified and stored on Delta Lake, a simple SQL query on MMSI patterns (contains information about flags) can help us identify vessels suspected of operating under a flag of convenience.


SELECT 
     callSign, 
     vesselName, 
     COLLECT_SET(flag(mmsi)) as flags 
FROM esg.cargos_points
GROUP BY callSign, vesselName 
HAVING SIZE(flags) > 1

In the example below, we have been able to identify PEAK PEGASUS, a shipping carrier operating in the Gulf of Mexico in 2018, consecutively sailing under either a Liberia or Gabon flag. Without taking a big data approach to this problem, it would not only be difficult to uncover which ships are changing flags but to predict which kind of potential ESG issues this may cause and where these ships may be heading toward (see later in this blog).

By better addressing the veracity of IoT data, we have demonstrated how transforming raw information into actionable signals offers no place to hide for ship owners to operate out of the sight of regulators (this cargo was hidden among 2 billion data points). Whether or not this particular vessel is breaching any regulatory requirement is outside of the scope of this blog.

A data and AI compass to maximize business value

With our raw information converted into actionable signals, one can easily identify the most common routes across the United States at different times of the year. Using Circos visualization techniques as represented below, we can appreciate the global complexity of the U.S. maritime traffic for 2018 (left picture). When most of the trips originating from San Francisco are headed to Los Angeles, the latter acts as a hub for the whole West Coast (right picture), uncovering some interesting economic insights. Similarly, Savannah, Georgia, seems to be the hub for the East Coast.

If we assume that the number of trips between two ports is positively correlated with the economic activity between two cities, a higher probability of reaching a port is therefore a function of higher profitability for a shipper. This Circos map is the key to economic growth, and Markov chain is its compass.

A Markov chain is a mathematical system that experiences transitions from one state (e.g., a port) to another according to certain probabilistic rules (i.e., the number of observations). Widely employed in economics, communication theory, genetics and finance, this stochastic process can be used to simulate sampling from complex probability distributions, for instance studying queues or lines of customers arriving at an airport or forecasting market crashes and cycles between recession and expansion. Using this approach, we demonstrate how port authorities could better regulate inbound traffic and reduce long queues at anchorage, resulting in cost benefits for industry stakeholders and a major reduction in carbon emission. Long queues at anchorage are a major safety and environmental issue.

Minimizing disruption to a business

As reported in the Financial Times, carriers have dramatically changed their operations during the COVID-19 pandemic, by quickly parking ships, sending vessels on longer journeys, and canceling hundreds of routes to protect profits. As the global economy recovers, we can leverage Markov chains to predict where a cargo operator should redeploy their fleet in order to minimize disruption to their businesses. Starting from a given port, where would a given vessel statistically be after three, four, five consecutive trips? These insights not only optimize profits, but also help protect the well-being of seamen that work on the ships by creating a data-driven framework that takes into account how to help them return home after voyages.

We capture the probability distribution of each port reaching any other port as what is commonly referred to as a transition matrix. Given an initial state vector (a port of origin), we can easily “random walk” these probabilities in order to find the next N most probable routes, factoring for erratic behavior (this is known in the Markovian literature as a “teleport” variable that contributed to Google’s successful algorithm, Page rank).

# create an input vector
# probability is 0 except for our initial state (NYC)
state_vector = np.zeros(shape=(1, transition_matrix.shape[0]))
state_vector[0][new_york] = 1.0

# update state vector for each simulated trip
for i in np.arange(0, trips):
     state_vector = np.dot(state_vector, transition_matrix)

Starting from New York City, we represent the most probable location any ship would be after five consecutive trips (12% chance of being at Savannah, Georgia).

Since the number of trips between two cities should be correlated with high economic activity, this framework will not only tell us what the most probable next N routes are, but what are the next N routes with the highest profitability. When part of their fleet must be redeployed to different locations or re-routed because of external events (e.g., weather conditions), this probabilistic framework will help ship operators think multiple steps ahead, optimizing routes to be the most economically viable and, therefore, minimizing further disruption to their businesses.

In the example below, we represent a journal log (dynamically generated from our framework) that optimizes business value when departing from New York City. For each trip, we can easily report the average duration and distance based on historical records.

An improvement of this model (at the reader’s discretion) would be to factor for additional variables, such as weather data, vessel type, known contracts or seasonality or allow users to input additional constraints (e.g., maximum distance). In fact, an anomaly in our approach was detected where no historical data was found between Sault Ste. Marie and Duluth between January and March. This apparent oddity could certainly be explained by the fact the Great Lakes are mostly frozen in winter, so recommending such a route would not necessarily be appropriate.

Being socially responsible and economically pragmatic

Currently, roughly 250,000 ship workers are believed to be marooned, as authorities have prevented seafarers from disembarking on grounds of infection risk. In this real-life scenario, how could a cargo operator bring its crew home safely while minimizing further disruption to their business? Using graph theory, our framework can also be used where the destination is known. The same Circos map shown earlier (hence its associated Markov transition matrix) can be converted as a graph using the networkX library in order to further study its structure, learning its connections and their shortest paths.

import networkx as nx
G = nx.from_pandas_adjacency(markov)
trips = nx.shortest_path(G,source='Albany',target='FortPierce')

Sailing half-empty from Albany to Fort Pierce may be the fastest but not the most economically viable route. As reported below, our framework indicates that — at this time of the year — a stop at Baltimore and Miami could help shippers maximize their profits while bringing their crew safely home in a timely manner.

Predict and prescribe

In the previous section, we demonstrated the use of Markov chains to better understand economic activity as a steady flow of traffic between U.S. ports. In this section, we use that acquired knowledge to understand the transition from one geographic location to another. Can we machine learn a ship destination given its port of origin and its current location, at any point in time? Besides the evident strategic advantage for the financial services industry to understand and predict shipment of goods and better model supply and demand, cargo operators can better optimize their operations by estimating the traffic at the destination and avoid long queues at anchorage.

Geospatial Markov chains

Although our approach is an extension to our existing framework, our definition of a probabilistic state has changed from a port to an exact geographical location. A high granularity would create a sparse transition matrix while a lower granularity would prevent us from running actual predictions. We will leverage H3 information (as introduced earlier) by approximating locations within a 20 km radius. Another consideration to bear in mind is the “Memorylessness” nature of Markov chains (i.e., the current location does not carry information about previous steps). Since the originating port of each vessel is known, we will create multiple machine learning models, one for each U.S. port of origin.

At any point in time, our system will detect the next probable N states (the next N locations) a ship is heading to. Given an infinite number of “random walks,” our probability distribution will become stationary as all known ports will be reached. Eventually, we want to stop our random walk process when the probability distribution remains unchanged (within a defined threshold). For that purpose, we use Bhattacharyya coefficient as a distance measure between two probability distributions.

def predict(state):

# Given the observed location, we create a new state vector...
state_vector = np.zeros(shape=(1, transition_matrix.shape[0]))
state_vector[0][state] = 1.0

# ... that we update for each random walk recursively
for walk in np.arange(0, max_walks):
     
     new_state_vector = np.dot(state_vector, transition_matrix)
     distance = bhattacharyya(state_vector, new_state_vector)
     state_vector = new_state_vector

     # ... until the probabilities remain unchanged
     if(distance < epsilon):
     break
     
# We return the probability of reaching each port 
return state_vector

Predicting destinations

Given a trip originating from Miami, we extract the most probable destinations at every step of its journey. We represent our model output in the picture below. The most probable destination was Wilmington (30% of chances) until the ship started to head east, moving toward the New York / Philadelphia route (probabilities were similar). Around 80% of trip completion, it became obvious that our ship was heading toward New York City (as the probability of heading toward Philadelphia dramatically dropped to zero).

As represented in the figure below, we observe the probability of reaching New York City increases over time. We are obviously more confident about a prediction being made when the destination port is in sight (and less random walks are required as represented by the polygons’ heights).

After multiple tests, we can observe an evident drawback in our model. There are multiple ports densely packed around specific regions. An example would be the Houston, Texas, area with Freeport, Houston, Galveston, Matagorda ports, all located within a 50–100 km radius, and all sharing the same inbound route pattern (overall direction pointing toward Houston). As a consequence, the most popular port shadows its least popular neighbors resulting in low probability distribution and apparent low accuracy. To fully appreciate the predictive power of our approach, one would need to look at the actual distance between predicted vs. actual locations as a more appropriate success metric.

With fuel costs representing as much as 50%–60% of total ship operating costs, ship owners can leverage this framework to reduce carbon emissions. Owners can not only monitor their own fleets, but also those of their competitors (AIS is publicly available), predict traffic at the destination, and establish data-driven strategies for optimizing fuel consumption in real time (reducing sailing speed, re-routing, etc.).

Preventive measures through AI

As we are now able to predict what the immediate next step of any vessel is, operators can use this information to detect unusual patterns. Anomalies can be observed given a drastic change in the probability distribution between two successive events (such a change can easily be captured using the Bhattacharyya coefficient introduced earlier) and preventive measures can be taken immediately. By monitoring the fleets of their competitors and the environmental and safety concerns related to the least regulated vessels, ship owners now have access to a real-time lense of dense traffic where auxiliary correction can be made in real time to vessels approaching the danger state (it may take about 20 minutes for a fully loaded large tanker to stop when heading at normal speed), navigating with higher safety standards.

Enabling a sustainable transformation

Through this series of real-world examples, we demonstrated why ESG is a data and AI challenge. From environmental (reducing carbon emission), social (ensuring the safety of their crew), and governance (detecting the least regulated activities), we have demonstrated how organizations who successfully embedded ESG at their core have built a strategic resilience to better optimize their operating model to emerging threats. Although we used maritime information, the same framework and its underlying technical capabilities can easily be ported to different sectors besides the logistics industry. In the financial services industry, this framework would be directly applicable to commodity trading, risk management, trade finance, and compliance (ensuring that vessels would not be sailing across sanctioned waters).

Try the below notebooks on Databricks to accelerate your sustainable transformation today and contact us to learn more about how we assist customers with similar use cases.

Try Databricks for free. Get started today.

The post Leveraging ESG Data to Operationalize Sustainability appeared first on Databricks.

↧

Analytics on the Data Lake With Tableau and the Lakehouse Architecture

November 11, 2020, 11:59 pm

≫ Next: Announcing the Launch of SQL Analytics

≪ Previous: Leveraging ESG Data to Operationalize Sustainability

Over the past two years we’ve seen a number of organizations moving their data work to the cloud. It simplifies access and scales to handle the biggest volumes. At Tableau, we’re all about customer choice and flexibility, and we’ve enabled our customers to move to the cloud faster than ever.

Analytics and data science/machine learning efforts are beginning to converge, and we’re seeing growing interest in connecting directly to data lakes for analysis as a result. A lot of data is coming into a cloud data lake very fast, from web logs and IoT sensors and it tends to be messy. We need a way to make sense of the data, and to have it delivered in a reliable and performant manner.

To enable this, we’re seeing more and more of our customers adopting a Lakehouse architecture. This new architecture takes the best of data lakes (low cost, flexible content structures) and data warehouses (high performance, data reliability) into a single place to store your data. With our partner Databricks, we’ve seen a number of joint customers adopt a lakehouse architecture to power their Tableau deployments. Databricks uses Delta Lake to enable a lakehouse architecture by improving the performance and reliability of data lake, so Tableau users can query the data lake directly.

This week Databricks is announcing a new SQL Analytics service that is going to provide Tableau customers with an entirely new experience for analyzing data that resides in the data lake. The performance and scale that they can achieve are unlike anything we’ve seen before.

Tableau users will be the most excited by the new SQL Analytics Endpoints which can be used immediately with our existing Databricks connector, no update required. This will improve access to your data lake for analytics in two ways:

Simple setup. SQL Analytics endpoints simplify the configuration of Databricks clusters used by Tableau to query the data lake, There is no need to deal with cluster management for Tableau users, just connect to Databricks SQL Analytics endpoint and go!
Performance improvements. SQL Analytics uses the Databricks Delta Engine, a vectorized query engine with an improved query optimizer and caching capabilities for really fast query performance.

Delta Engine architecture used with the new SQL Analytics service for Tableau from Databricks.

Figure 1: Delta Engine architecture

Customer Examples

Here are some examples of customers who are using a Lakehouse architecture with Databricks and Tableau.

Wehkamp uses Databricks with Delta Lake as a data lake, serving their entire organization for reporting and ad-hoc analysis using Tableau, and using Databricks for data science. You can read about Wehkamp’s implementation in this case study.
Flipp, a retail service provider, uses Databricks with Delta Lake to create a lakehouse that their data science team uses for machine learning, their engineering team uses for product feature analysis, and their sales team uses to provide analysis to their customers with Tableau. You can watch their session at the Tableau Conference.
The US Air Force uses Databricks with Delta Lake to manage all their cash flow analytics, and then provide the results in Tableau to analyze over 65 million records per quarter. You can watch the US Air Force present their implementation at the Data + AI Industry Leadership Forum.

Sample BI visualization in Tableau demonstrating the powerful analytics capabilities made possible by the Lakehouse architecture pioneered by Databricks.

Figure 2. A sample Flipp visualization

Learn more about Databricks and Tableau here.

Try Databricks for free. Get started today.

The post Analytics on the Data Lake With Tableau and the Lakehouse Architecture appeared first on Databricks.

↧

Announcing the Launch of SQL Analytics

November 12, 2020, 8:00 am

≫ Next: Data Teams Unite! Countdown to Data + AI Summit Europe

≪ Previous: Analytics on the Data Lake With Tableau and the Lakehouse Architecture

Today, we announced the new SQL Analytics service to provide Databricks customers with a first-class experience for performing BI and SQL workloads directly on the data lake. This launch brings to life a new experience within Databricks that data analysts and data engineers are going to love. The service provides a dedicated SQL-native workspace, built-in connectors to let analysts query data lakes with the BI tools they already use, query performance innovations that deliver fast results on larger and fresher data sets than analysts traditionally have access to, and new governance and administration capabilities. With this launch, we are the first to realize the complete vision of lakehouse architecture, combining data warehousing performance with data lake economics to deliver up to 9x better price/performance than traditional cloud data warehouses.

The enemy is complexity

Most customers routinely operate their business with a complex data architecture in the cloud that combines data warehouses and data lakes. As a result, customers’ data is moved around the organization through data pipelines that create a multitude of data silos. A large amount of time is spent maintaining these pipelines and systems rather than creating new value from data, and the downstream consumers of the data struggle to get a single source of truth due to the inherent data silos that get created. The situation becomes very expensive, both financially and operationally, and decision-making speed and quality are negatively affected.

Arriving at this problem was a gradual progression. It began with customers moving data from relational databases to data warehouses to do business intelligence 40 years ago. Then, data lakes began to emerge about 10 years ago because data warehouses couldn’t handle raw, video, audio, image, and natural language data, as well as very large scale structured data.

Data lakes in the cloud have high durability, low cost, and unbounded scale, and they provide good support for the data science and machine learning use cases that many enterprises prioritize today. But, all the traditional analytics use cases still exist. Therefore, customers generally have, and pay for, two copies of their data, and they spend a lot of time engineering processes to keep them in sync. This has a knock-on effect of slowing down decision making, because analysts and line-of-business teams only have access to data that’s been sent to the data warehouse rather than the freshest, most complete data in the data lake.

Finally, as multi-cloud becomes an increasingly common reality for enterprises, all of this data movement is getting repeated across several cloud platforms.

The whole situation is a mess.

The complexity from intertwined data lakes and data warehouses is not desirable, and our customers have told us that they want to be able to consolidate and simplify their data architecture. Advanced analytics and machine learning on unstructured and large-scale data are one of the most strategic priorities for enterprises today, – and the growth of unstructured data is going to increase exponentially – therefore it makes sense for customers to think about positioning their data lake as the center of data infrastructure. However, for this to be achievable, the data lake needs a way to adopt the strengths of data warehouses.

The lakehouse combines the best of data warehouses and data lakes

The answer to this complexity is the lakehouse, a platform architecture that combines the best elements of data lakes and data warehouses. The lakehouse is enabled by a new system design that implements similar data structures and data management features to those in a data warehouse directly on the low-cost storage used for cloud data lakes. The architecture is what you would get if you had to redesign data warehouses in the modern world, now that cheap and highly reliable storage (in the form of object stores) is available. You can read more about the characteristics of a lakehouse in this blog.

The foundation of the lakehouse is Delta Lake. Delta Lake has brought reliability, performance, governance, and quality to data lakes, which is necessary to enable analytics on the data lake. Now, with the right data structures and data management features in place, the last mile to make the lakehouse complete was to solve for how data analysts actually query a data lake.

Introducing SQL Analytics

SQL Analytics allows customers to perform BI and SQL workloads on a multi-cloud lakehouse architecture that provides up to 9x better price/performance than traditional cloud data warehouses. This new service consists of four core components: A dedicated SQL-native workspace, built-in connectors to common BI tools, query performance innovations, and governance and administration capabilities.

A SQL-native workspace

SQL Analytics provides a new, dedicated workspace for data analysts that uses a familiar SQL-based environment to query Delta Lake tables on data lakes. Because SQL Analytics is a completely separate workspace, data analysts can work directly within the Databricks platform without the distraction of notebook-based data science tools (although we find data scientists really like working with the SQL editor too). However, since the data analysts and data scientists are both working from the same data source, the overall infrastructure is greatly simplified and a single source of truth is maintained.

The workspace allows analysts to easily explore schemas, save regularly used code as snippets for quick reuse, and cache query results to keep subsequent run times short. Additionally, query updates can be scheduled to automatically refresh, as well as to issue automatic alerts on refresh, via email or Slack, when meaningful changes occur in the data.

SQL-native query editor

The workspace also allows analysts to make sense of data through rich visualizations, and to organize those visualizations into drag-and-drop dashboards. Once built, dashboards can be easily shared with stakeholders to make sharing data insights ubiquitous across an organization.

Databricks SQL Analytics’ built-in connectors for BI software like Tableau and Microsoft Power BI allow users to query the freshest, most complete data in the data lake.

Visualizations and dashboards

Built-in connectors to existing BI tools and broad partner support

For production BI, many customers have made investments in BI software like Tableau and Microsoft Power BI. To allow those tools to have the best possible experience querying the freshest, most complete data in the data lake, SQL Analytics includes built-in connectors for all major BI tools available today.

Across the data lifecycle, the launch of SQL Analytics is supported by the 500+ partners in the Databricks ecosystem. We’re very pleased to have the following partners investing above and beyond with us in this launch to enable customers to use their favorite analytics tools with SQL Analytics and lakehouse architecture:

BI Partners: Tableau, Power BI, Qlik, Looker, Thoughtspot
Ingest Partners: Fivetran, Fishtown Analytics, Matillion, Talend, Qlik
Catalog Partners: Collibra, Alation
Consulting Partners: Slalom, Thorogood, Advancing Analytics

Fast query performance

A big part of enabling analytics workloads on the data lake is solving for performance. There are two core challenges to solve to deliver great performance: query throughput and user concurrency. By delivering both, we’ve been able to achieve up to 9x better price/performance for SQL workloads using lakehouse architecture than other cloud data warehouses.

30 TB TPC-DS price/performance benchmark test for leading cloud data warehouse solutions.

30TB TPC-DS Price/Performance benchmark test (lower is better)

Earlier this year, we announced Delta Engine, our polymorphic query execution engine. Delta Engine accelerates the performance of Delta Lake for both SQL and data frame workloads through three components: an improved query optimizer, a caching layer that sits between the execution layer and the cloud object storage, and a polymorphic vectorized execution engine that’s written in C++. With Delta Engine, customers have observed query execution times up to 10x faster than Apache Spark 3.0.

With throughput handled, we turned our attention to user concurrency. Historically, data lakes have struggled to maintain fast performance under high user counts. To solve this, SQL Analytics adds new SQL-optimized compute clusters that auto-scale in response to user load to provide consistent performance as the number of data analysts querying the data lake increases. Setting up these clusters is fast and easy through the console, and Delta Engine is built-in to ensure the highest level of query throughput. External BI clients can connect to the clusters via dedicated endpoints.

Governance and administration

Finally, in the SQL Analytics console, we allow admins to apply SQL data access controls (AWS, Azure) onto your tables to gain much greater control over how data in the data lake is used for analytics. Additionally, we provide deep visibility into the history of all executed queries, allowing you to explore the who, when, and where of each query along with the executed code to assist you in compliance and auditing. The query history also allows you to understand the performance of each phase of query execution to assist with troubleshooting.

On the administrative side, you can aggregate details for query runtimes, concurrent queries, peak queued queries per hour, etc. to help you better optimize your infrastructure over time. You can also set controls around runtime limits to prevent bad actors and runaway queries, enqueued query limits, and more.

Getting started

SQL Analytics completes the final step in moving lakehouse architecture from vision to reality, and Databricks is proud to be the first to bring a complete lakehouse solution to market. All members of the data team, from data engineers and architects to data analysts to data scientists, are collaborating more than ever. The unified approach of the Databricks platform makes it easy to work together and innovate with a single source of truth that substantially simplifies data infrastructure and lower costs.

SQL Analytics is available in preview today. Existing customers can reach out to their account team to gain access. Additionally, you can request access via the SQL Analytics product page.

Sign-up for access to SQL Analytics

Try Databricks for free. Get started today.

The post Announcing the Launch of SQL Analytics appeared first on Databricks.

↧

Data Teams Unite! Countdown to Data + AI Summit Europe

November 12, 2020, 1:21 pm

≫ Next: MLflow 1.12 Features Extended PyTorch Integration

≪ Previous: Announcing the Launch of SQL Analytics

Data + AI Summit 2020 Europe takes place virtually in just a few days,from 17-19 November – and it’s free to attend! Formerly known as Spark + AI Summit, Data + AI Summit will bring together thousands of data teams to learn from practitioners, leaders, innovators and the original creators of Spark, Delta Lake, MLflow and Koalas.

In June, we were able to successfully transform Summit into a completely virtual experience. The Data + AI Summit will be even better with optimised engagement throughout the conference platform, plus hundreds of sessions at your fingertips, anytime and from anywhere. Although the conference schedule is scheduled for European attendees, all content will be available on demand for our global audience.

We’ve spent the last two months building a conference that will bring data teams together again with 125+ sessions, an incredible line-up of keynotes and countless opportunities to connect with your peers — more than 20,000 data scientists, data engineers, analysts, business leaders and other data professionals.

Our virtual platform launches on 13 November, but here’s a sneak peek at what awaits you. As soon as the platform launches, get a head start by building your agenda and your personal profile to get the most from your conference experience. If you haven’t already registered, there’s still time! General admission is FREE and content will be available live – and on demand after the event.

Personalised dashboard

As you enter the conference you will be welcomed by your personalised dashboard — a home for everything you need to know about the conference. We have highlighted the most useful links to access content and a quick view of your agenda. The left navigation panel will help you explore every aspect of the Summit. And keep an eye on your inbox for notifications so you don’t miss any updates.

Build your agenda

Our agenda is jam-packed this year with two days filled with technical content for data scientists, engineers, analysts and IT and business leaders. To add sessions to your agenda, simply click on the heart next to the session title.

We have an incredible lineup of keynotes from industry thought leaders such as Ali Ghodsi, Matei Zaharia and Reynold Xin, as well as luminary keynotes from Malcolm Gladwell, Mae Jemison, Dr. Kira Radinsky, Jeremy Singer-Vine and many others. We recommend you take time to play with the agenda filters and explore the speakers’ pages to build your ideal agenda.

Dev Hub + Expo

Connect with your peers and sponsors at the Dev Hub + Expo! We are making live networking happen at our Hallway Chatter rooms where you can chat live with like-minded attendees or hear a lightning talk from community members. You can also learn more about Delta Lake, Apache Spark™, MLflow, Redash and more at the Databricks Booth, and interact with our amazing sponsors.

We want to be able to bring people together and connect in a virtual space. We are doing this in many ways throughout the platform. Check out the Data People page for a directory of who’s around. You can also go to the Suggested for Me tab to meet like-minded individuals and to discover recommended sessions/experiences.

Summit Quest and evening events

Don’t forget to set aside time to do our daily body breaks, get social, and rack up points to hit the top of the Summit Quest leaderboard. The more points you accumulate, the more you have a chance to be one of our Top 50 leaderboard winners for some exclusive Summit merchandise.

Each evening, attendees will be able to ‘choose their own adventure’ by joining a live DJ set from DJ Kingmost, attend our highly engaging live Meetups or explore everything the Dev Hub + Expo hall has to offer.

There is so much more that we can share, but now it is your turn to discover what Data + AI Summit Europe has to offer. If you have registered, join the experience on 13 November and check out this guide for even more reasons to get excited. And if you haven’t yet registered, it’s not too late. Join us for all the action at Data + AI Summit 2020 and we look forward to seeing you there!

Try Databricks for free. Get started today.

The post Data Teams Unite! Countdown to Data + AI Summit Europe appeared first on Databricks.

↧

MLflow 1.12 Features Extended PyTorch Integration

November 13, 2020, 9:46 am

≫ Next: How to Evaluate Data Pipelines for Cost to Performance

≪ Previous: Data Teams Unite! Countdown to Data + AI Summit Europe

MLflow 1.12 features include extended PyTorch integration, SHAP model explainability, autologging MLflow entities for supported model flavors, and a number of UI and document improvements. Now available on PyPI and the docs online, you can install this new release with pip install mlflow==1.12.0 as described in the MLflow quickstart guide.

In this blog, we briefly explain the key features, in particular extended PyTorch integration, and how to use them. For a comprehensive list of additional features, changes and bug fixes read the MLflow 1.12 Changelog.

Support for PyTorch Autologging, TorchScript Models and TorchServing

At the PyTorch Developer Day, Facebook’s AI and PyTorch engineering team, in collaboration with Databricks’ MLflow team and community, announced an extended PyTorch and MLflow integration as part of the MLflow release 1.12. This joint engineering investment and integration with MLflow offer PyTorch developers an “end-to-end exploration to production platform for PyTorch.” We briefly cover three areas of integration:

Autologging for PyTorch models
Supporting TorchScript models
Deploying PyTorch models onto TorchServe

Autologging PyTorch pl.LightningModule Models

As part of the universal autologging feature introduced in this release (see autologging section below), you can automatically log (and track) parameters and metrics from PyTorch Lightning models.

Aside from customized entities to log and track, the PyTorch autolog tracking functionality will log the model’s optimizer names and learning rates; metrics like training loss, validation loss, accuracies; and models as artifacts and checkpoints. For early stopping, model checkpoints, early stopping parameters and metrics are logged too. To understand its mechanics and usage, read the PyTorch autologging example.

Converting PyTorch models to TorchScript

TorchScript is a way to create serializable and optimizable models from PyTorch code. As such any MLflow-logged PyTorch model can be converted into a TorchScript, saved and loaded (or deployed to) a high-performance, independent process, where there is no Python dependency. The process entails following steps:

Create an MLflow Python model
Compile the model using JIT and convert to TorchScript model
Log or save the TorchScript model
Load or deploy the TorchScript model

# Your PyTorch nn.Module or pl.LightningModule
model = Net()
scripted_model = torch.jit.script(model)
…
mlflow.pytorch.log_model(scripted_model, "scripted_model")
model_uri = mlflow.get_artifact_uri("scripted_model")
loaded_model = mlflow.pytorch.load_model(model_uri)
…

For brevity, we have not included all the code here, but you can examine the example code—IrisClassification and MNIST—in the GitHub mlflow/examples/pytorch/torchscript directory.

One thing you can do with a scripted (fitted or logged) model is use the mflow fluent and mlflow.pytorch APIs to access the model and its properties, as shown in the GitHub examples. Another thing you can do with the scripted model is deploy it to a TorchServe server using TorchServer MLflow Plugin.

Deploying PyTorch models with TorchServe MLflow Plugin

TorchServe offers a flexible, easy tool for serving PyTorch models. Through the TorchServe MLflow deployment plugin, you can deploy any MLflow-logged and fitted PyTorch model. This extended integration completes the PyTorch MLOps lifecycle—from developing, tracking and saving to deploying and serving PyTorch models.

Figure 1: Extended end-to-end PyTorch and MLflow Integration

For demonstration, two PyTorch examples—BertNewsClassifcation and MNIST—enumerate steps in how you can use the TorchServe MLflow deployment plugin to deploy a PyTorch saved model to an existing TorcheServe server. Any MLflow-logged and fitted PyTorch model can easily be deployed using mlflow deployments commands. For example:

mlflow deployments create -t torchserve -m models:/my_pytorch_model/production -n my_pytorch_model

Once deployed, you can just easily use mlflow deployments predict command for inference.

mlflow deployments predict --name my_pytorch_model --target torchserve --input-path sample.json --output-path output.json.

SHAP API Offers Model Explainability

As more and more machine learning models are deployed in production as part of business applications that offer suggestive hints or make decisive predictions, machine learning engineers are obliged to explain how a model was trained and what features contributed to its output. One common technique used to answer these questions is SHAP (SHapley Additive exPlanations), a theoretical approach to explain an output of any machine learning model.

Figure 2: SHAP can estimate how each feature contributes to the model output.

To that end, this release includes an mlflow.shap module with a single method mlflow.shap.log_explanation() to generate an illustrative figure that can be logged
as a model artifact and inspected in the UI.

import mlflow

# prepare training data
dataset = load_boston()
X = pd.DataFrame(dataset.data[:50, :8], columns=dataset.feature_names[:8])
y = dataset.target[:50]

# train a model
model = LinearRegression()
model.fit(X, y)

# log an explanation
with mlflow.start_run() as run:
    mlflow.shap.log_explanation(model.predict, X)
…

Figure 3: SHAP explanation saved as an MLflow artifact

You can view the example code in the docs page and try other examples of models with SHAP explanations in the MLflow GitHub mlflow/examples/shap directory.

Autologging Simplifies Tracking Experiments

The mlflow.autolog() method is a universal tracking API that simplifies training code by automatically logging all relevant model entities—parameters, metrics, artifacts such as models and model summaries—with a single call, without the need to explicitly call each separate method to log respective model’s entities.

As a universal single method, under the hood, it detects which supported autologging model flavor is used—in our case scikit-learn—and tracks all its respective entities to log. After the run, when viewed in the MLflow UI, you can inspect all automatically logged entities.

Figure 4: MLflow UI showing automatically logged entities for scikit-learn model

What’s next

Learn more about PyTorch integration at the Data + AI Summit Europe next week, with a keynote from Facebook AI Engineering Director Lin Qiao and a session on Reproducible AI Using PyTorch and MLflow from Facebook’s Geeta Chauhan.

Stay tuned for additional PyTorch and MLflow detailed blogs. For now you can:

Read MLflow and PyTorch — Where Cutting Edge AI meets MLOps
Checkout out the PyTorch and MLFlow mlflow/examples/pytorch/
Examine SHAP GitHub mlflow/examples/shap/
pip install mlflow==1.12.0 and have a go at it.

Community Credits

We want to thank the following contributors for updates, doc changes, and contributions to MLflow release 1.12. In particular, we want to thank the Facebook AI and PyTorch engineering team for their extended PyTorch integration contribution and all MLflow community contributors:

Andy Chow, Andrea Kress, Andrew Nitu, Ankit Mathur, Apurva Koti, Arjun DCunha, Avesh Singh, Axel Vivien, Corey Zumar, Fabian Höring, Geeta Chauhan, Harutaka Kawamura, Jean-Denis Lesage, Joseph Berry, Jules S. Damji, Juntai Zheng, Lorenz Walthert, Poruri Sai Rahul, Mark Andersen, Matei Zaharia, Martynov Maxim, Olivier Bondu, Sean Naren, Shrinath Suresh, Siddharth Murching, Sue Ann Hong, Tomas Nykodym, Yitao Li, Zhidong Qu, @abawchen, @cafeal, @bramrodenburg, @danielvdende, @edgan8, @emptalk, @ghisvail, @jgc128 @karthik-77, @kzm4269, @magnus-m, @sbrugman, @simonhessner, @shivp950, @willzhan-db

Try Databricks for free. Get started today.

The post MLflow 1.12 Features Extended PyTorch Integration appeared first on Databricks.

↧

How to Evaluate Data Pipelines for Cost to Performance

November 13, 2020, 11:00 am

≫ Next: Fatal Force: Exploring Police Shootings With SQL Analytics

≪ Previous: MLflow 1.12 Features Extended PyTorch Integration

Learn best practices for designing and evaluating cost-to-performance benchmarks from Germany’s #1 weather portal.

While we certainly conduct several benchmarks, we know the best benchmark is your queries running on your data. But what are you benchmarking against in your evaluation? The answer seems obvious – cost and integration with your cloud architecture roadmap.

We are finding, however, that many enterprises are only measuring the costs of individual services within a workflow, rather than the entire cost of the workflow. When comparing different architectures, running a complete workflow will demonstrate the total resources consumed (data engine + compute + ancillary support functions).

Without knowing the duration, job failure rate of each architecture, and manual effort required to support a job, comparing list prices of the individual components in two architectures will be misleading at best.

wetter.com case study

wetter.com is the DACH region’s #1 B2C weather portal with up to 20 million monthly unique users along with full cross-media production. To leverage and monetize its data, wetter.com created a new business unit called METEONOMIQS. With METEONOMIQS, the company could now generate new revenue-streams out of their data by developing and selling data-products to business customers. METEONOMIQS provides weather and geo-based data science services to decode the interrelation between weather, consumer behaviour and many other factors used by clients in retail, FMCG, e-commerce, tourism, food and advertising.

METEONOMIQS’ challenge

METEONOMIQS had chosen Amazon EMR for processing their data from raw ingestion through to cleansed and aggregated to serve downstream API users. Originally EMR had been the obvious choice as a best-in-class cloud-based Spark engine that fit into their AWS stack.

However, this architecture soon hit its limits. The data pipeline required substantial manual effort to update rows and clean tables, required high DevOps effort to maintain, and limited the potential to use ML due to prolonged development cycles. The poor notebook experience and risk of errors when handing over ML models from DS to DE made it harder to support multiple models at a time.

The greatest risk to the business however was the inability to implement an automated GDPR-compliant workflow by, for example, easily deleting individual customers. Instead METEONOMIQS had to manually clean the data, leading to days of downtime. With GDPR penalties reaching up to 4% of the parent company’s global revenue, this presented a large risk for parent company ProSiebenSat.1.

Building the test

METEONOMIQS turned to Databricks to see if there was a better way to architect their data ingest, processing, and management on Amazon S3. Working with Databricks, they set up a test to see how running this pipeline on Databricks compared in terms of:

Vector analyzed	Capabilities required
Setup	Ability to set up IAM-access roles by users Ability to integrate into their existing AWS Glue data catalogue as a metastore
Pipeline migration	Ability to migrate code from existing pipeline directly to Databricks without major re-engineering. Note: they did not tackle code optimization in this test
GDPR compliance	Ability to build a table with (test) customer/app-ids which could be removed to fulfill the GDPR requirements (right to be forgotten). Ability to set up automated deletion job removing the IDs from all intermediate and results-tables and validate the outcome
Clean up / Update	Ability to reconstruct an example of a previously updated / cleaned-up procedure. Build a clean-up procedure based on above example and do an update on the affected records
Ease of use	Ease of building visualisations within the databricks-notebooks by using the built-in functionalities and external plotting libraries (like matplotlib). Ability to work on multiple projects/streams by attaching two notebooks to a cluster
ML model management	Select an existing model from the current environment and migrate the code for the training-procedure to Databricks Conduct training-run(s) and use MLFlow tracking server to track all parameters, metrics and artifacts OPTIONAL: Store the artifacts in the currently used proprietary format Register (best) model in the MLflow Model Registry, set it into “production” state and demonstrate the approval process Demonstrate the handover from data domain (model building) to systems of engagement domain (model production) via MLflow Model Registry
Total cost	Use the generated data from the PoC and additional information (further pipelines/size of the data/number of users/ …) to project infrastructure costs, inclusive of Databricks, compute, and storage.

Benchmark results

Data corrections/enhancements without downtime

Vector analyzed	EMR-based architecture	Databricks-based architecture
Setup	✔	✔
Pipeline migration	—	✔
GDPR compliance	✘ GDPR deletes in hours/days with downtime	✔ GDPR deletes in minutes without downtime
Clean up / Update	✘ Requires days of downtime	✔
Ease of use	✘	✔
ML model management	✘	✔ Improved collaboration between Data Scientists and Data Engineers / Dev Team
Total cost	80% of EMR costs were from dedicated dev and analytics clusters leading to unpredictable compute costs. DataOps required substantial developer resources to maintain.	Through cluster sharing, METEONOMIQS could use cloud resources much more efficiently But more importantly, they can now do new use cases like automated GDPR compliance and scale their ML in ways not possible before.

For METEONOMIQS the main gains to the Databricks architecture were:

Adding use cases (e.g., automated data corrections and enhancements) that hadn’t been deployed on EMR due to the high level of development costs
Massively decreasing the amount of manual maintenance required for the pipeline
Simplifying and automating GDPR compliance of the pipeline so that it could now be done in minutes without downtime compared to hours/days with downtime previously

Additionally, the team had high AWS resource consumption in the EMR architecture since shared environments were not possible on EMR. As a result team members had to use dedicated clusters. Databricks’ shared environment for all developers plus the ability to work on shared projects (i.e., notebooks), resulted in a more efficient use of human and infrastructure resources.

Handover of ML models from data scientists to the data engineering team was complicated and led the ML code to diverge. With MLflow the team now has a comfortable way to hand over models and track changes over time.

Further, as Databricks notebooks are much easier to use, METEONOMIQS could enable access to the data lake to a broader audience like, for example, the mobile app team.

As one of their next steps, METEONOMIQS will look to optimize their code for further infrastructure savings and performance gains as well as look at other pipelines to transition to Databricks architecture.

Takeaways

The keys to the team’s successful benchmark relied on

Knowing what they were measuring for: Often clients will only compare list prices of individual services (e.g., compare the cost of one Spark engine versus another) when evaluating different architectures. What we try to advise clients is not to look at individual services but rather the total job cost (data engine + compute + team productivity) against the business value delivered. In this case, wetter.com’s data engineering team aligned their test with the overall business goal – ensuring their data pipelines could support business and regulatory requirements while decreasing infrastructure and developer overhead.
Choosing critical workloads: Instead of trying to migrate all pipelines at once, the team narrowed the scope to their most pressing business case. Through this project they were able to validate that Databricks could handle data engineering, machine learning, and even basic business analytics at scale, on budget, and in a timely manner.
Delivering value quickly: Critical for this team was to move from discussions to PoCs to production as quickly as possible to start driving cost savings. Discussions stretching months or longer was not an option nor a good use of their team’s time. Working with Databricks, they were able to stand up the first benchmark PoCs in less than three weeks.

Ready to run your own evaluation?

If you are looking to run your own tests to compare costs and performance of different cloud data pipelines, drop us a line at sales@databricks.com. We can provide a custom assessment based on your complete job flow and help qualify you for any available promotions. Included in the assessment are:

Tech validation: understand data sources, downstream data use, and resources currently required to run pipeline job
Business value analysis: identify the company’s strategic priorities, to understand how the technical use case (e.g., ETL) drives business use cases (e.g., personalization, supply chain efficiency, quality of experience). This ensures our SAs are designing a solution that fits not just today’s needs but the ongoing evolution of your business.

Below is an outline of our general approach based on best practices for designing and evaluating your benchmark test for data pipelines.

Designing the test

Given data pipelines within the same enterprise can vary widely depending on the data’s sources and end uses – and large enterprises can have thousands of data pipelines spanning supply chain, marketing, product, and operations – how do you test an architecture to ensure it can work across a range of scenarios, end-user personas, and use cases? More importantly, how can you do it within a limited time? What you want is to be able to go from test, to validation, to scaling across as many pipelines as possible as quickly as possible to reduce costs as well as the support burden on your data engineers.

One approach we have seen is to select pipelines that are architecturally representative of most of an enterprise’s pipelines. While this is a good consideration, we find selecting pipelines based primarily on architectural considerations does not necessarily lead to the biggest overall impact. For example, your most common data pipeline architecture might be for smaller pipelines that aren’t necessarily the ones driving your infrastructure costs or requiring the most troubleshooting support from your data engineers.

Instead, we recommend clients limit the scope of their benchmark tests to 3-5 data pipelines based on just two considerations:

Test first on business critical data workloads: Often the first reflex is to start with less important workloads and then move up the stack as the architecture proves itself. However, we recommend running the test on strategic, business critical pipelines first because it is better to know earlier rather than later if an architecture can deliver on the necessary business SLAs. Once you prove you can deliver on the important jobs, then it becomes easier to move less critical pipelines over to a new architecture. But the reverse (moving from less critical to more critical) will require validating twice – first on the initial test and then once again for important workloads.
Select pipelines based on the major stressors affecting performance: What’s causing long lead times, job delays, or job failures? When selecting test pipelines, make sure you know what the stressors are to your current architecture, and select representative pipelines generating long delays, high fail rates, and/or require constant support from your data engineering teams. For example, if you’re a manufacturer trying to get a real-time view of your supply chain, from parts vendors to assembly to shipping, but your IoT pipelines take hours to process large volumes of small files in batches, that is an ideal test candidate.

Evaluating the results

Once you have selected the data pipelines to test, the key metrics to evaluate are:

Total cost to run a job: What are the total resources required to run a job? This means looking not just at the data engine costs for ingest and processing, but also total compute and support function costs (like data validation) to complete the data query. In addition, what is your pipeline’s failure rate? Frequent job failures mean reprocessing the data several times, significantly increasing infrastructure costs.
Amount of time to run a job: How long does it take to run a job once you add cluster spin up and data processing along with the amount of time it takes to identify and remediate any job failures? The longer this period, the higher the infrastructure costs but also, the longer it will take for your data to drive real business value/insights. Enterprises rely on data to make important business decisions and rigid pipelines with long lead times prevent businesses from iterating quickly.
Productivity: How often are your jobs failing and how long does it take your data engineers to go through the logs to find the root cause, troubleshoot, and resolve? This loss of productivity is a real cost in terms of increased headcount plus the opportunity cost of having your data engineers focused on basic data reliability issues instead of solving higher level business problems. Even if your jobs run correctly, are your downstream users working with the most up to date information? Are they forced to deduplicate and clean data before use in reports, analytics, and data science? Particularly with streaming data where you can have out-of-order files, how can you ensure you have consistent data across users?
Extensibility: Will adding new use cases or data sources require full re-engineering of your data pipelines, or do you have a schema that can evolve with your data needs?

Additionally, as enterprises look to create a more future proof architecture, they should look to:

Implementation complexity: How big of a migration will this be? How complex is the re-engineering required? How much and for how long will it take data engineering resources to stand up a new data pipeline? How quickly can your architecture conform to security requirements? When UK-based food box company Guosto rebuilt their ETL pipelines to Delta Lake on Databricks, they noted, “the whole implementation, from the first contact with Databricks to have the job running in production took about two months — which was surprisingly fast given the size of Gousto tech and the governance processes in place.”
Portability: As more enterprises look to multi-cloud, how portable is their architecture across clouds? Is data being saved in proprietary formats resulting in vendor lock in (i.e., will it require substantial costs to switch in the future)?

Try Databricks for free. Get started today.

The post How to Evaluate Data Pipelines for Cost to Performance appeared first on Databricks.

↧

Fatal Force: Exploring Police Shootings With SQL Analytics

November 16, 2020, 9:00 am

≫ Next: How to Train XGBoost With Spark

≪ Previous: How to Evaluate Data Pipelines for Cost to Performance

Introduction

Data has shown that police in the United States kill civilians at a rate far higher than police in other wealthy countries.¹ In 2019, law enforcement in the U.S. killed 33.5 civilians per 10 million people, compared to 1.3 in Germany and 0.5 in the U.K.² This use of deadly force falls disproportionately on people of color and has become a flashpoint as videos surface and protesters take to the streets demanding change. As data scientists, we examine data, offer insights gleaned from data, and add our voice to this conversation by using the new SQL Analytics workspace in Databricks.

For our data source, we set out to track down a federal database of police shootings, which turned out to be much trickier than anticipated. Prior to 2016, there was no federal database on the use of force by law enforcement officers. Since then, the FBI has compiled its own database, however leading newspapers have found that the FBI numbers routinely underreport fatalities by law-enforcement officers by as much as a factor of two.

In the absence of a robust federal database of police shootings, we turned to the database compiled by The Washington Post,³ which holds verified and regularly updated information on fatal shootings by on-duty police officers since January 1, 2015 (our snapshot includes data through October 29, 2020). It is important to note, however, that the Post data does not include deaths of people while in police custody, fatal shootings by off-duty officers, or deaths not caused by a firearm. We joined this data set with state population and demographic data to normalize per capita across states.⁴

Dashboard overview video

In this blog post, we analyze national and state-level trends around police use of fatal force. We will guide you through the insights we have derived from the interactive dashboard we created. After analyzing national statistics, we will do a deep dive comparison of three states in particular — New York, Alaska and California — as they spend the most on law enforcement per capita out of all the states.⁵ However, despite their similar spending, our data analysis reveals drastically different outcomes.

All of our analyses are available in this Github repository for you to reproduce and expand upon. We stored our data in Delta Lake for faster query performance and version control given the frequent updates to The Washington Post data set.

Demographic breakdown of shootings

In terms of absolute numbers, more white people were fatally shot by police than any other race, with 2,591 fatalities since 2015. However, when normalizing the number of fatalities by demographics, Native American and Black people suffered almost three times more fatal police shootings than white people, as shown in the graph below. If we were to look at 2017 alone, roughly 1 out of 100,000 Native Americans were fatally shot by police, compared to 0.2 out of 100,000 white people.

Fatal shootings by race in 2015–2020

In Utah, this race-based discrepancy is even more pronounced. Although Black people make up only 1% of Utah’s population, they accounted for 10% of police fatalities. The same trend persists in Illinois, where the fatality rate for Black people is almost 10 times higher than for white people. And these are just two examples on a long list of states exhibiting disproportionately higher rates of fatal shootings by race. To help address these concerns of racial bias, police departments across the country have instituted unconscious bias training, only to find it had no impact on the numbers in the field.⁶

We can look at the Sankey diagram below to look at the various breakdowns of fatal shootings. Going from left to right:

4% of victims are female
51% White, 26% Black, 19% Hispanic, 2% Asian, and 2% Native American victims
23% of victims showed signs of mental illness, indicated by on-scene mental health crises or news reports
91% of victims were determined to be armed, with objects ranging from toy weapons to pepper spray to tasers to guns
12% of incidents may have been recorded by body cameras

2015–2020 breakdown of fatal shootings in the U.S.

All that it takes to generate these visualizations in the SQL Analytics workspace is a few lines of SQL.


    SELECT 
        gender AS stage1,
        race AS stage2,
        signs_of_mental_illness AS stage3,
        armed AS stage4,
        body_camera AS stage5,
        COUNT(*) AS value
    FROM police.police_shootings
    GROUP BY 1,  2,  3, 4, 5

Once you execute the SQL code above, you simply add a visualization and select “Sankey” as the visualization type.

Use of body cameras is actually declining

Since 2015, the United States has averaged 986 fatal shootings by on-duty police per year. As of this writing, there have only been 14 days in 2020 without a fatality by law enforcement.⁷ Despite the public outcry for police officers to wear body cameras, we found that body camera recordings of such incidents have actually decreased since 2016, as illustrated in the funnel chart below.

Percentage of incidents with body camera recordings

In 2016, 14.9% of fatal shootings had body camera recordings, and every year since then has had a lower rate of body camera recordings, despite the annual fatality rate remaining constant. How is it that 96% of Americans own a cellphone of some kind,⁸ but only 1 in 7 fatal shootings had any police recording of the incident? When six-year-old Jeremy Mardis was killed by police in 2015, “evidence from a police body-worn video camera was cited as being contributory to the speed of the arrests.”⁹ Without the body cameras, law enforcement would not have been able to make an arrest as justly and quickly as without it. According to the Bureau of Justice Statistics, 47% of general-purpose law enforcement agencies in 2016 had acquired body-worn cameras yet the percentage of fatal encounters where body camera recordings were present is far below that number.¹⁰

Now that we have established nationwide trends of fatal encounters, let’s compare how Alaska, California and New York — the three states with the highest police funding per capita — differ in terms of racial disparities, body camera coverage of incidents, and mental health episodes at the scene of interaction with police.

State Comparison: Alaska, California and New York

Alaska ranks first in police fatalities per capita

In the bar chart below, we can visually compare how each state ranks, and we can see there is a roughly 10x difference in per capita fatalities with the states on the left- and right-hand sides of the chart. Alaska has the dubious distinction of ranking first among U.S. states in police fatalities per capita in 2017, 2019 and 2020 (it still remained in the top five in 2015, 2016 and 2018). By comparison, in 2020, California is midrange at 20th highest, whereas New York is near the bottom with the sixth lowest fatality rate per capita.

Fatalities per 100K individuals in 2020

So how is it that these three states spend the most on law enforcement per capita, but have very different outcomes? Following up on the previous section, let’s start by taking a closer look at the breakdown of fatalities by race in these three states as well as their overall fatalities. The table below shows the total fatalities from 2015–2020 per 100,000 individuals.

State	Black	Native American	Hispanic	White	Asian	Overall
Alaska	14.2	8.6	–	4.9	4.1	5.5
California	6.0	2.7	2.1	1.6	0.5	2.2
New York	1.7	–	0.2	0.3	0.1	0.5

Fatality rates from 2015–2020 per 100,000 individuals

Comparing these fatality rates overall, we see that the magnitude of the fatal officer-involved shootings is the highest in Alaska. Alaska has 11x more fatalities per capita than New York, and California has over 4x more fatalities than New York. FiveThirtyEight found that while police fatal shootings are decreasing in urban areas, they are offset by the increase in fatalities in suburban and rural areas.¹¹ One plausible explanation for why the rates are dropping in urban areas is due to “reforms to use-of-force policies implemented in the wake of high-profile deaths.” Additionally, the increase in rural and suburban areas could be due to the increase in police budgets and the transfer of surplus military equipment to local police departments since 1997.¹²

What is also interesting to note is that Alaska has the highest gun ownership per capita,¹³ while New York has the third-lowest and California the ninth-lowest gun ownership per capita, respectively. There is a statistically significant, positive correlation between gun ownership and shootings by police, with a correlation strength value at 0.64 and a p-value of 10^-7. The more individuals per capita who own guns, the more police shootings there are — this is not due to random chance alone. At the time of the fatal incident, 57% of all victims in our data set were armed with a gun.

California has the highest racial disparity

When taking a look at the disparities by race and ethnicity, Black people in Alaska are 8x more likely to be killed by police than Black people in New York, and that figure jumps to 41x when comparing Asian people in Alaska to Asian people in New York. However, if we were to aggregate the non-white fatalities per capita (Black, Hispanic, Asian and Native American), California actually exhibited the highest racial disparity ratio: 11.3 fatalities per 100,000 non-white individuals vs. 1.6 per 100,000 white individuals, with a difference of 7x. In contrast, New York had a 6.7x difference, and Alaska, 5.4x. For a progressive state, California has a very large divide among victims of police violence and brutality.

New York has 15% more victims with mental illness than the U.S. average

Another concerning finding we uncovered is the alarmingly high rates of victims suffering from mental illness. Recall from the Sankey diagram above that the nationwide percentage of individuals with a mental health history or those who show mental distress at the scene of interaction with police is 23%. In New York, 38% of victims showed signs of mental illness, compared to 15% in Alaska and 23% in California. In a study released by the Treatment Advocacy Center,¹⁴ people with untreated mental illnesses are “16 times more likely to be killed during a police encounter than other civilians approached or stopped by law enforcement.” In the United States, over 240 million calls are made to 911 per year,¹⁵ and 1 out of 10 represent mental health–related reports.¹⁶ However, only 15% of police departments require officers to go through Crisis Intervention Training (CIT) that involves training on de-escalating situations and identifying mental health distress signs.¹⁷

On the other hand, police officers are first responders who receive perturbing calls for service for a variety of traumatic events, including homicides and domestic abuse. The repeated exposure to these stressors can lead to burnout, post-traumatic stress disorder (PTSD) and other mental illnesses. In fact, a study by Ruderman Family Foundation in 2017 found that more police died by suicide than in the line of duty. In addition, according to a survey conducted by NBC4 I-Team on over 600 police officers at the Los Angeles Police Department, a staggering 90% of police officers said seeking mental health assistance and therapy is stigmatizing. As a community, we ought to ask, how can we better support both civilians and police officers experiencing mental health illnesses?

Alaska has 0% body camera coverage of fatal encounters in 2015, 2019 and 2020

Let’s now compare the percentage of body camera recordings in these states. Back in 2015, New York had 0 body camera recordings of their 19 fatal encounters that year. Since 2016, body camera coverage of incidents has increased and is now at 36%. California has steadily increased its body camera usage, but its rate is two-thirds that of New York. Alaska, on the other hand, has actually seen a sharp decline in body camera coverage after 2017. Alaska has 0% body camera coverage in 2015, 2019 and 2020, despite the fatality rate remaining disturbingly high. Only four out of 40 fatalities in the past five years in Alaska have any body camera evidence. In an article from the Anchorage Daily News published in September 2020, they cited “funding as a primary reason for the delay” in adopting body cameras. However, Anchorage police have never applied for grants to fund body cameras. The Anchorage police force has 430 officers and a budget of $121 million, but none of that money is currently going toward body cameras.

Body camera rates in Alaska vs. California vs. New York

What the data does NOT tell us

Recognizing the limitations of the data and our analysis is important. The database we accessed to analyze police fatalities is far from complete. It is worth noting that it currently does not include any officer information (e.g., number of officers, tenure, past complaints, racial group, gender, etc.), and it excludes any deaths that resulted from non-shootings as well as non-fatal shootings. The Citizens Police Data Project collects and publishes information about police misconduct in Chicago, and some officers have been accused of misconduct over 100 times, and 1 out of 5 allegations are related to the use of excessive force. We could not locate a data set that includes officer information.

Looking ahead

Through our analysis of police shootings and law-enforcement funding, we hope that more people will be empowered to analyze current events with a more critical, data-driven eye. We also encourage the community to recognize the current limitations to our available data, be inspired to create space for conversations, and push toward movements that demand greater data transparency and availability from law enforcement agencies. As the old adage goes, “What gets measured, gets improved.” If the data on deadly encounters with police remains incomplete, incomprehensive and difficult to obtain, it will remain challenging to enact effective procedures to reduce use of force and improve trust between civilians and police.

How to reproduce and expand upon our analyses

The built-in visualizations in the SQL Analytics workspace helped to quickly derive insight from this data. We have provided a collection of notebooks that contains all of the queries used to generate our findings and visualizations. We encourage you to expand upon our analyses and share the results with your family, friends and colleagues. Further, we will host a tech talk on December 10, 2020, at 9:00 AM PST to show you how to create these dashboards in SQL Analytics.

References

¹“What the data says about police shootings,” Nature, September 4, 2019
²“Not just ‘a few bad apples’: U.S. police kill civilians at much higher rates than other countries,” Prison Policy Initiative, June 5, 2020.
³“How The Washington Post is examining police shootings in the United States,” The Washington Post, July 7, 2016.
⁴“State Population By Race, Ethnicity Data,” Governing: The Future of States and Localities.
⁵“State and Local Finance Initiative: Police and Corrections Expenditures, 2011 to present.
⁶“NYPD Study: Implicit Bias Training Changes Minds, Not Necessarily Behavior,” NPR, September 10, 2020.
⁷“Police Violence Map,” Mapping Police Violence.
⁸“Mobile Fact Sheet,” Pew Research Center, June 12, 2019.
⁹“Shooting of Jeremy Mardis,” Wikipedia.
¹⁰“Body-Worn Cameras in Law Enforcement Agencies, 2016,” Bureau of Justice Statistics, November 2018.
¹¹“Police Are Killing Fewer People In Big Cities, But More In Suburban And Rural America,” FiveThirtyEight, June 1, 2020.
¹²“How police departments got billions of dollars of tactical military equipment,” Marketplace, June 12, 2020.
¹³“Gun Ownership by State 2020,” World Population Review.
¹⁴“People with Untreated Mental Illness 16 Times More Likely to Be Killed by Law Enforcement,” Treatment Advocacy Center.
¹⁵“Understanding Police Enforcement: A 911 Data Analysis,” Vera Institute of Justice.
¹⁶“The Daily Crisis Cops Aren’t Trained to Handle,” Governing, May 2016.
¹⁷“Police have shot people experiencing a mental health crisis. Who should you call instead?,” USA Today, September 18, 2020.

Try Databricks for free. Get started today.

The post Fatal Force: Exploring Police Shootings With SQL Analytics appeared first on Databricks.

↧

How to Train XGBoost With Spark

November 16, 2020, 11:00 am

≫ Next: Key Sessions for AWS Customers at Data + AI Summit Europe 2020

≪ Previous: Fatal Force: Exploring Police Shootings With SQL Analytics

XGBoost is currently one of the most popular machine learning libraries and distributed training is becoming more frequently required to accommodate the rapidly increasing size of datasets. To utilize distributed training on a Spark cluster, the XGBoost4J-Spark package can be used in Scala pipelines but presents issues with Python pipelines. This article will go over best practices about integrating XGBoost4J-Spark with Python and how to avoid common problems.

Best practices: Whether to use XGBoost

This article assumes that the audience is already familiar with XGBoost and gradient boosting frameworks, and has determined that distributed training is required. However, it is still important to briefly go over how to come to that conclusion in case a simpler option than distributed XGBoost is available.

While trendy within enterprise ML, distributed training should primarily be only used when the data or model memory size is too large to fit on any single instance. Currently, for a large majority of cases, distributed training is not required. However, after the cached training data size exceeds 0.25x the instance’s capacity, distributed training becomes a viable alternative. As XGBoost can be trained on CPU as well as GPU, this greatly increases the types of applicable instances. But before just increasing the instance size, there are a few ways to avoid this scaling issue, such as transforming the training data at the hardware level to a lower precision format or from an array to a sparse matrix.

Most other types of machine learning models can be trained in batches on partitions of the dataset. But if the training data is too large and the model cannot be trained in batches, it is far better to distribute training rather than skip over a section of the data to remain on a single instance. So when distributed training is required, there are many distributed framework options to choose from.

When testing different ML frameworks, first try more easily integrable distributed ML frameworks if using Python. For sticking with gradient boosted decision trees that can be distributed by Spark, try PySpark.ml or MLlib. The “Occam’s Razor” principle of philosophy can also be applied to system architecture: simpler designs that provide the least assumptions are often correct. But XGBoost has its advantages, which makes it a valuable tool to try, especially if the existing system runs on the default single-node version of XGBoost. Migration to a non-XGBoost system, such as LightGBM, PySpark.ml, or scikit-learn, might cause prolonged development time. It should also be used if its accuracy is significantly better than the other options, but especially if it has a lower computational cost. For example, a large Keras model might have slightly better accuracy, but its training and inference time may be much longer, so the trade-off can cost more than a XGBoost model, enough to justify using XGBoost instead.

	Requires XGBoost	Does not require XGBoost
Non-Distributed Training	XGBoost	Scikit-learn, LightGBM
Distributed Training	XGBoost4J-Spark	PySpark.ml, MLlib

Table 1: Comparison of Gradient Boosted Tree Frameworks

Best practices: System design

Figure 1. Sample XGBoost4J-Spark Pipelines in PySpark or Scala

One way to integrate XGBoost4J-Spark with a Python pipeline is a surprising one: don’t use Python. The Databricks platform easily allows you to develop pipelines with multiple languages. The training pipeline can take in an input training table with PySpark and run ETL, train XGBoost4J-Spark on Scala, and output to a table that can be ingested with PySpark in the next stage. MLflow also supports both Scala and Python, so it can be used to log the model in Python or artifacts in Scala after training and load it into PySpark later for inference or to deploy it to a model serving applications.

If there are multiple stages within the training job that do not benefit from the large number of cores required for training, it is advisable to separate the stages and have smaller clusters for the other stages (as long as the difference in cluster spin-up time would not cause excessive performance loss). As an example, the initial data ingestion stage may benefit from a Delta cache enabled instance, but not benefit from having a very large core count and especially a GPU instance. Meanwhile, the training stage would be the reverse in that it might need a GPU instance and while not benefiting from a Delta cache enabled instance.

There are several considerations when configuring Databricks clusters for model training and selecting which type of compute instance:
– When multiple distributed model training jobs are submitted to the same cluster, they may deadlock each other if submitted at the same time. Therefore, it is advised to have dedicated clusters for each training pipeline.
– Autoscaling should be turned off so training can be tuned for a certain set amount of cores but autoscaling will have a varied number of cores available.
– Select a cluster where the memory capacity is 4x the cached data size due to the additional overhead handling the data. This is because, typically, the overhead and operations will cause 3x data consumption, which would place memory consumption optimally at 75%.
– Be sure to select one of the Databricks ML Runtimes as these come preinstalled with XGBoost, MLflow, CUDA and cuDNN.

Best practices: Hardware

XGBoost supports both CPU or GPU training. While there can be cost savings due to performance increases, GPUs may be more expensive than CPU only clusters depending on the training time. However, a recent Databricks collaboration with NVIDIA with an optimized fork of XGBoost showed how switching to GPUs gave a 22x performance boost and an 8x reduction in cost. RAPIDS is a collection of software libraries built on CUDA-X AI which provides high-bandwidth memory speed and GPU parallelism through simple Python APIs. RAPIDS accelerates XGBoost and can be installed on the Databricks Unified Analytics Platform. To set up GPU training, first start a Spark cluster with GPU instances (more information about GPU clusters here), and switching the code between CPU and GPU training is simple, as shown by the following example:

For CPU-based training:

xgb_reg = xgboost.XGBRegressor(..., tree_method='hist')

For GPU-based training:

xgb_reg = xgboost.XGBRegressor(..., tree_method='gpu_hist')

However, there can be setbacks in using GPUs for distributed training. First, the primary reason for distributed training is the large amount of memory required to fit the dataset. GPUs are more memory constrained than CPUs, so it could be too expensive at very large scales. This is often overcome by the speed of GPU instances being fast enough to be cheaper, but the cost savings are not the same as an increase in performance and will diminish with the increase in number of required GPUs.

Best practices: Hardware cost example

Performance increases do not have the same increase in cost savings. For example, NVIDIA released the cost results of GPU accelerated XGBoost4J-Spark training where there was a 34x speed-up, there was only a 6x cost saving (note that these experiment’s results were not run on Databricks).

Type	Cluster	Hardware	# of Instances	Instance Type	AWS EC2 Cost per Hour	AWS EMR Cost per Hour	Train Time in Minutes	Training Costs
GPU	AWS	4 x V100	2	p3.8xlarge	$12.24	$0.27	14	$5.81
CPU	AWS	2 x 8 cores	4	r5a.4xlarge	$0.904	$0.226	456	$34.37

This experiment was run with 190 GB of training data, meaning that following the 4x memory rule, it should preferably have a memory limit of at least 760 GB. The 8 V100 GPUs only hold a total of 128 GB yet XGBoost requires that the data fit into memory. However, this was worked around with memory optimizations from NVIDIA such as a dynamic in-memory representation of data based on data sparsity. But with 4 r5a.4xlarge instances that have a combined memory of 512 GB, it can more easily fit all the data without requiring other optimizations. 512 GB is lower than the preferred amount of data, but can still work under the memory limit depending on the particular dataset as the memory overhead can depend on additional factors such as how it is partitioned or the data format.

Note also that these cost estimates do not include labor costs. If training is run only a few times, it may save development time to simply train on a CPU cluster that doesn’t require additional libraries to be installed or memory optimizations for fitting the data onto GPUs. However, if model training is frequently run, it may be worth the time investment to add hardware optimizations. This example also doesn’t take into account CPU optimization libraries for XGBoost such as Intel DAAL (*not included in the Databricks ML Runtime nor officially supported) or showcase memory optimizations available through Databricks.

Best practices: PySpark wrappers

There are plenty of unofficial open-source wrappers available to either install or use as a reference when creating one. Most are based on PySpark.ml.wrapper and use a Java wrapper to interface with the Scala library in Python. However, be aware that XGBoost4J-Spark may push changes to its library that are not reflected in the open-source wrappers. An example of one such open-source wrapper that is later used in the companion notebook can be found here. Databricks does not officially support any third party XGBoost4J-Spark PySpark wrappers.

Solutions to Common Problems

Multithreading — While most Spark jobs are straightforward because distributed threads are handled by Spark, XGBoost4J-Spark also deploys multithreaded worker processes. For a cluster with E executors of C cores, there will be E*C available cores, so the number of threads should not exceed E*C
Careful — If this is not set, training may not start or may suddenly stop
Be sure to run this on a dedicated cluster with the Autoscaler off so you have a set number of cores
Required — To tune a cluster, you must be able to set threads/workers for XGBoost and Spark and have this be reliably the same and repeatable

XGBoost uses num_workers to set how many parallel workers and nthreads to the number of threads per worker. Spark uses spark.task.cpus to set how many CPUs to allocate per task, so it should be set to the same as nthreads. Here are some recommendations:

Set 1-4 nthreads and then set num_workers to fully use the cluster
- Example: For a cluster with 64 total cores, spark.tasks.cpus being set to 4, and nthreads set to 4, num_workers would be set to 16
Monitor the cluster during training using the Ganglia metrics. Watch for memory overutilization or CPU underutilization due to nthreads being set too high or low.
- If memory usage is too high: Either get a larger instance or reduce the number of XGBoost workers and increase nthreads accordingly
- If the CPU is overutilized: The number of nthreads could be increased while workers decrease
- If the CPU is underutilized, it most likely means that the number of XGBoost workers should be increased and nthreads decreased.
- The following table shows a summary of these techniques:

	Memory usage too high	Memory usage nominal
CPU overutilized	Larger instance or reduce num_workers and increase nthreads	Decrease nthreads
CPU underutilized	Reduce num_workers and increase nthreads	Increase num_workers, decrease nthreads
CPU nominal	Larger memory instance or reduce num_workers and increase nthreads	“Everything’s nominal and ready to launch here at Databricks”

Figure 2. Table of best tuning practices

There can be multiple issues dealing with sparse matrices. It’s important to calculate the memory size of the dense matrix for when it’s converted because the dense matrix can cause a memory overload during the conversion. If the data is very sparse, it will contain many zeroes that will allocate a large amount of memory, potentially causing a memory overload. For example, the additional zeros with float32 precision can inflate the size of a dataset from several gigabytes to hundreds of gigabytes. XGBoost by default treats a zero as “missing”, so configuring setMissing can correct this issue by setting the missing value to another value other than zero. For more information about dealing with missing values in XGBoost, see the documentation here.

XGBoost will automatically repartition the input data to the number of XGBoost workers, so the input data should be repartitioned in Spark to avoid the additional work in repartitioning the data again. As a hypothetical example, when reading from a single CSV file, it is common to repartition the DataFrame. It may be repartitioned to four partitions by the initial ETL but when XGBoost4J-Spark will repartition it to eight to distribute to the workers. This causes another data shuffle that will cause performance loss at large data sizes. So always calculate the number of workers and check the ETL partition size, especially because it’s common to use smaller datasets during development so this performance issue wouldn’t be noticed until late production testing.

When dealing with HIPAA compliance for medical data, XGBoost and XGBoost4J-Spark use unencrypted over-the-wire communication protocols that are normally not in compliance to use. Make sure to follow the instructions on how to create a HIPAA-compliant Databricks cluster and deploy XGBoost on AWS Nitro instances in order to comply with data privacy laws. While there are efforts to create more secure versions of XGBoost, there is not yet an established secure version of XGBoost4J-Spark.

There are integration issues with the PySpark wrapper and several other libraries to be made aware of. MLflow will not log with mlflow.xgboost.log_model but rather with mlfow.spark.log_model. It cannot be deployed using Databricks Connect, so use the Jobs API or notebooks instead. When using Hyperopt trials, make sure to use Trials, not SparkTrials as that will fail because it will attempt to launch Spark tasks from an executor and not the driver. Another common issue is that many XGBoost code examples will use Pandas, which may suggest converting the Spark dataframe to a Pandas dataframe. But this will invalidate the reason to use distributed XGBoost since the conversion will localize the data on the driver node, which is not supposed to fit on a single node if requiring distributed training.

If XGBoost4J-Spark fails during training, it stops the SparkContext, forcing the notebook to be reattached or stopping the job. If this occurs during testing, it’s advisable to separate stages to make it easier to isolate the issue since re-running training jobs is lengthy and expensive. The error causing training to stop may be found in the cluster stderr logs, but if the SparkContext stops, the error may not show in the cluster logs. In those cases, monitor the cluster while it is running to find the issue.

Conclusion

XGBoost4J-Spark can be tricky to integrate with Python pipelines but is a valuable tool to scale training. To create a wrapper from scratch will delay development time, so it’s advisable to use open source wrappers. If you decide that distributed training is required and that XGBoost is the best algorithm for the application, avoid overcomplication and excessive wrapper building to support multiple languages being used in your pipeline. Use MLflow and careful cluster tuning when developing and deploying production models. Using the methods described throughout this article, XGBoost4J-Spark can now be quickly used to distribute training on big data for high performance and accuracy predictions.

GET THE NOTEBOOK

Try Databricks for free. Get started today.

The post How to Train XGBoost With Spark appeared first on Databricks.

↧

Key Sessions for AWS Customers at Data + AI Summit Europe 2020

November 16, 2020, 5:14 pm

≫ Next: Key Sessions for Microsoft Azure Customers at Data + AI Summit Europe 2020

≪ Previous: How to Train XGBoost With Spark

Databricks and Summit Gold Sponsor AWS Present on a wide variety of topics at this year’s premier data and AI event.

Amazon Web Services (AWS) is sponsoring Data + AI Summit Europe 2020 and our work with AWS continues to make Databricks better integrated with other AWS services, making it easier for our customers to drive huge analytics outcomes.

As part of Data + AI Summit, we want to highlight some of the top sessions of interest for AWS customers. The sessions below are relevant to customers interested in or using Databricks on the AWS cloud platform, demonstrating key service integrations. If you have questions about your AWS platform or service integrations, visit the AWS booth at Data + AI Summit.

Building a Cloud Data Lake with Databricks and AWS

How are customers building enterprise data lakes on AWS with Databricks? Learn how
Databricks complements the AWS data lake strategy and how Databricks integrates with
numerous AWS Data Analytics services such as Amazon Athena and AWS Glue.

Moving to Databricks & Delta

wetter.com builds analytical B2B data products that heavily use Spark and AWS technologies for data processing and analytics. In this session Carsten Herbe will explain why wetter.com moved from AWS EMR to Databricks and Delta, and share their experiences from different angles like architecture, application logic and user experience. The session will cover how security, cluster configuration, resource consumption and workflow changed by using Databricks clusters as well as how using Delta tables simplified application logic and data operations.
From a data scientist/engineers perspective Carsten will show how daily analytical and development work has improved. Many of these points can also be applied when moving from some other Spark platform like Hadoop to Databricks
.

Speaker: Carsten Herbe, wetter.com GmbH

Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS SageMaker for Enterprise AI Scenarios

Transformer-based pre-trained language models such as BERT, XLNet, Roberta and Albert significantly advance the state-of-the-art of NLP and open doors for solving practical business problems with high-performance transfer learning. However, operationalizing these models with production-quality continuous integration/ delivery (CI/CD) end-to-end pipelines that cover the full machine learning life cycle stages of train, test, deploy and serve while managing associated data and code repositories is still a challenging task. In this presentation, the Outreach team will demonstrate how we use MLflow and AWS Sagemaker to productionize deep transformer-based NLP models for guided sales engagement scenarios at the leading sales engagement platform, Outreach.io.

Outreach will share their experiences and lessons learned in the following areas:

A publishing/consuming framework to effectively manage and coordinate data, models and artifacts (e.g., vocabulary file) at different machine learning stages
A new MLflow model flavor that supports deep transformer models for logging and loading the models at different stages
A design pattern to decouple model logic from deployment configurations and model customizations for a production scenario using MLProject entry points: train, test, wrap, deploy.
A CI/CD pipeline that provides continuous integration and delivery of models into a Sagemaker endpoint to serve the production usage

This session will be of great interest to a broad business community who are actively working on enterprise AI scenarios and digital transformation.

Speakers: Yong Liu, Outreach.io and Andrew Brooks, Outreach.io

From Hadoop to Delta Lake and Glue for Streaming and Batch

The modern data customer wants data now. Batch workloads are not going anywhere, but at Scribd the future of our data platform requires more and more streaming data sets. As such our new data platform built around AWS, Delta Lake, and Databricks must simultaneously support hundreds of batch workloads, in addition to dozens of new data streams, stream processing, and stream/ad-hoc workloads.
In this session we will share the progress of our transition into a streaming cloud-based data platform, and how some key technology decisions like adopting Delta Lake have unlocked previously unknown capabilities our internal customers enjoy. In the process, we’ll share some of the pitfalls and caveats from what we have learned along the way, which will help your organization adopt more data streams in the future.

Speakers: R Tyler Croy, Scribd

Join Us!

We look forward to connecting with you at Data + AI Summit Europe 2020! If you have questions about Databricks running on AWS, please visit the AWS virtual booth at Data + AI Summit.

For more information about Databricks on AWS including customer case studies and integration details, go to databricks.com/aws.

Try Databricks for free. Get started today.

The post Key Sessions for AWS Customers at Data + AI Summit Europe 2020 appeared first on Databricks.

↧

Key Sessions for Microsoft Azure Customers at Data + AI Summit Europe 2020

November 17, 2020, 9:44 am

≫ Next: Databricks Partner Executive Summit at Data + AI Summit 2020 Europe

≪ Previous: Key Sessions for AWS Customers at Data + AI Summit Europe 2020

Databricks, diamond sponsor Microsoft and Azure Databricks customers to present keynotes and breakout sessions at Data + AI Summit Europe.

Data + AI Summit Europe is the free virtual event for data teams — data scientists, engineers and analysts — who will tune in from all over the world to share best practices, discover new technologies, connect and learn. We are excited to have Microsoft as a Diamond sponsor, bringing Microsoft and Azure Databricks customers together for a lineup of great keynotes and sessions.

Rohan Kumar, Corporate Vice President of Azure Data, returns as a keynote speaker for the third year in a row, along with presenters from a number of Azure Databricks customers including Unilever, Daimler, Henkel, SNCF, Fluvius, Kaizen Gaming and DataSentics. Below are some of the top sessions to add to your agenda:

KEYNOTE
Keynote from Phinean Woodward
Unilever: During the WEDNESDAY MORNING KEYNOTE, 8:30 AM – 10:30 AM (GMT)
Phinean Woodward, Head of Architecture, Information and Analytics, Unilever

KEYNOTE
Keynote from Stephan Schwarz
Daimler: During the THURSDAY MORNING KEYNOTE, 8:30 AM – 10:30 AM (GMT)
Stephan Schwarz, Production Planning: Manager Smart Data Processing (Mercedes Operations), Daimler

KEYNOTE
Keynote from Rohan Kumar
Microsoft: During the THURSDAY MORNING KEYNOTE, 8:30 AM – 10:30 AM (GMT)
Rohan Kumar, Corporate Vice President, Azure Data, Microsoft
Sarah Bird, AI Research and Products, Microsoft

Responsible ML is the most talked about field in AI at the moment. With the growing importance of ML, it is even more important for us to exercise ethical AI practices and ensure that the models we create live up to the highest standards of inclusiveness and transparency. Join Rohan Kumar, as he talks about how Microsoft brings cutting-edge research into the hands of customers to make them more accountable for their models and responsible in their use of AI. For the AI community, this is an open invitation to collaborate and contribute to shape the future of Responsible ML.

Building the Next-gen Digital Meter Platform for Fluvius

Fluvius WEDNESDAY, 3:35 PM – 4:05 PM (GMT)

Fluvius is the network operator for electricity and gas in Flanders, Belgium. Their goal is to modernize the way people look at energy consumption using a digital meter that captures consumption and injection data from any electrical installation in Flanders ranging from households to large companies. After full roll-out there will be roughly 7 million digital meters active in Flanders collecting up to terabytes of data per day. Combine this with regulation that Fluvius has to maintain a record of these readings for at least 3 years, we are talking petabyte scale. delaware BeLux was assigned by Fluvius to set up a modern data platform and did so on Azure using Databricks as the core component to collect, store, process and serve these volumes of data to every single consumer and beyond in Flanders. This enables the Belgian energy market to innovate and move forward. Maarten took up the role as project manager and solution architect.

Building a MLOps Platform Around MLflow to Enable Model Productionalization in Just a Few Minutes

DataSentics WEDNESDAY, 3:35 PM – 4:05 PM (GMT)

Getting machine learning models to production is notoriously difficult: it involves multiple teams (data scientists, data and machine learning engineers, operations, …), who often does not speak to each other very well; the model can be trained in one environment but then productionalized in completely different environment; it is not just about the code, but also about the data (features) and the model itself… At DataSentics, as a machine learning and cloud engineering studio, we see this struggle firsthand – on our internal projects and client’s projects as well.

To address the issue, we decided to build a dedicated MLOps platform, which provides the necessary tooling, automations and standards to speed up and robustify the model productionalization process. The central piece of the puzzle is MLflow, the leading open-source model lifecycle management tool, around which we develop additional functionality and integrations to other systems – in our case primarily the Azure ecosystem (e.g. Azure Databricks, Azure DevOps or Azure Container Instances). Our key design goal is to reduce the time spent by everyone involved in the process of model productionalization to just a few minutes.

The Pill for Your Migration Hell

Microsoft WEDNESDAY, 4:45 PM – 5:15 PM (GMT)

This is the story of a great software war. Migrating Big Data legacy systems always involve great pain and sleepless nights. Migrating Big Data systems with Multiple pipelines and machine learning models only adds to the existing complexity. What about migrating legacy systems that protect Microsoft Azure Cloud Backbone from Network Cyber Attacks? That adds pressure and immense responsibility. In this session, we will share our migration story: Migrating a machine learning-based product with thousands of paying customers that process Petabytes of network events a day. We will talk about our migration strategy, how we broke down the system into migrationable parts, tested every piece of every pipeline, validated results, and overcome challenges. Lastly, we share why we picked Azure Databricks as our new modern environment for both Data Engineers’ and Data Scientists’ workloads.

End to End Supply Chain Control Tower

Henkel THURSDAY, 11:00 AM – 11:30 AM (GMT)

When you look at traditional ERP or management systems, they are usually used to manage the supply chain originating from either the point of Origin or point of destination which all are primarily physical locations. And for these, you have several processes like order to cash, source to pay, physical distribution, production etc.

Which allows us to do cross-functional data-based applications, one such example is like Digital Sales and operations planning. Which is a very powerful tool to align operations execution with our financial goals.

All this is possible, by using a future proof big data architecture and strong partnership with their respective suppliers such as microsoft and Databricks.

Bank Struggles Along the Way for the Holy Grail of Personalization: Customer 360

DataSentics THURSDAY, 1:35 PM – 2:05 PM (GMT)

Ceska sporitelna is one of the largest banks in Central Europe and one it’s main goals is to improve the customer experience by weaving together the digital and traditional banking approach. The talk will focus on the real world (both technical and enterprise) challenges during shifting the vision from powerpoint slides into production: Implementing Spark and Databricks-centric analytics platform in the Azure cloud combined with a on-prem data lake in the EU-regulated financial environment Forming a new team focused on solving use cases on top of C360 in the 10 000+ employee enterprise Demonstrating this effort on real use cases such as client risk scoring using both offline and online data Spark and its MLlib as an enabler for employing hundreds of millions of client interactions personalized omni-channel CRM campaigns.

Personalization Journey: From Single Node to Cloud Streaming

Kaizen THURSDAY, 1:35 PM – 2:05 PM (GMT)

Building a Streaming Data Pipeline for Trains Delays Processing

SNCF THURSDAY, 2:10 PM – 2:40 PM (GMT)

SNCF (French National Railway Company) has distributed a network of beacons over its 32,000 km of train tracks, triggering a flow of events at each train passage. In this talk, we will present how we built real-time data processing on these data, to monitor traffic and map the propagation of train delays. During the presentation we will demonstrate how to build an end to end solution, from ingestion to exposure.

Try Databricks for free. Get started today.

The post Key Sessions for Microsoft Azure Customers at Data + AI Summit Europe 2020 appeared first on Databricks.

↧

Databricks Partner Executive Summit at Data + AI Summit 2020 Europe

November 18, 2020, 8:30 am

≫ Next: Databricks and Coursera Launch Data Science Specialization for Data Analysts

≪ Previous: Key Sessions for Microsoft Azure Customers at Data + AI Summit Europe 2020

This week’s Partner Executive Summit, held in concert with Data + AI Summit 2020 Europe, is a feature event for our 500+ partners globally, and we love to share how partners are critical to making a positive impact on our joint customers with their solutions and integrations. Databricks success simply would not and could not happen without these partners. A number of Databricks executives were part of the agenda, including Ali Ghodsi, CEO and Michael Hoff, SVP of business development and partners, who hosted the event.

The event served as a great forum to hear directly from Databricks customers about their data and AI journey with Databricks and our partners to deliver business value. These customer stories were great examples of how joint engagements with our partners make a difference in accelerating time to value and consistently implementing best practices.

Databricks announced two new programs at Partner Executive Summit, and our partner awards to recognize the partners who made exceptional contributions to the Databricks ecosystem. We are excited to share those announcements and award winners below.

Partner Badges

The Databricks partner team officially launched a Digital Badging Program for Partners. As partners complete Databricks training paths and accreditations, they can now earn digital badges that enable them to showcase their skills with their peers and customers via LinkedIn and in their email signatures. This program was in a beta release up to this point, with over 6,000 badges awarded!

Why digital badges matter to our partners:

Visible recognition of Databricks expertise and thought leadership
Ability to show proven capabilities to implement their data and AI vision on Databricks
Makes it easier for customers to select partners with recognized expertise

Partners and customers can learn more about this program on the Partner Badges web page.

Databricks Partner program badges

Demand Generation Kits

Databricks announced a new Hadoop Migration demand generation kit, the first in a series of kits for partners to build awareness and engage customers for a specific type of solution. These kits include all the marketing content needed to get started, and given the success of Hadoop migration programs to date we are excited to provide this first kit to partners to engage customers and maximize the opportunity.

Awards

Always a highlight for the Databricks partner community, we announced the following awards in recognition of special partner achievements.

Innovation Awards

The C&SI partner Innovation award went to Wejo, received by Paul Reynolds. Wejo is using the Databricks Unified Data Analytics platform to build a global Automotive Industry Data Marketplace.

The ISV partner Innovation award went to Fivetran, received by Logan Welley. Fivetran demonstrated excellent engineering work in support of the Lakehouse architecture and SQL Analytics. Read more from Fivetran on how they are doubling down on the Lakehouse architecture and SQL analytics with Databricks.

Rising Star

The C&SI partner Rising Star award went to New Signature, received by Tom Zglobicki. New Signature grew quickly and were recently acquired by Cognizant, a reflection on their success with Azure Databricks and positioning Databricks at the heart of a modern data platform.

The ISV partner Rising Star award went to Immuta, received by Chris Devaney. Immuta is critical to accelerating data governance for several customers around the world with their ability to provide advanced fine-grained access control and security features via native integration with Databricks. You can read Immuta’s press release here.

Customer Impact Awards

The ISV partner Customer Impact award went to Tableau, received by Brian Matsubara. Tableau is a key partner for our work on the Lakehouse architecture and our recently announced SQL Analytics. We share a common vision to help our many joint customers gain valuable insight from massive volumes of data in the Lakehouse architecture.

The C&SI Northern Europe partner award for Customer Impact went to BJSS, received by Simon Dale. BJSS were selected for their business experience and digitally transforming Health and Pharma customers to deliver a modern cloud data architecture that prioritizes ROI.

The C&SI Central Europe partner award for Customer Impact went to Accenture, received by Nick Millman. Accenture was selected for their guidance to a key client on a complex multi-cloud solution.

The C&SI Southern Region partner award for Customer Impact Award went to OpenValue, for the second year in a row, received by Matthieu Reynier. OpenValue has made a consistent impact on a number of key accounts across France.

EMEA Partner of the Year

The EMEA Partner of the Year award went to TCS and was received by Ranjan Mishra. TCS has embraced our Strategic Partner Programme across multiple client successes.

Global Partner of the Year

Our Global Partner of the Year award went to Avanade, recognized in June at our global Summit, received by Alan Grogan, EMEA Data Modernisation Lead. Avanade is heavily engaged in building a Center of Excellence and a number of Joint Business Accelerators.

Partner Champions

This year we awarded special recognition to members of our Partner Champions group, the top technical evangelists in the community, which has grown leaps and bounds this year under the stewardship of Ryan Simpson. These six of the European Partner Champions were recognized for their excellent evangelism of the Databricks platform, through both customer implementations and community support.

Darren Fuller from Elastacloud for his Leadership and Guidance to the Azure UK Community
Ofer Habushi from Talend for championing technical alignment and education programs resulting in success at a number of client engagements
Eli Kling from Cognizant for his leadership and direction on standardising their Data Science and Data Engineering with Databricks
Pierre Troufflard from WanDisco for championing cost and risk reduction for Hadoop migrations
Simon Whiteley from Advancing Analytics for his renowned Video Series
Dael Willamson from Avanade for his constant presence and support across many of our large Azure Databricks clients

Thank you to the entire community of our 500+ partners for another great event!

If you are interested in learning more about becoming a partner, please visit the Databricks Partner page.

Try Databricks for free. Get started today.

The post Databricks Partner Executive Summit at Data + AI Summit 2020 Europe appeared first on Databricks.

↧

Databricks and Coursera Launch Data Science Specialization for Data Analysts

November 18, 2020, 10:00 am

≫ Next: New Features to Accelerate the Path to Production With the Next Generation Data Science Workspace

≪ Previous: Databricks Partner Executive Summit at Data + AI Summit 2020 Europe

Earlier this year, Databricks made a massive investment in training by providing free self-paced courses to all of our customers. Databricks furthers this investment by partnering with Coursera to provide Massive Open Online Courses (MOOC) training to the larger data community. Together we launched a new three-course specialization, Data Science with Databricks for Data Analysts, targeted at data analysts seeking to learn how to work with evolving architectures and develop skills in data science. This specialization is focused at enabling data analysts to work with larger data sets and to help them get closer to their company’s data.

Data analysts pull data, summarize it, and build dashboards to help businesses make critical decisions. However, the key to their success lies in having access to that data. Conversely, data scientists generate data that drive insights, predictive modeling, and can perform more advanced data exploration techniques, such as machine learning, to create more useful data. By leveraging data science practices data analysts can fill in the blanks for data they wouldn’t normally have.

Databricks and Coursera have partnered together to create a specialization targeted at precisely those data analysts who seek to expand their toolbox beyond using spreadsheets and writing SQL to query in data warehouses and relational databases. This specialization will enable data analysts to leverage existing skills to learn advanced technologies (e.g., Apache Spark and Delta Lake) not traditionally linked to their role. By completing this specialization, an individual with the current job role of a data analyst should have the real-world skills needed to be considered an entry-level data scientist, such as probability and statistics, machine learning, and programming with Python. Developing these skills will put individuals in a position to work with larger and more complex data sets of the future.

The specialization consists of approximately thirty hours of training across three courses. It starts with Apache Spark SQL for Data Analysts, which teaches data analysts how to apply their SQL data analysis skills in a Data Lakehouse architecture with Apache Spark and Databricks. Then follows with Data Science Fundamentals for Data Analysts, which covers data science concepts in an easy to understand manner to ease the transition for data analysts. It concludes with Applied Data Science for Data Analysts, which focuses on projects to put those new data science skills in practice. The first two courses are currently available on coursera.org. The third course will be available on November 30, 2020.

Each course is available now for all learners to audit. Click the following link below to start your 7-day free trial with Coursera and get started on Data Science with Databricks for Data Analysts.

GET STARTED!

Try Databricks for free. Get started today.

The post Databricks and Coursera Launch Data Science Specialization for Data Analysts appeared first on Databricks.

↧