Quantcast
Channel: Databricks
Viewing all 1873 articles
Browse latest View live

Personalizing the Customer Experience with Recommendations

$
0
0

Go directly to the Recommendation notebooks referenced throughout this post.

Retail made a giant leap forward in the adoption of e-commerce in 2020, E-commerce as a percentage of total retail saw multiple years of progress in one year. Meanwhile, COVID, lockdowns and economic uncertainty have completely disrupted how we engage and retain customers. Companies need to rethink personalization to effectively compete in this period of rapid change.

In 2020, we saw a rapid shift in consumer behavior, not just in the adoption of e-commerce. Store brands saw increased consumer adoption. Staple goods saw a resurgence in demand. Customers not only rethought their relationships with specific products but retailers as well, spreading their spend across multiple retail partners. The relevance of in-store displays, features and promotions was challenged by leading retailers capable of driving 35% of their revenue through personalized recommendations.

Providing an experience that makes customers feel understood helps retailers stand out from the crowd of mass merchants and build loyalty. This was true before COVID but shifting consumer preferences make this more critical for retail organizations. With research showing the cost of customer acquisition being as much as five times as retaining existing ones, organizations looking to succeed in the new normal must continue to build deeper connections with the existing customers in order to retain a solid consumer base. There is no shortage of options and incentives for today’s consumers to rethink long-established patterns of spending.

Personalization is a must to compete

Presented with overwhelming choice, consumers expect the brands they buy and the organizations they buy them from to deliver an experience aligned with their needs and preferences. Personalization, once presented as an exotic vision for what could be, is increasingly becoming the baseline expectation for consumers continuously connected, short on time and seeking value through an increasingly more complex set of considerations.

Brands that deliver personalized experiences can compete with these retail giants. In a pre-COVID analysis of consumer attitudes and spending patterns, 80% of participants indicated they were more likely to do business with a company offering personalized experiences. Those individuals were found to be 10-times more likely to make 15 or more purchases per year with organizations they believe understood and responded to their personal needs and preferences. In a separate survey, 50% of participants reported seeing the brands they buy as extensions of themselves, driving deeper, more sustained customer loyalty for the brands that get it right.

As COVID forced a shift in consumer focus towards value, availability, quality, safety and community, brands most attuned to changing needs and sentiments saw customers switch from rivals. While some segments gained business and many lost, organizations that had already begun the journey towards improved customer experience saw better outcomes, closely mirroring patterns observed in the 2007-2008 recession (Figure 1).

How CX leaders outperform laggards, even in a down market.

Figure 1. CX leaders outperform laggards, even in a down market, a visualization of the Forrester Customer Experience Performance Index as provided by McKinsey & Company (link)

As we look towards what will be the new normal, it is clear that the personalization of customer experiences will remain a key focus for many B2C and even B2B organizations. Increasingly, market analysts are recognizing customer experience as a disruptive force enabling upstart organizations to upend long-established players. Organizations focused on competing through product, placement, pricing and promotion alone will find themselves under pressure from competitors capable of delivering more value to consumers for each dollar received.

Focus on the customer journey

Personalization starts with a careful exploration of the customer journey. This starts as customers come to recognize a need and move to identify a product to fulfill it. It then shifts towards the selection of a channel for its purchase and concludes with consumption, disposal and the possible repeat purchase. The path is varied and not simply linear, but with every stage, there is an opportunity for value to be created for the customer.

The digitization of each stage provides the customer with flexibility in terms of how they will engage and provides the organization with the ability to assess the health of their model. While part and parcel of the online and mobile experience, digitization can be extended to the in-store, in-transit and even the in-home stages of the customer journey with appropriate considerations of transparency, privacy and value-add for the customer.

Bringing the digital experience into the store can be used to facilitate a personalized engagement

Figure 2. Bringing the digital experience into the store can be used to facilitate a personalized engagement

This customer-generated data as well as third-party inputs provide the organization with the information they need to refine their understanding of the customer and their unique journeys. Individual motivations, goals and preferences can now be better understood and more personalized experiences delivered to the customer.

The examination of the customer journey, its digitization and the analysis of the data generated by it are used to create a feedback loop through which the customer experience improves. To get this loop in motion and sustain it over time, a clear vision for competing on customer experience must be expressed. This vision must bring together the entire organization, not just marketing and their IT-enablers, around shared goals. These goals must then be translated into incentive structures that encourage cross-departmental collaboration and innovation. The organization’s journey towards delivering differentiating customer experiences is fundamentally a journey towards becoming a learning organization, one which puts insights into motion, celebrates the learnings that come with failure, and rapidly scales its successes to drive customer value.

Leverage customer preferences

Personalization is multifaceted, but at various points in the customer journey, organizations will have the opportunity to select content, products, promotions to be presented to the customer. In these moments, we can take into consideration past feedback from customers to select the right items to present. Customer feedback doesn’t always come to us in the form of 1-to-5 star ratings or written reviews. Feedback may be expressed through interactions, dwell times, product searches, and purchase events. Careful consideration of how customers interact with various assets and how these interactions may be interpreted as expressions of preference can unlock a wide range of data with which you can enable personalization.

With feedback in hand, we now consider which items to present. Consider a customer browsing an assortment of recommended products, clicking on one, exploring alternatives to this item, putting it into their cart and then exploring items frequently bought in combination with this item. At each stage of this very narrow slice of the customer’s journey, the customer is interacting with our content with very different goals in mind. The customer’s preferences are unchanged throughout this journey but their intent leads us to use that information to make very different choices with regards to what we might present.

Understand it’s as much art as science

The engines we use to serve content based on customer preferences are known as recommenders. To describe their construction as much Art as Science would be an understatement. With some recommenders, we focus heavily on the shared preferences of similar customers to expand the range of content we might expose to customers. With others, we focus on the properties of the content itself (e.g., product descriptions) and leverage user-specific interactions with related content to quantify the likelihood an item is likely to resonate with the customer. Each class of recommendation engine orients around a general goal, but within each, there are myriad decisions the business must make that orient its recommendations towards specific goals.

The complexity of these engines and the nature of why we build them are such that any upfront evaluation of their supposed accuracy is suspect. While offline evaluation methods have been proposed and should be employed to ensure that the recommenders we build are not flying off the rails, the reality is that we can only effectively evaluate their ability to assist us in achieving a particular goal by releasing them in limited pilots and assessing customer response. And in those assessments, it’s important to keep in mind that there is no expectation of perfection, only incremental improvement over the prior solution.

Consider tradeoffs between performance & completeness

The primary challenge we must overcome in the assembly of any recommender is scalability. Consider a recommender leveraging user similarities. A small pool of 100,000 users requires the evaluation of approximately 5,000,000,000 user pairs and each of those evaluations may involve a comparison of preferences for each item we might recommend. From a purely technical standpoint, performing this number of calculations is not a problem, but the cost of doing it on a regular basis and within the time-constraints imposed on these systems makes a brute-force evaluation untenable.

It’s for this reason that the technical literature surrounding the development of recommender systems puts a heavy emphasis on approximate similarity techniques. These techniques offer shortcuts that allow us to home in on those users or items most likely to be similar to the objects we are comparing. With these techniques, there is a tradeoff between performance gains and recommendation completeness. So while these techniques are quite technically oriented, there is an important conversation to be had between solution architects and the business stakeholders about the right balance between these two considerations.

Jumpstart your efforts with solution accelerators

It goes without saying that careful management of resources can go a long way to keeping the cost of on-going recommender development, training and deployment. Databricks is purpose-built for scalable development on cloud infrastructures that allow organizations to rapidly provision and then deprovision resources for exactly this reason.

To help our customers understand how they might use Databricks to develop various recommenders, we’ve made available a series of detailed notebooks as part of our Solution Accelerators program. Each notebook leverages a real-world dataset to show how raw data may be transformed into one or more recommender solutions.

The focus of these notebooks is on education. No one should take the techniques demonstrated here as the only way or even the preferred way to solve a specific recommendation challenge. Still, in wrestling with the issues described above, we hope that some portions of the presented code will assist our customers in tackling their own recommender needs.

Collaborative Filter Recommenders

Content-Based Recommenders

You can also view our on-demand webinar around personalization and recommendations.

The post Personalizing the Customer Experience with Recommendations appeared first on Databricks.


Natively Query Your Delta Lake With Scala, Java, and Python

$
0
0

Today, we’re happy to announce that you can natively query your Delta Lake with Scala and Java (via the Delta Standalone Reader) and Python (via the Delta Rust API). Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark™ APIs. The project has been deployed at thousands of organizations and processes more exabytes of data each week, becoming an indispensable pillar in data and AI architectures. More than 75% of the data scanned on the Databricks Platform is on Delta Lake!

In addition to Apache Spark, Delta Lake has integrations with Amazon Redshift, Redshift Spectrum, Athena, Presto, Hive, and more; you can find more information in the Delta Lake Integrations. For this blog post, we will discuss the most recent release of the Delta Standalone Reader and the Delta Rust API that allows you to query your Delta Lake with Scala, Java, and Python without Apache Spark.

Delta Standalone Reader

The Delta Standalone Reader (DSR) is a JVM library that allows you to read Delta Lake tables without the need to use Apache Spark; i.e. it can be used by any application that cannot run Spark. The motivation behind creating DSR is to enable better integrations with other systems such as Presto, Athena, Redshift Spectrum, Snowflake, and Apache Hive. For Apache Hive, we rewrote it using DSR to get rid of the embedded Spark in the new release.

To use DSR using sbt include delta-standalone as well as hadoop-client and parquet-hadoop.

libraryDependencies ++= Seq(
"io.delta" %% "delta-standalone" % "0.2.0",
"org.apache.hadoop" % "hadoop-client" % "2.7.2",
"org.apache.parquet" % "parquet-hadoop" % "1.10.1")

Using DSR to query your Delta Lake table

Below are some examples of how to query your Delta Lake table in Java.

Reading the Metadata

After importing the necessary libraries, you can determine the table version and associated metadata (number of files, size, etc.) as noted below.

import io.delta.standalone.DeltaLog;
import io.delta.standalone.Snapshot;
import io.delta.standalone.data.CloseableIterator;
import io.delta.standalone.data.RowRecord;

import org.apache.hadoop.conf.Configuration;

DeltaLog log = DeltaLog.forTable(new Configuration(), "[DELTA LOG LOCATION]");

// Returns the current snapshot
log.snapshot();

// Returns the version 1 snapshot
log.getSnapshotForVersionAsOf(1);

// Returns the snapshot version
log.snapshot.getVersion();

// Returns the number of data files
log.snapshot.getAllFiles().size();

Reading the Delta Table

To query the table, open a snapshot and then iterate through the table as noted below.

// Create a closeable iterator
CloseableIterator iter = snapshot.open();

RowRecord row = null;
int numRows = 0;

// Schema of Delta table is {long, long, string}
while (iter.hasNext()) {
row = iter.next();
      numRows++;

      Long c1 = row.isNullAt("c1") ? null : row.getLong("c1");
      Long c2 = row.isNullAt("c2") ? null : row.getLong("c2");
      String c3 = row.getString("c3");
      System.out.println(c1 + " " + c2 + " " + c3);
}

// Sample output
175 0 foo-1
176 1 foo-0
177 2 foo-1
178 3 foo-0
179 4 foo-1

Requirements

DSR has the following requirements:

  • JDK 8 or above
  • Scala 2.11 or Scala 2.12
  • Dependencies on parquet-hadoop and hadoop-client

For more information, please refer to the Java API docs or Delta Standalone Reader wiki.

Delta Rust API

delta.rs is an experimental interface to Delta Lake for Rust. This library provides low-level access to Delta tables and is intended to be used with data processing frameworks like datafusion, ballista, rust-dataframe, and vega. It can also act as the basis for native bindings in other languages such as Python, Ruby, or Golang.

QP Hou and R. Tyler Croy at Scribd use Delta Lake to enable the world’s largest digital library are the initial creators of this API. The Delta Rust API has quickly gained traction in the community with a special callout of community-driven Azure support within weeks after the initial release.

How Scribd Uses Delta Lake to Enable the World’s Largest Digital Library.

Reading the Metadata (Cargo)

You can use the API or CLI to inspect the files of your Delta Lake table as well as provide the metadata information; below are sample commands using the CLI via cargo. Once the 0.2.0 release of delta.rs has been published, `cargo install deltalake` will provide the delta-inspect binary.

To inspect the files, check out the source and use delta-inspect files:

❯ cargo run --bin delta-inspect files ./tests/data/delta-0.2.0

part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet
part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet
part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet

To inspect the metadata, use delta-inspect info:

❯ cargo run --bin delta-inspect info ./tests/data/delta-0.2.0
DeltaTable(./tests/data/delta-0.2.0)
version: 3
metadata: GUID=22ef18ba-191c-4c36-a606-3dad5cdf3830, name=None, description=None, partitionColumns=[], configuration={}
min_version: read=1, write=2
files count: 3

Reading the Metadata (Python)

You can also use the delta.rs to query Delta Lake using Python via the delta.rs Python bindings.

To obtain the Delta Lake version and files, use the .version() and .files() methods respectively.

from deltalake import DeltaTable
dt = DeltaTable("../rust/tests/data/delta-0.2.0")

# Get the Delta Lake Table version
dt.version()

# Example Output
3

# List the Delta Lake table files
dt.files()

# Example Output
['part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet', 'part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet', 'part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet']

Reading the Delta Table (Python)

To read a Delta table using the delta.rs Python bindings, you will need to convert the Delta table into a PyArrow Table and Pandas Dataframe.

# Import Delta Table
from deltalake import DeltaTable

# Read the Delta Table using the Rust API
dt = DeltaTable("../rust/tests/data/simple_table")

# Create a Pandas Dataframe by initially converting the Delta Lake
# table into a PyArrow table
df = dt.to_pyarrow_table().to_pandas()
# Query the Pandas table
Df

# Example output
0	5
1	7
2	9

You can also use Time Travel and load a previous version of the Delta table by specifying the version number by using the load_version method.

# Load version 2 of the table
dt.load_version(2)

Notes

Currently, you can also query your Delta Lake table through delta.rs using Python and Ruby, but the underlying Rust APIs should be straightforward to integrate into Golang or other languages too.. Refer to delta.rs for more information. There’s lots of opportunity to contribute to Delta.rs, so be sure to check out the open issues! https://github.com/delta-io/delta.rs/issues

Discussion

We’d like to thank Scott Sandre and the Delta Lake Engineering team for creating the Delta Standalone Reader and QP Hou and R. Tyler Croy for creating the Delta Rust API. Try out the Delta Standalone Reader and Delta Rust API today – no Spark required!

Join us in the Delta Lake community through our Public Slack Channel (Register here | Log in here) or Public Mailing list.

--

Try Databricks for free. Get started today.

The post Natively Query Your Delta Lake With Scala, Java, and Python appeared first on Databricks.

How to Manage Python Dependencies in PySpark

$
0
0

Controlling the environment of an application is often challenging in a distributed computing environment – it is difficult to ensure all nodes have the desired environment to execute, it may be tricky to know where the user’s code is actually running, and so on.

Apache Spark™ provides several standard ways to manage dependencies across the nodes in a cluster via script options such as --jars, --packages, and configurations such as spark.jars.* to make users seamlessly manage the dependencies in their clusters.

In contrast, PySpark users often ask how to do it with Python dependencies – there have been multiple issues filed such as SPARK-13587, SPARK-16367, SPARK-20001 and SPARK-25433. One simple example that illustrates the dependency management scenario is when users run pandas UDFs.

import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf('double')
def pandas_plus_one(v: pd.Series) -> pd.Series:
    return v + 1

spark.range(10).select(pandas_plus_one("id")).show()

If they do not have required dependencies installed in all other nodes, it fails and complains that PyArrow and pandas have to be installed.

Traceback (most recent call last):
  ...
ModuleNotFoundError: No module named 'pyarrow'

One straightforward method is to use script options such as --py-files or the spark.submit.pyFilesconfiguration, but this functionality cannot cover many cases, such as installing wheel files or when the Python libraries are dependent on C and C++ libraries such as pyarrow and NumPy.

This blog post introduces how to control Python dependencies in Apache Spark comprehensively. Most of the content will be also documented in the upcoming Apache Spark 3.1 as part of Project Zen. Please refer to An Update on Project Zen: Improving Apache Spark for Python Users for more details.

Using Conda

Conda is one of the most widely-used Python package management systems. PySpark users can directly use a Conda environment to ship their third-party Python packages by leveraging conda-pack which is a command line tool creating relocatable Conda environments. It is supported in all types of clusters in the upcoming Apache Spark 3.1. In Apache Spark 3.0 or lower versions, it can be used only with YARN.

The example below creates a Conda environment to use on both the driver and executor and packs it into an archive file. This archive file captures the Conda environment for Python and stores both Python interpreter and all its relevant dependencies.

conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz

After that, you can ship it together with scripts or in the code by using the --archives option or spark.archives configuration (spark.yarn.dist.archives in YARN). It automatically unpacks the archive on executors.

In the case of a spark-submit script, you can use it as follows:

export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py

Note that PYSPARK_DRIVER_PYTHON above is not required for cluster modes in YARN or Kubernetes.

For a pyspark shell:

export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python
 pyspark--archives pyspark_conda_env.tar.gz#environment

If you’re on a regular Python shell or notebook, you can try it as shown below:

import os
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
    "spark.archives",  # 'spark.yarn.dist.archives' in YARN.
    "pyspark_conda_env.tar.gz#environment").getOrCreate()

Using Virtualenv

Virtualenv is a Python tool to create isolated Python environments. Since Python 3.3, a subset of its features has been integrated into Python as a standard library under the venv module. In the upcoming Apache Spark 3.1, PySpark users can use virtualenv to manage Python dependencies in their clusters by using venv-pack in a similar way as conda-pack. In the case of Apache Spark 3.0 and lower versions, it can be used only with YARN.

A virtual environment to use on both driver and executor can be created as demonstrated below. It packs the current virtual environment to an archive file, and It self-contains both Python interpreter and the dependencies.

python -m venv pyspark_venv
source pyspark_venv/bin/activate
pip install pyarrow pandas venv-pack
venv-pack -o pyspark_venv.tar.gz

You can directly pass/unpack the archive file and enable the environment on executors by leveraging the --archives option or spark.archives configuration (spark.yarn.dist.archives in YARN).

For spark-submit, you can use it by running the command as follows. Also, notice that PYSPARK_DRIVER_PYTHON is not necessary in Kubernetes or YARN cluster modes.

export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_venv.tar.gz#environment app.py

In the case of a pyspark shell:

export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python
pyspark --archives pyspark_venv.tar.gz#environment 

For regular Python shells or notebooks:

import os
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
    "spark.archives",  # 'spark.yarn.dist.archives' in YARN.
    "pyspark_venv.tar.gz#environment").getOrCreate()

Using PEX

PySpark can also use PEX to ship the Python packages together. PEX is a tool that creates a self-contained Python environment. This is similar to Conda or virtualenv, but a .pex file is executable by itself.

The following example creates a .pex file for the driver and executor to use. The file contains the Python dependencies specified with the pex command.

pip install pyarrow pandas pex
pex pyspark pyarrow pandas -o pyspark_pex_env.pex

This file behaves similarly with a regular Python interpreter.

./pyspark_pex_env.pex -c "import pandas; print(pandas.__version__)"
1.1.5

However, .pex file does not include a Python interpreter itself under the hood so all nodes in a cluster should have the same Python interpreter installed.

In order to transfer and use the .pex file in a cluster, you should ship it via the spark.files configuration (spark.yarn.dist.files in YARN) or --files option because they are regular files instead of directories or archive files.

For application submission, you run the commands as shown below. PYSPARK_DRIVER_PYTHON is not needed for cluster modes in YARN or Kubernetes.

export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./pyspark_pex_env.pex
spark-submit --files pyspark_pex_env.pex app.py

For the interactive pyspark shell, the commands are almost the same:

export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./pyspark_pex_env.pex
pyspark --files pyspark_pex_env.pex

For regular Python shells or notebooks:

import os
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = "./pyspark_pex_env.pex"
spark = SparkSession.builder.config(
    "spark.files",  # 'spark.yarn.dist.files' in YARN.
    "pyspark_pex_env.pex").getOrCreate()

Conclusion

In Apache Spark, Conda, virtualenv and PEX can be leveraged to ship and manage Python dependencies.

  • Conda: this is one of the most commonly used package management systems. In Apache Spark 3.0 and lower versions, Conda can be supported with YARN cluster only, and it works with all other cluster types in the upcoming Apache Spark 3.1.
  • Virtualenv: users can do it without an extra installation because it is a built-in library in Python. Virtualenv works only with YARN cluster in Apache Spark 3.0 and lower versions, and all other cluster types support it in the upcoming Apache Spark 3.1.
  • PEX: it can be used with any type of cluster in any version of Apache Spark although it is arguably less widely used and requires to have the same Python installed in all nodes whereas Conda and virtualenv do not require it.

These package management systems can handle any Python packages that --py-files or spark.submit.pyFiles configuration cannot cover. Users can seamlessly ship not only pandas and PyArrow but also other dependencies to interact together when they work with PySpark.

In the case of Databricks notebooks, we not only provide an elegant mechanism by having a well-designed UI but also allow users to directly use pip and Conda in order to address this Python dependency management. Try out these today for free on Databricks.

--

Try Databricks for free. Get started today.

The post How to Manage Python Dependencies in PySpark appeared first on Databricks.

Bayesian Modeling of the Temporal Dynamics of COVID-19 Using PyMC3

$
0
0

In this post, we look at how to use PyMC3 to infer the disease parameters for COVID-19. PyMC3 is a popular probabilistic programming framework that is used for Bayesian modeling. Two popular methods to accomplish this are the Markov Chain Monte Carlo (MCMC) and Variational Inference methods. The work here looks at using the currently available data for the infected cases in the United States as a time-series and attempts to model this using a compartmental probabilistic model. We want to try to infer the disease parameters and eventually estimate R_0 using MCMC sampling.

The work presented here is for illustration purposes only and real-life Bayesian modeling requires far more sophisticated tools than what is shown here. Various assumptions regarding population dynamics are made here, which may not be valid for large non-homogeneous populations. Also,  interventions such as social distancing and vaccinations are not considered here.

This post will cover the following:

  1. Compartmental models for Epidemics
  2. Where the data comes from and how it is ingested
  3. The SIR and an overview of the SIRS model
  4. Bayesian Inference for ODEs with PyMC3
  5. Inference Workflow on Databricks

Compartmental models for epidemics

For an overview of compartmental models and their behavior, please refer to this notebook in Julia.

Compartmental models are a set of Ordinary Differential Equations (ODEs) for closed populations, which imply that there is a movement of the population in or out of this compartment. These aim to model disease propagation in compartments of populations that are homogeneous. As you can imagine, these assumptions may not be valid in large populations. It is also important to point out here that vital statistics such as the number of births and deaths in the population may not be included in this model. The following list mentions some of the compartmental models along with the various compartments of disease propagation, however, this is not an exhaustive list by any means.

  • Susceptible Infected (SI)
  • Susceptible Infected Recovered (SIR)
  • Susceptible Infected Susceptible (SIS)
  • Susceptible Infected Recovered Susceptible (SIRS)
  • Susceptible Infected Recovered Dead (SIRD)
  • Susceptible Exposed Infected Recovered (SEIR)
  • Susceptible Exposed Infected Recovered Susceptible (SEIRS)
  • Susceptible Exposed Infected Recovered Dead (SEIRD)
  • Maternally-derived Immunity Susceptible Infectious Recovered (MSIR)
  • SIDARTHE

The last one listed above is more recent and specifically targets COVID-19 and maybe worth a read for those interested. Real-world disease modeling often involves more than just the temporal evolution of disease stages since many of the assumptions associated with compartments are violated. To understand how the disease propagates, we would want to look at the spatial discretization and evolution of the progression of the disease through the population. An example of a framework that models this spatio-temporal evolution is GLEAM (Fig.1).

Real-world epidemic modeling (spatio-temporal dynamics).

​Fig. 1

Tools such as GLEAM use the population census data and the mobility patterns to understand how people move geographically. GLEAM divides the globe into spatial grids of roughly 25km x 25km. There are broadly two types of mobility: global or long-range mobility and local or short-range mobility. Long-term mobility mostly involves air travel and as such airports are considered a central hub for disease transmission. Travel by sea is also another significant factor and therefore naval ports are another type of access point. Along with the mathematical models listed above, this provides a stochastic framework that can be used to make millions of simulations to draw inferences about parameters and make forecasts.

The data is obtained from the Johns Hopkins CSSE Github page where case counts are regularly updated:

CSSE GitHub

Confirmed cases

Number of deaths

The data is available as CSV files which can be read in through Python pandas.

The SIR and SIRS models

SIR model

The SIR model is given by the set of three Ordinary Differential Equations (ODEs) shown below. There are three compartments in this model.

     \begin{gather*} \frac{dS}{dt} = -\lambda\frac{SI}{N} \\ \frac{dS}{dt} = \lambda\frac{SI}{N}-\mu I\ \\ \frac{dR}{dt} = f \mu I \end{gather*}

Here ‘S’, ‘I’ and ‘R’ refer to the susceptible, infected and recovered portions of the population of size ‘N’ such that

S + I + R = N

The assumption here is that once you have recovered from the disease, lifetime immunity is conferred on an individual. This is not the case for a lot of diseases and hence may not be a valid model.

\lambda is the rate of infection and \mu is the rate of recovery from the disease. The fraction of people who recover from the infection is given by ‘f’ but for the purpose of this work, ‘f’ is set to 1 here. We end up with an Initial Value Problem (IVP) for our set of ODEs where I(0) is assumed to be known from the case counts at the beginning of the pandemic and S(0) can be estimated as N – I(0). Here we make the assumption that the entire population is susceptible. Our goal is to accomplish the following:

  • Use Bayesian Inference to make estimates about \lambda and \mu
  • Use the above parameters to estimate I(t) for any time ‘t’
  • Compute R_0

As already pointed out, \lambda is the disease transmission coefficient. This depends on the number of interactions, in unit time, with infectious people. This in turn depends on the number of infectious people in the population.

\lambda = contact rate x transmission probability

The force of infection or risk at any time ‘t’ is defined as \lambda\frac{I_t}{N}. Also, is the fraction of recovery that happens in unit time. \mu−1 is hence the mean recovery time. The ‘basic reproduction number’ R_0  is the average number of secondary cases produced by a single primary case (Examples https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6002118/). R_0 is also defined in terms of the \lambda and \mu as the ratio given by

R_0 = \frac{\lambda}{\mu} (Assumes S_0 is close to 1)

When R_0 > 1 we have a proliferation of the disease and we have a pandemic. With the recent efforts to vaccinate the vulnerable, this has become even more relevant to understand. If we vaccinate a fraction ‘p’ of the population to get (1-p)R_0 < 1 , we can halt the spread of the disease.

SIRS model

The SIRS model, shown below, makes no assumption of lifetime immunity once an infected person has recovered. Therefore, one goes from the recovered compartment to the susceptible compartment. As such, this is probably a better low-fidelity baseline model for COVID-19 where it is suggested that the acquired immunity is short-term. The only additional parameter here is \gamma which refers to the rate at which immunity is lost and the infected individual moves from the recovered pool to the susceptible pool.

     \begin{gather*} \frac{dS}{dt} = -\lambda\frac{SI}{N} + \gamma R \\ \frac{dI}{dt} = \lambda\frac{SI}{N} - \mu I \\ \frac{dR}{dt} = -\mu I - \gamma R \\ \end{gather*}

For this work, only the SIR model is implemented, and the SIRS model and its variants are left for future work.

Using PyMC3 to infer the disease parameters

We can discretize the SIR model using a first-order or a second-order temporal differentiation scheme which can then be passed to PyMC3 which will march the solution forward in time using these discretized equations. The parameters \lambda and \mu can then be fitted using the Monte Carlo sampling procedure.

First-order scheme

     \begin{gather*} (S_t - S_t_-_1)/ \Delta t = -\lambda\frac{SI}{N} \\ (I_t - I_t_-_1)/ \Delta t = -\lambda\frac{SI}{N} - \mu I \\ (R_t - R_t_-_1)/ \Delta t = \mu I \end{gather*}

Second-order scheme

     \begin{gather*} S_t = (4 - \frac {2\Delta t \lambda I}{n})\frac{S_t_-_1}{3}-\frac{S_t_-_2}{3} \\ I_t = (\frac{2\Delta t \lambda S_t_-_1}{N}-2\Delta t\mu+4)I_t_-_1-\frac{I_t_-_2}{3} \\ R_t = \frac {2\Delta t \mu I_t_-_1+4R_t_-_1-R_t_-_2}{3} \end{gather*}

The DifferentialEquation method in PyMC3

While we can provide the discretization manually with our choice of a higher-order discretization scheme, this quickly becomes cumbersome and error-prone not to mention computationally inefficient. Fortunately, PyMC3 has an ODE module to help do exactly this. We can use the DifferentialEquation method from the ODE module which takes as input a function that returns the value of the set of ODEs as a vector, the time steps where the solution is desired, the number of states corresponding to the number of equations and the number of variables we would like to have solved. One of the disadvantages of this method is that it tends to be slow. The recommended best practice is to use the ‘sunode’ module (see below) in PyMC3. For example,  the same problem took 5.4 mins using DifferentialEquations vs. 16s with sunode for 100 samples,100 tuning samples and 20 time points.

self.sir_model_non_normalized = DifferentialEquation(
    func = self.SIR_non_normalized,
    times = self.time_range1:],
    n_states = 2,
    n_theta = 2,
    t0 = 0)

def SIR_non_normalized(self, y, t, p):
    ds = -p[0] * y[0] * y[1] / self.covid_data.N,
    di = p[0] * y[0] * y[1] / self.covid_data.N - p[1] * y[1]
    return[ds, di]   

The syntax for using the sunode module is shown below.While there are some syntactic differences, the general structure is the same as that of DifferentialEquations.

import sunode
import sunode.wrappers.as_theano

def SIR_sunode(t, y, p):
    return {
        'S': -p.lam * y.S * y.I,
        'I': p.lam * y.S * y.I - p.mu * y.I}
        
    ...
    ...
    
    sir_curves, _, problem, solver, _, _ = sunode.wrappers.as_theano.solve_ivp(
        y0={ # Initial conditions of the ODE
            'S': (S_init, ()),
            'I': (I_init, ()),
        },
        params={
                # Parameters of the ODE, specify shape
            'lam': (lam, ()),
            'mu': (mu, ()),
            '_dummy': (np.array(1.), ())  # currently, sunode throws an error
        },                                # without this
            # RHS of the ODE
        rhs=SIR_sunode,
            # Time points of th solution
        tvals=times,
        t0=times[0],
    )

The inference process for an SIR model

In order to perform inference on the parameters we seek, we start by selecting reasonable priors for the disease parameters. Based on our understanding of the behavior of these parameters, a lognormal distribution is a reasonable prior. Ideally, we want the mean parameter of this lognormal to be in the neighborhood of what we expect the desired parameters to reside. For good convergence and solutions, it is also essential that the data likelihood is appropriate (domain expertise!). It is common to pick one of the following as the likelihood.

  • Normal distribution
  • Lognormal distribution
  • Student’s t-distribution

We obtain the Susceptible (S(t)) and Infectious (I(t)) numbers from the ODE solver and then sample for values of \lambda and \mu as shown below.

with pm.Model() as model4:
sigma = pm.HalfCauchy('sigma', self.likelihood['sigma'], shape=1)
lam = pm.Lognormal('lambda', self.prior['lam'], self.prior['lambda_std']) # 1.5, 1.5
mu = pm.Lognormal('mu', self.prior['mu'], self.prior['mu_std'])           # 1.5, 1.5
res, _, problem, solver, _, _ = sunode.wrappers.as_theano.solve_ivp(
y0={
    'S': (self.S_init, ()), 'I': (self.I_init, ()),},
params={
    'lam': (lam, ()), 'mu': (mu, ()), '_dummy': (np.array(1.), ())},
rhs=self.SIR_sunode,
tvals=self.time_range,
t0=self.time_range[0]
)
if(likelihood['distribution'] == 'lognormal'):
    I = pm.Lognormal('I', mu=res['I'], sigma=sigma, observed=self.cases_obs_scaled)
elif(likelihood['distribution'] == 'normal'):
    I = pm.Normal('I', mu=res['I'], sigma=sigma, observed=self.cases_obs_scaled)
elif(likelihood['distribution'] == 'students-t'):
    I = pm.StudentT( "I",  nu=likelihood['nu'],       # likelihood distribution of the data
            mu=res['I'],     # likelihood distribution mean, these are the predictions from SIR
            sigma=sigma,
            observed=self.cases_obs_scaled
            )
R0 = pm.Deterministic('R0',lam/mu)

trace = pm.sample(self.n_samples, tune=self.n_tune, chains=4, cores=4)
data = az.from_pymc3(trace=trace)

The inference workflow with PyMC3 on Databricks

Since developing a model such as this, for estimating the disease parameters using Bayesian inference, is an iterative process we would like to automate away as much as possible. It is probably a good idea to instantiate a class of model objects with various parameters and have automated runs. Fortunately, automating the execution is quite easy to accomplish using Databricks notebooks, each cell contains a combination of the desired parameters (see below) and once executed outputs the plots without user intervention. It is also a good idea to save the trace information, inference metrics such as along with other metadata information for each run. A file format such as NetCDF can be used for this although it could be as simple as using the Python built-in database module ‘shelve’.

covid_obj = COVID_data('US', Population=328.2e6)
covid_obj.get_dates(data_begin='10/1/20', data_end='10/28/20')
sir_model = SIR_model_sunode(covid_obj)
likelihood = {'distribution': 'normal', 
                'sigma': 2}
prior = {'lam': 1.5, 
            'mu': 1.5, 
            'lambda_std': 1.5,
            'mu_std': 1.5 }
sir_model.run_SIR_model(n_samples=500, n_tune=500, likelihood=likelihood, prior=prior)

Sample results

These results are purely for illustration purposes and extensive experimentation is needed before meaningful results can be expected from this simulation. The case count for the United States from January to October is shown below (Fig 2).

Example COVID-19 case count visualization generated by PyMC3 on Databricks.

Fig. 2

Fig. 3 shows the results of an inference run where the posterior distributions of \lambda, \mu and R_0 are displayed. One of the advantages of performing Bayesian inference is that the distributions show the mean value estimate along with the Highest Density Interval (HDI) for quantifying uncertainty. It is a good idea to check the trace (at the very least!) to ensure sampling was done properly.

Example results of an inference run displaying the highest density interval using PyMC3 on Databricks.

Fig. 3

Notes and guidelines

Some general guidelines for modeling and inference:

  • Use at least 5000 samples and 1000 samples for tuning
  • For the results shown above, I have used:
    • Mean: = \lambda 1.5, = 1.5
    • Standard deviation: 2.0 for both parameters
  • Sample from 3 chains at least
  • Set target_accept to > 0.85
  • If possible, sample in parallel with cores=n, where ‘n’ is the number of cores available
  • Inspect the trace for convergence
  • Limited time-samples have an impact on inference accuracy, it is always better to have more good quality data
  • Normalize your data, large values are generally not good for convergence

Debugging your model

  • Since the backend for PyMC3 is theano, the Python print statement cannot be used to inspect the value of a variable. Use theano.printing.Print(DESCRIPTIVE_STRING)(VAR) to accomplish this
  • Initialize stochastic variables by passing a ‘testval’. This is very helpful to check those pesky ‘Bad Energy’ errors, which are usually due to poor choice of likelihoods or priors. Use Model.check_test_point() to verify this.
  • Use step = pm.Metropolis() for quick debugging, this runs much faster but results in a rougher posterior
  • If the sampling is slow, check your prior and likelihood distributions

Conclusion

This post covered the basics of using PyMC3 for obtaining the disease parameters. In a follow-up post, we will look at how to use the Databricks environment and integrate workflow tools such as MLflow for experiment tracking and HyperOpt for hyperparameter optimization.

Try the Notebook

References

  • The work by the Priesemann Group
    • https://github.com/Priesemann-Group/covid_bayesian_mcmc
  • Demetri Pananos work on the PyMC3 page
    • https://docs.pymc.io/notebooks/ODE_API_introduction.html

--

Try Databricks for free. Get started today.

The post Bayesian Modeling of the Temporal Dynamics of COVID-19 Using PyMC3 appeared first on Databricks.

Lakehouse Architecture Realized: Enabling Data Teams With Faster, Cheaper and More Reliable Open Architectures

$
0
0

Databricks was founded under the vision of using data to solve the world’s toughest problems. We started by building upon our open source roots in Apache Spark™ and creating a thriving collection of projects, including Delta Lake, MLflow, Koalas and more. We’ve now built a company with over 1,500 employees helping thousands of data teams with data analytics, data engineering, data science and AI.

This year has been especially challenging for the world, with the COVID-19 epidemic affecting everyone around the globe. We’ve seen the potential of data realized to help with vaccine research and clinical trials, healthcare delivery, hospital patient allocation, disease-spread prediction and more. We hope the challenges brought by 2021 are less intense but still trust that data teams will be at the forefront of solving them.

Lakehouse Architecture

At the beginning of the year, we published a blog post analyzing a trend we’ve been seeing: the move towards the Lakehouse architecture. This architecture, based on open formats, combines the flexibility of data lakes built on low-cost cloud object stores with the ACID transactions, schema enforcement and performance typically associated with data warehouses. While Delta Lake, one of the key open source technologies enabling the Lakehouse architecture, was launched in 2019, advancements in Delta Lake, Apache Spark and the Databricks Unified Analytics Platform have continued to increase the capabilities and performance of the data Lakehouse architecture. You can read more about the underlying technical challenges in our VLDB research paper Delta Lake: High-Performance ACID Storage over Cloud Object Stores and our CIDR paperLakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics.
 

At the beginning of 2020, there was still a significant gap in the tech stack though: high-performance SQL execution to enable decision-making, analytics, business intelligence (BI), and dashboarding workloads on top of the Lakehouse. This gap existed in two forms: underlying query processing engine performance and a UI to simplify analytics for the average data analyst. Throughout the year, we focused on filling these gaps.

The result of this work was showcased in the November Data + AI Summit Europe keynotes — the new SQL Analytics product enabling low latency, competitive price/performance and high concurrency access to your data lake. Under-the-hood, SQL Analytics takes advantage of Delta Engine, which combines the Spark 3.0 vectorized query engine with Databricks enhancements to the query optimization and caching layers. It also includes a web UI that makes it easy for BI teams to query, visualize and dashboard massive amounts of data. Of course, it also supports all your favorite data tools, like Power BI and Tableau so that your existing tools and processes can immediately start using the data lake as the data source.

Databricks launched the new SQL Analytics product as well as integrations with popular analytics tools like Tableau

SQL Anlaytics and Tableau screenshots

We believe that this focus on improving performance and enabling data warehousing workloads on data lakes led Gartner to name us a Visionary in the 2020 Gartner Magic Quadrant for Cloud Database Management Systems (DBMS).

To learn more about the Lakehouse architecture capabilities, the enabling technologies and the history that led us to this point, watch our new Data Brew vidcast series.

Open Source Enabling the Lakehouse

Many important advances in the Delta Lake and Apache Spark open source projects have enabled us to realize the Lakehouse architecture, as mentioned earlier. An important milestone happened this year for the community building these open source projects: we celebrated the 10th anniversary of the open source release of Apache Spark and the launch of Spark 3.0, which provides 2x performance improvements and better support for Python and SQL- the most popular interfaces to Spark.

In particular, Project Zen has led to significant Python usability improvements, including better PySpark documentation, PySpark type hints, standardized warnings and exceptions, Pandas UDF enhancements—and more improvements are on the way.

Learn about the Spark 3.0 enhancements and more in the newly-published Learning Spark 2nd Edition from O’Reilly, which is available for free as an ebook.
 
 In 2020, Databricks and the data community celebrated the 10th anniversary of Apache Spark.

Of course, the Lakehouse is about more than just having a reliable, authoritative data store for big data analytics and great data engineering infrastructure unifying real-time streaming and batch data. The Lakehouse is also about how you take advantage of the structured and unstructured data stored in the data lake. The ability to easily apply machine learning algorithms, perform data science and use artificial intelligence (AI) on top of the Lakehouse is an important characteristic of the architecture, so it also benefits from advances in projects such as MLflow and Koalas.

MLflow, which joined the Linux Foundation this year, released the model registry, making it easier to version models and transition them through their complete lifecycle. The team has also focused on simplicity of development with UI improvements, including syntax highlighting, as well as integration with popular libraries such as scikit-learn, Spacy models, and PyTorch. The PyTorch work was done in collaboration with the PyTorch team at Facebook and discussed in depth at the Data + AI Summit Europe.

They’ve also released support for MLflow plugins to seamlessly integrate third-party code and infrastructure with MLflow. The community has come together and written a variety of plugins, including an ElasticSearch backend for tracking, model deployment to RedisAI, project execution on YARN, artifact storage on SQL Server + Alibaba Cloud OSS and more.
 
Overview of new developments to the open source MLflow project in 2020.

Koalas is an open source project that continues to simplify the experience for Python developers and data scientists to achieve high-scale analysis. Koalas provides a drop-in replacement for pandas, enabling efficient scaling out to hundreds of worker nodes for everyday data science and machine learning. After over one year of development, Koalas 1.0 was released this summer, quickly followed by several other big releases, culminating in Koalas 1.5 in December. Recent releases have achieved close to 85% coverage of the pandas APIs and significantly faster performance by building upon Spark 3.0 APIs.

The growth of Koalas’ API coverage over 2020.

Data + AI Community

Over the last ten years, data analysts, data scientists and others have joined the Spark community and are working in teams to solve complex data challenges. Born out of this community are key open source technologies such as Delta Lake, MLflow, Redash and Koalas – all of which are growing rapidly. This, in addition to the evolution of Databricks, led to the expansion of the content and community around Spark + AI Summit to be the Data + AI Summit, with almost 10,000 data practitioners joining the inaugural European event this past November.

This has truly been a year spent online for the community, with 44 Data + AI Online Meetups, spanning data science, data engineering, data analytics and more. Some of the more popular series include: Getting Started with Delta Lake tech talks, Diving into Delta Lake internals tech talks, the Introduction to Data Analysis for Aspiring Data Scientists workshops, and the Managing the Machine Learning Lifecycle with MLflow workshops.

The global academic community has overcome some major challenges this year in the switch to online learning (aka “Zoom School”). We’ve been honored to be able to support this transition in a small way through our new University Alliance with over 100 universities joining to share best practices, access self-paced courses and workshops and enable students to practice their skills in the cloud on the Databricks Unified Analytics Platform.

On to 2021

We expect this year to be very exciting as data practitioners and for Databricks as a company. As more and more companies adopt the open data Lakehouse architecture pattern, we look forward to working with them to simplify the journey towards solving the world’s toughest problems. Keep an eye out on this blog and on our YouTube channel to hear more stories about the accomplishments of your fellow data teams and to learn about the technology advancements that simplify and improve your work in data science, data engineering, data analytics and machine learning.

--

Try Databricks for free. Get started today.

The post Lakehouse Architecture Realized: Enabling Data Teams With Faster, Cheaper and More Reliable Open Architectures appeared first on Databricks.

Over 200K Enrolled in Databricks Certification and Training

$
0
0

More than 200,000 individuals have participated in Databricks certification and training over the past two years, including thousands of partners. In the past year alone, over 75,000 individuals have been trained and over 1,500 customers and partners have also earned their Databricks Academy Certifications. Today, we are pleased to announce new digital badges so you can share your accomplishments with your network!

The need for continuous learning

As we meet with customers and partners at virtual events and support them on proofs of concept (POCs) and production deployments, one of the most common requests is for training to help them innovate even faster. We are frequently asked for more certification programs and accreditation options to help data teams get started quickly and increase the depth of their knowledge and expertise with specific skills.

Databricks digital badges are now available

We are pleased to now offer digital badges for customers and partners. Digital badges are a great way to showcase your knowledge and experience with Databricks across a range of skill levels and competencies.

Databricks digital badge collections
Databricks digital badge collections are available today

Skill up your team

As your team starts their data + AI journey with Databricks, they receive digital badges that can be shared on social media and in email signatures with a link to a website to validate their progress. As your team expands their knowledge and gains more experience, they can receive digital badges that reflect their growing skillset.

The Databricks Academy Accreditation digital badge for Databricks Unified Data Analytics Essentials
The Databricks Academy Accreditation digital badge for Databricks Unified Data Analytics Essentials

Partner training, certification and digital badging

Databricks partners are critical to help organizations understand the business value of Databricks and effectively deliver customer solutions. We are pleased to announce that partners can now earn digital badges to showcase their skills as they complete Databricks training and certification. Digital badges make it easy for customers to see which partners have the capabilities they need to implement their vision with Databricks-powered data and AI solutions.

The Partner Training digital badge for Azure Databricks Developer Essentials
The Partner Training digital badges for Databricks Developer Essentials and Azure Databricks Developer Essentials

Partner Champions

Partners who go above and beyond can be nominated to receive additional recognition as Databricks Partner Champions. Partner Champions are nominated by active Databricks partners for significant contributions to the community and if selected become part of our extended team. They lead and support the solution architecture activities and oversee delivery of their Databricks practice.

The digital badges for Databricks Partner Champions
The digital badges for Databricks Partner Champions

In addition to the unique digital badge to share on social media and e-mail signatures, Databricks Partner Champions receive a Champion’s jacket with the Champion Badge on the sleeve.

Databricks Partner Champion jacket
Databricks Partner Champion jacket

Earn your digital badges

Visit Databricks Academy and (for registered Partners) the Partner portal to learn how you can get started on your learning journey and check out the Databricks digital badges available today. If you have completed a Databricks partner training course, watch your inbox for instructions on how to claim your digital badge, as we reach out to past participants.

--

Try Databricks for free. Get started today.

The post Over 200K Enrolled in Databricks Certification and Training appeared first on Databricks.

Learn How Disney+ Built Their Streaming Data Analytics Platform With Databricks and AWS to Improve the Customer Experience

$
0
0

Martin Zapletal, Software Engineering Director at Disney+, is presenting at re:Invent 2020 with the session How Disney+ uses fast data ubiquity to improve the customer experience (must be registered to watch but registration is free!).

In this breakout session, Martin will showcase Disney+’s architecture using Databricks on AWS for processing and analyzing millions of real-time streaming events.

Abstract:

Disney+ uses Amazon Kinesis to drive real-time actions like providing title recommendations for customers, sending events across microservices, and delivering logs for operational analytics to improve the customer experience. In this session, you learn how Disney+ built real-time data-driven capabilities on a unified streaming platform. This platform ingests billions of events per hour in Amazon Kinesis Data Streams, processes and analyzes that data in Amazon Kinesis Data Analytics for Apache Flink, and uses Amazon Kinesis Data Firehose to deliver data to destinations without servers or code. Hear how these services helped Disney+ scale its viewing experience to tens of millions of customers with the required quality and reliability.

Disney+ on Databricks at AWS re:Invent 2020

You can also check out the Databricks Quality of Service blog/notebook based on a similar architecture if you want to see how to process streaming and batch data at scale for video/audio streaming services. This solution demonstrates how to process playback events and quickly identify, flag, and remediate audience experience issues.

See Databricks at AWS re:Invent 2020!

--

Try Databricks for free. Get started today.

The post Learn How Disney+ Built Their Streaming Data Analytics Platform With Databricks and AWS to Improve the Customer Experience appeared first on Databricks.

Over 200K Enrolled in Databricks Certification and Training

$
0
0

More than 200,000 individuals have participated in Databricks certification and training over the past four years, including thousands of partners. In the past year alone, over 75,000 individuals have been trained and over 1,500 customers and partners have also earned their Databricks Academy Certifications. Today, we are pleased to announce new digital badges so you can share your accomplishments with your network!

The need for continuous learning

As we meet with customers and partners at virtual events and support them on proofs of concept (POCs) and production deployments, one of the most common requests is for training to help them innovate even faster. We are frequently asked for more certification programs and accreditation options to help data teams get started quickly and increase the depth of their knowledge and expertise with specific skills.

Databricks digital badges are now available

We are pleased to now offer digital badges for customers and partners. Digital badges are a great way to showcase your knowledge and experience with Databricks across a range of skill levels and competencies.

Databricks digital badge collections
Databricks digital badge collections are available today

Skill up your team

As your team starts their data + AI journey with Databricks, they receive digital badges that can be shared on social media and in email signatures with a link to a website to validate their progress. As your team expands their knowledge and gains more experience, they can receive digital badges that reflect their growing skillset.

The Databricks Academy Accreditation digital badge for Databricks Unified Data Analytics Essentials
The Databricks Academy Accreditation digital badge for Databricks Unified Data Analytics Essentials

Partner training, certification and digital badging

Databricks partners are critical to help organizations understand the business value of Databricks and effectively deliver customer solutions. We are pleased to announce that partners can now earn digital badges to showcase their skills as they complete Databricks training and certification. Digital badges make it easy for customers to see which partners have the capabilities they need to implement their vision with Databricks-powered data and AI solutions.

The Partner Training digital badge for Azure Databricks Developer Essentials
The Partner Training digital badges for Databricks Developer Essentials and Azure Databricks Developer Essentials

Partner Champions

Partners who go above and beyond can be nominated to receive additional recognition as Databricks Partner Champions. Partner Champions are nominated by active Databricks partners for significant contributions to the community and if selected become part of our extended team. They lead and support the solution architecture activities and oversee delivery of their Databricks practice.

The digital badges for Databricks Partner Champions
The digital badges for Databricks Partner Champions

In addition to the unique digital badge to share on social media and e-mail signatures, Databricks Partner Champions receive a Champion’s jacket with the Champion Badge on the sleeve.

Databricks Partner Champion jacket
Databricks Partner Champion jacket

Earn your digital badges

Visit Databricks Academy and (for registered Partners) the Partner portal to learn how you can get started on your learning journey and check out the Databricks digital badges available today. If you have completed a Databricks partner training course, watch your inbox for instructions on how to claim your digital badge, as we reach out to past participants.

--

Try Databricks for free. Get started today.

The post Over 200K Enrolled in Databricks Certification and Training appeared first on Databricks.


Data Access Governance and 3 Signs You Need it

$
0
0

This is a guest authored post by Heather Devane, content marketing manager, Immuta.

Cloud data analytics is only as powerful as the ability to access that data for use. Yet, the data stewards responsible for managing data governance often find themselves in a holding pattern, waiting for approval from various stakeholders to operationalize data assets based on access control policies and the data protections they’ve created.

Without the right tools to automate data access governance (DAG), these data stewards and data owners typically are responsible for manually determining access rights by granting or restricting data access individually, as well as curating a data pipeline that delivers secure, compliant data. After all, if any regulatory requirements are violated, they can be held personally accountable.

Immuta, the automated data governance solution, integrates with Databricks, the data and AI company, to help customers overcome DAG challenges while maximizing data’s utility and security, so organizations can reap the time and revenue benefits of fast, compliant data access and analysis.

A Guide to Data Access Governance with Immuta and Databricks spells out in detail how exactly this works. But because the issue of scalable, secure data access governance is becoming vitally important, identifying the signs your data stewards and data owners may need Immuta for Databricks can maximize time, money, and most importantly, your data’s value.

If any of these scenarios sound familiar, it might be time to add automated data access governance to your Databricks platform:

1. Your current data governance framework has led to role explosion.

According to Immuta’s research, 80% of data teams use role-based access control (RBAC) or “all-or-nothing” access control policies for identity and access management. Although these approaches are relatively easy to implement when you have one data platform or only a handful of users, the static nature of role-based or all-or-nothing access controls makes them unscalable.

Why is this? RBAC requires data engineers and architects to create roles for each new user or data set. This can quickly lead to hundreds or thousands of roles, which — even with a data governance strategy in place — becomes difficult to keep track of and manage efficiently. Trying to keep up with which permissions correspond to each role is a drag on data stewards’ time, not to mention that it increases the likelihood of implementing inconsistent data access rights across platforms. This can also lead to overly broad or restrictive access permissions, which can introduce a risk of data breaches and inefficiency.

Immuta’s native integration with Databricks uses attribute-based data access controls (ABAC) to grant or restrict access to data at query time based on distinct sets of attributes like title, data location, or data owner. Databricks customers report reducing the number of roles in their systems by 100 times when using Immuta’s attribute-based access control.

2. Regulatory requirements are difficult to decipher, and even more difficult to act upon.


Harry S. Truman once said, “If you can’t convince them, confuse them.” Today, the quote takes on new life as a joke about legal jargon. It’s not hard to see why — legal jargon is notoriously difficult to understand, let alone act upon. Yet, many data stewards and data owners are responsible for this very task.

Translating regulatory requirements into data access control policies is even more challenging in today’s increasingly regulated environment. This is due in part to the fact that rules and regulations are frequently amended and updated, requiring data stewards and data owners to proactively and sufficiently update their existing policies. For example, voters elected to amend the CCPA (soon to be the CPRA) just over two years after it was first signed into law. This means data stewards and data owners must understand how the amendments differ from the original legislation and change all relevant policies accordingly — before the law takes effect in January of 2023.

Immuta Regulatory starter policies for data access governance

Immuta simplifies this process with data access governance starter policies that meet the requirements of the CCPA and HIPAA’s Safe Harbor Policy. Additionally, Immuta enables purpose-based access controls, which help data stewards and data owners comply with the GDPR’s purpose limitation. Together, these features streamline regulatory compliance for Databricks users and help safeguard them from potential data privacy penalties. As a result, Databricks users can multiply permitted use cases for cloud analytics by a factor of four, simply by safely unlocking sensitive data.

3. You are responsible for enabling real-time data access rights.

How valuable is data if it can’t be accessed — let alone used — for months after it enters your active directory? Competitive advantage thrives on the ability to make data insights in near real time, but often, arbitrary or convoluted data management processes and manual data preparation processes, like RBAC or sensitive data discovery, delay time to data access.

Gartner’s analysis of data science teams shows that nearly half of the time spent on data projects is on tasks that take place before even developing models or conducting problem analysis. Considering the number of new and existing data assets available to data teams, this ratio can and should be reversed. Without the right tools, though, data stewards and data owners remain responsible for time to data access, often without the resources to reduce that time.

For Databricks customers, however, Immuta’s native integration streamlines these time-consuming DAG processes and accelerates time to data access. Databricks users report that Immuta’s ability to provide secure, self-service data access reduces typically months-long processes to mere seconds and increases data engineering productivity by 40%.

Databricks and Immuta seamlessly implement automated data access governance in a best-of-breed data analytics platform, empowering data stewards, data owners, and end-users to do more with their data. To learn more about Immuta’s native integration with Databricks, download A Guide to Data Access Governance with Immuta and Databricks.

Experience Databricks with Immuta for yourself by starting a free trial today.

--

Try Databricks for free. Get started today.

The post Data Access Governance and 3 Signs You Need it appeared first on Databricks.

Leveling the Playing Field: HorovodRunner for Distributed Deep Learning Training

$
0
0

This is a guest post authored by Sr. Staff Data Scientist/User Experience Researcher Jing Pan and Senior Data Scientist Wendao Liu of leading health insurance marketplace eHealth.


None generates Taichi;
Taichi generates two complementary forces;
Two complementary forces generate four aggregates;
Four aggregates generate eight trigrams;
Eight trigrams determine myriads of phenomena.

—Classic of Changes

From the Classic of Changes or I Ching’s primitive concept of binary numbers to the origin of the term “algorithm” (al-jabr, from the title of the foundational text by al-Khwarizmi, the father of algebra), every civilization has placed great emphasis on the power of computation. So too does every modern technology company. Mid-cap companies are joining the race with big tech firms to turn out fast iterations of deep learning services. A scrum master in an agile team often asks their engineers, “How long will it take to finish this development?” But the scrum master, shocked by the length of the training period for any deep learning project, has no way to accelerate the project simply by adding more ML engineers, encouraging longer working hours, or setting KPIs.

See the official page for the eHealth session for more information about the program, including downloadable slides a transcript.

One of the bottlenecks for deep learning projects is the size of the largest GPU machine available on AWS, Azure, or any other cloud provider. Want to increase your batch size, looking to gain training speed without (hopefully) compromising accuracy? Whoops, GPU out of memory. So you apply the universal wisdom of the 80/20 rule and maybe don’t train for so many epochs, among other time-reduction measures (such as fast-converging optimizers… which raises the question of how many optimizers you are going to try). The result? Your ambitious deep learning-based artificial “intelligence” turns out to be an artificial “intellectually challenged.” And at a mid-cap company, there are a thousand more important challenges to prioritize in the budget. Supercomputer? Customized GPU cluster? Maybe in the long-term roadmap, but for now you’re out of luck. Even if your Silicon Valley company manages to build a successful product and is desperate to deploy it in production, you’ll have to contest with California’s wildfires and preventive power shutdowns. These are the kinds of pain points that can keep a smaller company stuck in what feels like the 1980s, while larger tech giants leap into the future.

HorovodRunner benchmark

Ice breaks, however, with good tools—and in this case, it’s as simple as importing HorovodRunner as hvd in Python. HorovodRunner is a general API to run distributed deep learning workloads on a Databricks Spark cluster using Uber’s Horovod framework (Databricks, 2019). In February 2020, we (on behalf of online health insurance broker eHealth Inc.) presented a paper entitled “Benchmark Tests of Convolutional Neural Network and Graph Convolutional Network on HorovodRunner Enabled Spark Clusters” at the First International Workshop on Deep Learning on Graphs: Methodologies and Applications (held in conjunction with the 34th AAAI Conference on Artificial Intelligence, one of the top AI conferences in the world—the full paper is available at https://arxiv.org/abs/2005.05510). Our research showed that Databricks’s HorovodRunner achieves significant lift in scaling efficiency for convolutional neural network-based (CNN-based, hereafter) tasks on both GPU and CPU clusters, but not the original graph convolutional network (GCN) task. On GPU clusters for the Caltech 101 image classification task, the scaling efficiency ranges from 18.5% to 79.7%, depending on the number of GPUs and models. On CPU clusters for the MNIST handwritten digit classification task, it shows a positive lift in image processing speed where the number of processes is 16 or under. We also implemented the Rectified Adam optimizer for the first time in HorovodRunner. The complete code can be found at https://github.com/psychologyphd/horovodRunnerBenchMark_IPython.

CPU cluster performance on image classification on MNIST dataset

Figure 1. CPU cluster performance on image classification on MNIST dataset

GPU cluster performance on image classification on Caltech101 dataset

Figure 2. GPU cluster performance on image classification on Caltech101 dataset

HorovodRunner implementation tips

In addition to the overall positive lift in benchmark performance, here are some more technical tips on how to get things done right:

  1. Cluster settings: Get a TensorFlow 1 cluster (GPU or CPU). This will enable most Keras Functional API models to run relatively smoothly, except ResNet, which requires TensorFlow 2 (there are too many compatibility issues with TensorFlow 2 at the moment for it to run on HorovodRunner). The cluster should not have auto-scaling enabled. If you want to use Rectified Adam, you need to run an init script as described below. If you want to use TensorFlow 2 anyway, you can use Databricks Runtime 7.x which includes TF2.
  2. Distributed model retrieval: Since the Keras Model API downloads models from GitHub and GitHub has a limit on how many downloads you can have per second that HorovodRunner’s processes will easily exceed, we recommend downloading your model to your master with the Keras Functional API and then uploading it to an S3 or DBFS location. HorovodRunner can then get the model from that location.
  3. Avoid Horovod Timeline: Previous studies have shown that using Horovod Timeline increases overall training time (Databricks, 2019) and leads to no overall increase in training efficiency (Wu et al., 2018). We get time in the following two ways. First, we get the wall-clock total time from right before calling HorovodRunner and after the HorovodRunner object’s run method finishes running, which includes overhead in loading data, loading the model, pickling functions, etc. Second, we get the time it takes to run each epoch (called the epoch time) from the driver’s standard output. Every time a new line of driver output is printed, we add a timestamp. In the driver’s standard output, in the form of notebook output, it will print out [1,x]<stdout>:Epoch y/z, where x is the xth hvd.np, y is the yth epoch, and z is the total number of epochs. We record the timestamp t1 of the first time Epoch y/z shows up in the standard output and the timestamp t2 of the first time Epoch (y+1)/z shows up, regardless of which process emits the output. The time difference t2 – t1 approximates the time it takes for the epoch y to complete, based on the assumption that only after all processes finish an epoch and the weights have been averaged can the next epoch begin. For MNIST, we got the wall-clock run time for three repetitions (with elapsed time measured in Python), and epoch times from TF1 output. We used the number of images in the training set times the number of repetitions divided by the total wall-clock time to get the number of images per second.
  4. Use Rectified Adam: This is a new optimizer that accurately finds the initial direction for gradient descent (Liu et al., 2019). To use it, first you need to install the Python package keras-rectified-adam. Second, you need to run the initiation script on the cluster:
dbutils.fs.put("tmp/tf_horovod/tf_dbfs_timeout_fix.sh","""
#!/bin/bash
fusermount -u /dbfs
nohup /databricks/spark/scripts/fuse/goofys-dbr -f -o allow_other --file-mode=0777 --dir-mode=0777 --type-cache-ttl 0 --stat-cache-ttl 1s --http-timeout 5m /: /dbfs >& /databricks/data/logs/dbfs_fuse_stderr &""", True)
Note, the preceding step may not be necessary due to recent updates to the goofys within Databricks.

The notebook titled np4_VGG horovod benchmark_debug_RAdam in the GitHub repository for our paper can run successfully with the Rectified Adam optimizer. We found %env TF_KERAS = 1 worked, but not os.environ[‘TF_KERAS’] = ‘1’. We imported RAdam from keras_radam, then used optimizer = RAdam(total_steps=5000, warmup_proportion=0.1, learning_rate=learning_rate*hvd.size(), min_lr=1e-5), which we learned from optimizer = keras.optimizers.Adadelta(lr=learning_rate*hvd.size()) in Uber’s Horovod advanced example. We don’t know the exact optimal parameters for Rectified Adam on a HorovodRunner-enabled Spark cluster yet, but at least the code can run.

Note, the %env TF_KERAS = 1 configuration can also be configured at Databricks cluster configuration level.

  1. Choice of models: HorovodRunner builds on Horovod. Horovod implements data parallelism to take in programs written based on single-machine deep learning libraries to run distributed training fast (Sergeev and Del Balso, 2017). It’s based on the Message Passing Interface (MPI) concepts of size, rank, local rank, allreduce, allgather, and broadcast (Sergeev and Del Balso, 2017; Sergeev and Del Balso, 2018). So, a model that will gain scaling efficiency has to support data parallelism (i.e., partitioning of rows of data on Spark). Regular sequential neural networks,  RNN-based models, and CNN-based models will gain scaling efficiency. We haven’t done the coding ourselves, but we think with Petastorm preparing data the way they digest it, HorovodRunner should work with LSTMs too. GCNs, which need the entire graph’s connection matrix to be distributed to a slave, will not. There are partitioning (Liao et al., 2018) and fast GCN (Wu et al., 2019) methods and so on to help with parallelized GCN training, and AWS’s deep graph library is actively implementing those strategies. As for models that can gain scaling efficiency out of the box with HorovodRunner, theoretical modeling of distributed data-parallel training (Castelló et al., 2019) predicts models with lower operational intensity will benefit more, which is consistent with our work and with Uber’s benchmarks. Inception V3’s scaling efficiency is higher than VGG-16’s.

Why HorovodRunner?

In addition to speeding up the deep learning training process, HorovodRunner solves additional pain points for mid-cap companies. Residing on Spark clusters, it eliminates the need for customized GPU clusters which are cumbersome to build and maintain. Since Spark clusters reside in the cloud, it takes (nearly) full advantage of the cloud and removes the burden of maintaining the clusters from the company’s ML engineers and data scientists. It also supports multiple deep learning frameworks, which provides more flexibility and less business risk than sticking to one framework. Maybe the only caveat is that it is not elastic yet, but we expect that once HorovodRunner gains enough momentum in the industry it will move toward being elastic.

There’s more than one distributed training platform on the market, with distributed TensorFlow being the original player. Note, TensorFlow 2 has a better system for distributed TensorFlow but it is specific to TensorFlow only.  Uber developed Horovod because using distributed TensorFlow is hard—both the code and the cluster setup. Then Databricks developed  HorovodRunner and HorovodEstimator to make things even easier, especially on the cluster setup side. We had prior experience with Spark and single-machine deep learning training, but not with distributed deep learning training, and we were able to pick up HorovodRunner in about two months with Databricks’s standard user support. Although some people approximate the cognitive load of adopting a new framework with lines of code in GitHub examples, we find the framework’s class design and the readability of the example code to be good indicators of the ease of learning. By either measure, HorovodRunner, in our opinion, is an easy-to-use tool. This means it not only enables business advancement but also technological advancement.

The author of the Taichi programming language for graphics applications, Yuanming Hu, once said that the lack of user-friendly tools is a contributing factor to why computer graphics lacks advancement (https://www.qbitai.com/2020/01/10534.html). The importance of usability also applies to the advancement of distributed deep learning, with HorovodRunner being a good choice of tool. People use Keras because TensorFlow is hard to read. Uber developed Horovod because distributed TensorFlow maintenance is difficult, and even they found the code hard to understand. How many resources can a smaller company dedicate to competing with the tech giants for top tech talent who can understand the backend logic behind the code? Sometimes the difference is a matter of life or death or the difference between an IPO or not. Zoom went public based on a mere $7 million annual profit. If they hired a handful of additional top ML engineers in Silicon Valley, their balance sheet would go negative and their billion-dollar IPO could be jeopardized.

Of course, there are alternatives to HorovodRunner. CollectiveAllReduceStrategy, under development by TensorFlow, is a similar tool, though as of the end of 2020 there are limited resources available online and unknown use in production. A quick search with the keyword “CollectiveAllReduceStrategy” returns several reports of failures and performance issues. At this point, it’s too early to say which one will prove to be a better framework for distributed deep learning, although CollectiveAllReduceStrategy’s deep ties to TensorFlow slightly increase the business risk. HorovodRunner is agnostic to deep learning platforms, which provides flexibility in choosing deep learning tools as well as integrated support of deep learning on different cloud platforms. Not every cloud platform is allowed to operate anywhere in the world, nor does every cloud platform support every deep learning platform.

Looking forward

In our opinion, it will be some time before the AI community no longer needs labels. Until then, the models (sequential models, CNNs, and RNNs) HorovodRunner currently supports will continue to have a huge impact on applications like natural language processing and understanding, translation, image classification, and image transformation. As mentioned previously, even if HorovodRunner doesn’t support them now, any models that allow data parallelism can theoretically work with it. If the model accuracy is highly dependent on the labeling of data, it’s important to get the correct labels and own your labels in-house (instead of buying labeled training data from elsewhere, which is a doomed business model for AI companies).

An urban legend in the AI community says that a shitake mushroom–picking robot’s performance strongly depends on whether the human labeler is the plant’s senior director, a technician, or a seasonal worker, resulting in descending order of accuracy when classifying mushrooms into different grades. But oftentimes, this wisdom is ignored. There is a slight twist to the concept of owning your labels, advertised as “generating” your own labels, but in our opinion, this understates the importance of proper labeling. Let’s say a medical AI startup wants to create a product that can automatically label images for disease classification. Instead of hiring top physicians to label images, the startup trains novices to do this task, and then builds fancy models on top of those labels. Garbage labels don’t do any AI company, investor, or customer any good. Although GCN models have more complex relationships among nodes than the sequential relationships in RNN (and RNN-like) models, the success of any GCN model is still highly dependent on the accuracy of labeling. It is now well known that reinforcement learning without any labels outperforms any human master at the game of Go, and outperforms its predecessors that used human masters’ labels (Silver et al., 2017). Because of our dependency on labels, and the relative scarcity of labeled data, this means reinforcement learning might well be the future of AI. We are excited to see how it will evolve into training not done from scratch (i.e., leveraging previously available human or model knowledge), adapting to a changing environment, and going distributed.

HorovodRunner’s easy-to-use,  platform-agnostic, production-grade support for the currently popular models has the potential to level the playing field for smaller companies so they can, perhaps, compete with the tech giants. And who doesn’t want to be the next Zoom, anyway?

GET SESSION SLIDES & TRANSCRIPT!

References
Castelló, A., Dolz, M.F., Quintana-Ortí, E.S., and Duato, J. 2019. “Theoretical Scalability Analysis of Distributed Deep Convolutional Neural Networks.” In Proceedings of the 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). https://doi.ieeecomputersociety.org/10.1109/CCGRID.2019.00068

Databricks. 2019. “Record Horovod Training with Horovod Timeline.” https://docs.databricks.com/applications/deep-learning/distributed-training/horovod-runner.html#record-horovod-training-with-horovod-timeline

Liao, R., Brockschmidt, M., Tarlow, D., Gaunt, A.L., Urtasun, R., and Zemel, R. 2018. “Graph Partition Neural Networks for Semi-Supervised Classification.” arXiv preprint arXiv: 1803.06272. https://arxiv.org/abs/1803.06272

Liu, L., Jiang H., He, P., Chen W., Liu X., Gao J., and Han J. 2019. “On the Variance of the Adaptive Learning Rate and Beyond.” arXiv preprint arXiv:1908.03265. https://arxiv.org/abs/1908.03265

Sergeev, A., and Del Balso, M. 2017. “Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow.” https://eng.uber.com/horovod/

Sergeev, A., and Del Balso, M. 2018. “Horovod: Fast and Easy Distributed Deep Learning in TensorFlow.” arXiv preprint arXiv:1802.05799. https://arxiv.org/abs/1802.05799

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., et al. 2017. “Mastering the Game of Go Without Human Knowledge.” Nature 550(7676):354–359. http://augmentingcognition.com/assets/Silver2017a.pdf

Wu, F., Zhang, T., Holanda de Souza Jr., A., Fifty, C., Yu, T., and Weinberger, K.Q. 2019. “Simplifying Graph Convolutional Networks.” In Proceedings of the 36th International Conference on Machine Learning. http://proceedings.mlr.press/v97/wu19e.html

Wu, X., Taylor, V., Wozniak, J. M., Stevens, R., Brettin, T., and Xia, F. 2018. “Performance, Power, and Scalability Analysis of the Horovod Implementation of the CANDLE NT3 Benchmark on the Cray XC40 Theta.” In Proceedings of the 8th Workshop on Python for High-Performance and Scientific Computing. https://sc18.supercomputing.org/proceedings/workshops/workshop_files/ws_phpsc104s2-file1.pdf

--

Try Databricks for free. Get started today.

The post Leveling the Playing Field: HorovodRunner for Distributed Deep Learning Training appeared first on Databricks.

Combining Rules-based and AI Models to Combat Financial Fraud

$
0
0

The financial service industry (FSI) is rushing towards transformational change, delivering transactional features and facilitating payments through new digital channels to remain competitive. Unfortunately, the speed and convenience that these capabilities afford also benefit fraudsters.

Fraud in financial services still remains the number one threat to organizations’ bottom line given the record-high increase in overall fraud and how it has diversified in recent years. A recent survey by PwC outlines a staggering global impact of fraud. For example, in the United States alone, the cost to businesses in 2019 totaled $42bn, and 47% of surveyed companies experienced fraud in the past 24 months.

So how should companies respond to the ever-increasing threat of fraud? Fraudsters are exploiting the capabilities of the new digital landscape, meaning organizations must fight fraud in real-time while still keeping the customer experience in mind. To elaborate further, financial institutions leverage two key levers for minimizing fraud losses: effective fraud prevention strategies and chargeback to customers. Both present pros and cons, as they directly affect the customer experience. In this blog, we describe how to build a fraud detection and prevention framework using Databricks’ modern data architecture that can effectively balance fraud prevention strategies and policies to improve recoveries while maintaining the highest possible customer satisfaction.

Challenges in building a scalable and robust framework

Building a fraud prevention framework often goes beyond just creating a highly-accurate machine learning (ML) model due to an ever-changing landscape and customer expectations. Oftentimes, it involves a complex ETL process with a decision science setup that combines a rules engine with an ML platform. The requirements for such a platform include scalability and isolation of multiple workspaces for cross-regional teams built on open source standards. By design, such an environment empowers data scientists, data engineers and analysts to collaborate in a secure environment.

We will first look at using a data Lakehouse architecture combined with Databricks’ enterprise platform, which supports the infrastructure needs of all downstream applications of a fraud prevention application. Throughout this blog, we will also be referencing Databricks’ core components of Lakehouse called Delta Engine, which is a high-performance query engine designed for scalability and performance on big data workloads, and MLflow, a fully managed ML governance tool to track ML experiments and quickly productionalize them.

Customer 360 Data Lake

In financial services, and particularly when building fraud prevention applications, we often need to unify data from various data sources, usually at a scale ranging from multiple terabytes to petabytes. As technology changes rapidly and financial services integrate new systems, data storage systems must keep up with the changing underlying data formats. At the same time, these systems must enable organic evolutions of data pipelines while staying cost-effective. We are proposing Delta Lake as a consistent storage layer built on open-source standards to enable storage and compute of features to keep anomaly detection models on the cutting edge.

Data engineers can easily connect to various external data pipelines and payment gateways using the Databricks Ingestion Partner Network to unify member transactions, performance and trade history. As mentioned, it is critical to compute new features and refresh existing ones over time as the data flows in. Examples of pre-computed features are historical aggregates of member account history or statistical averages important for downstream analytical reporting and accelerating retraining of ML models. Databricks’ Delta Lake and native Delta Engine are built exactly for this purpose and can accelerate the speed of feature development using Spark-compatible APIs to enforce the highest levels of quality constraints for engineering teams.

A unified data lake to store and catalog customer data and enable new feature creation at scale

One platform serving all fraud prevention use cases

Since our approach to fraud detection involves a combination of a rules suite and ML, Databricks fits in well, as it is home to a diverse set of personas required to create rules and ML models – namely business analysts, domain experts and data scientists. In the following section, we’ll outline the different components of Databricks that map to personas and how rules meet ML using MLflow, shared notebooks and SQL Analytics.

The ability for users to collaborate using multiple workspaces while providing isolation at the user level is critical in financial services. With the Databricks’ Enterprise Cloud Service architecture, an organization can create new workspaces within minutes. This is extremely helpful when orchestrating a fraud detection framework since it creates isolations for various product, business group users and CI/CD orchestration within each group. For example, credit cards business group users can be isolated from deposits, and each line of business can control the development and promotion of model artifacts.

Cross-domain collaboration made possible with multiple-workspaces as well as shared model registry in production among various teams.

Combining rules-based systems with AI

Mapping users and Databricks’ components

The fraud detection development cycle begins with business analysts and domain experts who often contribute a major part of initial discovery, including sample rulesets. These common sense rules involving tried-and-true features (such as customer location and distance from home):

a) Fast to execute
b) Easily interpretable and defensible by a FSI
c) Decrease false positives (i.e. false declines through rules framework)
d) Flexible enough to increase the scope of training data required for fraud models

While rules are the first line of defense and an important part of a firm’s overall fraud strategy, the financial services industry has been leading the charge in developing and adopting cutting-edge ML algorithms alongside rulesets. The following design tier shows several components using the approach of combining rule sets and ML models. Now let’s look at each component and the typical workflow for the personas who will be supporting the respective operations.

Defining rules-based and ML models through a unified design tier infrastructure

Exploring rules using SQL functionalities

For exploratory data analysis, Databricks offers two avenues of attacking fraud for the analyst persona: Databricks SQL/Python/R-supported notebooks for data engineering and data science and SQL Analytics for business intelligence and decision-making. SQL Analytics is an environment where users can build dashboards to capture historical information, query data lake tables with ease, and hook up BI tools for further exploration. As shown below, analysts have the ability to create dashboards with descriptive statistics, then transition to an investigation of individual fraudulent predictions to validate the reasons why a particular transaction was chosen as fraudulent.
 
Exploring fraudulent patterns using SQL and dashboard capabilities of SQL analytics.
 
In addition, users can edit any of the underlying queries powering the dashboard and access the catalog of data lake objects to inform future features that could be used as part of a rule based / ML based fraud prevention algorithm. In particular, users can start to slice data using rulesets, which ultimately make their way into production models. See the image below, which highlights the SQL query editor as well as the following collaborative and ease-of-use features:

  • Query sharing and reusability – the same query can power multiple dashboards, which demands less of a load on the underlying SQL endpoint, allowing for higher concurrency
  • Query formatting – improved readability and ease of use of SQL on Databricks
  • Sharing – queries can be shared across business analysts, domain experts, and data scientists with the ‘Share’ functionality at the top right-hand side

 
Creating efficient SQL queries at scale using SQL analytics.
 

Rules and model orchestration framework

We have covered the benefits of leveraging rulesets in our fraud detection implementation. However, it is important to recognize the limitations of a strict rules-based engine, namely:

  • Strict rules-based approaches put in place today become stale tomorrow since fraudsters are routinely updating strategies. However, as new fraud patterns emerge, analysts will scramble to develop new rules to detect new instances, resulting in high maintenance costs. Furthermore, there are hard costs associated with the inability to detect fraud quickly given updated data — ML approaches can help speed up time to detect fraud and thus save merchants potential losses
  • Rules lack a spectrum of conclusions and thus ignore risk tolerance since they cannot provide a probability of fraud
  • Accuracy can suffer due to the lack of interaction between rules when assessing fraudulent transactions, resulting in losses

For fraud detection framework development, we recommend a hybrid approach that combines rulesets with ML models. To this end, we have used a business logic editor to develop rules in a graphical interface, which is commonly used by systems such as Drools to make rule-making simple. Specifically, we interactively code our rules as nodes in a graph and reference an existing MLflow model (using its ML registry URI such as models:/fraud_prediction/production) to signal that an ML model, developed by a data scientist colleague, should be loaded and used to predict the output after executing the rules above it. Each rule uses a feature from a Delta Lake table, which has been curated by the data engineering team and is instantly available once the feature is added (see more details on schema evolution here to see how simple it is to add features to tables that change throughout the life of an ML project).

We create a logical flow by iteratively adding each rule (e.g. authorized amounts should be less than cash available money on the account as a baseline rule) and adding directed edges to visualize the decision-making process. In tandem, our data scientist may have an ML model to catch fraudulent instances discoverable by training data. As a data analyst, we can simply annotate a note to capture the execution of the ML model to give us a probability of fraud for the transaction being evaluated.

Note: In the picture below, the underlying markup language (DMN) that contains all the rules is XML-based, so regardless of the tools or GUIs used to generate rules, it is common to extract the rulesets and graph structure from the underlying flat file or XML (e.g. a system like Drools).

Augmenting a rule set workflow with ML models

After assembling a combined ruleset and model steps, as shown above, we can now encode this entire visual flow into a decisioning fraud detection engine in Databricks. Namely, we can extract the DMN markup from the Kogito tool and upload directly into Databricks as a file. Since the .dmn file has node and edge contents, representing the order of rules and models to execute, we can leverage the graph structure. Luckily, we can use a network analysis Python package, networkx, to ingest, validate, and traverse the graph. This package will serve as the basis for the fraud scoring framework.

import networkx as nx

xmldoc = minidom.parse('DFF_Ruleset.dmn')
itemlist = xmldoc.getElementsByTagName('dmn:decision')
G = nx.DiGraph()
for item in itemlist:
   node_id = item.attributes['id'].value
   node_decision = str(item.attributes['name'].value)
   G.add_node(node_id, decision=node_decision)
  
   infolist = item.getElementsByTagName("dmn:informationRequirement")
   if(len(infolist) > 0):
     info = infolist[0]
     for req in info.getElementsByTagName("dmn:requiredDecision"):
       parent_id = req.attributes['href'].value.split('#')[-1]
       G.add_edge(parent_id, node_id)

Now that we have the metadata and tools in place, we’ll use MLflow to wrap the hybrid ruleset and models up into a custom Pyfunc model, which is a lightweight wrapper we’ll use for fraud scoring. The only assumptions are that the model, which is annotated and used in the DAG (directed acyclic graph) above, is registered in the MLflow model registry and has a column called ‘predicted’ as our probability. The framework pyfunc orchestrator model (which leverages networkx) will traverse the graph and execute the rules directly from the XML content, resulting in an ‘approved’ or ‘denied’ state once the pyfunc is called for inference.

Below is a sample DAG created from the rules editor mentioned. We’ve encoded the mixture of rules and a model that has been pre-registered (shown in the attached notebooks). The rules file itself is persisted within the model artifacts so, at inference time, all rules and models are loaded from the cloud storage, and the models used (in this case the fraud detection model) are loaded from the MLflow model registry in a real-time data pipeline. Note that in a sample run for an example transaction, the third rule is not satisfied for a sample input, so the node is marked as red, which indicates a fraudulent transaction.

Traversing directed acyclic graph for fraud detection

To further understand how the model executes rules, here is a snippet from the custom Pyfunc itself, which uses pandasql to encode the string from the XML ruleset inside of a SQL case statement for a simple flag setting. This results in output for the orchestrator, which is used to designate a fraudulent or valid transaction.


import mlflow.pyfunc
class DFF_Model(mlflow.pyfunc.PythonModel):
 import networkx as nx
 import pandas as pd
 from pandasql import sqldf
  '''
 For rule based, we simply match record against predefined SQL where clause
 If rule matches, we return 1, else 0
 '''
 def func_sql(self, sql):
   from pandasql import sqldf
   # We do not execute the match yet, we simply return a function
   # This allow us to define function only once
   def func_sql2(input_df):
pred = sqldf("SELECT CASE WHEN {} THEN 1 ELSE 0 END AS predicted FROM input_df".format(sql)).predicted.iloc[0]
     return pred
   return func_sql2

Decisioning and serving

Lastly, we’ll show what an end-to-end architecture looks like for the fraud detection framework. Notably, we have outlined what data scientists and data analysts work on, namely rulesets and models. These are combined in the decisioning tier to test out exactly what patterns will be deemed fraud or valid. The rulesets themselves are stored as artifacts in custom MLflow Pyfunc models and can be loaded in memory at inference time, which is done in a Python conda environment during testing. Finally, once the decisioning framework is ready to be promoted to production, there are a few steps that are relevant to deployment:

  • The decisioning framework is encoded in a custom pyfunc model, which can be loaded into a Docker container for inference in real time.
  • The base MLflow container used for inference should be deployed to ECR (Amazon), ACR (Azure) or generally Docker Hub.
  • Once the framework is deployed to a container service (EKS, AKS, or custom k8s deployments), the service refers to the container repository and MLflow model repository for standing up an application endpoint. Since the serving layer is based on k8s and a lightweight pyfunc model, model inference is relatively fast. In cases where the inference demands sub-second (ms) latency, the logic can be rewritten in C, Go or other frameworks.
  • For fast lookups on historical data when scoring in real time, data can be loaded into an in-memory database from the Customer 360 feature store that was created earlier. Finally, an enterprise case management system can be interfaced with the Customer 360 Data Lake to capture scoring results and from the deployment container.

nd-to-end architecture for fraud framework development and deployment

Building a modern fraud framework

While it’s a shared responsibility between vendors and financial services organizations to combat fraud effectively, by deploying effective fraud prevention strategies, FSIs can minimize direct financial loss and improve customers’ trust from fewer false declines. As we have seen in the surveys, fraud has diversified rapidly and the finance industry has turned to analytical models and ML to manage losses and increase customer satisfaction. It is a big mandate to build and maintain infrastructure to support multiple product teams and personas, which could directly impact a company’s revenue bottom line.

We believe this solution addresses the key areas of scalability in the cloud, fraud prevention workflow management and production-grade open source ML frameworks for organizations to build and maintain a modern fraud and financial crimes management infrastructure by bringing closer alignment between different internal teams.

This Solution Accelerator is part 1 of a series on building fraud and financial crimes solutions using Databricks’ Unified Analytics Platform. Try the below notebooks on Databricks to harness the power of AI to mitigate reputation risk. Contact us to learn more about how we assist FSIs with similar use cases.

Try the notebooks

--

Try Databricks for free. Get started today.

The post Combining Rules-based and AI Models to Combat Financial Fraud appeared first on Databricks.

How to Save up to 50% on Azure ETL While Improving Data Quality

$
0
0

The challenges of data quality

One of the most common issues our customers face is maintaining high data quality standards, especially as they rapidly increase the volume of data they process, analyze and publish. Data validation, data transformation and de-identification can be complex and time-consuming. As data volumes grow, new downstream use cases and applications emerge, and expectations of timely delivery of high-quality data increase the importance of fast and reliable data transformation, validation, de-duplication and error correction. Over time, a wide variety of data sources and types add processing overhead and increase the risk of an error being introduced into the growing number of data pipelines as both streaming and batch data are merged, validated and analyzed.

City-scale data processing

The City of Spokane, located in Washington state, is committed to providing information that promotes government transparency and accountability and understands firsthand the challenges of data quality. The City of Spokane deals with an enormous amount of critical data that is required for many of its operations, including financial reports, city council meeting agendas and minutes, issued and pending permits, as well as map and Geographic Information System (GIS) data for road construction, crime reporting and snow removal. With their legacy architecture, it was nearly impossible to obtain operational analytics and real-time reports. They needed a method of publishing and disseminating city datasets from various sources for analytics and reporting purposes through a central location that could efficiently process data to ensure data consistency and quality.

How the City of Spokane improved data quality while lowering costs

To abstract their entire ETL process and achieve consistent data through data quality and master data management services, the City of Spokane leveraged DQLabs and Azure Databricks. They merged a variety of data sources, removed duplicate data and curated the data in Azure Data Lake Storage (ADLS).

“Transparency and accountability are high priorities for the City of Spokane,” said Eric Finch, Chief Innovation and Technology Officer, City of Spokane. “DQLabs and Azure Databricks enable us to deliver a consistent source of cleansed data to address concerns for high-risk populations and to improve public safety and community planning.”

Using this joint solution, the City of Spokane increased government transparency and accountability and can provide citizens with information that encourages and invites public participation and feedback. Using the integrated golden record view, datasets became easily accessible to improve reporting and analytics. The result was an 80% reduction in duplicates, significantly improving data quality. With DQLabs and Azure Databricks, the City of Spokane also achieved a 50% lower total cost of ownership (TCO) by reducing the amount of manual labor required to classify, organize, de-identify, de-duplicate and correct incoming data as well as lower costs to maintain and operate their information systems as data volumes increase.

City of Spokane ETL/ELT process with DQLabs and Azure Databricks

City of Spokane ETL/ELT process with DQLabs and Azure Databricks

How DQLabs leverages Azure Databricks to improve data quality

“DQLabs is an augmented data quality platform, helping organizations manage data smarter,” said Raj Joseph, CEO, DQLabs. “With over two decades of experience in data and data science solutions and products, what I find is that organizations struggle a lot in terms of consolidating data from different locations. Data is commonly stored in different forms and locations, such as PDFs, databases, and other file types scattered across a variety of locations such as on-premises systems, cloud APIs, and third-party systems.”

To help customers make sense of their data and answer even simple questions such as, “is it good?” or “is it bad?” are far more complicated than organizations ever anticipated. To solve these challenges, DQLabs built an augmented data quality platform. DQLabs helped the City of Spokane to create an automated cloud data architecture using Azure Databricks to process a wide variety of data formats, including JSON and relational databases. They first leveraged Azure Data Factory (ADF) with DQLabs’ built-in data integration tools to connect the various data sources and orchestrate the data ingestion at different velocities, for both full and incremental updates.

DQLabs uses Azure Databricks to process and de-identify both streaming and batch data in real time for data quality profiling. This data is then staged and curated for machine learning models PySpark MLlib.

Incoming data are evaluated to understand its semantic type using DQLabs’ artificial intelligence (AI) module, DataSense. This helps organizations classify, catalog, and govern their data, including sensitive data, such as personally identifiable information (PII) that includes contact information and social security numbers.

Based on the DataSense classifications, additional checks and custom rules can be applied to ensure data is managed and shared according to the city’s guidelines. Data quality scores can be monitored to catch errors quickly. Master Data Models (MDM) are defined at different levels. For example, contact information can include name, address and phone number.

Refined data are published as golden views for downstream analysis, reporting and analytics. Thanks to DQLabs and Azure Databricks, this process is fast and efficient, putting organizations like the City of Spokane in a leading position to leverage their data for operations, decision-making and future planning.

Get started with DQLabs and Azure Databricks to improve data quality

Learn more about DQLabs by registering for a live event with Databricks, Microsoft, and DQLabs. Get started with Azure Databricks with a Quickstart Lab and this 3-part webinar training series.

Register for Live Event!

--

Try Databricks for free. Get started today.

The post How to Save up to 50% on Azure ETL While Improving Data Quality appeared first on Databricks.

Data Exfiltration Protection With Databricks on AWS

$
0
0

In this blog, you will learn a series of steps you can take to harden your Databricks deployment from a network security standpoint, reducing the risk of Data exfiltration happening in your organization.

Data Exfiltration is every company’s worst nightmare, and in some cases, even the largest companies never recover from it. It’s one of the last steps in the cyber kill chain, and with maximum penalties under the General Data Protection Regulation (GDPR) of €20 million (~ $23 million) or 4% of annual global turnover – it’s arguably the most costly.

But first, let’s define what data exfiltration is. Data exfiltration, or data extrusion, is a type of security breach that leads to the unauthorized transfer of data. This data often contains sensitive customer information, the loss of which can lead to massive fines, reputational damage, and an irreparable breach of trust. What makes it especially difficult to protect against is that it can be caused by both external and internal actors, and their motives can be either malicious or accidental. It can also be extremely difficult to detect, with organizations often not knowing that it’s happened until their data is already in the public domain and their logo is all over the evening news.

There are tons of reasons why preventing data exfiltration is top of mind for organizations across industries. One that we often hear about are concerns over platform-as-a-service (PaaS). Over the last few years, more and more companies are seeing the benefits of adopting a PaaS model for their enterprise data and analytics needs. Outsourcing the management of your data and analytics service can certainly free up your data engineers and data scientists to deliver even more value to your organization. But if the PaaS service provider requires you to store all of your data with them, or if it processes the data in their network, solving for data exfiltration can become an unmanageable problem. In that scenario, the only assurances you really have are whatever industry-standard compliance certifications they can share with you.

Databricks Lakehouse platform enables customers to store their sensitive data in their existing AWS account and process it in their own private virtual network(s), all while preserving the PaaS nature of the fastest growing Data & AI service in the cloud. And now, following the announcement of a cloud-native managed firewall service on AWS, customers can benefit from a new data exfiltration protection architecture- one that’s been informed by years of work with the world’s most security-conscious customers.

Data exfiltration protection architecture

We recommend a hub and spoke topology reference architecture, powered by AWS Transit Gateway. The hub will consist of a central inspection and egress virtual private cloud (VPC), while the Spoke VPC houses federated Databricks workspaces for different business units or segregated teams. In this way, you create your own version of a centralized deployment model for your egress architecture, as is recommended for large enterprises.

A high-level view of this architecture and the steps required to implement it are provided below:

A high-level view of the recommended hub & spoke architecture to protect against Data exfiltration with Databricks on AWS

  1. Deploy a Databricks Workspace in your own Spoke VPC
  2. Set up VPC endpoints for the Spoke VPC
  3. (Optional) Set up AWS Glue or an External Hive Metastore
  4. Create a Central Inspection/Egress VPC
  5. Deploy AWS Network Firewall to the Inspection/Egress VPC
  6. Link the Hub & Spoke with AWS Transit Gateway
  7. Validate Deployment
  8. Clear up the Spoke VPC resources

Secure AWS Databricks deployment details

Prerequisites

Please note the Databricks Webapp, Secure Cluster Relay, Hive Metastore endpoints, and Control Plane IPs for your workspace here (map it based on the region you plan to deploy in). These details are needed to configure the firewall rules later on.

Step 1: Deploy a Databricks Workspace in your own spoke VPC

Databricks enterprise security and admin features allow customers to deploy Databricks using your own Customer Managed VPC, which enables you to have greater flexibility and control over the configuration of your spoke architecture. You can also leverage our feature-rich integration with Hashicorp Terraform to create or manage deployments via Infrastructure-as-a-Code, so that you can rinse and repeat the operation across the wider organization.

Prior to deploying the workspace, you’ll need to create the following prerequisite resources in your AWS account:

Once you’ve done that, you can create a new workspace using the Account Console or workspace API.

In the example below, a Databricks workspace has been deployed into a spoke VPC with a CIDR range of 10.173.0.0/16 and two subnets in different availability zones with the ranges 10.173.4.0/22 and 10.173.8.0/22. You can use these IP ranges to follow the deployment steps and diagrams below.

Step 2: Set up VPC endpoints for the Spoke VPC

Over the last decade, there have been many well-publicized data breaches from incorrectly configured cloud storage containers. So, in terms of major threat vectors and mitigating them, there’s no better place to start than by setting up your VPC Endpoints.

As well as setting up your VPC endpoints, it’s also well worth considering how these might be locked down further. Amazon S3 has a host of ways you can further protect your data, and we recommend you use these wherever possible.

In the AWS console:

  • Go to Services > Virtual Private Cloud > Endpoints.
  • Select Create endpoint and create VPC endpoints for S3, STS and Kinesis, choosing the VPC, subnets/route table and (where applicable) Security Group of the Customer Managed VPC created as part of Step 1 above:
Service category Service Name Policy
AWS services com.amazonaws.<region>.s3 Leave as “Full Access” for now
AWS services com.amazonaws.<region>.kinesis-streams Leave as “Full Access” for now
AWS services com.amazonaws.<region>.sts Leave as “Full Access” for now

Note – If you want to add VPC endpoint policies so that users can only access the AWS resources that you specify, please contact your Databricks account team as you will need to add the Databricks AMI and container S3 buckets to the Gateway Endpoint Policy for S3.

Please note that applying a regional endpoint to your VPC will prevent cross-region access to any AWS services- for example S3 buckets in other AWS regions. If cross-region access is required, you will need to allow-list the global AWS endpoints in the AWS Network Firewall Rules below.

Step 3: (Optional) set up AWS Glue or an external metastore

For data cataloging and discovery, you can either leverage a managed Hive Metastore running in the Databricks Control Plane, host your own, or use AWS Glue. The steps for setting these up are fully documented in the links below.

Step 4- Create a Central Inspection/Egress VPC

Next, you’ll create a central inspection/egress VPC, which once we’ve finished should look like this:

Example central inspection/egress VPC used with the Databricks data exfiltration prevention architecture for AWS.

For simplicity, we’ll demonstrate the deployment into a single availability zone. For a high availability solution, you would need to replicate this deployment across each availability zone within the same region.

  • Go to Your VPCs and select Create VPC
  • As per the Inspection VPC diagram above, create a VPC called “Inspection-Egress-VPC” with the CIDR range 10.10.0.0/16
  • Go to Subnets and select Create subnet
  • As per the Inspection-Egress VPC diagram above, create 3 subnets in the above VPC with the CIDR ranges 10.10.0.0/28, 10.10.1.0/28 and 10.10.2.0/28. Call them TGW-Subnet-1, NGW-Subnet-1 and Firewall-Subnet-1 respectively
  • Go to Security Groups and select Create security group. Create a new Security Group for the Inspection/Egress VPC as follows:
Name Description Inbound rules Outbound rules
Inspection-Egress-VPC-SG SG for the Inspection/Egress VPC Add a new rule for All traffic from 10.173.0.0/16 (the Spoke VPC) Leave as All traffic to 0.0.0.0/0

Because you’re going from private to public networks you will need to add both a NAT and Internet Gateway. This helps from a security point of view because the NAT will sit on the trusted side of the AWS Network Firewall, giving an additional layer of protection (a NAT GW not only gives us a single external IP address, it will also refuse unsolicited inbound connections from the internet).

    • Go to Internet Gateways and select Create internet gateway
    • Enter a name like “Egress-IGW” and select Create internet gateway
  • On the Egress-IGW page, select Actions > Attach to VPC
  • Select Egress-VPC and then click Attach internet gateway
  • Go to Route Tables and select Create route table
  • Create a new route table called “Public Route Table” and associate it with the Inspection-Egress-VPC created above
  • Select the route table > Routes > Edit routes and add a new route to Destination 0.0.0.0/0 with a Target of igw* (if you start typing it, Egress-IGW created above should appear)
  • Select Save routes and then Subnet associations > Edit subnet associations
  • Add an association to Firewall-Subnet-1 created above
  • Go to NAT Gateways and select Create NAT gateway
  • Complete the NAT gateway settings page as follows:
Name Subnet Elastic IP allocation ID
Egress-NGW-1 NGW-Subnet-1 Allocate Elastic IP
  • Select Create NAT gateway
  • Go to Route Tables and select Create route table
  • Create a new route table called “TGW Route Table” and associated with the Inspection-Egress-VPC created above
  • Select the route table > Routes > Edit routes and add a new route to Destination 0.0.0.0/0 with a Target of nat-* (if you start typing it, Egress-NGW-1 created above should appear)
  • Select Save routes and then Subnet associations > Edit subnet associations
  • Add an association to TGW-Subnet-1 created above

At the end of this step, your central inspection/egress VPC should look like this:
Your central inspection/egress VPC at the end of Step 4

Step 5- Deploy AWS Network Firewall to the Inspection/Egress VPC

Now that you’ve created the networks, it’s time to deploy and configure your AWS Network Firewall.

  • Go to Firewalls and select Create firewall 
  • Create a Firewall as follows (Select Create and associate an empty firewall policy under Associated firewall policy):
Name VPC Firewall subnets New firewall policy name
<Globally unique name> Egress-Inspection-VPC Firewall-Subnet-1 Egress-Policy
  • Go to Firewall details and scroll down to Firewall endpoints. Make a note of the Endpoint ID. You’ll need this later to route traffic to and from the firewall.

In order to configure our firewall rules, you’re going to use the AWS CLI. The reason for this is that in order for AWS Network Firewall to work in a hub & spoke model, you need to provide it with a HOME_NET variable- that is the CIDR ranges of the networks you want to protect. Currently, this is only configurable via the CLI.

  • Download, install and configure the AWS CLI.
  • Test that it works as expected by running the command aws network-firewall list-firewalls
  • Create a JSON file for your allow-list rules. A template for all of the files you need to create can be found on GitHub here. You’ll need to replace the values for anything capitalized as follows: <REPLACE>. For example <SPOKE VPC CIDR RANGE> and <HUB VPC CIDR RANGE> should be replaced with the CIDR ranges for your Customer Managed VPC and central Inspection/Egress VPC as appropriate.
  • Within the Targets object, replace everything capitalized and surrounded by <> with the appropriate hostname from Firewall appliance infrastructure.
  • Some Databricks features also communicate back to the control plane via the REST API. You therefore also need to allow-list your Databricks Instance Name.
  • In the example below, we’ve also included the managed Hive Metastore URL. If you chose to use Glue or host your own in step 3, this can be omitted.
  • Note also that the HOME_NET variable should contain all of your spoke CIDR ranges, as well as the CIDR range of the Inspection/Egress VPC itself. Save it as “allow-list-fqdns.json.” An example of a valid rule group configuration for the eu-west-1 region would be as follows:
 
{
"RuleVariables": {
                "IPSets": {
                        "HOME_NET": {
      
                          "Definition": [
                                        "10.173.0.0/16",
							  "10.10.0.0/16"
                                ]
                        }
                }
        },
        "RulesSource": {
                "RulesSourceList": {
                        "TargetTypes": [
                           "TLS_SNI",
                           "HTTP_HOST"
                        ],
                        "GeneratedRulesType": "ALLOWLIST",
                        "Targets": [
         "ireland.cloud.databricks.com",
         "tunnel.eu-west-1.cloud.databricks.com",
         "md15cf9e1wmjgny.cxg30ia2wqgj.eu-west-1.rds.amazonaws.com",
         "my-databricks-instance-name.cloud.databricks.com", 
         ".pypi.org",
         ".pythonhosted.org",
         ".cran.r-project.org"
         ]
      }
   }
}

aws network-firewall create-rule-group --rule-group-name Databricks-FQDNs --rule-group file://allow-list-fqdns.json --type STATEFUL --capacity 100

You also need to add an IP based rule to allow access to the Databricks Control Plane Infrastructure IP ranges, again as per the guide in Firewall appliance infrastructure. Create another JSON file, this time called “allow-list-ips.json.” An example of a valid rule group configuration for the eu-west-1 region would be as follows:


{
     "RuleVariables": {
                "IPSets": {
                        "HOME_NET": {
                                "Definition": [
                                        "10.173.0.0/16",
							  "10.10.0.0/16"
                                ]
                        }
                }
        },
        "RulesSource": {
            "StatefulRules": [
                {
                    "Action": "PASS",
                    "Header": {
                        "Protocol": "TCP",
                        "Source": "10.173.0.0/16",
                        "SourcePort": "Any",
                        "Direction": "ANY",
                        "Destination": "3.250.244.112/28",
                        "DestinationPort": "443"
                    },
                    "RuleOptions": [
                        {
                            "Keyword": "sid:1"
                        }
                    ]
                }
            ]
        }
}


aws network-firewall create-rule-group --rule-group-name Databricks-IPs --rule-group file://allow-list-ips.json --type STATEFUL --capacity 100

Finally, add some basic deny rules to cater for common firewall scenarios such as preventing the use of protocols like SSH/SFTP, FTP and ICMP. Create another JSON file, this time called “deny-list.json.” An example of a valid rule group configuration would be as follows:


{
     "RuleVariables": {
                "IPSets": {
                        "HOME_NET": {
                                "Definition": [
                                        "10.173.0.0/16",
							  "10.10.0.0/16"
                                ]
                        }
                }
        },
        "RulesSource": {
            "StatefulRules": [
                {
                    "Action": "DROP",
                    "Header": {
                        "Protocol": "FTP",
                        "Source": "Any",
                        "SourcePort": "Any",
                        "Direction": "ANY",
                        "Destination": "Any",
                        "DestinationPort": "Any"
                    },
                    "RuleOptions": [
                        {
                            "Keyword": "sid:1"
                        }
                    ]
                },
                {
                    "Action": "DROP",
                    "Header": {
                        "Protocol": "SSH",
                        "Source": "Any",
                        "SourcePort": "Any",
                        "Direction": "ANY",
                        "Destination": "Any",
                        "DestinationPort": "Any"
                    },
                    "RuleOptions": [
                        {
                            "Keyword": "sid:2"
                        }
                    ]
                },
                {
                    "Action": "DROP",
                    "Header": {
                        "Protocol": "ICMP",
                        "Source": "Any",
                        "SourcePort": "Any",
                        "Direction": "ANY",
                        "Destination": "Any",
                        "DestinationPort": "Any"
                    },
                    "RuleOptions": [
                        {
                            "Keyword": "sid:3"
                        }
                    ]
                }
            ]
        }
    }

aws network-firewall create-rule-group --rule-group-name Deny-Protocols --rule-group file://deny-list.json --type STATEFUL --capacity 100

Now add the following rule groups to the Egress-Policy created above.

    • Go to Firewalls and select Firewall policies
    • Select the Egress-Policy created above
    • In Stateless default actions, select Edit and change Choose how to treat fragmented packets to Use the same actions for all packets
    • You Should now have Forward to stateful rule groups for all types of packet
    • Scroll down to Stateful rule groups and select Add rule groups > Add stateful rule groups to the firewall policy
  • Select all 3 rule groups created above (Databricks-FQDNs, Databricks-IPs and Deny-Protocols) and then select Add stateful rule group

Our AWS Network Firewall is now deployed and configured, all you need to do now is route traffic to it.

  • Go to Route Tables and select Create route table
  • Create a new route table called “NGW Route Table” and associated with the Inspection-VPC created above
  • Select the route table > Routes > Edit routes and add a new route to Destination 0.0.0.0/0 with a Target of vpce-* (if you start typing it, the VPC endpoint id for the firewall created above should appear).
  • Select Save routes and then Subnet associations > Edit subnet associations
  • Add an association to NGW-Subnet-1 created above
  • Select Create route table again
  • Create a new route table called “Ingress Route Table” and associated with the Inspection-VPC created above
  • Select the route table > Routes > Edit routes and add a new route to Destination 10.10.1.0/28 (NGW-Subnet-1 created above) with a Target of vpce-* (if you start typing it, the VPC endpoint id for the firewall created above should appear.
  • Select Save routes and then Edge Associations > Edit edge associations
  • Add an association to the Egress-IGW created above

These steps walk through creating a firewall configuration that restricts outbound http/s traffic to an approved set of Fully Qualified Domain Names (FQDNs). So far, this blog has focussed a lot on this last line of defense, but it’s also worth taking a step back and considering the multi-layered approach taken here. For example, the security group for the Spoke VPC only allows outbound traffic. Nothing can access this VPC unless it is in response to a request that originates from that VPC. This approach is enabled by the Secure Cluster Connectivity feature offered by Databricks and allows us to protect resources from the inside out.

At the end of this step, your central inspection/egress VPC should look like this:

Your central inspection/egress VPC at the end of Step 5

Now that our spoke and inspection/egress VPCs are ready to go, all you need to do is link them all together, and AWS Transit Gateway is the perfect solution for that.

Step 6- Link the spoke, inspection and hub VPCs with AWS Transit Gateway

First, let’s create a Transit Gateway and link our Databricks data plane via TGW subnets:

    • Go to Transit Gateways > Create Transit Gateway
    • Enter a Name tag of “Hub-TGW” and uncheck the following:
  • Default route table association 
  • Default route table propagation
  • Select Create Transit Gateway
  • Go to Subnets and select Create subnet
  • Create a new subnet for each availability zone in the Customer Managed VPC with the next available CIDR ranges (for example 10.173.12.0/28 and 10.173.12.16/28) and name them appropriately (for example “TGW-Subnet-1” and “TGW-Subnet-2”)
  • Go to Transit Gateway Attachments > Create Transit Gateway Attachment
  • Complete the Create Transit Gateway Attachment page as follows:
Transit Gateway ID Attachment type Attachment name tag VPC ID Subnet IDs
Hub-TGW VPC Spoke-VPC-Attachment Customer Managed VPC created above TGW-Subnet-1 and TGW-Subnet-2 created above
  • Select Create attachment

Repeat the process to create Transit Gateway attachments for the TGW to Inspection/Egress-VPC:

  • Select Create Transit Gateway Attachment
  • Complete the Create Transit Gateway Attachment page as follows:
Transit Gateway ID Attachment type Attachment name tag VPC ID Subnet IDs
Hub-TGW VPC Inspection-Egress-VPC-Attachment Inspection-Egress-VPC TGW-Subnet-1
  • Select Create attachment

All of the logic that determines what routes where via a Transit Gateway is encapsulated within Transit Gateway Route Tables. Next we’re going to create some TGW route tables for our Hub & Spoke networks.

  • Go to Transit Gateway Route Tables > Create Transit Gateway Route Table
  • Give it a Name tag of “Spoke > Firewall Route Table” and associate it with Transit Gateway ID Hub-TGW
  • Select Create Transit Gateway Route Table
  • Repeat the process, this time creating a route table with a Name tag of “Firewall > Spoke Route Table” and again associating it with Hub-TGW

Now associate these route tables and, just as importantly, create some routes:

    • Select “Spoke > Firewall Route Table” created above > the Association tab > Create association
    • In Choose attachment to associate, select Spoke-VPC-Attachment and then Create association
    • For the same route table, select the Routes tab and then Create static route
  • In CIDR enter 0.0.0.0/0 and in Choose attachment select Inspection-Egress-VPC-Attachment. Select Create static route.
    • Select “Firewall > Spoke Route Table” created above > the Association tab > Create association
    • In Choose attachment to associate, select Inspection-Egress-VPC-Attachment and select Create association
    • Select the Routes tab for “Firewall > Spoke Route Table”
    • Select Create static route
  • In CIDR, enter 10.173.0.0/16 and in Choose attachment, select Spoke-VPC-Attachment. Select Create static route. This route will be used to return traffic to the Spoke-VPC.

The Transit Gateway should be set up and ready to go, now all that needs to be done is update the route tables in each of the subnets so that traffic flows through it.

  • Go to Route Tables and find the NGW Route Table that you created in the Inspection-Egress-VPC during Step 5 earlier
  • Select the Routes tab > Edit routes and add a new route with a Destination of 10.173.0.0/16 (the Spoke-VPC) and a Target of Hub-TGW
  • Select Save routes
  • Now find the TGW Route Table that you created in Egress-VPC earlier
  • Again, select the Routes tab > Edit routes and add a new route with a Destination of 10.173.0.0/16 (the Spoke-VPC) and a Target of Hub-TGW
  • Select Save routes
  • Go to Route Tables and find the route table associated with the subnets that make up your Databricks data plane. It will have been created as part of the Customer Managed VPC in Step 1 above.
  • Select the Routes tab > Edit routes. You should see a route to pl-* for com.amazonaws.<region>.s3 and potentially a route to 0.0.0.0/0 to a nat-*
  • Replace the target in the route for Destination 0.0.0.0/0 to be the Hub-TGW created above.

Step 7- Validate deployment

To ensure there are no errors, we recommend some thorough testing before handing the environment over to end-users.

First you need to create a cluster. If that works, you can be confident that your connection to the Databricks secure cluster connectivity relay works as expected.

Next, check out Get started as a Databricks Workspace user, particularly the Explore the Quickstart Tutorial notebook as this is a great way to test the connectivity to a number of different sources- from the Hive Metastore to S3.

As an additional test, you could use %sh in a notebook to invoke curl and test connectivity to each of the required URLs.

Now go to Firewalls in the AWS console and select the Hub-Inspection-Firewall you created above. On the Monitoring tab you should see the traffic generated above being routed through the firewall:

via the Monitoring tab you can see all of the traffic flowing through your AWS Network Firewall

If you want a more granular level of detail, you can set up specific logging & monitoring configurations for your firewall, sending information about the network traffic flowing through your firewall and any actions applied to it to sinks such as CloudTrail or S3. What’s more, by combining these with Databricks audit logs, you can build a 360-degree view of exactly how users are using their Databricks environment, and set-up alerts on any potential breaches of the acceptable use policy.

As well as positive testing, we recommend doing some negative tests of the firewall too. For example:

HTTPS requests to the Databricks Web App URL are allowed.

Data exfiltration test to ensure that approved traffic is allowed through.

Whereas HTTPS requests to google.com fail.

Data exfiltration test to ensure denied traffic is dropped as expected.

Finally, it’s worth testing the “doomsday scenario” as far as data exfiltration protection is concerned- that data could be leaked to an S3 bucket outside of your account. Since the global S3 URL has not been allow-listed, attempts to connect to S3 buckets outside of your region will fail:

Data exfiltration “doomsday scenario” test for malicious or unintentional insider reading/writing of protected data to an unprotected external bucket.

And if you combine this with endpoint policies for Amazon S3, you can tightly enforce which S3 buckets a user can access from Databricks within your region too.

Step 8- Clear up the Spoke VPC resources

Depending on how you set up the Customer Managed VPC, you might find that there are now some unused resources in it, namely:

  • A public subnet
  • A NAT GW and EIP
  • An IGW

Once you have completed your testing, it should be safe to detach and delete these resources. Before you do, it’s worth double-checking that your traffic is routing through the AWS Network Firewall as expected, and not via the default NAT Gateway. You can do this in any of the following ways:

  1. Find the Route Table that applies to the subnets in your Customer Managed VPC. Make sure that the 0.0.0.0/0 route has a destination of your TGW rather than the default NAT Gateway.
  2. As described above, go to Firewalls in the AWS console and select the Hub-Inspection-Firewall created. On the Monitoring tab you should see all of the traffic that’s been routed through the firewall.
  3. If you set up logging & monitoring configurations for your firewall, you can see all flow and alert log entries in either CloudWatch or S3. This should include the traffic that is being routed back to the Databricks control plane.

If your Databricks workspace continues to function as expected (for example you can start clusters and run notebook commands), you can be confident that everything is working correctly. In the event of a configuration error, you might see one of these issues:

# Issue Things to check
1 Cluster creation takes a long time and eventually fails with Container Launch Failure
  • Check the VPC endpoints for S3 and STS. Commonly this is caused by endpoint policies or security groups that prevent us from pulling the container images at cluster startup. Note that in particular, Databricks recommends you use the same security group that was created for your workspace VPC. This configuration causes workspace traffic to STS to use the endpoint route
2 Cluster creation takes a long time and eventually times out with Network Configuration Failure
  • Check the route tables and firewall rules. Commonly this is caused by the clusters not being able to connect back to the relay service on cluster startup. Confirm that the routes and firewall rules are correct and that the traffic is flowing through the firewall
3 Commands take a long time and eventually time out
  • Check the route tables and firewall rules. Commonly this is caused by the clusters not being able to connect back to the relay service on cluster startup. Confirm that the routes and firewall rules are correct and that the traffic is flowing through the firewall

If clusters won’t start, and more in-depth troubleshooting is required, you could create a test EC2 instance in one of your Customer Managed VPC subnets and use commands like curl to test network connectivity to the necessary URLs.

Next Steps

You can no longer put a price on data security. The cost of lost exfiltrated data is often the tip of the iceberg, compounded by the cost of long-term reputational damage, regulatory backlash, loss of IP, and more…

This blog shows an example firewall configuration and how security teams can use it to restrict outbound traffic based on a set of allowed FQDNs. It’s important to note however that a one-size-fits-all approach will not work for every organization based on risk profile or sensitivity of data. There’s plenty that can be done to lock this down further. It’s important therefore to engage the right security and risk professionals in your organization to determine the appropriate access controls for your individual use case. This guide should be seen as a starting point, not the finishing line.

The war against cybercriminals and the many cyber threats faced in this connected, data-driven world is never won, but there are step-wise approaches like protecting against data exfiltration that you can take to fortify your defense.

This blog has focussed on how to prevent data exfiltration with an extra-secure architecture on AWS. But the best security is always based on a defense-in-depth approach. Learn more about the other platform features you can leverage in Databricks to protect your intellectual property, data and models. And learn how other customers are using Databricks to transform their business, and better still, how you can too!

--

Try Databricks for free. Get started today.

The post Data Exfiltration Protection With Databricks on AWS appeared first on Databricks.

Ray & MLflow: Taking Distributed Machine Learning Applications to Production

$
0
0

This is a guest blog from software engineers Amog Kamsetty and Archit Kulkarni of Anyscale and contributors to Ray.io

In this blog post, we’re announcing two new integrations with Ray and MLflow: Ray Tune+MLflow Tracking and Ray Serve+MLflow Models, which together make it much easier to build machine learning (ML) models and take them to production.

These integrations are available in the latest Ray wheels. You can follow the instructions here to pip install the nightly version of Ray and take a look at the documentation to get started. They will also be in the next Ray release — version 1.2

Two new integrations between Ray.io and MLflow make it easier to bring ML models to production.

Our goal is to leverage the strengths of the two projects: Ray’s distributed libraries for scaling training and serving and MLflow’s end-to-end model lifecycle management.

What problem are these tools solving?

Let’s first take a brief look at what these libraries can do before diving into the new integrations.

Ray Tune scales hyperparameter tuning

With ML models increasing in size and training times, running large-scale ML experiments on a single machine is no longer feasible. It’s now a necessity to distribute your experiment across many machines.

Ray Tune is a library for executing hyperparameter tuning experiments at any scale and can save you tens of hours in training time.

With Ray Tune you can:

  • Launch a multi-node hyperparameter sweep in <10 lines of code
  • Use any ML framework such as Pytorch, Tensorflow, MXNet, or Keras
  • Leverage state of the art hyperparameter optimization algorithms such as Population Based Training, HyperBand, or Asynchronous Successive Halving (ASHA).

Ray Serve scales model serving

After developing your machine learning model, you often need to deploy your model to actually serve prediction requests. However, ML models are often compute intensive and require scaling out to distributed systems in real deployments.

Ray Serve is an easy-to-use scalable model serving library that:

  • Simplifies model serving using GPUs across many machines so you can meet production uptime and performance requirements.
  • Works with any ML framework, such as Pytorch, Tensorflow, MXNet, or Keras.
  • Provides a programmatic configuration interface (no more YAML or JSON!).

MLflow tames end-to-end model lifecycle management

The components of MLflow--taming end-to-end ML lifecycle management.

Ray Tune and Ray Serve make it easy to distribute your ML development and deployment, but how do you manage this process? This is where MLflow comes in.

During experiment execution, you can leverage MLflow’s Tracking API to keep track of the hyperparameters, results, and model checkpoints of all your experiments, as well as easily visualize and share them with other team members. And when it comes to deployment, MLflow Models provides standardized packaging to support deployment in a variety of different environments.

Key Takeaways

Together, Ray Tune, Ray Serve, and MLflow remove the scaling and managing burden from ML Engineers, allowing them to focus on the main task– building ML models and algorithms.

Let’s see how we can leverage these libraries together.

Ray Tune + MLflow Tracking

Ray Tune integrates with MLflow Tracking API to easily record information from your distributed tuning run to an MLflow server.

There are two APIs for this integration: an MLflowLoggerCallback and an mlflow_mixin.

With the MLflowLoggerCallback, Ray Tune will automatically log the hyperparameter configuration, results, and model checkpoints from each run in your experiment to MLflow.

from ray.tune.integration.mlflow import MLflowLoggerCallback
tune.run(
    train_fn,
    config={
        # define search space here
        "parameter_1": tune.choice([1, 2, 3]),
        "parameter_2": tune.choice([4, 5, 6]),
    },
    callbacks=[MLflowLoggerCallback(
        experiment_name="experiment1",
        save_artifact=True)])

You can see below that Ray Tune runs many different training runs, each with a different hyperparameter configuration, all in parallel. These runs can all be seen on the MLflow UI, and on this UI, you can visualize any of your logged metrics. When the MLflow tracking server is remote, others can even access the results of your experiments and artifacts.

If you want to manage what information gets logged yourself rather than letting Ray Tune handle it for you, you can use the mlflow_mixin API.

Add a decorator to your training function to call any MLflow methods inside the function:


from ray.tune.integration.mlflow import mlflow_mixin

@mlflow_mixin
def train_fn(config):
    ...
    mlflow.log_metric(...)

You can check out the documentation here for full runnable examples and more information.

Ray Serve + MLflow Models

MLflow models can be conveniently loaded as python functions, which means that they can be served easily using Ray Serve. The desired version of your model can be loaded from a model checkpoint or from the MLflow Model Registry by specifying its Model URI. Here’s how this looks:

import ray
from ray import serve

import mlflow.pyfunc


class MLflowBackend:
    def __init__(self, model_uri):
        self.model = mlflow.pyfunc.load_model(model_uri=model_uri)

    async def __call__(self, request):
        return self.model.predict(request.data)


ray.init()
client = serve.start()


# This can be the same checkpoint that was saved by MLflow Tracking
model_uri = "/Users/ray_user/my_mlflow_model"
# Or you can load a model from the MLflow model registry
model_uri = "models:/my_registered_model/Production"
client.create_backend("mlflow_backend", MLflowBackend, model_uri)

Conclusion and outlook

Using Ray with MLflow makes it much easier to build distributed ML applications and take them to production. Ray Tune+MLflow Tracking delivers faster and more manageable development and experimentation, while Ray Serve+MLflow Models simplify deploying your models at scale.

Try running this example in the Databricks Community Edition (DCE) with this notebook. Note: This Ray Tune + MLflow extension has only been tested on DCE runtimes 7.5 and MLR 7.5.

What’s next

Give this integration a try by pip install the latest Ray nightly wheels and pip install mlflow. Or try this notebook on DCE. Also, stay tuned for a future deployment plugin that further integrates Ray Serve and MLflow Models.

For now you can:

Credits

Thanks to the respective Ray and MLflow team members from Anyscale and Databricks: Richard Liaw, Kai Fricke, Eric Liang, Simon Mo, Edward Oakes, Michael Galarnyk, Jules Damji, Sid Murching and Ankit Mathur.

--

Try Databricks for free. Get started today.

The post Ray & MLflow: Taking Distributed Machine Learning Applications to Production appeared first on Databricks.

Security Cluster Connectivity Is Generally Available on Azure Databricks

$
0
0

This is a collaborative post co-authored by Principal Product Manager Premal Shah, Microsoft, and Principal Enterprise Readiness Manager Abhinav Garg, Databricks

We’re excited to announce the general availability of Secure Cluster Connectivity (also commonly known as No Public IP) on Azure Databricks. This release applies to Microsoft Azure Public Cloud and Azure Government regions, in both Standard and Premium pricing tiers. Hundreds of our global customers including large financial services, healthcare and retail organizations have already adopted the capability to enable secure and reliable deployments of the Azure Databricks unified data platform. It allows them to securely process company and customer data in private Azure Virtual Networks, thus satisfying a major requirement of their enterprise governance policies.

Secure Cluster Connectivity overview

An Azure Databricks workspace is a managed application on the Azure Cloud enabling you to realize enhanced security capabilities through a simple and well-integrated architecture. Secure Cluster Connectivity enables the following benefits:

  • No public IPs: There are no Public IP addresses for the nodes across all clusters in the workspace, thus eliminating the risk (or perception of it) of any direct public access. The two subnets required for a workspace are thus both private.
  • No open inbound ports: There are no open inbound ports for access from the Control Plane or from other Azure services in the Network Security Group of the workspace. All access from a cluster in the data plane is either outbound (see minimum required) or internal to the cluster. The outbound access includes the connectivity to the Secure Cluster Connectivity relay hosted in the control plane, which acts as the transit for all cluster administration tasks and for running the customer workloads. An egress device with a public IP address is needed per workspace for all such outbound traffic.
  • Increased reliability and scalability – Your data platform becomes more reliable and scalable for large and extra-large workloads, as there’s no dependency to launch as many public IPs as cluster nodes and attaching those to the corresponding network interfaces.

At a high-level, the product architecture consists of a control/management plane and data plane. The control plane resides in a Microsoft-managed subscription and houses services such as web application, cluster manager, jobs service, etc. The data plane that is in the customer’s subscription consists of the Virtual Network (two subnets), Network Security Group and a root Azure storage account known as DBFS.

All traffic from and to Azure Databricks Control Plane traverses a reverse tunnel requiring no public IPs or open ports in the customer’s network

You can deploy a workspace with Secure Cluster Connectivity using both Managed-VNET or VNET Injection (also known as Bring Your Own VNET) modes, either using the Azure Portal or any of the common automation options like ARM Templates, Azure CLI, etc.

  • With the default Managed-VNET deployment, a managed Azure NAT Gateway will be deployed as the default egress device for the workspace and will be attached to the managed subnets.
  • With the VNET Injection deployment, you should bring your own egress device, which could be your own-managed Azure NAT Gateway, Azure Firewall or a third-party appliance. You could also opt for a managed-outbound load balancer option for simpler deployments.

Getting started with Secure Cluster Connectivity

Get started with the enhanced security capabilities by deploying an Azure Databricks workspace with Secure Cluster Connectivity enabled using Azure Portal or ARM Template. Please refer to the following resources:

Please refer to Platform Security for Enterprises and Azure Databricks Security Baseline for a deeper view into how we bring a security-first mindset while building our popular first-party Azure service.

--

Try Databricks for free. Get started today.

The post Security Cluster Connectivity Is Generally Available on Azure Databricks appeared first on Databricks.


How Lakehouses Solve Common Issues With Data Warehouses

$
0
0

Editor’s note: This is the first in a series of posts largely based on the CIDR paper Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics, with permission from the authors.

Data analysts, data scientists, and artificial intelligence experts are often frustrated with the fundamental lack of high-quality, reliable and up-to-date data available for their work. Some of these frustrations are due to known drawbacks of the two-tier data architecture we see prevalent in the vast majority of Fortune 500 companies today. The open lakehouse architecture and underlying technology can dramatically improve the productivity of data teams and thus the efficiency of the businesses employing them.

Challenges with the two-tier data architecture

In this popular architecture, data from across the organization is extracted from operational databases and loaded into a raw data lake, sometimes referred to as a data swamp due to the lack of care for ensuring this data is usable and reliable. Next, another ETL (Extract, Transform, Load) process is executed on a schedule to move important subsets of the data into a data warehouse for business intelligence and decision making.

Databricks data lakehouse architecture

This architecture gives data analysts a nearly impossible choice: use timely and unreliable data from the data lake or use stale and high-quality data from the data warehouse. Due to the closed formats of popular data warehousing solutions, it also makes it very difficult to use the dominant open-source data analysis frameworks on high-quality data sources without introducing another ETL operation and adding additional staleness.

We can do better

These two-tier data architectures, which are common in enterprises today, are highly complex for both the users and the data engineers building them, regardless of whether they’re hosted on-premises or in the cloud.

Lakehouse architecture reduces the complexity, cost and operational overhead by providing many of the reliability and performance benefits of the data warehouse tier directly on top of the data lake, ultimately eliminating the warehouse tier.

Databricks data lakehouse architecture

Data reliability

Data consistency is an incredible challenge when you have multiple copies of data to keep in sync. There are multiple ETL processes — moving data from operational databases to the data lake and again from the data lake into the data warehouse. Each additional process introduces additional complexity, delays and failure modes.

By eliminating the second tier, the data lakehouse architecture removes one of the ETL processes, while adding support for schema enforcement and evolution directly on top of the data lake. It also supports features like time travel to enable historic validation of data cleanliness.

Data staleness

Because the data warehouse is populated from the data lake, it is often stale. This forces 86% of analysts to use out-of-date data, according to a recent Fivetran survey.

Fivetran report, “Data Analysts: A Critical, Underutilized Resource.
While eliminating the data warehouse tier solves this problem, a lakehouse can also support efficient, easy and reliable merging of real-time streaming plus batch processing, to ensure the most up-to-date data is always being used for analysis.

Limited support for advanced analytics

Advanced analytics, including machine learning and predictive analytics, often requires processing very large datasets. Common tooling, such as TensorFlow, PyTorch and XGBoost, makes it easy to read the raw data lakes in open data formats. However, these tools won’t read most of the proprietary data formats used by the ETL’d data in the data warehouses. Warehouse vendors thus recommend exporting this data to files for processing, resulting in a third ETL step plus increased complexity and staleness.

Alternatively, in the open lakehouse architecture, these common toolsets can operate directly on high-quality, timely data stored in the data lake.

Total cost of ownership

While storage costs in the cloud are declining, this two-tier architecture for data analytics actually has three online copies of much of the enterprise data: one in the operational databases, one in the data lake, and one in the data warehouse.

The total cost of ownership (TCO) is further compounded when you add the significant engineering costs associated with keeping the data in sync to storage costs.

The data lakehouse architecture eliminates one of the most expensive copies of the data, as well as at least one associated synchronization process.

What about performance for business intelligence?

Business intelligence and decision support require high-performance execution of exploratory data analysis (EDA) queries, as well as queries powering dashboards, data visualizations and other critical systems. Performance concerns were often the reason companies maintained a data warehouse in addition to a data lake. Technology for optimizing queries on top of data lakes has improved immensely over the past year, making most of these performance concerns moot.

Lakehouses provide support for indexing, locality controls, query optimization and hot data caching to improve performance. This results in data lake SQL performance that exceeds leading cloud data warehouses on TPC-DS, while also providing the flexibility and governance expected of data warehouses.

Conclusion and next steps

Forward-leaning enterprises and technologists have looked at the two-tier architecture being used today and said: “there has to be a better way.” This better way is what we call the open data lakehouse, which combines the openness and flexibility of the data lake with the reliability, performance, low latency, and high concurrency of traditional data warehouses.

I’ll cover more detail on improvements in data lake performance in an upcoming post of this series.

Of course, you can cheat and skip ahead by reading the complete CIDR paper, or watching a video series diving into the underlying technology supporting the modern lakehouse.

--

Try Databricks for free. Get started today.

The post How Lakehouses Solve Common Issues With Data Warehouses appeared first on Databricks.

Automatically Evolve Your Nested Column Schema, Stream From a Delta Table Version, and Check Your Constraints

$
0
0

We recently announced the release of Delta Lake 0.8.0, which introduces schema evolution and performance improvements in merge and operational metrics in table history. The key features in this release are:

  • Unlimited MATCHED and NOT MATCHED clauses for merge operations in Scala, Java, and Python. Merge operations now support any number of whenMatched and whenNotMatched clauses. In addition, merge queries that unconditionally delete matched rows no longer throw errors on multiple matches. This will be supported using SQL with Spark 3.1. See the documentation for details.
  • MERGE operation now supports schema evolution of nested columns. Schema evolution of nested columns now has the same semantics as that of top-level columns. For example, new nested columns can be automatically added to a StructType column. See Automatic schema evolution in Merge for details.
  • MERGE INTO and UPDATE operations now resolve nested struct columns by name. Update operations UPDATE and MERGE INTO commands now resolve nested struct columns by name, meaning that when comparing or assigning columns of type StructType, the order of the nested columns does not matter (exactly in the same way as the order of top-level columns). To revert to resolving by position, set the following Spark configuration to false: spark.databricks.delta.resolveMergeUpdateStructsByName.enabled.
  • Check constraints on Delta tables. Delta now supports CHECK constraints. When supplied, Delta automatically verifies that data added to a table satisfies the specified constraint expression. To add CHECK constraints, use the ALTER TABLE ADD CONSTRAINTS command. See the documentation for details.
  • Start streaming a table from a specific version (#474). When using Delta as a streaming source, you can use the options startingTimestamp or startingVersion to start processing the table from a given version and onwards. You can also set startingVersion to latest to skip existing data in the table and stream from the new incoming data. See the documentation for details.
  • Ability to perform parallel deletes with VACUUM (#395). When using `VACUUM`, you can set the session configuration spark.databricks.delta.vacuum.parallelDelete.enabled to true in order to use Spark to perform the deletion of files in parallel (based on the number of shuffle partitions). See the documentation for details.
  • Use Scala implicits to simplify read and write APIs. You can import io.delta.implicits. to use the `delta` method with Spark read and write APIs such as spark.read.delta(“/my/table/path”). See the documentation for details.

In addition, we also highlight that you can now read a Delta table without using Spark via the Delta Standalone Reader and Delta Rust API. See Use Delta Standalone Reader and the Delta Rust API to query your Delta Lake without Apache Spark™ to learn more.

Automatically evolve your nested column schema

As noted in previous releases, Delta Lake includes the ability to:

With Delta Lake 0.8.0, you can automatically evolve nested columns within your Delta table with UPDATE and MERGE operations.

Let’s showcase this by using a simple coffee espresso example. We will create our first Delta table using the following code snippet.

# espresso1 JSON string
json_espresso1 = [ ... ]

# create RDD
espresso1_rdd = sc.parallelize(json_espresso1)

# read JSON from RDD
espresso1 = spark.read.json(espresso1_rdd)

# Write Delta table
espresso1.write.format("delta").save(espresso_table_path)

The following is a view of the espresso table:
DataFrame table in Delta Lake 8.0.0

The following code snippet creates the espresso_updates DataFrame:

# Create DataFrame from JSON string
json_espresso2 = [...]
espresso2_rdd = sc.parallelize(json_espresso2)
espresso2 = spark.read.json(espresso2_rdd)
espresso2.createOrReplaceTempView("espresso_updates")

with this table view:
DataFrame table in Delta Lake 0.8.0

Observe that the espresso_updates DataFrame has a different coffee_profile column, which includes a new flavor_notes column.

# espresso Delta Table `coffee_profile` schema
 |-- coffee_profile: struct (nullable = true)
 |    |-- temp: double (nullable = true)
 |    |-- weight: double (nullable = true)
# espresso_updates DataFrame `coffee_profile` schema
 |-- coffee_profile: struct (nullable = true)
 |    |-- flavor_notes: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- temp: double (nullable = true)
 |    |-- weight: double (nullable = true)

To run a MERGE operation between these two tables, run the following Spark SQL code snippet:

MERGE INTO espresso AS t
USING espresso_updates u 
   ON u.espresso_id = t.espresso_id
 WHEN MATCHED THEN
      UPDATE SET *
 WHEN NOT MATCHED THEN
      INSERT *

By default, this snippet will have the following error since the coffee_profile columns between espresso and espresso_updates are different.

Error in SQL statement: AnalysisException:
Cannot cast struct,temp:double,weight:double> to struct.
All nested columns must match.

AutoMerge to the rescue

To work around this issue, enable autoMerge using the below code snippet; the espresso Delta table will automatically merge the two tables with different schemas including nested columns.

-- Enable automatic schema evolution
SET spark.databricks.delta.schema.autoMerge.enabled=true;

In a single atomic operation, MERGE performs the following:

  • UPDATE: espresso_id = 100 has been updated with the new flavor_notes from the espresso_changes DataFrame.
  • espresso_id = (101, 102) no changes have been made to the data as appropriate.
  • INSERT: espresso_id = 103 is a new row that has been inserted from the espresso_changes DataFrame.
Tabular View displaying nested columns of the coffee_profile column.

Tabular View displaying nested columns of the coffee_profile column.

Simplify read and write APIs with Scala Implicits

You can import io.delta.implicits. to use the delta method with Spark read and write APIs such as spark.read.delta("/my/table/path"). See the documentation for details.

// Traditionally, to read the Delta table using Scala, you would execute the following
spark 
  .read 
  .format("delta")
  .load("/tmp/espresso/")
  .show()
// With Scala implicts, the format is a little simpler
import io.delta.implicits.
spark
  .read 
  .delta("/tmp/espresso/")
  .show()

Check Constraints

You can now add CHECK constraints to your tables, which not only checks the existing data, but also enforces future data modifications. For example, to ensure that the espresso_id >= 100, run this SQL statement:

-- Ensure that espresso_id >= 100
-- This constraint will both check and enforce future modifications of data to your table
ALTER TABLE espresso ADD CONSTRAINT idCheck CHECK (espresso_id >= 100);
-- Drop the constraint from the table if you do not need it
ALTER TABLE espresso DROP CONSTRAINT idCheck;

The following constraint will fail as the `milk-based_espresso` column has both True and False values.

- Check if the column has only True values; NOTE, this constraint will fail.
ALTER TABLE espresso ADD CONSTRAINT milkBaseCheck CHECK (`milk-based_espresso` IN (True));
-- Error output
Error in SQL statement: AnalysisException: 1 rows in profitecpro.espresso violate the new CHECK constraint (`milk-based_espresso` IN ( True ))

The addition or dropping of CHECK constraints will also appear in the transaction log (via DESCRIBE HISTORY espresso) of your Delta table with the operationalParameters articulating the constraint.

Tabular View displaying the constraint operations within the transaction log history

Tabular View displaying the constraint operations within the transaction log history

Start streaming a table from a specific version

When using Delta as a streaming source, you can use the options startingTimestamp or startingVersionto start processing the table from a given version and onwards. You can also set startingVersion to latestto skip existing data in the table and stream from the new incoming data. See the documentation for details.

Within the notebook, we will generate an artificial stream:

# Generate artificial stream
stream_data = spark.readStream.format("rate").option("rowsPerSecond", 500).load()

And then generate a new Delta table using this code snippet:

stream = stream_data \
            .withColumn("second", second(col("timestamp"))) \
            .writeStream \
            .format("delta") \
            .option("checkpointLocation", "...") \
            .trigger(processingTime = "2 seconds") \
            .start("/delta/iterator_table")

The code in the notebook will run the stream for approximately 20 seconds to create the following iterator table with the below transaction log history. In this case, this table has 10 transactions.

-- Review history by table path
DESCRIBE HISTORY delta.`/delta/iterator_table/`
-- OR review history by table name
DESCRIBE HISTORY iterator_table;
Tabular View displaying the iterator table transaction log history

Tabular View displaying the iterator table transaction log history

Review iterator output

The iterator table has 10 transactions over a duration of approximately 20 seconds. To view this data over a duration, we will run the next SQL statement that calculates the timestamp of each insert into the iterator table rounded to the second (ts). Note that the value of ts = 0 is the minimum timestamp, and e want to bucket by duration (ts) via a group by running the following:

SELECT ts, COUNT(1) as cnt
  FROM (
     SELECT value, (second - min_second) AS ts
       FROM (
          SELECT * FROM iterator_table CROSS JOIN (SELECT MIN(second) AS min_second FROM iterator_table) x
       ) y
  ) z
 GROUP BY ts
 ORDER BY ts

The preceding statement produces this bar graph with time buckets (ts) by row count (cnt).

Notice for the 20 second stream write performed with ten distinct transactions, there are 19 distinct time-buckets.

Notice for the 20-second stream write performed with ten distinct transactions, there are 19 distinct time-buckets.

Start the Delta stream from a specific version

Using .option("startingVersion", "6"), we can specify which version of the table we will want to start our readStream from (inclusive).

# Start the readStream using startingVersion 
reiterator = spark.readStream.format("delta").option("startingVersion", "6").load("/delta/iterator_table/")
# Create a temporary view against the stream
reiterator.createOrReplaceTempView("reiterator")

The following graph is generated by re-running the previous SQL query against the new reiterator table.

 Notice for the reiterator table, now there are 10 distinct time-buckets as we’re starting from a later transaction version of the table.

Notice for the reiterator table, there are 10 distinct time-buckets, as we’re starting from a later transaction version of the table.

Get Started with Delta Lake 0.8.0

Try out Delta Lake with the preceding code snippets on your Apache Spark 3.1 (or greater) instance (on Databricks, try this with DBR 8.0+). Delta Lake makes your data lakes more reliable–whether you create a new one or migrate an existing data lake. To learn more, refer to https://delta.io/, and join the Delta Lake community via the Slack and Google Group. You can track all the upcoming releases and planned features in GitHub milestones and try out Managed Delta Lake on Databricks with a free account.

Credits

We want to thank the following contributors for updates, doc changes, and contributions in Delta Lake 0.8.0: Adam Binford, Alan Jin, Alex Liu, Ali Afroozeh, Andrew Fogarty, Burak Yavuz, David Lewis, Gengliang Wang, HyukjinKwon, Jacek Laskowski, Jose Torres, Kian Ghodoussi, Linhong Liu, Liwen Sun, Mahmoud Mahdi, Maryann Xue, Michael Armbrust, Mike Dias, Pranav Anand, Rahul Mahadev, Scott Sandre, Shixiong Zhu, Stephanie Bodoff, Tathagata Das, Wenchen Fan, Wesley Hoffman, Xiao Li, Yijia Cui, Yuanjian Li, Zach Schuermann, contrun, ekoifman, and Yi Wu.

--

Try Databricks for free. Get started today.

The post Automatically Evolve Your Nested Column Schema, Stream From a Delta Table Version, and Check Your Constraints appeared first on Databricks.

Accelerating ML Experimentation in MLflow

$
0
0

This fall, I interned with the ML team, which is responsible for building the tools and services that make it easy to do machine learning on Databricks. During my internship, I implemented several ease-of-use features in MLflow, an open-source machine learning lifecycle management project, and made enhancements to the Reproduce Run capability on the Databricks ML Platform. This blog post walks through some of my most impactful projects and the benefits they offer Databricks customers.

Autologging improvements

MLflow autologging automatically tracks machine learning training sessions, recording valuable parameters, metrics, and artifacts.
MLflow autologging automatically tracks machine learning training sessions, recording valuable parameters, metrics, and model artifacts.

MLflow autologging, which was introduced last year, offers an easy way for data scientists to automatically track relevant metrics and parameters when training machine learning (ML) models by simply adding two lines of code. During the first half of my internship, I made several enhancements to the autologging feature.

Input examples and model signatures

As a starter project, I worked to implement input example and model signature support for MLflow’s XGBoost and LightGBM integrations. The input example is a snapshot of model input for inference. The model signature defines the input and output fields and types, providing input schema verification capabilities for batch and real-time model scoring. Together, these attributes enrich autologged models, enabling ML practitioners across an organization to easily interpret and integrate them with production applications.

Efficiently measuring training progress

Next, I expanded the iteration/epoch logging support in MLflow autologging. When training a model, the model goes through many iterations to improve accuracy. If training takes many hours, it is helpful to track performance metrics, such as accuracy, throughout the training process to ensure that it’s proceeding as expected.

Simultaneously, it is also important to ensure that collecting these performance metrics does not slow down the training process. Since each call to our logging API is a network call, naively logging on each iteration means the network latency can easily add up to a significant chunk of time.

We prototyped several solutions to balance ease-of-use, performance, and code complexity. Initially, we experimented with a multithreaded approach in which training occurs in the main thread and logging is executed in a parallel thread. However, during prototyping, we observed that the performance benefit from this approach was minimal in comparison to the implementation complexity.

We ultimately settled on a time-based approach, executing both the training and logging in the same thread. With this approach, MLflow measures time spent on training and logging, only logging metrics when the time spent on training reaches 10x the time spent on logging. This way, if each iteration takes a long time, MLflow logs metrics for every iteration since the logging time is negligible compared to the training time. In contrast, if each iteration is fast, MLflow stores the iteration results and logs them as one bundle after a few training iterations. In both cases, training progress can be observed in near-real time, with an additional latency overhead of no more than 10%.

MLflow logging optimizations help to reduce latency for both short and long training lifecycles.

Left: When training iterations are short, we batch metrics together and log them after several iterations have completed. Right: When training iterations are longer, we log metrics after each iteration so that progress can be tracked. Both cases avoid imposing significant latency overhead.

Universal autolog

Finally, I introduced a universal mlflow.autolog() API to further simplify ML instrumentation. This unified API enables autologging for all supported ML library integrations, eliminating the need to add a separate API call for each library used in the training process.

Software environment reproducibility

The performance and characteristics of an ML model depend heavily on the software environment (specific libraries and versions) where it is trained. To help Databricks users replicate their ML results more effectively, I added library support to the ‘Reproduce Run’ feature.

Databricks now stores information about the installed libraries when an MLflow Run is created. When a user wants to replicate the environment used to train a model, they can click ‘Reproduce Run’ from the MLflow Run UI to create a new cluster with the same compute resources and libraries as the original training session.
 
In MLflow, clicking ‘Reproduce Run’ from the MLflow Run UI creates a new cluster with the same compute resources and libraries as the original training session.
 
Clicking “Reproduce Run” opens a dialog modal, allowing the user to inspect the compute resources and libraries that will be reinstalled to reproduce the run. After clicking “Confirm,” the notebook is seamlessly cloned and attached to a Databricks cluster with the same compute resources and libraries as the one used to train the model.
 
In MLflow, Clicking “Reproduce Run” opens a dialog modal allowing the user to inspect the compute resources and libraries that will be reinstalled in order to reproduce the run.
 
Engineering this feature involved working across the entire stack. The majority of time was spent on backend work, where I had to coordinate communication between several microservices to create the new cluster and reinstall the libraries on it. It was also interesting to learn about React and Redux when implementing the UI based on the design team’s mockups.

Conclusion

These sixteen weeks at Databricks have been an amazing experience. What really stood out to me was that I truly owned each of my features. I brought each feature through the entire product cycle, including determining user requirements, implementing an initial prototype, writing a design document, conducting a design review, and applying all this feedback to the prototype to implement, test, and ship the final polished feature. Furthermore, everyone at Databricks was awesome to work with and happy to help out, whether with career advice or with feedback about the features I was working on. Special thanks to my mentor Corey Zumar and manager Paul Ogilvie for answering my endless questions, and thanks to everyone at Databricks for making the final internship of my undergrad the best yet!

Visit the Databricks Career page to learn more about upcoming internships and other career opportunities across the company.

--

Try Databricks for free. Get started today.

The post Accelerating ML Experimentation in MLflow appeared first on Databricks.

Amplify Insights into Your Industry With Geospatial Analytics

$
0
0

Data science is becoming commonplace and most companies are leveraging analytics and business intelligence to help make data-driven business decisions. But are you supercharging your analytics and decision-making with geospatial data? Location intelligence, and specifically geospatial analytics, can help uncover important regional trends and behavior that impact your business. This goes beyond looking at location data aggregated by zip codes, which interestingly in the US and in other parts of the world is not a good representation of a geographic boundary.

Are you a retailer who’s trying to figure out where to set up your next store or understand foot traffic that your competitors are getting in the same neighborhood? Or are you looking at real estate trends in the region to guide your next best investment? Do you deal with logistics and supply chain data and have to determine where the warehouses and fuel stops are located? Or do you need to identify network or service hot spots so you can adjust supply to meet demand? These use cases all have one point in common — you can run a point-in-polygon operation to associate these latitude and longitude coordinates to their respective geographic geometries.

Technical implementation

The usual way of implementing a point-in-polygon operation would be to use a SQL function like st_intersects or st_contains from PostGIS, the open-source geographic information system(GIS) project. You could also use a few Apache Spark™  packages like Apache Sedona (previously known as Geospark) or Geomesa that offer similar functionality executed in a distributed manner, but these functions typically involve an expensive geospatial join that will take a while to run. In this blog post, we will take a look at how H3 can be used with Spark to help accelerate a large point-in-polygon problem, which is arguably one of the most common geospatial workloads that many will benefit from.

We introduced Uber’s H3 library in a past blog post. As a recap, H3 is a geospatial grid system that approximates geo features such as polygons or points with a fixed set of identifiable hexagonal cells. This can help scale large or computationally expensive big data workloads.

In our example, the WKT dataset that we are using contains MultiPolygons that may not work well with H3’s polyfill implementation. To ensure that our pipeline returns accurate results, we will need to split the MultiPolygons into individual Polygons.

Converting MultiPolygons to Polygon before the join will ensure the most accurate results when using the H3 grid system for geospatial analysis.

SFA MultiPolygon” by  Mwtoews is licensed under CC BY-SA 3.0


%scala
import org.locationtech.jts.geom.GeometryFactory
import scala.collection.mutable.ArrayBuffer

def getPolygon = udf((geometry: Geometry)=>{
    var numGeometries = geometry.getNumGeometries()
    var polygonArrayBuffer = ArrayBuffer[Geometry]()
    for( geomIter <- 0 until numGeometries)
    {polygonArrayBuffer += geometry.getGeometryN(geomIter)}
    polygonArrayBuffer
})

val wktDF_polygons = wktDF.withColumn("num_polygons", st_numGeometries(col("the_geom")))
                            .withColumn("polygon_array", getPolygon(col("the_geom")))
                            .withColumn("polygon", explode($"polygon_array"))

After splitting the polygons, the next step is to create functions that define an H3 index for both your points and polygons. To scale this with Spark, you need to wrap your Python or Scala functions into Spark UDFs.


%scala
val res = 7 //the resolution of the H3 index, 1.2km
val points = df
    .withColumn("h3index", hex(geoToH3(col("pickup_latitude"), col("pickup_longitude"), lit(res))))

points.createOrReplaceTempView("points")

val polygons = wktDF
    .withColumn("h3index", multiPolygonToH3(col("the_geom"), lit(res)))
    .withColumn("h3", explode($"h3index"))
    .withColumn("h3", hex($"h3"))

polygons.createOrReplaceTempView("polygons")

H3 supports resolutions 0 to 15, with 0 being a hexagon with a length of about 1,107 km and 15 being a fine-grained hexagon with a length of about 50 cm. You should pick a resolution that is ideally a multiple of the number of unique Polygons in your dataset. In this example, we go with resolution 7.

The H3 grid system for geospatial analysis supports resolutions 0-15.

One thing to note here is that using H3 for a point-in-polygon operation will give you approximated results and we are essentially trading off accuracy for speed. Choosing a coarse-grained resolution may mean that you lose some accuracy at the polygon boundary, but your query will run really quickly. Picking a fine-grained resolution will give you better accuracy but will also increase the computational cost of the upcoming join query since you will have many more unique hexagons to join on. Picking the right resolution is a bit of an art, and you should consider how exact you need your results to be. Considering that your GPS points may not be that precise, perhaps forgoing some accuracy for speed is acceptable.

With the points and polygons indexed with H3, it’s time to run the join query. Instead of running a spatial command like st_intersects or st_contains here, which would trigger an expensive spatial join, you can now run a simple Spark inner join on the H3 index column. Your point-in-polygon query can now run in the order of minutes on billions of points and thousands or millions of polygons.


%sql
SELECT *
FROM
    Points p
    INNER JOIN
    Polygons s
    ON p.h3 = s.h3

If you require more accuracy, another possible approach here is to leverage the H3 index to reduce the number of rows passed into the geospatial join. Your query would look something like this, where your st_intersects() or st_contains() command would come from 3rd party packages like Geospark or Geomesa:


%sql
SELECT * 
FROM 
    points p 
    INNER JOIN
    shape s 
    ON p.h3 = s.h3
WHERE st_intersects(st_makePoint(p.pickup_longitude, p.pickup_latitude), s.the_geom);    

Potential optimizations

It’s common to run into data skews with geospatial data. For example, you might receive more cell phone GPS data points from urban areas compared to sparsely populated areas. This means that there may be certain H3 indices that have way more data than others, and this introduces skew in our Spark SQL join. This is true as well for the dataset in our notebook example where we see a huge number of taxi pickup points in Manhattan compared to other parts of New York. We can leverage skew hints here to help with the join performance.

First, determine what your top H3 indices are.


display(points.groupBy("h3").count().orderBy($"count".desc))

Then, re-run the join query with a skew hint defined for your top index or indices. You could also try broadcasting the polygon table if it’s small enough to fit in the memory of your worker node.


SELECT /*+ SKEW('points_with_id_h3', 'h3', ('892A100C68FFFFF')), BROADCAST(polygons) */ 
*
FROM
    points p
    INNER JOIN
    Polygons s
    ON p.h3 = s.h3

Also, don’t forget to have the table with more rows on the left side of the join. This reduces shuffle during the join and can greatly improve performance.

Do note that with Spark 3.0’s new Adaptive Query Execution (AQE), the need to manually broadcast or optimize for skew would likely go away. If your favorite geospatial package supports Spark 3.0 today, do check out how you could leverage AQE to accelerate your workloads!

Data visualization

A good way to visualize H3 hexagons would be to use Kepler.gl, which was also developed by Uber. There’s a PyPi library for Kepler.gl that you could leverage within your Databricks notebook. Do refer to this notebook example if you’re interested in giving it a try.

The Kepler.gl library runs on a single machine. This means that you will need to sample down large datasets before visualizing. You can create a random sample of the results of your point-in-polygon join and convert it into a Pandas dataframe and pass that into Kepler.gl.

The H3 grid system for geospatial analysis supports resolutions 0-15.

Now you can explore your points, polygons, and hexagon grids on a map within a Databricks notebook. This is a great way to verify the results of your point-in-polygon mapping as well!

Try the notebook

Please Note: The notebook may not display correctly when viewed in the browser. For best results, please download and run it in your Databricks Workspace.

--

Try Databricks for free. Get started today.

The post Amplify Insights into Your Industry With Geospatial Analytics appeared first on Databricks.

Azure Databricks Achieves DoD Impact Level 5 (IL5) on Microsoft Azure Government

$
0
0

We are excited to announce that Azure Databricks has received a Provisional Authorization (PA) by the Defense Information Systems Agency (DISA) at Impact Level 5 (IL5), as published in the Department of Defense Cloud Computing Security Requirements Guide (DoD CC SRG). The authorization closely follows our FedRAMP High authorization and further validates Azure Databricks security and compliance for higher sensitivity Controlled Unclassified Information (CUI), mission-critical information and national security systems across a wide variety of data analytics and AI use cases.

Federal, state and local U.S. government agencies such as the U.S. Department of Veterans Affairs (VA), Centers for Medicare and Medicaid Services (CMS), Department of Transportation (DOT), the City of Spokane and DC Water trust Azure Databricks for their critical data and artificial intelligence (AI) needs. Databricks maintains the highest level of data security by incorporating industry-leading best practices into our cybersecurity program. The DoD IL5 provisional authorization provides customers the assurance that Azure Databricks meets U.S. Department of Defense security and compliance requirements to support sensitive analytics and data science use cases.

“Numerous federal agencies are looking to build cloud data lakes and leverage Delta Lake for a complete and consistent view of all their data,” said Kevin Davis, VP, Public Sector at Databricks. “The power of data and AI are being used to dramatically enhance public services, lower costs and improve quality of life for citizens. Using Azure Databricks, government agencies have aggregated hundreds of data sources to improve citizen outreach, automated processing of hourly utility infrastructure IoT data for predictive maintenance, deployed machine learning (ML) models to predict patient needs and built dashboards to predict transportation needs and optimize logistics. DoD IL5 provisional authorization for Azure Databricks further enables federal agencies to analyze all of their data for improved decision making and more accurate predictions.”

With this provisional authorization, the Pentagon, federal agencies and contractors can now use Azure Databricks to process the most sensitive unclassified, mission-critical and national security data in cloud computing environments, including data related to national security and the protection of life and financial assets. Azure Databricks enables organizations across industries to accelerate innovation while minimizing risk when working with highly sensitive private sector and public sector data.

See an overview of Microsoft Azure compliance offerings and the list of Azure services in FedRAMP and DoD CC SRG audit scope. Learn more about Impact Level 5 (IL5) by reading Microsoft Documentation.

“Azure Databricks helps customers address security and compliance requirements for regulated public sector use cases, such as immunization, chronic disease prevention, transportation, weather, and financial and economic risk analytics,” said David Cook, Chief Information Security Officer at Databricks. “The DoD IL5 provisional authorization validates Azure Databricks security controls and monitoring in accordance with the DoD CC SRG. We are pleased to demonstrate our commitment to security and compliance with the DoD IL5 provisional authorization on Microsoft Azure Government.”

Impact Level 5 (IL5) provisional authorization enables government agencies to securely use cloud services to analyze sensitive data such as insurance statements, financial records and healthcare claims to improve processing times, lower operating costs and reduce claims fraud. For example, government agencies and their vendors can analyze large geospatial datasets from GPS satellites, cell towers, ships and autonomous platforms for marine mammal and fish population assessments, highway construction, disaster relief, and population health.

View the Azure Databricks DoD Impact Level 5 (IL5) provisional authorization and other security compliance documentation

You can view related authorizations (visit the Azure compliance documentation) and details on all Microsoft Azure services, including Azure Databricks at the Microsoft Azure compliance offerings documentation and the list of Azure services in FedRAMP and DoD CC SRG audit scope. Learn more about DoD IL5 by viewing the Azure DoD IL5 provisional authorization documentation and the Azure Databricks isolation guidelines for Impact Level 5 workloads.

As always, we welcome your feedback and questions and commit to helping customers achieve and maintain the highest standard of security and compliance. Please feel free to reach out to the team through Microsoft Azure Support.

Learn more by attending the Azure Databricks Government Forum, as well as future Azure Databricks events. Follow us on Twitter and LinkedIn for more Azure Databricks security and compliance news, customer highlights, and new feature announcements.

--

Try Databricks for free. Get started today.

The post Azure Databricks Achieves DoD Impact Level 5 (IL5) on Microsoft Azure Government appeared first on Databricks.

Viewing all 1873 articles
Browse latest View live