We’re happy to introduce a new, open source connector with Redash, a cloud-based SQL analytics service, to make it easy to query data lakes with Databricks. Traditionally, data analyst teams face issues with stale and partial data compromising the quality of their work, and want to be able to connect to the most complete and recent data available in data lakes, but their tools are often not optimized for querying data lakes. We’ve been working with the Redash team to improve a new Databricks connector that makes it easy for analysts to perform SQL queries and build dashboards directly against data lakes, including open source Delta Lake architectures, to deliver better decision making and insights.

Redash and Databricks Overview

Redash is an open source SQL-based service for analytics and dashboard visualizations. It offers a familiar SQL editor interface to browse the data schema, build queries, and view results that should be familiar to anyone who’s worked with a relational database. Queries can be easily converted into visualizations for quick insights, connected to alerts to notify on specific data events, or managed by API for automated workflows. Live web-based reports can be shared, refreshed, and modified by other teams for easy collaboration. These capabilities are driven by the Redash open source community, with over 300 contributors and more than 7,000 open source deployments of Redash globally.

With the new, performance-optimized and open source connector, Redash offers fast and easy data connectivity to Databricks for querying your data lake, including Delta Lake architectures. Delta Lake adds an open source storage layer to data lakes to improve reliability and performance, with data quality guarantees like ACID transactions, schema enforcement, and time travel, for both streaming and batch data in cloud blob storage. An optimized Spark SQL runtime running on scalable cloud infrastructure provides a powerful, distributed query engine for these large volumes of data. Together, these storage and compute layers on Databricks ensure data teams get reliable SQL queries and fast visualizations with Redash.

How to get started with the Redash Connector

You can connect Redash to Databricks in minutes. After creating a free Redash account, you’re prompted to connect to a “New Data Source”. Select “Databricks” as the data source from the menu of available options.

The next screen prompts you for the necessary configuration details to securely connect to Databricks. You can check out the documentation for more details.

With the Databricks data source connected, you can now run SQL queries on Delta Lake tables as if it were any other relational data source, and quickly visualize the query results.

Recap

Combining Redash’s open source SQL analytics capabilities with Databricks’s open data lakes gives SQL analysts an easy and powerful way to edit queries and create visualizations and dashboards directly on the organization’s most recent and complete data. Learn more from the Redash documentation.

Try Databricks for free. Get started today.

The post Announcing a New Redash Connector for Databricks appeared first on Databricks.

This is a joint engineering effort between the Databricks Apache Spark engineering team — Wenchen Fan, Herman van Hovell and MaryAnn Xue — and the Intel engineering team — Ke Jia, Haifeng Chen and Carson Wang.

See the AQE notebook to demo the solution covered below

Over the years, there’s been an extensive and continuous effort to improve Spark SQL’s query optimizer and planner in order to generate high-quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark choose better plans. Examples of these cost-based optimization techniques include choosing the right join type (broadcast hash join vs. sort merge join), selecting the correct build side in a hash-join, or adjusting the join order in a multi-way join. However, outdated statistics and imperfect cardinality estimates can lead to suboptimal query plans. Adaptive Query Execution, new in the upcoming Apache SparkTM 3.0 release and available in the Databricks Runtime 7.0 beta, now looks to tackle such issues by reoptimizing and adjusting query plans based on runtime statistics collected in the process of query execution.

The Adaptive Query Execution (AQE) framework

One of the most important questions for Adaptive Query Execution is when to reoptimize. Spark operators are often pipelined and executed in parallel processes. However, a shuffle or broadcast exchange breaks this pipeline. We call them materialization points and use the term “query stages” to denote subsections bounded by these materialization points in a query. Each query stage materializes its intermediate result and the following stage can only proceed if all the parallel processes running the materialization have completed. This provides a natural opportunity for reoptimization, for it is when data statistics on all partitions are available and successive operations have not started yet.

When the query starts, the Adaptive Query Execution framework first kicks off all the leaf stages — the stages that do not depend on any other stages. As soon as one or more of these stages finish materialization, the framework marks them complete in the physical query plan and updates the logical query plan accordingly, with the runtime statistics retrieved from completed stages. Based on these new statistics, the framework then runs the optimizer (with a selected list of logical optimization rules), the physical planner, as well as the physical optimization rules, which include the regular physical rules and the adaptive-execution-specific rules, such as coalescing partitions, skew join handling, etc. Now that we’ve got a newly optimized query plan with some completed stages, the adaptive execution framework will search for and execute new query stages whose child stages have all been materialized, and repeat the above execute-reoptimize-execute process until the entire query is done.

In Spark 3.0, the AQE framework is shipped with three features:

Dynamically coalescing shuffle partitions
Dynamically switching join strategies
Dynamically optimizing skew joins

The following sections will talk about these three features in detail.

Dynamically coalescing shuffle partitions

When running queries in Spark to deal with very large data, shuffle usually has a very important impact on query performance among many other things. Shuffle is an expensive operator as it needs to move data across the network, so that data is redistributed in a way required by downstream operators.

One key property of shuffle is the number of partitions. The best number of partitions is data dependent, yet data sizes may differ vastly from stage to stage, query to query, making this number hard to tune:

If there are too few partitions, then the data size of each partition may be very large, and the tasks to process these large partitions may need to spill data to disk (e.g., when sort or aggregate is involved) and, as a result, slow down the query.
If there are too many partitions, then the data size of each partition may be very small, and there will be a lot of small network data fetches to read the shuffle blocks, which can also slow down the query because of the inefficient I/O pattern. Having a large number of tasks also puts more burden on the Spark task scheduler.

To solve this problem, we can set a relatively large number of shuffle partitions at the beginning, then combine adjacent small partitions into bigger partitions at runtime by looking at the shuffle file statistics.

For example, let’s say we are running the query SELECT max(i)FROM tbl GROUP BY j. The input data tbl is rather small so there are only two partitions before grouping. The initial shuffle partition number is set to five, so after local grouping, the partially grouped data is shuffled into five partitions. Without AQE, Spark will start five tasks to do the final aggregation. However, there are three very small partitions here, and it would be a waste to start a separate task for each of them.

Instead, AQE coalesces these three small partitions into one and, as a result, the final aggregation now only needs to perform three tasks rather than five.

Dynamically switching join strategies

Spark supports a number of join strategies, among which broadcast hash join is usually the most performant if one side of the join can fit well in memory. And for this reason, Spark plans a broadcast hash join if the estimated size of a join relation is lower than the broadcast-size threshold. But a number of things can make this size estimation go wrong — such as the presence of a very selective filter — or the join relation being a series of complex operators other than just a scan.

To solve this problem, AQE now replans the join strategy at runtime based on the most accurate join relation size. As can be seen in the following example, the right side of the join is found to be way smaller than the estimate and also small enough to be broadcast, so after the AQE reoptimization the statically planned sort merge join is now converted to a broadcast hash join.

For the broadcast hash join converted at runtime, we may further optimize the regular shuffle to a localized shuffle (i.e., shuffle that reads on a per mapper basis instead of a per reducer basis) to reduce the network traffic.

Dynamically optimizing skew joins

Data skew occurs when data is unevenly distributed among partitions in the cluster. Severe skew can significantly downgrade query performance, especially with joins. AQE skew join optimization detects such skew automatically from shuffle file statistics. It then splits the skewed partitions into smaller subpartitions, which will be joined to the corresponding partition from the other side respectively.

Let’s take this example of table A join table B, in which table A has a partition A0 significantly bigger than its other partitions.

The skew join optimization will thus split partition A0 into two subpartitions and join each of them to the corresponding partition B0 of table B.

Without this optimization, there would be four tasks running the sort merge join with one task taking a much longer time. After this optimization, there will be five tasks running the join, but each task will take roughly the same amount of time, resulting in an overall better performance.

TPC-DS performance gains from AQE

In our experiments using TPC-DS data and queries, Adaptive Query Execution yielded up to an 8x speedup in query performance and 32 queries had more than 1.1x speedup Below is a chart of the 10 TPC-DS queries having the most performance improvement by AQE.

Most of these improvements have come from dynamic partition coalescing and dynamic join strategy switching since randomly generated TPC-DS data do not have skew. Yet we’ve seen even greater improvements in production workload in which all three features of AQE are leveraged.

Enabling AQE

AQE can be enabled by setting SQL config spark.sql.adaptive.enabled to true (default false in Spark 3.0), and applies if the query meets the following criteria:

It is not a streaming query
It contains at least one exchange (usually when there’s a join, aggregate or window operator) or one subquery

By making query optimization less dependent on static statistics, AQE has solved one of the greatest struggles of Spark cost-based optimization — the balance between the stats collection overhead and the estimation accuracy. To achieve the best estimation accuracy and planning outcome, it is usually required to maintain detailed, up-to-date statistics and some of them are expensive to collect, such as column histograms, which can be used to improve selectivity and cardinality estimation or to detect data skew. AQE has largely eliminated the need for such statistics as well as for the manual tuning effort. On top of that, AQE has also made SQL query optimization more resilient to the presence of arbitrary UDFs and unpredictable data set changes, e.g., sudden increase or decrease in data size, frequent and random data skew, etc.

There’s no need to “know” your data in advance any more. AQE will figure out the data and improve the query plan as the query runs, increasing query performance for faster analytics and system performance. Try it out today free on Databricks as part of our Databricks Runtime 7.0 beta.

Try Databricks for free. Get started today.

The post Adaptive Query Execution: Speeding Up Spark SQL at Runtime appeared first on Databricks.

R is one of the most popular computer languages in data science, specifically dedicated to statistical analysis with a number of extensions, such as RStudio addins and other R packages, for data processing and machine learning tasks. Moreover, it enables data scientists to easily visualize their data set.

By using SparkR in Apache Spark^TM, R codes can easily be scaled. To interactively run jobs, you can easily run the distributed computation by running an R shell.

When SparkR does not require interaction with the R process, the performance is virtually identical to other language APIs such as Scala, Java and Python. However, significant performance degradation happens when SparkR jobs interact with native R functions or data types.

Databricks Runtime introduced vectorization in SparkR to improve the performance of data I/O between Spark and R. We are excited to announce that using the R APIs from Apache Arrow 0.15.1, the vectorization is now available in the upcoming Apache Spark 3.0 with the substantial performance improvements.

This blog post outlines Spark and R interaction inside SparkR, the current native implementation and the vectorized implementation in SparkR with benchmark results.

Spark and R interaction

SparkR supports not only a rich set of ML and SQL-like APIs but also a set of APIs commonly used to directly interact with R code — for example, the seamless conversion of Spark DataFrame from/to R DataFrame, and the execution of R native functions on Spark DataFrame in a distributed manner.

In most cases, the performance is virtually consistent across other language APIs in Spark — for example, when user code relies on Spark UDFs and/or SQL APIs, the execution happens entirely inside the JVM with no performance penalty in I/O. See the cases below which take ~1 second similarly.

// Scala API
// ~1 second
sql("SELECT id FROM range(2000000000)").filter("id > 10").count()

# R API
# ~1 second
count(filter(sql("SELECT * FROM range(2000000000)"), "id > 10"))

However, in cases where it requires to execute the R native function or convert it from/to R native types, the performance is hugely different as below.

// Scala API
val ds = (1L to 100000L).toDS
// ~1 second
ds.mapPartitions(iter => iter.filter(_ < 50000)).count()

# R API
df <- createDataFrame(lapply(seq(100000), function (e) list(value=e)))
# ~15 seconds - 15 times slower
count(dapply(
df, function(x) as.data.frame(x[x$value < 50000,]), schema(df)))

Although this simple case above just filters the values lower than 50,000 for each partition, SparkR is 15x slower.

// Scala API
// ~0.2 seconds
val df = sql("SELECT * FROM range(1000000)").collect()

# R API
# ~8 seconds - 40 times slower
df <- collect(sql("SELECT * FROM range(1000000)"))

The case above is even worse. It simply collects the same data into the driver side, but it is 40x slower in SparkR.

This is because the APIs that require the interaction with R native function or data types and its implementation are not very efficient. There are six APIs that have the notable performance penalty:

createDataFrame()
collect()
dapply()
dapplyCollect()
gapply()
gapplyCollect()

In short, createDataFrame() and collect() require to (de)serialize and convert the data from JVM from/to R driver side. For example, String in Java becomes character in R. For dapply() and gapply(), the conversion between JVM and R executors is required because it needs to (de)serialize both R native function and the data. In case of dapplyCollect() and gapplyCollect(), it requires the overhead at both driver and executors between JVM and R.

Native implementation

The computation on SparkR DataFrame gets distributed across all the nodes available on the Spark cluster. There’s no communication with the R processes above in driver or executor sides if it does not need to collect data as R data.frame or to execute R native functions. When it requires R data.frame or the execution of R native function, they communicate using sockets between JVM and R driver/executors.

It (de)serializes and transfers data row by row between JVM and R with an inefficient encoding format, which does not take the modern CPU design into account such as CPU pipelining.

Vectorized implementation

In Apache Spark 3.0, a new vectorized implementation is introduced in SparkR by leveraging Apache Arrow to exchange data directly between JVM and R driver/executors with minimal (de)serialization cost.

Instead of (de)serializing the data row by row using an inefficient format between JVM and R, the new implementation leverages Apache Arrow to allow pipelining and Single Instruction Multiple Data (SIMD) with an efficient columnar format.

The new vectorized SparkR APIs are not enabled by default but can be enabled by setting spark.sql.execution.arrow.sparkr.enabled to true in the upcoming Apache Spark 3.0. Note that vectorized dapplyCollect() and gapplyCollect() are not implemented yet. It is encouraged for users to use dapply() and gapply() instead.

Benchmark results

The benchmarks were performed with a simple data set of 500,000 records by executing the same code and comparing the total elapsed times when the vectorization is enabled and disabled. Our code, dataset and notebooks are available here on GitHub.

In case of collect() and createDataFrame() with R DataFrame, it became approximately 17x and 42x faster when the vectorization was enabled. For dapply() and gapply(), it was 43x and 33x faster than when the vectorization is disabled, respectively.

There was a performance improvement of up to 17x–43x when the optimization was enabled by {spark.sql.execution.arrow.sparkr.enabled }} to true. The larger the data was, the higher performance expected. For details, see the benchmark performed previously for Databricks Runtime.

Conclusion

The upcoming Apache Spark 3.0, supports the vectorized APIs, dapply(), gapply(), collect() and createDataFrame() with R DataFrame by leveraging Apache Arrow. Enabling vectorization in SparkR improved the performance up to 43x faster, and more boost is expected when the size of data is larger.

As for future work, there is an ongoing issue in Apache Arrow, ARROW-4512. The communication between JVM and R is not fully in a streaming manner currently. It has to (de)serialize in batch because Arrow R API does not support this out of the box. In addition, dapplyCollect() and gapplyCollect() will be supported in Apache Spark 3.x releases. Users can work around via dapply() and collect(), and gapply() and collect() individually in the meantime.

Try out these new capabilities today on Databricks, through our DBR 7.0 Beta, which includes a preview of the upcoming Spark 3.0 release.

Try Databricks for free. Get started today.

The post Vectorized R I/O in Upcoming Apache Spark 3.0 appeared first on Databricks.

Cloud computing has fundamentally changed how companies operate – users are no longer subject to the restrictions of on-premises hardware deployments such as physical limits of resources and onerous environment upgrade processes. With the convenience and flexibility of cloud services comes challenges on how to properly monitor how your users utilize these conveniently available resources. Failure to do so could result in problematic and costly anti-patterns (with both cloud provider core resources and a PaaS like Databricks). Databricks is cloud-native by design and thus tightly coupled with the public cloud providers, such as Microsoft and Amazon Web Services, fully taking advantage of this new paradigm, and the audit logs capability provides administrators a centralized way to understand and govern activity happening on the platform. Administrators could use Databricks audit logs to monitor patterns like the number of clusters or jobs in a given day, the users who performed those actions, and any users who were denied authorization into the workspace.

In the first blog post of the series, Trust but Verify with Databricks, we covered how Databricks admins could use Databricks audit logs and other cloud provider logs as complementary solutions for their cloud monitoring scenarios. The main purpose of Databricks audit logs is to allow enterprise security teams and platform administrators to track access to data and workspace resources using the various interfaces available in the Databricks platform. In this article, we will cover, in detail, how those personas could process and analyze the audit logs to track resource usage and identify potentially costly anti-patterns.

Audit Logs ETL Design

Databricks delivers audit logs for all enabled workspaces as per delivery SLA in JSON format to a customer-owned AWS S3 bucket. These audit logs contain events for specific actions related to primary resources like clusters, jobs, and the workspace. To simplify delivery and further analysis by the customers, Databricks logs each event for every action as a separate record and stores all the relevant parameters into a sparse StructType called requestParams.

In order to make this information more accessible, we recommend an ETL process based on Structured Streaming and Delta Lake.

Utilizing Structured Streaming allows us to:
- Leave state management to a construct that’s purpose built for state management. Rather than having to reason about how much time has elapsed since our previous run to ensure that we’re only adding the proper records, we can utilize Structured Streaming’s checkpoints and write-ahead log to ensure that we’re only processing the newly added audit log files. We can design our streaming queries as triggerOnce daily jobs which are like pseudo-batch jobs
Utilizing Delta Lake allows us to do the following:
- Gracefully handle schema evolution, specifically with regards to the requestParams field, which may have new StructField based on new actions tracked in the audit logs
- Easily utilize table to table streams
- Take advantage of specific performance optimizations like OPTIMIZE to maximize read performance

For reference, this is the medallion reference architecture that Databricks recommends:

Bronze: the initial landing zone for the pipeline. We recommend copying data that’s as close to its raw form as possible to easily replay the whole pipeline from the beginning, if needed

Silver: the raw data get cleansed (think data quality checks), transformed and potentially enriched with external data sets

Gold: production-grade data that your entire company can rely on for business intelligence, descriptive statistics, and data science / machine learning

Following our own medallion architecture, we break it out as follows for our audit logs ETL design:

Raw Data to Bronze Table

Stream from the raw JSON files that Databricks delivers using a file-based Structured Stream to a bronze Delta Lake table. This creates a durable copy of the raw data that allows us to replay our ETL, should we find any issues in downstream tables.

Databricks delivers audit logs to a customer-specified AWS S3 bucket in the form of JSON. Rather than writing logic to determine the state of our Delta Lake tables, we’re going to utilize Structured Streaming‘s write-ahead logs and checkpoints to maintain the state of our tables. In this case, we’ve designed our ETL to run once per day, so we’re using a file source with triggerOnce to simulate a batch workload with a streaming framework. Since Structured Streaming requires that we explicitly define the schema, we’ll read the raw JSON files once to build it.

streamSchema = spark.read.json(sourceBucket).schema

We’ll then instantiate our StreamReader using the schema we inferred and the path to the raw audit logs.

streamDF = (
    spark
    .readStream
    .format("json")
    .schema(streamSchema)
    .load(sourceBucket)
)

We then instantiate our StreamWriter and write out the raw audit logs into a bronze Delta Lake table that’s partitioned by date.

(streamDF
.writeStream
.format("delta")
.partitionBy("date")
.outputMode("append")
.option("checkpointLocation", "{}/checkpoints/bronze".format(sinkBucket))
.option("path", "{}/streaming/bronze".format(sinkBucket))
.option("mergeSchema", True)
.trigger(once=True)
.start()
)

Now that we’ve created the table on an AWS S3 bucket, we’ll need to register the table to the Databricks Hive metastore to make access to the data easier for end users. We’ll create the logical database audit_logs, before creating the Bronze table.

CREATE DATABASE IF NOT EXISTS audit_logs

spark.sql("""
CREATE TABLE IF NOT EXISTS audit_logs.bronze
USING DELTA
LOCATION '{}/streaming/bronze'
""".format(sinkBucket))

If you update your Delta Lake tables in batch or pseudo-batch fashion, it’s best practice to run OPTIMIZE immediately following an update.

OPTIMIZE audit_logs.bronze

Bronze to Silver Table

Stream from a bronze Delta Lake table to a silver Delta Lake table such that it takes the sparse requestParams StructType and strips out all empty keys for every record, along with performing some other basic transformations like parsing email address from a nested field and parsing UNIX epoch to UTC timestamp.

Since we ship audit logs for all Databricks resource types in a common JSON format, we’ve defined a canonical struct called requestParams which contains a union of the keys for all resource types. Eventually, we’re going to create individual tables for each service, so we want to strip down the requestParams field for each table so that it contains only the relevant keys for the resource type. To accomplish this, we define a user-defined function (UDF) to strip away all such keys in requestParams that have null values.

def stripNulls(raw):
    return json.dumps({i: raw.asDict()[i] for i in raw.asDict() if raw.asDict()[i] != None})
  
strip_udf = udf(stripNulls, StringType())

We instantiate a StreamReader from our bronze Delta Lake table:

bronzeDF = (
    spark
    .readStream
    .load("{}/streaming/bronze".format(sinkBucket))
    )

We then apply the following transformations to the streaming data from the bronze Delta Lake table:

strip the null keys from requestParams and store the output as a string
parse email from userIdentity
parse an actual timestamp / timestamp datatype from the timestamp field and store it in date_time
drop the raw requestParams and userIdentity

query = (
    bronzeDF
    .withColumn("flattened", strip_udf("requestParams"))
    .withColumn("email", col("userIdentity.email"))
    .withColumn("date_time", from_utc_timestamp(from_unixtime(col("timestamp")/1000), "UTC"))
    .drop("requestParams")
    .drop("userIdentity")
)

We then stream those transformed records into the SIlver Delta Lake table:

(query
.writeStream
.format("delta")
.partitionBy("date")
.outputMode("append")
.option("checkpointLocation", "{}/checkpoints/silver".format(sinkBucket))
.option("path", "{}/streaming/silver".format(sinkBucket))
.option("mergeSchema", True)
.trigger(once=True)
.start()
)

Again, since we’ve created a table based on an AWS S3 bucket, we’ll want to register it with the vive Metastore for easier access.

spark.sql("""
CREATE TABLE IF NOT EXISTS audit_logs.silver
USING DELTA
LOCATION '{}/streaming/silver'
""".format(sinkBucket))

Although Structured Streaming guarantees exactly once processing, we can still add an assertion to check the counts of the Bronze Delta Lake table to the SIlver Delta Lake table.

assert(spark.table("audit_logs.bronze").count() == spark.table("audit_logs.silver").count())

As for the bronze table earlier, we’ll run OPTIMIZE after this update for the silver table as well.

OPTIMIZE audit_logs.silver

Silver to Gold Tables

Stream to individual gold Delta Lake tables for each Databricks service tracked in the audit logs

The gold audit log tables are what the Databricks administrators will utilize for their analyses. With the requestParams field pared down at the service level, it’s now much easier to get a handle on the analysis and what’s pertinent. With Delta Lake’s ability to handle schema evolution gracefully, as Databricks tracks additional actions for each resource type, the gold tables will seamlessly change, eliminating the need to hardcode schemas or babysit for errors.

In the final step of our ETL process, we first define a UDF to parse the keys from the stripped down version of the original requestParams field.

def justKeys(string):
return [i for i in json.loads(string).keys()]

just_keys_udf = udf(justKeys, StringType())

For the next large chunk of our ETL, we’ll define a function which accomplishes the following:

gathers the keys for each record for a given serviceName (resource type)
creates a set of those keys (to remove duplicates)
creates a schema from those keys to apply to a given serviceName (if the serviceName does not have any keys in requestParams, we give it one key schema called placeholder)
write out to individual gold Delta Lake tables for each serviceName in the silver Delta Lake table

def flattenTable(serviceName, bucketName):
    flattenedStream = spark.readStream.load("{}/streaming/silver".format(bucketName))
    flattened = spark.table("audit_logs.silver")
    ...

We extract a list of all unique values in serviceName to use for iteration and run above function for each value of serviceName:

serviceNameList = [i['serviceName'] for i in spark.table("audit_logs.silver").select("serviceName").distinct().collect()]

for serviceName in serviceNameList:
    flattenTable(serviceName, sinkBucket)

As before, register each Gold Delta Lake table to the Hive Metastore:

for serviceName in serviceNameList:
spark.sql("""
CREATE TABLE IF NOT EXISTS audit_logs.{0}
USING DELTA
LOCATION '{1}/streaming/gold/{2}'
""".format(serviceName,sinkBucket,i))

Then run OPTIMIZE on each table:

for serviceName in serviceNameList:
spark.sql("OPTIMIZE audit_logs.{}".format(serviceName))

Again as before, asserting that the counts are equal is not necessary, but we do it nonetheless:

flattened_count = spark.table("audit_logs.silver").count()

total_count = 0
for serviceName in serviceNameList:
    total_count += (spark.table("audit_logs.{}".format(serviceName)).count())

assert(flattened_count == total_count)

We now have a gold Delta Lake table for each serviceName (resource type) that Databricks tracks in the audit logs, which we can now use for monitoring and analysis.

Audit Log Analysis

In the above section, we process the raw audit logs using ETL and include some tips on how to make data access easier and more performant for your end users. The first notebook included in this article pertains to that ETL process.

The second notebook we’ve included goes into more detailed analysis on the audit log events themselves. For the purpose of this blog post, we’ll focus on just one of the resource types – clusters, but we’ve included analysis on logins as another example of what administrators could do with the information stored in the audit logs.

It may be obvious to some as to why a Databricks administrator may want to monitor clusters, but it bears repeating: cluster uptime is the biggest driver of cost and we want to ensure that our customers get maximum value while they’re utilizing Databricks clusters.

A major portion of the cluster uptime equation is the number of clusters created on the platform and we can use audit logs to determine the number of Databricks clusters created on a given day.

By querying the clusters’ gold Delta Lake table, we can filter where actionName is create and perform a count by date.

SELECT date, count(*) AS num_clusters 
FROM clusters 
WHERE actionName = 'create' 
GROUP BY 1 
ORDER BY 1 ASC

There’s not much context in the above chart because we don’t have data from other days. But for the sake of simplicity, let’s assume that the number of clusters more than tripled compared to normal usage patterns and the number of users did not change meaningfully during that time period. If this were truly the case, then one of the reasonable explanations would be that the clusters were created programmatically using jobs. Additionally, 12/28/19 was a Saturday, so we don’t expect there to be many interactive clusters created anyways.

Inspecting the requestParam StructType for the clusters table, we see that there’s a cluster_creator field, which should tell us who created it.

SELECT requestParams.cluster_creator, actionName, count(*) 
FROM clusters 
WHERE date = '2019-12-28' 
GROUP BY 1,2 
ORDER BY 3 DESC

Based on the results above, we notice that JOB_LAUNCHER created 709 clusters, out of 714 total clusters created on 12/28/19, which confirms our intuition.

Our next step is to figure out which particular jobs created these clusters, which we could extract from the cluster names. Databricks job clusters follow this naming convention job-<jobId>-run-<runId>, so we can parse the jobId from the cluster name.

SELECT split(requestParams.cluster_name, "-")[1] AS jobId, count(*) 
FROM clusters 
WHERE actionName = 'create' AND date = '2019-12-28'
GROUP BY 1 
ORDER BY 2 DESC

Here we see that jobId “31303” is the culprit for the vast majority of clusters created on 12/28/19. Another piece of information that the audit logs store in requestParams is the user_id of the user who created the job. Since the creator of a job is immutable, we can just take the first record.

SELECT requestParams.user_id 
FROM clusters 
WHERE actionName = 'create' AND date = '2019-12-28' AND split(requestParams.cluster_name, "-")[1] = '31303' 
LIMIT 1

Now that we have the user_id of the user who created the job, we can utilize the SCIM API to get the user’s identity and ask them directly about what may have happened here.

In addition to monitoring the total number of clusters overall, we encourage Databricks administrators to pay special attention to all purpose compute clusters that do not have autotermination enabled. The reason is because such clusters will keep running until manually terminated, regardless of whether they’re idle or not. You can identify these clusters using the following query:

SELECT date, count(*) AS num_clusters 
FROM clusters 
WHERE actionName = 'create' AND requestParams.autotermination_minutes = 0 AND requestParams.cluster_creator IS null 
GROUP BY 1
ORDER BY 1 ASC

If you’re utilizing our example data, you’ll notice that there are 5 clusters whose cluster_creator is null which means that they were created by users and not by jobs.

By selecting the creator’s email address and the cluster’s name, we can identify which clusters we need to terminate and which users we need to discuss the best practices for Databricks resource management.

How to start processing Databricks Audit Logs

With a flexible ETL process that follows the best practice medallion architecture with Structured Streaming and Delta Lake, we’ve simplified Databricks audit logs analysis by creating individual tables for each Databricks resource type. Our cluster analysis example is just one of the many ways that analyzing audit logs helps to identify a problematic anti-pattern that could lead to unnecessary costs. Please use the following notebooks for the exact steps we’ve included in this post to try it out at your end:

For more information, you can also watch the recent tech talk: Best Practices on How to Process and Analyze Audit Logs with Delta Lake and Structured Streaming.

For a slightly different architecture that processes the audit logs as soon as they’re available, consider evaluating the new Auto Loader capability that we discuss in detail in this blog post.

We want our customers to maximize the value they get from our platform, so please reach out to your Databricks account team if you have any questions.

Try Databricks for free. Get started today.

The post Monitor Your Databricks Workspace with Audit Logs appeared first on Databricks.

Download the Customer Lifetimes Part 1 notebook to demo the solution covered below.

The biggest challenge every marketer faces is how to best spend money to profitably grow their brand. We want to spend our marketing dollars on activities that attract the best customers, while avoiding spending on unprofitable customers or on activities that erode brand equity.

Too often, marketers just look at spending efficiency. What is the least I can spend on advertising and promotions to generate revenue? Focusing solely on ROI metrics can weaken your brand equity and make you more dependent on price promotion as a way to generate sales.

Within your existing set of customers are people ranging from brand loyalists to brand transients. Brand loyalists are highly engaged with your brand, are willing to share their experience with others, and are the most likely to purchase again. Brand transients have no loyalty to your brand and shop based on price. Your marketing spend ideally would focus on growing the group of brand loyalists, while minimizing the exposure to brand transients.

So how can you identify these brand loyalists and best use your marketing dollars to prolong their relationship with you?

Today’s customer has no shortage of options. To stand out, businesses need to speak directly to the needs and wants of the individual on the other side of the monitor, phone, or station, often in a manner that recognizes not only the individual customer but the context that brings them to the exchange. When done properly, personalized engagement can drive higher revenues, marketing efficiency and customer retention¹, and as capabilities mature and customer expectations rise, getting personalization right will become ever more important. As McKinsey & Company puts it, personalization will be “the prime driver of marketing success within the next five years²”.

But one critical aspect of personalization is understanding that not every customer carries with him or her the same potential for profitability. Not only do different customers derive different value from our products and services but this directly translates into differences in the overall amount of value we might expect in return. If the relationship between us and our customers is to be mutually beneficial, we must carefully align customer acquisition cost (CAC) and retention rates with the total revenue or customer lifetime value (CLV) we might reasonably receive over that relationship’s lifetime.

This is the central motivation behind the customer lifetime value calculation. By calculating the amount of revenue we might receive from a given customer over the lifetime of our relationship with them, we might better tailor our investments to maximize the value of our relationship for both parties. We might further seek to understand why some customers value our products and services more than others and orient our messaging to attract more higher potential individuals. We might also use CLV in aggregate to assess the overall effectiveness in our marketing practices in building equity and monitor how innovation and changes in the marketplace affect it over time³.

But as powerful as CLV is, it’s important we appreciate it’s derived from two separate and independent estimates⁴. The first of these is the per-transaction spend (or average order value) we may expect to see from a given customer. The second is the estimated number of transactions we may expect from that customer over a given time horizon. This second estimate is often seen as a means to an end, but as organizations shift their marketing spend from acquiring new customers towards retention5, it becomes incredibly valuable in its own right.

How Customers Signal Their Lifetime Intent

In the non-contractual scenarios within which most retailers engage, customers may come and go as they please. Retailers attempting to assess the remaining lifetime in a customer relationship must carefully examine the transactional signals previously generated by customers in terms of the frequency and recency of their engagement. For example, a frequent purchaser who slows their pattern of purchases or simply fails to reappear for an extended period of time may signal they are approaching the end of their relationship lifetime. Another purchaser who infrequently engages may continue to be in a viable relationship even when absent for a similar duration.

Different customers with the same number of transactions but signaling different lifetime intent

Understanding where a customer is in the lifespan of their relationship with us can be critical to delivering the right messages at the right time. Customers signaling their intent to be in a long-term relationship with our brand, may respond positively to higher-investment offers which deepen and strengthen their relationship with us and which maximize the long-term potential of the relationship even while sacrificing short-term revenues. Customers signaling their intent for a short-term relationship may be pushed away by similar offers or worse may accept those offers with no hope of us ever recovering the investment.

Leveraging mlflow, a Machine Learning model management and deployment platform, we can easily map our model to standardized application program interfaces. While mlflow does not natively support the models generated by lifetimes, it is easily extended for this purpose. The end result of this is that we can quickly turn our trained models into functions and applications enabling periodic, real-time and interactive customer scoring of life expectancy metrics.

Similarly, we may recognize shifts in relationship signals such as when long-lived customers approach the end of their relationship lifetime and promote alternative products and services which transition them into a new, potentially profitable relationship with ourselves or a partner. Even with our short-lived customers, we might consider how best to deliver products and services which maximize revenues during their time-limited engagement and which may allow them to recommend us to others seeking similar offerings.

As Peter Fader and Sarah Toms write in The Customer Centricity Playbook, in an effective customer-centric strategy “opportunities to make maximum financial gains are identified and fully taken advantage of, but these high-risk bets must be weighted out and distributed across lower-risk categories of assets as well.” Finding the right balance and tailoring our interactions starts with a careful estimate of where customers are in their lifetime journey with us.

Estimating Customer Lifetime from Transactional Signals

As previously mentioned, in non-subscription models, we cannot know a customer’s exact lifetime or where he or she resides in it, but we can leverage the transactional signals they generate to estimate the probability the customer is active and likely to return in the future. Popularized as the Buy ‘til You Die (BTYD) models, a customer’s frequency and recency of engagement relative to patterns of the same across a retailer’s customer population can be used to derive survivorship curves which provide us these values.

Figure 2. The probability of re-engagement (P_alive) relative to a customer’s history of purchases

The mathematics behind these predictive CLV models is quite complex. The original BTYD model proposed by Schmittlein et al. in the late 1980s (and today known as the Pareto/Negative Binomial Distribution or Pareto/NBD model) didn’t take off in adoption until Fader et al. simplified the calculation logic (producing the Beta-Geometrical/Negative Binomial Distribution or BG/NBD model) in the mid-2000s. Even then, the math of the simplified model gets pretty gnarly pretty fast. Thankfully, the logic behind both of these models is accessible to us through a popular Python library named lifetimes to which we can provide simple summary metrics in order to derive customer-specific lifetime estimates.

Delivering Customer Lifetime Estimates to the Business

While highly accessible, the use of the lifetimes library to calculate customer-specific probabilities in a manner aligned with the needs of a large enterprise can be challenging. First, a large volume of transaction data must be processed in order to generate the per-customer metrics required by the models. Next, curves must be derived from this data, fitting it to expected patterns of value distribution, the process of which is regulated by a parameter which cannot be predetermined and instead must be evaluated iteratively across a large range of potential values. Finally, the lifetimes models, once fitted, must be integrated into the marketing and customer engagement functions of our business for the predictions it generates to have any meaningful impact. It is our intent in this blog and the associated notebook to demonstrate how each of the challenges may be addressed.

Metrics Calculations

The BTYD models depend on three key per-customer metrics:

Frequency – the number of time units within a given time period on which a non-initial (repeat) transaction is observed. If calculated at a daily level, this is simply the number of unique dates on which a transaction occurred minus 1 for the initial transaction that indicates the start of a customer relationship.
Age – the number of time units from the occurrence of an initial transaction until the end of a given time period. Again, if transactions are observed at a daily level, this is simply the number of days since a customer’s initial transaction to the end of the dataset.
Recency – the age of a customer (as previously defined) at the time of their latest non-initial (repeat) transaction.

The metrics themselves are pretty straightforward. The challenge is deriving these values for each customer from transaction histories which may record each line item of each transaction occurring over a multi-year period. By leveraging a data processing platform such as Apache Spark which natively distributes this work across the capacity of a multi-server environment, this challenge can be easily addressed and metrics computed in a timely manner. As more transactional data arrives and these metrics must be recomputed across a growing transactional dataset, the elastic nature of Spark allows additional resources to be enlisted to keep processing times within business-defined bounds.

Model Fitting

With per-customer metrics calculated, the lifetimes library can be used to train one of multiple BTYD models which may be applicable in a given retail scenario. (The two most widely applicable are the Pareto/NBD and BG/NBD models but there are others.) While computationally complex, each model is trained with a simple method call, making the process highly accessible.

Still, a regularization parameter is employed during the training process of each model to avoid overfitting it to the training data. What value is best for this parameter in a given training exercise is difficult to know in advance so that the common practice is to train and evaluate model fit against a range of potential values until an optimal value can be determined.

This process often involves hundreds or even thousands of training/evaluation runs. When performed one at a time, the process of determining an optimal value, which is typically repeated as new transactional data arrives, can become very time consuming.

By using a specialized library named hyperopt, we can tap into the infrastructure behind our Apache Spark environment and distribute the model training/evaluation work in a parallelized manner. This allows the parameter tuning exercise to be performed efficiently, returning to us the optimal model type and regularization parameter settings.

Solution Deployment

Once properly trained, our model has the capability of not only determining the probability a customer will re-engage but the number of engagements expected over future periods. Matrices illustrating the relationship between recency and frequency metrics and these predicted outcomes provide powerful visual representations of the knowledge encapsulated in the now fitted models. But the real challenge is putting these predictive capabilities into the hands of those that determine customer engagement.

Matrices illustrating the probability a customer is alive (left) and the number of future purchases in a 30-day window given a customer’s frequency and recency metrics (right)

Figure 3. Matrices illustrating the probability a customer is alive (left) and the number of future purchases in a 30-day window given a customer’s frequency and recency metrics (right)

Leveraging mlflow, a Machine Learning model management and deployment platform, we can easily map our model to standardized application program interfaces. While mlflow does not natively support the models generated by lifetimes, it is easily extended for this purpose. The end result of this is that we can quickly turn our trained models into functions and applications enabling periodic, real-time and interactive customer scoring of life expectancy metrics.

Bringing It All Together with Databricks

The predictive capability of the BYTD models combined with the ease of implementation provided by the lifetimes library make widespread adoption of customer lifetime prediction feasible. Still, there are several technical challenges which must be overcome in doing so. But whether it’s scaling the calculation of customer metrics from large volumes of transaction history, performing optimized hyperparameter tuning across a large search space or the deployment of an optimal model as a solution enabling customer scoring, the capabilities needed to overcome each of these challenges is available. Still, integrating these capabilities into a single environment can be challenging and time consuming. Thankfully, Databricks has done this work for us. And by delivering these as a cloud-native platform, retailers and manufacturers needing access to these can develop and deploy solutions in a highly-scalable environment with limited upfront cost.

Download the Notebook to get started.

Try Databricks for free. Get started today.

The post Customer Lifetime Value Part 1: Estimating Customer Lifetimes appeared first on Databricks.

Statistical Analysis in the Game of Baseball

A single pitch in Major League Baseball (MLB) generates tens of megabytes of data, from pitch movement to ball rotation to hitter behavior to the movement of each individual baseball player in response to a hit. How do you derive actionable insights from all this data over the course of a game and season? Learn how the Baseball Operations Group within the 2019 AL Central Division Champion Minnesota Twins is using Databricks to take reams of sensor data, run thousands or tens of thousands of simulations on each pitch, and then quickly generate actionable insights to analyze and improve player performance, scout the competition, and better evaluate talent. Additionally, learn how they are planning to shorten the analysis cycle even further to get these insights to the coaches in order to optimize in-game strategy based on real-time action.

In part 1, we walk through the challenges the Twins’ front office was facing in running thousands of simulations on their pitch data to improve their player evaluation model. We evaluate the tradeoffs around different methodologies for modeling and inferring pitch outcomes in R, and why we chose to proceed with user defined function with R in Spark.

In part 2, we will take a deeper dive into user defined functions with R in Spark. We will learn how to optimize performance to enable scale by managing the behavior of R inside the UDF as well as the behavior of Spark as it orchestrates execution.

Background

There are several discrete baseball statistics that can be used to evaluate a baseball player’s’ performance such as batting average or runs batted in, but the sabermetric baseball community developed Wins Above Replacement (WAR) to estimate a players’ total contribution to the team’s success in a way to make it easier to compare players. From FanGraphs.com:

You should always use more than one metric at a time when evaluating players, but WAR is all-inclusive and provides a useful reference point for comparing players. WAR offers an estimate to answer the question, “If this player got injured and their team had to replace them with a freely available minor leaguer or a AAAA player from their bench, how much value would the team be losing?” This value is expressed in a wins format, so we could say that Player X is worth +6.3 wins to their team while Player Y is only worth +3.5 wins, which means it is highly likely that Player X has been more valuable than Player Y.

Historically, comparing players’ WAR over the course of hundreds or thousands of pitches was the best way to gauge relative value. The Minnesota Twins had data on over 15 million pitches that included not just final outcome (ball, strike, base hit, etc), but also deeper data like ball speed and rotation, exit velocity, player positioning, fielding independent pitching (FIP), and so on. However, any single pitch can have several variables, like batter-pitcher pairing or weather, that affect the play’s expected run value (e.g., the breeze picks up and what could have been a homerun instead becomes a foul ball). But how do you correct for those variables in order to derive a more accurate prediction of future performance?

Based on the law of large numbers, industries like Financial Services run Monte Carlo simulations on historical data to increase the accuracy of their probabilistic models. Similarly, a 100-fold increase in scenarios (think of each pitch as a scenario) leads to a 10-fold more accurate WAR estimate. To increase the accuracy of their expected run value, the Twins looked for a solution that could generate up to 20,000 simulations on each of the pitches in their database, or up to 300 billion scenarios total (15 million pitches x 20,000 simulations per pitch in max scenario = 300 billion total scenarios). Running this analysis on-prem with Base R on a single node, they realized it would take almost four years to compute the historical data (~8 seconds to run 20k simulations per pitch x 15 million pitches / 3.15×107 seconds per year). If they were going to eventually use run value/WAR estimates to optimize in-game decisions, where each game could generate over 40,000 scenarios to score, they needed:

A way to quickly spin up massive amounts of compute power for short periods of time and
A living model where they could continuously add new baseball data to improve the accuracy of their forecasts and generate actionable insights in near real time.

The Twins turned to Databricks and Microsoft Azure, given our experience with massive data sets that include both structured and unstructured data. With the near limitless on-demand compute available in the cloud, the Twins no longer had to worry about provisioning hardware for a one-time spike in use around analyzing the historical data. Where previously running 40,000 daily simulations would have taken 44 minutes, with Databricks on Azure they unlocked the capability for real-time scoring as data gets generated.

Scaling Pitch Simulation Outcomes by 100X

Modeling and Inferring Pitch Outcomes in R

To model the outcome of a given pitch, the data science team settled on a training dataset consisting of 15 million rows with pitch location coordinates as well as season and game features like pitcher-batter handedness, inning, and so on. In order to capture the non-linear properties of pitch outcomes in their models, the team turned to the vast repository of open source packages available in the R ecosystem.

Ultimately an R package was chosen that was particularly useful for its flexibility and interpretability when modeling non-linear distributions. Data scientists could fit a complex model on historical data and still understand the precise effect each predictor has on the pitch outcome. Since the models would be used to help evaluate player performance and team composition, interpretability of the model was an important consideration for coaches and the business.

Having modeled various pitch outcomes, the team then used R to simulate the joint probability distribution of x-y coordinates for each pitch in the dataset. This effectively generated additional records that could be scored with their trained models. Inferring the expected pitch outcome for each simulated pitch sketches an image of expected player performance. The greater the number of simulations the sharper that image becomes.

A plan was made to generate 20,000 simulations for each one of the 15 million rows in the historical dataset. This would yield a final dataset of 300 billion simulated pitches ready for inference with their non-linear models, and provide the organization with the data to evaluate their players more accurately.

The problem with this approach was that by nature R operates in a single threaded, single node execution environment. Even when leveraging the multi-threading packages available in open source and a CPU heavy VM in the cloud, the team estimated that it would take months for the code to complete, if it completed at all. The question became one of scale: How can the simulation and inference logic scale to 300 billion rows and complete the job in a reasonable time frame? The answer lay in the Databricks Unified Analytics Platform powered by Apache Spark.

Scaling R with Spark

The first step in scaling this simulation pipeline was to refactor the feature engineering code to work with one of the two packages available in R for Spark – SparkR or sparklyr. Luckily, they had written their logic using the popular data manipulation package dplyr, which is tightly integrated with sparklyr. This integration enables the use of dplyr functions on the tbl_spark objects created when reading data with sparklyr. In this instance we only had to convert a few ifelse() statements to dplyr::mutate(case_when(...)), and then their feature engineering code would scale from a single node process to a massively parallel workload with Spark! In fact, about only 10% of the existing dplyr code needed to be refactored to work with Spark. We were now able to generate billions of rows for inference in a matter of minutes.

How does this dplyr magic work? To understand this we need to first understand how the R to Spark interface is structured:

Most of the functions in the Spark + R packages are wrappers around native Spark classes – your R code gets translated into Scala, and those commands are sent to the JVM on the driver where tasks are then assigned to each worker. In a similar vein, dplyr verbs are translated into SQL expressions that are then evaluated by Spark through SparkSQL. As a result one generally should see the same performance in Spark + R packages that you would expect in Scala.

Distributed Inference with R Packages

With the feature engineering pipeline running in Spark, our attention turned towards scaling model inference. We considered three distinct approaches:

Retrain the non-linear models in Spark
Extract the model coefficients from R and score in Spark
Embed the R models in a user defined function and parallelize execution with Spark

Let’s briefly examine the feasibility and tradeoffs associated with each of these approaches.

Retrain the Non-linear Models in Spark

The first approach was to see if an implementation of the modeling technique from R existed in Spark’s machine learning library (SparkML). In general, if your model type is available in SparkML it is best to choose that for performance, stability, and simpler debugging. This can require refactoring code somewhat if you are coming from R or Python, but the scalability gains can easily make it worth it.

Unfortunately the modeling approach chosen by the team is not available in SparkML. There has been some work done in academia, but since it involved introducing modifications to the codebase in Spark we decided to deprioritize this approach in favor of alternatives. The amount of work required to refactor and maintain this code made it prohibitive to use as a first approach.

Extracting Model Coefficients

If we couldn’t retrain in Spark, perhaps we could use the learned coefficients from R models and apply them to a Spark DataFrame? The idea here is that inference can be performed by multiplying an input DataFrame of features by a smaller DataFrame of coefficients to arrive at predicted values. Again, the tradeoff would be some refactoring code to orchestrate coefficient extraction from R to Spark in order to gain massive scalability and performance.

For this approach to work with this type of non-linear model, we would need to extract a matrix of linear predictors along with the model coefficients. Due to the nature of the R package being used, this matrix cannot be generated without passing new data through the R model object itself. Generating a matrix of linear predictors from 300 billion rows is not feasible in R, and so the second approach was abandoned.

Parallelizing R Execution with User Defined Functions in Spark

Given the lack of native support for this particular non-linear modeling approach in Spark and the futility of generating a 300 billion row matrix in R, we turned to User Defined Functions (UDFs) in SparkR and sparklyr.

In order to understand UDFs, we need to take a step back and reconsider the Spark and R architecture shown above.

With UDFs, the pattern changes somewhat from before. UDFs create an R process on each worker, enabling the user to execute arbitrary R code in parallel across the cluster. These functions can be applied to a partition or group of a Spark DataFrame, and will return the results back as a Spark DataFrame. Another way to think of this is that UDFs provide access to the R console on each worker, with the ability to apply an R function to data there and return the results back to Spark.

To perform inference on billions of rows with a model type not available in SparkML, we loaded the simulation data into a Spark DataFrame and applied an R function to each partition using SparkR::dapply. The high level structure of our function was as follows.

results <- dapply(features,
    # x is a partition of data
    function(x) {
    # load model object
    model <- readRDS(“/dbfs/pitch_outcome_model.rds”)
    # infer pitch outcomes 
    x$predictions <- predict(model, data = x)
    # return result
    x
    },  schema = schema)

Let’s break this function down further. For each partition of data in the features Spark DataFrame we apply an R function to:

Load an R model into memory from the Databricks File System (DBFS)

Make predictions in R
Return the resulting dataframe back to Spark
Specify output schema for the results Spark DataFrame

DBFS is a path mounted on each node in the cluster that points to cloud storage, and is accessible from R. This makes it easy to load models on the workers themselves instead of broadcasting them as variables across the network with Spark. Ultimately, this third approach proved fruitful and allowed the team to scale their models trained in R for distributed inference on billions of rows with Spark.

Conclusion

In this post we reviewed a number of different approaches for scaling simulation and player valuation pipelines written in R for the Minnesota Twins MLB team. We walked through the reasoning for choosing various approaches, the tradeoffs, and the architectural differences between core SparkR/sparklyr functions and user defined functions.

From feature engineering to model training, coefficient extraction, and finally user defined functions it is clear that plenty of options exist to extend the power of the R ecosystem to big data.

In the next post we will take a deeper dive into user defined functions with R in Spark. We will learn how to optimize performance by managing the behavior of R inside the UDF as well as the behavior of Spark as it orchestrates execution.

Try Databricks for free. Get started today.

The post How the Minnesota Twins Scaled Pitch Scenario Analysis to Measure Player Performance – Part 1 appeared first on Databricks.

CONTENTS

Overview

Why do we need yet another deployment framework?

Simplifying CI/CD on Databricks via reusable templates

Development lifecycle using Databricks Deployments

How to create and deploy a new data project with Databricks Labs CI/CD Templates in 10 minutes?

Create a new project using the Databricks Labs CI/CD Templates project template

Let’s deploy our project to target Databricks workspace

Test Automation using Databricks Labs CI/CD Templates

Deploying production pipelines using Databricks Deployments

Dependency and configuration management

How to learn more

Outlook and next steps

How to contribute?

Overview

Databricks Labs continuous integration and continuous deployment (CI/CD) Templates are an open source tool that makes it easy for software development teams to use existing CI tooling with Databricks Jobs. Furthermore, it includes pipeline templates with Databricks’ best practices baked in that run on both Azure and AWS so developers can focus on writing code that matters instead of having to set up full testing, integration and deployment systems from scratch.

CI/CD Templates in 3 steps:

Pip install cookiecutter
Cookiecutter https://github.com/databrickslabs/cicd-templates.git
- Answer the interactive questions in the terminal such as which cloud you would like to use and you have a full working pipeline.
- Pip install databricks_cli && databricks configure --token
- Start pipeline on Databricks by running ./run_pipeline.py pipelines in your project main directory
Add your databricks token and workspace URL to github secrets and commit your pipeline to a github repo.

Your Databricks Labs CI/CD pipeline will now automatically run tests against databricks whenever you make a new commit into the repo. When you are ready to deploy your code, make a github release and templates will automatically package and deploy your pipeline to databricks as a job.

That’s it! You now have a scalable working pipeline which your development team can use and develop off of. Additionally, you can always modify the template to be more specific to your team or use-case to ensure future projects can be set up with ease.

For the remainder of this post, we’ll go into depth about why we decided to create Databricks Labs CI/CD templates, what is planned for the future of the project, and how to contribute.

Why do we need yet another deployment framework?

As projects on Databricks grow larger, Databricks users may find themselves struggling to keep up with the numerous notebooks containing the ETL, data science experimentation, dashboards etc. While there are various short term workarounds such as using the %run command to call other notebooks from within your current notebook, it’s useful to follow traditional software engineering best practices of separating reusable code from pipelines calling that code. Additionally, building tests around your pipelines to verify that the pipelines are also working is another important step towards production-grade development processes.

Finally, being able to run jobs automatically upon new code changes without having to manually trigger the job or manually install libraries on clusters is important for achieving scalability and stability of your overall pipeline. In summary, to scale and stabilize our production pipelines, we want to move away from running code manually in a notebook and move towards automatically packaging, testing, and deploying our code using traditional software engineering tools such as IDEs and continuous integrationI tools.

Indeed, more and more data teams are using Databricks as a runtime for their workloads preferring to develop their pipelines using traditional software engineering practices: using IDEs, GIT and traditional CI/CD pipelines. These teams usually would like to cover their data processing logic with unit tests and perform integration tests after each change in their version control system.

The release process is also managed using a version control system: after a PR is merged into the release branch, integration tests can be performed and in a case of positive results deployment pipelines can be also updated. Bringing a new version of pipelines to production workspace is also a complex process since they can have different dependencies, like configuration artifacts, python and/or maven libraries and other dependencies. In most cases, different pipelines can depend on different versions of the same artifact(s).

Simplifying CI/CD on Databricks via reusable templates

Many organizations have invested many resources into building their own CI/CD pipelines for different projects. All those pipelines have a lot in common: basically they build, deploy and test some artifacts. In the past, developers were also investing long hours in developing different scripts for building, testing and deploying of applications before CI tools made most of those tasks obsolete: conventions introduced by CI tools made it possible to provide developers with the frameworks which can implement most of those tasks in an abstract way so that they can be applied to any project which follows these conventions. For example Maven has introduced such conventions in Java development, which made it possible to automate most of the build process, which were implemented in huge ant scripts.

Databricks Labs CI/CD Templates makes it easy to use existing CI/CD tooling, such as Jenkins, with Databricks; Templates contain pre-made code pipelines created according to Databricks best practices. Furthermore, Templates allow teams to package up their CI/CD pipelines into reusable code to ease the creation and deployment of future projects. Databricks Labs CI/CD Templates introduces similar conventions for Data Engineering and Data Science projects which provide data practitioners using Databricks with abstract tools for implementing CI/CD pipelines for their data applications.

Let us go deeper into the conventions we have introduced. Most of the data processing logic including data transformations, feature generation logic, model training, etc should be developed in the python package. This logic can be utilized in a number of production pipelines that can be scheduled as jobs. The aforementioned logic can be also tested using local unit tests that test individual transformation functions and integration tests. Integration tests are run on Databricks workspace and can test the data pipelines as a whole.

Development lifecycle using Databricks Deployments

Data Engineers and Data Scientists can rely on Databricks Labs CI/CD Templates for testing and deploying the code they develop in their IDEs locally in Databricks. Databricks Labs CI/CD Templates provides users with the reusable data project template that can be used to jumpstart the development of a new data use case. This project will have the following structure:

The structure of Databricks Labs’ reusable data project templates make it easy for developers to easily jumpstart tThe development of a new data use case.

Data ingestion, validation, and transformation logic, together with feature engineering and machine learning models, can be developed in the python package. This logic can be utilized by production pipelines and be tested using developer and integration tests. Databricks Labs CI/CD Templates can deploy production pipelines as Databricks Jobs, including all dependencies, automatically. These pipelines must be placed in the ‘pipelines’ directory and can have their own set of dependencies, including different libraries and configuration artifacts. Developers can utilize a local mode of Apache Spark or Databricks Connect to test the code while developing in IDE installed on their laptop. In case they would like to run these pipelines on Databricks, they can use the Databricks Labs CI/CD Templates CLI. Developers can also utilize the CLI to kick off integration tests for the current state of the project on Databricks.

After that, users can push changes to GitHub, where they will be automatically tested on Databricks using GitHub Actions configuration. After each push, GitHub Actions starts a VM that checks out the code of the project and runs the local pytest tests in this VM. If these tests were successful, it will build the python wheel and deploy it along all other dependencies to Databricks and run developer tests on Databricks.

At the end of the development cycle, the whole project can be deployed to production by creating a GitHub release, which will kick off integration tests in Databricks and deployment of production pipelines as Databricks Jobs. In this case the CI/CD pipeline will look similarly to the previous one, but instead of developer tests, integration tests will be run on Databricks and if they are successful, the production job specification on Databricks will be updated.

How to create and deploy a new data project with Databricks Labs CI/CD Templates in 10 minutes?

Create a new project using the Databricks Labs CI/CD Templates project template

Install Cookiecutter python package: pip install cookiecutter
Create your project using our cookiecutter template: cookiecutter https://github.com/databrickslabs/cicd-templates.git
Answer the questions…

After that, the new project will be created for you. It will have the following structure:

.
├── cicd_demo
│   ├── __init__.py
│   ├── data
│   │   ├── __init__.py
│   │   └── make_dataset.py
│   ├── features
│   │   ├── __init__.py
│   │   └── build_features.py
│   ├── models
│   │   ├── __init__.py
│   │   ├── predict_model.py
│   │   └── train_model.py
│   └── visualization
│       ├── __init__.py
│       └── visualize.py
├── create_cluster.py
├── deployment
│   └── databrickslabs_mlflowdepl-0.2.0-py3-none-any.whl
├── deployment.yaml
├── dev-tests
│   ├── pipeline1
│   │   ├── job_spec_aws.json
│   │   ├── job_spec_azure.json
│   │   └── pipeline_runner.py
│   └── pipeline2
│       ├── job_spec_aws.json
│       ├── job_spec_azure.json
│       └── pipeline_runner.py
├── docs
│   ├── Makefile
│   ├── commands.rst
│   ├── conf.py
│   ├── getting-started.rst
│   ├── index.rst
│   └── make.bat
├── integration-tests
│   ├── pipeline1
│   │   ├── job_spec_aws.json
│   │   ├── job_spec_azure.json
│   │   └── pipeline_runner.py
│   └── pipeline2
│       ├── job_spec_aws.json
│       ├── job_spec_azure.json
│       └── pipeline_runner.py
├── notebooks
├── pipelines
│   ├── pipeline1
│   │   ├── job_spec_aws.json
│   │   ├── job_spec_azure.json
│   │   └── pipeline_runner.py
│   └── pipeline2
│       ├── job_spec_aws.json
│       ├── job_spec_azure.json
│       └── pipeline_runner.py
├── requirements.txt
├── run_pipeline.py
├── runtime_requirements.txt
├── setup.py
└── tests
└── test_smth.py

The name of the project we have created is ‘cicd_demo’, so the python package name is also ‘cicd_demo’, so our transformation logic will be developed in the ‘cicd_demo’ directory. It can be used from the pipelines that will be placed in ‘pipelines’ directory. In ‘pipelines’ directory we can develop a number of pipelines, each of them in its own directory.

Each pipeline must have an entry point python script, which must be named ‘pipeline_runner.py’. In this project, we can see two sample pipelines created. Each of these pipelines has python script and job specification json file for each supported cloud. These files can be used to define cluster specification (e.g., number of nodes, instance type, etc.), job scheduling settings, etc.

‘Dev-tests’ and ‘integration-tests’ directories are used to define integration tests that test pipelines in Databricks. They should also utilize the logic developed in the python package and evaluate the results of the transformations.

Let’s deploy our project to target Databricks workspace

Databricks Deployments is tightly integrated with GitHub Actions. We will need to create a new GitHub repository where we can push our code and where we can utilize GitHub Actions to test and deploy our pipelines automatically. In order to integrate GitHub repository with the Databricks workspace, workspace URL and Personal Authentication token (PAT) must be configured as GitHub secrets. Workspace URL must be configured as DATABRICKS_HOST secret and token as DATABRICKS_TOKEN.

Now we can initialize a new git repository in the project directory. After that, we can add all files to git and push them to the remote GitHub repository. After we have configured our tokens and proceeded with our first push GitHub Actions will run dev-test automatically on target Databricks Workspace and our first commit will be marked green if the tests are successful.

It is possible to initiate run of production pipelines or individual tests on Databricks from the local environment by running run_pipeline.py script:

    ./run_pipeline.py pipelines --pipeline-name test_pipeline

This command will run test_pipeline from the pipelines folder on Databricks.

Test Automation using Databricks Deployments

The newly created projects are preconfigured with two standard CI/CD pipelines: one of them is executed for each push and runs dev-tests on Databricks workspace.

Another one is run for each created GitHub release and runs integration-tests on Databricks workspace. In the case of a positive result of integration tests, the production pipelines are deployed as jobs to the Databricks workspace.

Deploying production pipelines using Databricks Deployments

In order to deploy pipelines to production workspace, GitHub release can be created. It will automatically start integration tests and if they are positive, the production pipelines are deployed as jobs to the Databricks workspace. During the first run, the jobs will be created in Databricks workspace. During the subsequent release, the definition of existing jobs will be updated.

Dependency and configuration management

Databricks Deployments supports dependency management on two levels:

Project level:
- project level python package dependencies, which are needed during production runtime, can be placed in runtime_requiremnets.txt
- It is also possible to use project level JAR or Python Whl dependencies. They can be placed in dependencies/jars and dependencies/wheels directories.
Pipeline dependencies:
- Pipeline level python/maven/other dependencies can be specified in job specification json directly in libraries section
- Jars and wheels can be placed in dependencies/jars and dependencies/wheels directories respectively in pipeline folder

Configuration files can be placed in the pipeline directory. They will be logged to MLflow together with python script. During execution in Databricks the job script will receive the path to the pipeline folder as first parameter. This parameter can be used to open any files that were present in the pipeline directory.

Let’s discuss how we can manage dependencies using Databricks Deployments using the following example:

.
├── dependencies
│   ├── jars
│   │   └── direct_dep.jar
│   └── wheels
│       └── someotherwheel-0.1.0-py3-none-any.whl
├── job_spec_aws.json
├── job_spec_azure.json
├── pipeline_runner.py
└── train_config.yaml

This pipeline has two dependencies on pipeline level: one jar file and one wheel. Train_config.yaml file contains configuration parameters that pipeline can read using the following code:

def read_config(name, root):
   try:
       filename = root.replace('dbfs:', '/dbfs') + '/' + name
       with open(filename) as conf_file:
           conf = yaml.load(conf_file, Loader=yaml.FullLoader)
           return conf
   except FileNotFoundError as e:
       raise FileNotFoundError(
           f"{e}. Please include a config file!")

conf = read_config('train_config.yaml', sys.argv[1])

How to learn more?

You can follow our tutorial on github to build a sample data project using Databricks Labs CI/CD Templates. The tutorial is available under the following link: https://github.com/databrickslabs/cicd-templates/blob/master/tutorial.md

Outlook and next steps

There are different directions of the further development of Databricks Deployments. We are thinking of extending a set of CI/CD tools we provide the templates for. As of now it is just GitHub Actions, but we can add a template that integrates with CircleCI or Azure DevOps.

Another direction can be supporting development of pipelines developed in Scala.

How to contribute?

Databricks Labs CI/CD Templates is an open source tool and we happily welcome contributions to it. You are welcome to submit a PR!

Try Databricks for free. Get started today.

The post Automate continuous integration and continuous delivery on Databricks using Databricks Labs CI/CD Templates appeared first on Databricks.

Understanding and mitigating risk is at the forefront of any financial services institution. However, as previously discussed in the first blog of this two-part series, banks today are still struggling to keep up with the emerging risks and threats facing their business. Plagued by the limitations of on-premises infrastructure and legacy technologies, banks until recently have not had the tools to effectively build a modern risk management practice. Luckily, a better alternative exists today based on open-source technologies powered by cloud-native infrastructure. This Modern Risk Management framework enables intraday views, aggregations on demand and an ability to future proof/scale risk management. In this two-part blog series, we demonstrate how to modernize traditional value-at-risk calculation through the use of Delta Lake, Apache Spark^TM and MLflow in order to enable a more agile and forward looking approach to risk management.

A modern approach to portfolio risk management requires the use of technologies like Delta Lake, Apache SparkTM and MLflow in order to scale value-at-risk calculations, backtest models and explore alternative data

The first demo addressed the technical challenges related to modernizing risk management practices with data and advanced analytics, covering the concepts of risk modelling and Monte Carlo simulations using MLflow and Apache Spark^TM. This article focuses on the risk analyst persona and their requirements to efficiently slice and dice risks simulations (on demand) in order to better understand portfolio risks as new threats emerge, in real time. We will cover the following topics:

Using Delta Lake and SQL for aggregating value-at-risk on demand
Using Apache Spark^TM and MLflow to backtest models and report breaches to regulators
Exploring the use of alternative data to better assess your risk exposure

Slicing and dicing value-at-risk with Delta Lake

In part one of this two-part blog series, we unveiled what a modern risk management platform looks like and the need for FSIs to shift the lense in which data is viewed: not as a cost, but as an asset. We demonstrated the versatile nature of data, and how storing Monte Carlo data in its most granular form would enable multiple use-cases along with providing analysts with the flexibility to run ad-hoc analysis, contributing to a more robust and agile view of the risks banks are facing.

In this blog and demo, we uncover the risk of various investments in a Latin America equity portfolio composed of 40 instruments across multiple industries. For that purpose, we leverage the vast amount of data we were able to generate through Monte Carlo simulations (40 instruments x 50,000 simulations x 52 weeks = 100 million records), partitioned by day and enriched with our portfolio taxonomy as follows.

Value-at-risk

Value-at-risk is the process of simulating random walks that cover possible outcomes as well as worst case (n) scenarios. A 95% value-at-risk for a period of (t) days is the best case scenario out of the worst 5% trials.

As our trials were partitioned by day, analysts can easily access a day’s worth of simulations data and group individual returns by a trial Id (i.e. the seed used to generate financial market conditions) in order to access the daily distribution of our investment returns and its respective value-at-risk. Our first approach is to use Spark SQL to aggregate our simulated returns for a given day (50,000 records) and use in-memory python to compute the 5% quantile through a simple numpy operation.

returns = spark \
    .read \
    .table(monte_carlo_table) \
    .filter(F.col('run_date') == '2020-01-01') \
    .groupBy('seed') \
    .agg(F.sum('trial').alias('return')) \
    .select('return') \
    .toPandas()['return']

value_at_risk = np.quantile(returns, 5 / 100)

Provided an initial $10,000 investment across all our Latin American equity instruments, the 95% value-at-risk – at that specific point in time – would have been $3,000. This is how much our business would be ready to lose (at least) in the worst 5% of all the possible events.

The downside of this approach is that we first need to collect all daily trials in memory in order to compute the 5% quantile. While this process can be performed easily when using 1 day worth of data, it quickly becomes a bottleneck when aggregating value-at-risk over a longer period of time.

A pragmatic and scalable approach to problem solving

Extracting percentile from a large dataset is a known challenge for any distributed computing environment. A common (albeit inefficient) practice is to 1) sort all of your data and 2) cherry pick a specific row using takeOrdered or to find an approximation through the approxQuantile method. Our challenge is slightly different since our data does not constitute a single dataset but spans across multiple days, industries and countries, where each bucket may be too big to be efficiently collected and processed in memory.

In practice, we leverage the nature of value-at-risk and only focus on the worst n events (n small). Given 50,000 simulations for each instrument and a 99% VaR, we are interested in finding the best of the worst 500 experiments only. For that purpose, we create a user defined aggregate function (UDAF) that only returns the best of the worst n events. This approach will drastically reduce the memory footprint and network constraints that may arise when computing large scale VaR aggregation.

class ValueAtRisk(n: Int) extends UserDefinedAggregateFunction {

    // These are the input fields for your aggregate function.
    override def inputSchema: org.apache.spark.sql.types.StructType = {
        StructType(StructField("value", DoubleType) :: Nil)
    }
    
    // These are the internal fields you keep for computing your aggregate.
    override def bufferSchema: StructType = StructType(
        Array(
            StructField("worst", ArrayType(DoubleType))
        )
    )
    
    // This is the output type of your aggregation function.
    override def dataType: DataType = DoubleType
    
    // The order we process dataframe does not matter
    // the worst will always be the worst
    override def deterministic: Boolean = true
    
    // This is the initial value for your buffer schema.
    override def initialize(buffer: MutableAggregationBuffer): Unit = {
        buffer(0) = Seq.empty[Double]
    }
    
    // This is how to update your buffer schema given an input.
    override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
        buffer(0) = buffer.getAs[Seq[Double]](0) :+ input.getAs[Double](0)
    }
    
    // This is how to merge two objects with the bufferSchema type.
    // We only keep worst N events
    override def merge(buffer: MutableAggregationBuffer, row: Row): Unit = {
        buffer(0) = (
            buffer.getAs[Seq[Double]](0) ++ row.getAs[Seq[Double]](0)
        ).sorted.take(n)
    }
    
    // This is where you output the final value
    // Our value at risk is best of the worst n overall
    override def evaluate(buffer: Row): Any = {
        return buffer.getAs[Seq[Double]](0).sorted.last
    }
    
}

// 95% value at risk is the best of the worst N events 
val n = (100 - 95) * numSimulations / 100

// Register UADFs
spark.udf.register("VALUE_AT_RISK", new ValueAtRisk(n))

By registering our UADF through spark.udf.register method, we expose that functionality to all of our users, democratizing risk analysis to everyone without an advanced knowledge of scala / python / spark. One simply has to group by trial Id (i.e. seed) in order to apply the above and extract the relevant value-at-risk using plain old SQL capabilities across all their data.

SELECT 
    t.run_date AS day, 
    VALUE_AT_RISK(t.return) AS value_at_risk
FROM 
    (
    SELECT 
        m.run_date, 
        m.seed, 
        sum(m.trial) AS return
    FROM
        monte_carlo m
    GROUP BY
        m.run_date, 
        m.seed 
    ) t
GROUP BY 
    t.run_date
ORDER BY t.run_date ASC

We can easily uncover the effect of COVID-19 on our market risk calculation. A 90-day period of economic volatility resulted in a much lower value-at-risk and therefore a much higher risk exposure overall since early March 2020.

Holistic view of our risk exposure

In most cases, understanding overall value-at-risk is not enough. Analysts need to understand the risk exposure to different books, asset classes, different industries or different countries of operations. In addition to Delta Lake capabilities such as time travel and ACID transactions discussed earlier, Delta Lake and Apache Spark^TM have been highly optimised on Databricks runtime to provide fast aggregations at read. High performance can be achieved using our native partitioning logic (by date) alongside a z-order indexing applied to both country and industry. This additional indexing will be fully exploited when selecting a specific slice of your data at a country or industry level, drastically reducing the amount of data that needs to be read prior to your VaR aggregation.

OPTIMIZE monte_carlo ZORDER BY (country, industry)

We can easily adapt the above SQL code by using country and industry as our grouping parameter for VALUE_AT_RISK method in order to have a more granular and descriptive view of our risk exposure. The resulting data set can be visualised “as-is” using Databricks notebook and can be further refined to understand the exact contribution each of these countries have to our overall value-at-risk.

In this example, Peru seems to have the biggest contribution to our overall risk exposure. Looking at the same SQL code at an industry level in Peru, we can investigate the contribution of the risk across industries.

With a contribution close to 60% in March 2020, the main risk exposure in Peru seems to be related to the mining industry. An increasingly severe lockdown in response to the COVID virus has been impacting mining projects in Peru, centre for copper, gold and silver production (source).

Stretching the scope of this article, we may wonder if we could have identified this trend earlier using alternative data and specifically the global database of events, locations and tone (GDELT). We report in below graph the media coverage for the mining industry in Peru, color coding positive and negative trends through a simple moving average.

This clearly exhibits a positive trend in early February, i.e. 15 days prior to the observed stock volatility, which could have been an early indication of mounting risks. This analysis stresses the importance of modernizing value-at-risk calculations, augmenting historical data with external factors derived from alternative data.

Model backtesting

In response to the 2008 financial crisis, an additional set of measures were developed by the Basel committee on banking supervision. The 1 day VaR 99 results are to be compared against daily P&Ls. Backtests are to be performed quarterly using the most recent 250 days of data. Based on the number of exceedances experienced during that period, the VaR measure is categorized as falling into one of three colored zones.

Level	Threshold	Results
Green	Up to 4 exceedances	No particular concerns raised
Yellow	Up to 9 exceedances	Monitoring required
Red	More than 10 exceedances	VaR measure to be improved

AS-OF value-at-risk

Given the aggregated function we defined earlier, we can extract daily value-at-risk across our entire investment portfolio. As our aggregated value-at-risk dataset is small (contains 2 years of history, i.e. 365 x 2 data points), our strategy is to collect daily VaR and broadcast it to our larger set in order to avoid unnecessary shuffles. More details on AS-OF functionalities can be found in an earlier blog post Democratizing Financial Time Series Analysis.

case class VarHistory(time: Long, valueAtRisk: String)

val historicalVars = sql(s"""
    SELECT t.run_date, VALUE_AT_RISK(t.return) AS valueAtRisk
    FROM 
        (
        SELECT m.run_date, m.seed, sum(m.trial) AS return
        FROM monte-carlo m
        GROUP BY m.run_date, m.seed
        ) t
    GROUP BY 
        t.run_date
    """
    )
    .withColumn("time", convertDate(col("run_date")))
    .orderBy(asc("time"))
    .select("time", "valueAtRisk")
    .as[VarHistory]
    .collect()
    .sortBy(_.time)
    .reverse

val historicalVarsB = spark.sparkContext.broadcast(historicalVars)

We retrieve the closest value-at-risk to our actual returns via a simple user defined function and perform a 250-day sliding window to extract continuous daily breaches.

val asOfVar = udf((s: java.sql.Date) => {
    val historicalVars = historicalVarsB.value
    if(s.getTime < historicalVars.last.time) {
        Some(historicalVars.last.valueAtRisk)
    } else {
        historicalVarsB
            .value
            .dropWhile(_.time > s.getTime)
            .headOption.map(_.valueAtRisk) 
    }
})

val windowSpec = Window.orderBy("time").rangeBetween(-3600 * 24 * 250, 0)
val countBreaches = udf((asOfVar: Double, returns: Seq[Double]) => {
    returns.count(_ < asOfVar)
})

spark
    .read
    .table(stock_return_table)
    .groupBy("date")
    .agg(sum("return").as("return"))
    .withColumn("var", asOfVar(col("date")))
    .orderBy(asc("date"))
    .withColumn("returns", collect_list("return").over(windowSpec))
    .withColumn("count", countBreaches(col("var"), col("returns")))
    .createOrReplaceTempView("breaches")

We can observe a consecutive series of 17 breaches from February onwards that would need to be reported to regulations according to the Basel III framework. The same can be reported onto a graph, over time.

In early 2020, we have observed a period of unusual stability that seems likely to presage the difficult times we are now facing. We can also observe that our value-at-risk is dramatically decreasing (as our overall risk increases) but does not seem to decrease as fast as the actual returns. This apparent lag in our value-at-risk calculation is due to the 90-day observation period of volatility required by our model.

With our model registered on MLflow, we may want to record these results as evidence for audit and regulation, providing them with a single source of truth of our risk models, their accuracies, technical context (for transparency) as well as their worst cases scenario as identified here.

Stressed VaR

Introducing a “stressed VaR“ helps mitigate the risk we face today by including worst-ever trading days as part of our ongoing calculation. However, this wouldn’t change the fact that this whole approach is solely based on historical data and unable to cope with actual volatility driven by new emerging threats. In fact, despite complex “stressed VaR” models, banks are no longer equipped to operate in so-called “unprecedented times“ where history no longer repeats itself. As a consequence, most of the top tier banks are currently reporting severe breaches in their value-at-risk calculations as reported in the Financial Times article below.

St banks’ trading risk surges to highest since 2011

[...] The top five Wall St banks’ aggregate “value-at-risk”, which measures their potential daily trading losses, soared to its highest level in 34 quarters during the first three months of the year, according to Financial Times analysis of the quarterly VaR high disclosed in banks’ regulatory filings

https://on.ft.com/2SSqu8Q

A forward looking approach

As demonstrated earlier, a modern risk and portfolio management practice should not be solely based on historical returns but also must embrace the variety of information available today, introducing shocks to Monte Carlo simulations augmented with real-life news events, as they unfold. For example, a white paper from Atkins et al describes how financial news can be used to predict stock market volatility better than close price. As indicated via the Peru example above, the use of alternative data can dramatically augment the intelligence for risk analysts to have a more descriptive lense of modern economy, enabling them to better understand and react to exogenous shocks in real time.

In this series of articles, we have demonstrated how Apache Spark^TM, Delta Lake and MLflow can be used for value-at-risk calculation, and how banks can modernize their risk management practices by moving to the cloud and adopting a unified approach to data analytics with Databricks. In addition, we show how banks can take back control of their data (consider data as an asset, not a cost) and enrich the view they have on the modern economy through the use of alternative data in order to move towards a forward looking and a more agile approach to risk management and investment decisions.

Modernizing Your Approach to Risk Management: Next Steps

Try the below on Databricks today! And if you want to learn how unified data analytics can bring data science, business analytics and engineering together to accelerate your data and ML efforts, check out the on-demand workshop - Unifying Data Pipelines, Business Analytics and Machine Learning with Apache Spark^TM.

VaR and Risk Management Notebooks:
https://databricks.com/notebooks/00_context.html
https://databricks.com/notebooks/01_market_etl.html
https://databricks.com/notebooks/02_model.html
https://databricks.com/notebooks/03_monte_carlo.html
https://databricks.com/notebooks/04_var_aggregation.html
https://databricks.com/notebooks/05_alt_data.html
https://databricks.com/notebooks/06_backtesting.html

Try Databricks for free. Get started today.

The post Modernizing Risk Management Part 2: Aggregations, Backtesting at Scale and Introducing Alternative Data appeared first on Databricks.

Guest blog by R Tyler Croy, Director of Platform Engineering at Scribd

People don’t tend to get excited about the data platform. It is often regarded much like road infrastructure: nobody thinks much about how vital it is for them to get from points A to B, unless it’s terribly bad. Imagine my surprise when I started to hear from users: “Wow this is amazing,” or “I can’t wait for my whole team to adopt it”, or “we’re really excited about it!” The enthusiasm doesn’t make the migration project any less challenging, but it certainly makes it more enjoyable.

At Scribd, we’ve had a “conventional data platform” for some time: Hadoop mixed with HDFS and smattering of Hive. Over time the needs of the business have changed and we now need more machine learning, more real-time data processing, and more support for teams collaborating to deliver new data products; we needed something better than the “conventional data platform.” Our new data platform is a combination of Airflow, Databricks, Delta Lake, and AWS Glue Catalog, a powerful suite of tools that have already improved our development velocity and collaboration significantly. The transition from “the old” to “the new” has been peppered with equal parts successes and stumbles as we re-platform and rid ourselves of complexity and technical debt.

The legacy data platform is conventional in more than just the technologies we have deployed, it is also deployed on a fixed data center infrastructure. A static set of machines sitting in a data center rapidly churn through data during peak batch workloads, then waste money and energy when idle. As the company and our needs have grown, the “peak” of the data platform became more and more noticeable, and much more painful for developers. Some would fire off queries or jobs prior to heading out to lunch, or at the end of the day, with hopes that they would get their results upon their return. A particularly grievous anti-pattern I noticed shortly after my arrival at the company: some machine learning engineers would prepare their data sets, dump them into a personal AWS S3 bucket, launch GPU-capable instances, train their models, and then submit reimbursement requests to their manager at the end of the month.

If the developer horror stories weren’t enough, operationally things were arguably worse! Many conventional data platform technologies are difficult to combine with automation tools like Chef, etc, and as such our legacy data platform suffered from the lack of proper management the rest of our production infrastructure enjoyed. Every time we added more machines to the environment, the process would take a day or two depending on the request. As such, we would only add nodes when we really needed it, or when drive and system failures required it.

Our conventional data platform was wasting both developers’ time, infrastructure engineers’ time. I shudder thinking about what we could have done if all those talented people were instead working on projects that advanced our business objectives.

Modernizing our data infrastructure: evaluating options

By mid-2019 Scribd had hired up a “Core Platform” team, chartered with building out a “Real-time Data Platform”, and a completely new team of data scientists and machine learning engineers. The unanimous agreement from all parties invested in the legacy data platform was that we had to get “to the cloud.” Coupled with a company-wide initiative to migrate into AWS, our list of potential options was relatively short. We needed a data platform that worked well on AWS, relied on S3, would run our queries and Spark jobs well, could provide some semblance of self-service for developers, and could enable new machine-learning workloads we hadn’t even conceived of yet.

The options I looked at could be placed into two categories:

“Looks like a conventional data platform, but with S3, and in the cloud!” (not very compelling).
Databricks

Since we knew that our storage options were going to involve AWS S3 in some form or fashion, we had already started to look at how our massive data warehouse and workloads would interoperate with S3. For data platform usage, S3 is a great tool but it’s not as simple as its name might suggest. The eventually consistent nature of S3 can cause a number of problems for Parquet files accessed through a table-like interface (e.g. Hive). Some of those issues can be worked around with S3Guard but the additional architectural complexity made us a bit weary. Around this time we were lucky to have noticed the open sourcing of Delta Lake by Databricks. The initial evaluation of Delta Lake blew us away; we found our storage layer.

Delta Lake actually ended up being the gateway to Databricks, which wasn’t part of our initial compute platform evaluation. As we dug into Databricks more and more we found two killer features:

Databricks Notebooks proved to be such a killer feature for developers and analysts, who had to date been collaborating by sharing queries to copy and paste into Hue.
The optimized Spark runtime in Databricks, which helps execute queries and jobs even faster, helping us get developers their results as soon as possible.

Migrating Spark workloads to the cloud: calculating costs and benefits

In AWS, time very directly equals money. The sooner you can shut down a machine in AWS, the less money you will spend. With some help from our sales team at Databricks, I was able to come up with a cost model for our existing Spark workloads if we were to fork-lift them directly into AWS without significant changes. They claimed an optimization of 30-50% for most traditional Spark workloads. “I’d say 30% just to be conservative,” they mentioned. Out of curiosity, I refactored my cost model to account for the price of Databricks and the potential Spark job optimizations. After tweaking the numbers I discovered that at a 17% optimization rate, Databricks would reduce our AWS infrastructure cost so much that it would pay for the cost of the Databricks platform itself.

After our initial evaluation, I was already sold on the features and developer velocity improvements Databricks would offer. When I ran the numbers in my model, I learned that I couldn’t afford not to adopt Databricks!

In production

The road from our data center-based data platform to Databricks is a long one, which we’re still traveling. As of today, we have backfilled our entire data warehouse into Delta Lake, a migration which conveniently addressed the egregious small files problem we had in HDFS. We have a little bit of custom tooling which is keeping data in sync from the data center to Delta Lake while we begin to move over some of our heavier-weight batch processing tasks. Despite the long road of batch-task migration ahead, we already have new projects deploying directly on top of Databricks:

Ad hoc internal user queries, once serviced by Hue and Hive, are being replaced en masse by powerful Delta Cache-enabled clusters. Those who previously spent much of their time in Hue are already excitedly collaborating with their peers via shared notebooks.
New Spark Streaming/Delta Lake projects are already in production. Workloads that were previously not possible have already been developed and deployed. For some data streams we have flowing into Kafka, we have deployed Spark Streaming applications to bring that data directly into Delta Lake, where jobs and users are querying data within minutes of its creation. For some of these workloads, users previously had to wait for 24 hour batch cycles to complete, but now they’re getting fresh production data from Delta Lake within 2-3 minutes.
When many have seen this in action, their thinking immediately has turned to “what data can I change to streams to get more real-time insight?” Suddenly there are multiple projects in team roadmaps which include either “produce data stream for X” or “process data stream from Y”.

To me, success for a team delivering data infrastructure and tooling is when your users are both excited to answer their questions using the infrastructure, and when they start to conceive of wholly new problems to solve with the platform. Suffice it to say, I’m already pleased with our results!

Everything isn’t quite as rosy as I would like, however: our administrative policies are still lagging behind the exuberance different teams have had with adopting Databricks. We have given developers fantastic tools to solve their problems, but we don’t yet have policies in place to prevent those users from launching oversized or overpriced clusters. The capabilities to do so exist in the platform, but when considering the cost of some EC2 waste compared to people’s time, we have erred on the side of getting clusters into peoples’ hands as quickly as possible for now.

Doing it better next time around

Re-platforming a massive data infrastructure is not easy. Our biggest pain points to date have been self-inflicted. We had a significant investment in Hive and Hive-based queries, with an assortment of custom UDFs which all needed to be migrated over to Spark and Spark SQL. We wrote some tools to help us automatically convert Hive queries and templates to Spark SQL, which managed to automate converting ~80% of those Hive workloads. The other 80% we have to convert manually, which is frustrating work.

The migration process also uncovered more technical debt than I’m proud to admit. Numerous Spark workloads still rely on Spark 1 (!), jobs which were not using Hive’s table interfaces but speaking directly to HDFS instead (!), and testing big batch jobs for which the original developers neglected to write any tests (!).

Personally, I’m looking forward to the day when the first new employee to join who never has to see the legacy data platform. They will be gleefully unaware that there was once a time when you had to copy and paste query snippets into chat, wait until tomorrow for fresh data, or waste hours idling while your jobs completed.

Hear R. Tyler Croy talk about Scribd’s transition into a streaming cloud-based data platform in their upcoming session at SPARK + AI Summit. Register for free here.

Scribd is also hiring talented remote engineers to help change the way the world reads, learn more at tech.scribd.com

Try Databricks for free. Get started today.

The post Accelerating developers by ditching the data center appeared first on Databricks.

Spark + AI Summit 2020 is now virtual and free! June 22-26 is just around the corner and the excitement is building! More sessions. More speakers. 4x More training. And more of the world’s data community will attend than ever before.

When we made the decision to transform Summit into a completely virtual event, we wanted to make sure that we could offer you as much of the in-person experience as possible — but in a completely virtual environment.

We’ve spent the last two months building a conference that will bring data teams together around 220+ sessions, a stellar lineup of keynotes and countless opportunities to connect with your peers — over 50,000 data scientists, data engineers, analysts, business leaders and other data professionals.

Our virtual platform launches June 18, but here’s a sneak peek at what awaits you. As soon as the platform launches, make sure to get a head start by building your agenda and your profile to get the most of your conference experience.

Personalized Dashboard

As you enter the conference you will be welcomed by your personalized dashboard — home base for everything you need to know about the conference. We have highlighted the most useful links to access content and a quick view of your agenda. The left navigation panel will help you explore every aspect of the conference. And keep an eye on your inbox for notifications so you don’t miss any updates.

Build your Agenda

Our agenda is jam-packed this year with five days filled with technical content for data scientists, engineers, IT leaders and industries. To add sessions to your agenda, simply click on the heart next to the session title.

We also have an incredible lineup of keynotes from industry thought leaders like Ali Ghodsi, Matei Zaharia and Reynold Xin as well as luminary keynotes from Nate Silver, Hany Farid, Amy Heineke, Adam Patzke and many others. We recommend you take time to play with the agenda filters and explore the speaker’s pages to build your dream agenda.

Dev Hub + Expo

Connect with your peers and sponsors at the Dev Hub + Expo! You can also book a 1-to-1 meeting with experts in the Advisory Lounge, get content specific to your industry in the Industry Lounges, learn more about Delta Lake, Apache Spark™, MLflow and more at the Databricks Booth, and interact with our amazing sponsors.

We want to be able to bring people together and connect in a virtual space. We are doing this in many ways throughout the platform. Check out the Data People page for a directory of who’s around. You can also go to the Suggested for Me tab to meet like-minded individuals and recommended sessions/experiences to check out.

Summit Quest and Swag Store

Make sure to set aside time to do our daily body breaks, get social, rack up points to hit the top of the Summit Quest leaderboard. The more points you accumulate, the more you can shop at our Swag Store!

There is so much more that we can share, but now it is your turn to discover what Spark + AI Summit has to offer. If you have registered, join the experience now. And if you haven’t yet registered, it’s not too late. Join us for all of the action at Spark + AI Summit 2020 and we look forward to seeing you there!

Try Databricks for free. Get started today.

The post Data Teams Unite! Countdown to Spark + AI Summit appeared first on Databricks.

For years, the Spark + AI Summit has been the premier meeting place for organizations looking to build artificial intelligence (AI) applications at scale with leading technologies such as Apache Spark^TM, Delta Lake and MLflow. In 2020, we’re continuing the tradition by taking the summit entirely virtual. Data scientists and engineers from anywhere in the world will be able to join June 22-26 to learn and share best practices for delivering the benefits of AI.

This year we’ve added a robust experience for data teams in the media, entertainment, gaming and communications industries. Join thousands of your peers to learn how the latest innovations in data and AI are helping drive audience engagement and deepen customer loyalty. Register for Spark + AI Summit to take advantage of the full Media and Entertainment experience at Summit. The following is a summary of all the Media and Entertainment content and events we have planned.

Media and Entertainment Tech Talks

Here is an overview of some of our most highly anticipated Media and Entertainment session talks at this year’s summit.

Data-Driven Decisions at Scale
Comcast is a prime example of a long-established company that has successfully made the transition from channel-centric to customer-centric by prioritizing their data-driven efforts. In this talk, they’ll discuss how their Product Analytics & Behavior Science (PABS) team plays a crucial role in the customer experience as an interpreter, transforming data into consumable insights and providing these insights to the broader product teams within Comcast to make smarter decisions and fuel innovation. You’ll learn specifically how the PABS team has been using Databricks and Delta Lake to build highly reliable and performant real-time data pipelines to deliver insights for their analytics needs in a timely manner.

Advertising Fraud Detection at Scale at T-Mobile
In this session, the T-Mobile Marketing Solutions (TMS) Data Science team will present a platform architecture and production framework supporting TMS internal products and services. Powered by Apache Spark technologies, these services operate in a hybrid of on-premises and cloud environments. They’ll cover key lessons learned and best practices from their Advertising Fraud Detection service as an example, including how they scaled machine learning algorithms outside of the Spark MLlib framework.

Deliver Dynamic Customer Journey Orchestration at Scale
The traditional one-size-fits-all customer journey is no longer a viable option in today’s omnichannel environment. Publicis’s answer to the ever-expanding landscape is COSMOS, a customer intelligence platform that offers a set of comprehensive and scalable Marketing Machine Learning (MML) Models for recommending the ‘next-best-action’ based on the customer journey. In this session, Publicis will discuss the business benefits of dynamic orchestration, limitations of the classic customer journey models, and demonstrate how COSMOS MML models overcome these limitations.

Productionizing Deep Reinforcement Learning with Spark and MLflow
Deep Reinforcement Learning has driven exciting AI breakthroughs in the consumer space for years. But how can businesses harness this power for real-world applications? In this talk, learn how Zynga successfully uses RL to personalize games and increase engagement with over 70 million active users. They’ll discuss what works and what doesn’t work when applying cutting edge AI techniques to users, as well as the top lessons learned from productionizing Deep RL applications for millions of players per day using tools like Spark, MLflow and TensorFlow on top of the Databricks unified data analytics platform.

Scaling Production Machine Learning Pipelines with DatabricksConde Nast offers over one hundred million users a solution called Spire for user segmentation. Spire consists of thousands of models, many of which require individual scheduling and optimization. From data preparation to model training to interference, they’ve built abstractions around the data flow, monitoring, orchestration, and other internal operations. In this talk, they’ll explore the complexities of building large scale machine learning pipelines within Spire and discuss some of the solutions they’ve discovered using Databricks and MLflow.

Media and Entertainment Industry Forum

Join us on Wednesday, June 24, at 2:00 PM – 3:30 PM PST for an interactive Media and Entertainment Industry Forum at Spark + AI Summit. In this free virtual event, you will have the opportunity to network with your peers and participate in engaging panel discussions with leaders in the Media industry on how data and machine learning are driving innovation across the customer lifecycle.

Panel Discussion: Data and AI in the Media and Entertainment Industry
In this panel, hear industry experts speak on the growing importance of personalization, disintermediation, direct-to-consumer trends in business, and what the next year of their roadmaps look like in light of COVID-19. Panelists include:

Dan Morris

VP, Data Platform

Stephen Layland

Head of Data Engineering

Eric Wasserman

Sr. Architect

Kevin Perko

Head of Data Science, Applied Research

Demos on Popular Data + AI Use Case

Join live demos on the hottest use cases in the media and entertainment industry.

Quality of Service Analytics
Streaming services have seen unprecedented demand as consumers turn to digital channels to consume news and entertainment. With streaming video and audio consumption reaching record highs, it is critical that content owners deliver flawless high quality service and product stability. Join this live demo to learn how to build a digital media quality of service analytics solution with the Databricks Unified Data Analytics Platform.

Accelerating the Content Production Cycle with Machine Learning and AI
Media companies, ad agencies and brands run thousands of promotional campaigns every year, producing an immense amount of image and engagement data whose relationships to one another are complex and fragmented. With Digital Asset Management (DAM) systems making images more accessible and easier to find, marketers are now faced with the more difficult problem of making sense of what images and content will stand out and be the most engaging amidst the ad clutter. In this interactive demo, you will learn how to overcome the challenges of accelerating campaign analysis and content production process with machine learning at scale.

Sign-up for the Media and Entertainment Experience at Summit!

To take advantage of the full Media and Entertainment experience at Spark + AI Summit, simply register for our free virtual conference and select Media and Entertainment Forum during the registration process. If you’re already registered for the conference, log into your registration account, edit “Additional Events” and check the forum you would like to attend.

Try Databricks for free. Get started today.

The post Media and Entertainment Sessions You Don’t Want to Miss at Spark + AI Summit 2020 appeared first on Databricks.

Radical transformation is the theme of 2020, with customers demanding personalized products, improved protection against fraud, and digital experiences that match every small shift in behavior. Banks, insurance companies, and institutional investors are even more reliant on big data and AI to meet these demands and to outmaneuver the competition.

For years, the Spark + AI Summit has been the premier meeting place for organizations looking to build AI applications at scale with leading open-source technologies such as Apache Spark^TM, Delta Lake and MLflow. In 2020, we’re continuing the tradition by taking the summit entirely virtual. Data scientists and engineers from anywhere in the world will be able to join June 22-26 to learn and share best practices for delivering the benefits of AI.

This year’s Summit features a full agenda of talks from Financial Services industry leaders, including CapitalOne, VISA, Credit Suisse and Intuit, among others. As usual, attendees can also take part in our Financial Services Experience to meet with their peers, participate in engaging discussions with industry thought leaders and participate in interactive demos on the hottest data and AI use cases. The following is a summary of all the Financial Services content we have planned.

Financial Services Tech Talks

Here is an overview of some of our most highly anticipated financial services session talks at this year’s summit.

Disrupting Risk Management through Emerging Technologies
Optimally measuring risk is a critical function for many, but it’s especially so in financial services. Professionals need to understand the performance of products prior to investment in order to make strategic decisions. In this talk with CapitalOne, you’ll hear how senior members of their engineering are leveraging technology to provide modelers, analysts, and key stakeholders with end-to-end analytic experiences that enable loss forecasting, gaming analysis, result comparison of model runs, intelligent insights and outputs, and the creation of new features.

Cloud and Analytics—From Platforms to an Ecosystem
Data science powers Zurich’s insurance business like a central nervous system: 70 data scientists work on everything from optimizing claims-handling processes to protecting against the next risk, to revamping the suite of data and analytics for customers. In this talk, you’ll hear exactly how they implemented Zurich’s scalable, self-service data science ecosystem to optimize and scale the activities in the project lifecycle, as well as how they streamline machine learning and predictive analytics efforts with Azure data lake with analytical tools.

Using AI to Support Proliferating Merchant Changes
At VISA, merchants are a core entity in their payments network. Millions are observed to be added to the ecosystem every month, with a significant portion of them being merchants that have created a new identity and changed attributes. In this talk, learn how they use AI, big data, and a suite of tools to look at merchant patterns over regular intervals, detect these changes and, more importantly, track them with accuracy to prevent incorrect offers and delays in queries.

Using Machine Learning Algorithms to Construct All the Components of a Knowledge Graph
Machine learning algorithms drive product delivery at Reonomy. In this talk, you’ll learn the ins and outs directly from their chief data scientist as she walks through examples of critical code designs, cluster configuration, and the algorithms used for successfully building the components Reonomy’s knowledge graph. Takeaways include key points to consider when implementing production-quality models, as well as a logical framework for building knowledge graphs that are able to support a diverse set of property intelligence products.

How Intuit uses Apache Spark to Monitor in-production Machine Learning Models at Large-scale
In this presentation, Intuit will discuss their soon-to-be open source Model Monitoring Service (MMS). MMS is an in-house, Spark-based solution developed by Intuit AI to provide ongoing monitoring for both data and model metrics of in-production ML models. MMS aims to tackle multiple challenges of in-production ML model monitoring, including the integration of multiple data sources from different time ranges, and reusable and extendable metric and segmentation libraries.

Financial Services Industry Forum

Join us on Thursday, June 25, at 9:00 AM-10:30 AM EST for an interactive Financial Services Forum at Spark + AI Summit. In this capstone event at Spark + AI Summit, you’ll have the opportunity to engage in interactive discussions with leaders in the Financial Services industry on how data and machine learning are driving innovation across the entire sector. Here’s a snapshot of the presenter lineup for the Financial Services Forum:

How Credit Suisse is using data analytics and AI on Databricks to rapidly scale new product innovation
Despite the increasing embrace of big data and machine learning, most financial services companies still experience significant challenges around data types, privacy, and scale. Credit Suisse is overcoming these obstacles and leading the way in employing data analytics and AI for risk management and client focus to drive business growth and operational efficiency. How? Credit Suisse has brought together a core set of partnerships, people, processes and technology—with Databricks as its unified analytics platform— that enables them to collaborate and scale new product innovation, rapidly decreasing the time-to-market it takes from initial business idea to commercially viable product.

Nasdaq x AI, Dynamic Markets and the New Norm
AI has become a key differentiator for modern enterprises today. Technology innovations that were once only available to the top 1% of companies are now quickly becoming democratized. At Nasdaq, they are deploying advanced analytics to serve multiple use cases — from protecting financial markets to enabling new digital markets. Specifically, we will explore how Nasdaq leverages advanced data science techniques like graph processing and deep learning to see relationships between unstructured data (such as images and text) to feed models that will protect, transform and unlock new opportunities in Capital Markets.

Demos on Popular Data + AI Use Case in Financial Services

Join live demos on the hottest use cases in the financial services industry including value at risk modeling, automating claims assessments with computer vision, credit risk analytics and more.

Modernizing Risk Management Practices
Traditional banks relying on on-premises infrastructure can no longer effectively manage risk. This demo highlights the value of an agile Modern Risk Management practice capable of rapidly responding to market and economic volatility. Using the value-at-risk use case, you will learn how Databricks is helping FSIs modernize their risk management practices, leverage Delta Lake, Apache Spark and MLFlow to adopt a more agile approach to risk management.

How to Build Models that Move Quickly Through Validation and Audit
Financial institutions and banks are increasingly using data and ML to drive competitive insights reliable enough for business to trust and act upon. In this demo focused on credit risk analytics, we show how a unified data analytics platform brings a more disciplined and structured approach to commercial data science, reducing the model lifecycle process from 12 months to a few weeks.

Accelerating Claim Assessment Through Computer Vision
With over 15 thousands car accidents in the US every day (10 accidents every minute), automotive insurers recognize the need to improve operational efficiency through the use of AI. In this session, we demonstrate how Databricks helps insurance companies kickstart their AI/Computer Vision journey towards claim assessment and damage estimation.

Financial Services Training

Practical Problem-Solving in Finance: Real-Time Fraud Detection with Apache Spark
In this half-day course, you’ll learn how Databricks and Spark can help solve real-world problems one faces when working with financial data. You’ll learn how to deal with dirty data and how to get started with Structured Streaming and Real-Time Fraud Detection. Students will also receive a longer take-home capstone exercise as bonus content to the class where they can apply all the concepts presented. This class is taught concurrently in Python and Scala.

AI Disruption of Quantitative Finance. From Forecasting, to Probability Density Estimation, to Generative Models, and to Optimisation with Reinforcement Learning
In this talk, Nima Nooshi, a customer success engineer at Databricks, will showcase an end-to- end asset management pipeline based on recent AI developments. He’ll walk through the steps to build an autonomous portfolio manager, discuss how a predictive AI component, such as a nonlinear-dynamic Boltzmann machine, can improve the learning of the agent, as well as the possibility of using a data generating component to learn the conditional distribution of asset prices. You’ll walk away with a clear picture of an end-to-end data pipeline and how different components of a complex model work together in a unified platform architecture.

Sign-up for the Financial Services Experience at Summit!

To take advantage of the full Financial Services Experience at Spark + AI Summit, simply register for our free virtual conference and select Financial Services Forum during the registration process. If you’re already registered for the conference, log into your registration account, edit “Additional Events” and check the forum you would like to attend.

Try Databricks for free. Get started today.

The post Financial Services Sessions You Don’t Want to Miss at Spark + AI Summit 2020 appeared first on Databricks.

It’s been 2 years since we originally launched MLflow, an open source platform for the full machine learning lifecycle, and we are thrilled and humbled by the adoption and impact it has gained in the data science and data engineering community. With now over 2M+ monthly downloads, 200 code contributors and over a 100 contributing organizations, MLflow is the fastest growing and most widely used open source machine learning platform, confirming the need for an open source approach to help manage the complete ML lifecycle.

We provided an overview of how MLflow helps manage the ML lifecycle with a representation of a diverse set of use cases and industries during a recent virtual conference focused on ML platforms, and we have much more coming at Spark + AI Summit. Below is a list of sessions, tutorials, and trainings on MLflow for you to dive in.

MLflow Training

Register to our MLflow Learning Path for a full-day course on MLflow. The curriculum includes MLflow: Managing the Machine Learning Lifecycle and Machine Learning Deployment: 3 Model Deployment Paradigms, Monitoring, and Alerting, where you will learn best practices for putting machine-learning models into production.

Keynote

Join Matei Zaharia on Thursday, June 25th for his keynote on Simplifying Model Development and Management with MLflow to learn more about some of the most recent and new MLflow features. Specifically, he will cover what’s new in MLflow to further streamline the ML lifecycle with simplified experiment tracking, model management, and model deployment with the new MLflow Model Registry. Many organizations face challenges tracking which models are available in the organization and which ones are in production. The MLflow Model Registry provides a centralized database to keep track of these models, share and describe new model versions, and deploy the latest version of a model through APIs.

Talks

We have a fantastic lineup of speakers and sessions throughout the conference on MLflow. Join experts from Accenture, ExxonMobil, Zynga, Atlassian, Databricks and more for real-life examples and deep dives on MLflow (in chronological order):

AutoML Toolkit – Deep Dive with Daniel Tomes of Databricks
Productionalizing Models through CI/CD Design MLflow with Mary Grace Moesta and Peter Tamisin of Databricks
Tuning ML Models: Scaling, Workflows, and Architecture with Joseph Bradley of Databricks
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production with Nathan Buesgens of Accenture
Productionizing Machine Learning Pipelines with Databricks & Azure ML with Trace Smith and Amirhessam Tahmassebi of ExxonMobil
Advertising Fraud Detection at Scale at T-Mobile with Eric Yatskowitz and Phan Chuong of T-Mobile
Continuous Delivery of ML-enabled Pipelines on Databricks using MLflow with Michael Shtelma and Thunder Shiviah of Databricks
Saving Energy in Homes with a Unified Approach to Data and AI Dr. Stephen Galsworthy and Erni Durdevic of Quby
Productionizing Deep Reinforcement Learning with Spark and MLflow with Patrick Halina and Curren Pangler of Zynga
Scaling Production Machine Learning Pipelines with Databricks with Max Cantor and James Evers of Conde Nast
Translating Models to Medicine, a Minimal Example Using Open COVID-19 Data with Andrew Bauman and James Hibbard of Seattle Children’s
Automated & Explainable Deep Learning for Clinical Language Understanding at Roche with David Talby of Pacific AI and Vishakha Sharma and Yogesh Pandit of Roche
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on Quick-Insight Analytics and Demand Modelling with Patryk Oleniuk, and Sandhya Raghavan of Virgin Hyperloop One
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ground to cloud using SQL Server with Daniel Coelho of Microsoft
Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflow with Perry Stephenson of Atlassian
Patterns and Anti-patterns for Memorializing Data Science Project Artifacts with Derrick Higgins and Sonjia Waxmonsky of Blue Cross / Blue Shield of Illinois
Scaling Data and ML with Apache Spark and Feast at Gojek with Willem Pienaar of GOJEK
Continuous Delivery of Deep Transformer-based NLP Models Using MLflow and AWS Sagemaker for Enterprise AI Scenarios with with Yong Liu and Andrew Brooks of Outreach Corporation
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific with Allison Wu of Thermo Fisher
Machine Learning Data Lineage with MLflow and Delta Lake with Richard Zang and Denny Lee of Databricks
Scaling up AI Research to Production with PyTorch and MLflow with Joe Spisak of Facebook
Operationalizing Machine Learning at Scale at Starbucks with Balaji Venkataraman of Starbucks and Denny Lee of Databricks

Free Tutorial

Last but not least, you can join Using MLflow for end-to-end machine learning on Databricks for a free 80-minute tutorial presented by Sean Owen of Databricks. In this session, we’ll take a look at a simple example where health data can be used to predict life expectancy. It will start with data engineering in Apache SparkTM, data exploration, model tuning and logging with hyperopt and MLflow. It will continue with examples of how the model registry governs model promotion, and simple deployment to production with MLflow as a job or dashboard.

Next Steps

You can browse through our sessions from the Spark +AI 2020 Summit schedule, too.

To get started with open source MLflow, follow the instructions at mlflow.org or check out the release code on Github. We are excited to hear your feedback!

If you’re an existing Databricks user, you can start using managed MLflow on Databricks by importing the Quick Start Notebook for Azure Databricks or AWS. If you’re not yet a Databricks user, visit databricks.com/mlflow to learn more and start a free trial of Databricks and managed MLflow.

Related Blogs:

Try Databricks for free. Get started today.

The post A Guide to the MLflow Talk at Spark + AI Summit 2020 appeared first on Databricks.

At Databricks, we have had the opportunity to collaborate with companies that have transformed the way people live. Some of our customers have developed life saving drugs, delivered industry-first user experiences, as well as provided edge-of-the-seat entertainment (so needed during shelter in place). These companies transformed their business by building efficiencies in how they operate, delivering delightful customer experiences, and innovating new products and features; all by building a data practice that enables all their engineers, scientists & analysts with the data they need to deliver positive business outcomes.

Our experience shows us that at the core of such a data driven enterprise is a data platform that can support all of their data projects and users globally. So far few companies have had the resources and expertise to build such platforms & capabilities and in doing so have dominated the market in their segment. There are several challenges to building such a platform for broad use – data security & governance at the top, followed by operational simplicity and scalability. Enabling all engineers, scientists & analysts with all the data, while ensuring sensitive data is kept confidential and protected from exfiltration is where the challenge lies.

We encountered some of these challenges first hand while deploying the current generally available version of our platform at scale. We identified a set of architectural changes, security and manageability controls that would form the foundation of the next version of our platform and would significantly enhance the simplicity, scalability and security capabilities.

We are excited to announce the public preview of our Enterprise Cloud Service on AWS. The Enterprise Cloud Service is a simple, scalable and secure data platform delivered as a service that is built to support all data personas for all use cases, globally and at scale. It is built with strong security controls required by regulated enterprises, is API driven so that it can be fully automated by integrating into enterprise specific workflows, and is built for production and business critical operations.

In this article, we will share major features and capabilities that a data team could utilize to massively scale their Databricks footprint on AWS while complying with their enterprise governance policies. The platform is already generally available for Azure Databricks, though some of the aspects mentioned below are new for that product as well.

Enterprise Security

Security that Unblocks the True Potential of Your Data Lake

Learn how Databricks helps address the challenges that come with securing a cloud-native data analytics platform.

Customer-managed VPC

Deploy Databricks data plane in your own enterprise-managed VPC, in order to do necessary customizations as required by your cloud engineering & security teams. This feature is in public preview.

Secure Cluster Connectivity

Databricks establishes secure connectivity between the scalable control plane and the clusters in your private VPC. We don’t need a single Public IP in your cluster infrastructure to interact with the control plane. This feature is in public preview.

Customer-managed Keys for Notebooks

Databricks stores customer notebooks in the scalable control plane so as to provide a slick and fast user experience via the web interface. You can now choose to use your own AWS KMS key to encrypt those notebooks. This feature is in private preview.

IAM Credential Passthrough

Access S3 buckets and other IAM-enabled AWS data services using the identity that you use to login into Databricks, either with SAML 2.0 Federation or SCIM. This feature is in public preview.

Simple Administration

Manage a Cloud-scale Enterprise Data Platform with Ease

Deliver cloud-native data environments for your global analytics teams while retaining the visibility, control and scale from a single pane of glass.

Multiple Workspaces at Global Scale

Deploy multiple workspaces in a single VPC, or across multiple VPCs in a single AWS account, or across multiple AWS accounts, all mapping to the same Databricks account. This feature is in public preview.

Trust But Verify with Databricks

Get visibility into relevant cloud platform activity in terms of who’s doing what and when, by configuring Databricks Audit Logs and other related audit logs in the AWS Cloud. See how you could process the Databricks Audit Logs for continuous monitoring.

Cluster Policies

Implement cluster policies across multiple workspaces to make cluster creation interface relevant for different data personas, and to enforce different security and cost controls. This feature is in public preview.

Production Ready

Productionize and Automate your Data Platform at Scale

Create fully configured data environments and bootstrap them with users / groups, cluster policies, clusters, notebooks, object permissions etc. all through APIs.

Create Workspace using Multi-Workspace API

We’ve an API-first approach to building any new feature. The Multi-Workspace API allows you to automate the provisioning of a workspace, and then other APIs allow you to bootstrap it as per your needs. If you use Terraform, you could also utilize the Databricks Terraform Resource Provider to bootstrap and operate a workspace.

CI/CD for your Data Workloads

Streamline your application development and deployment process with integration to DevOps tools like Jenkins, Azure DevOps, CircleCI etc. Use REST API 2.0 under the hood to deploy your application artifacts and provision workspace-level objects.

Databricks Pools

Enable clusters to start and scale faster by creating a managed cache of virtual machine instances that can be acquired for use when needed. This feature is in public preview.

What’s Next?

Attend the Enterprise Cloud Service Webinar to learn more about the above mentioned capabilities & see how we put those into action. If you want to take part in the public preview, please reach out to your Databricks account team or use this form to contact us.

Try Databricks for free. Get started today.

The post Enterprise Cloud Service Public Preview on AWS appeared first on Databricks.

Genetic analyses are a critical tool in revolutionizing how we treat cancer. By understanding the mutations present in tumor cells, researchers can gain clues that lead to drug targets and eventually new therapies. At the same time, genetic characterizations of individual tumors enables physicians to tailor treatments to individual patients and improve outcomes while reducing side effects.

However, the full promise of genetic data for cancer therapeutic development has not been realized. There are a lack of robust, scalable, or standardized approaches to process and analyze cancer genome sequencing data. In a typical research setting, each scientist picks their own set of algorithms and stitches them together with custom glue code. This creates analytics workflows that are hard to manage, scale or reproduce. Moving from ad hoc analytics to robust, reproducible and well engineered genomics data pipelines is critical to take cancer research and treatment to the next level.

Identifying Genetic Variants Responsible for Cancer

The first step in this process is identifying the genetic variations in tumor cells compared to the non-tumor cells in an individual. This process is called somatic variant calling.

In industry, most researchers have standardized these data analytics workflows on the GATK’s best practice pipeline for somatic variant calling, known as MuTect2. While MuTect2 offers high accuracy, it can take hours to days to run, and the mutations are output in a textual file format that is cumbersome for data scientists to analyze. Combining the mutation calls with clinical or imaging data requires additional complex integrations of distinct systems, slowing down the process of generating clinical reports or population-scale analyses of cancer mutations.

Fortunately, there’s a path forward with Databricks Unified Data Analytics Platform for Genomics. More specifically, to address the problems outlined above, we are excited to announce our TNSeq pipeline (AWS | Azure) which builds on top of our DNASeq pipeline. The TNSeq pipeline enables oncology teams to build and scale rapid data analysis pipelines that flow directly into downstream tertiary analyses for critical cancer research. Initially, the sequenced DNA from the tumor and germline samples are processed equivalently to our DNASeq pipeline: the reads are mapped to a reference genome and then common sequencing errors (like PCR duplicates or biased base quality scores) are corrected. Once aligned and preprocessed, somatic mutations are identified by pooling the tumor and germline reads together, and looking for genomic locations where different alleles are seen between the tumor and germline data. Ultimately, our pipeline reduces pipeline latency by 6x and total cost by 20%, while producing equivalent somatic variant calls. We output our mutation calls directly into Delta Lake tables, formatted using the Glow schemas. This allows for pipelines to feed directly into reports, annotation pipelines (AWS | Azure), and statistical genetics analyses.

By building on top of the Databricks Unified Data Analytics Platform for Genomics, our open source project Glow, and existing single node tools, our pipeline allows researchers and clinicians to seamlessly blend genomic data engineering and data science pipelines, while reducing pipeline latency, computational cost, and infrastructure complexity. By unifying cancer genomic data with both machine learning techniques and clinical/imaging data, Databricks customers are identifying genes that drive cancer progression, developing more sensitive algorithms for the early detection of cancer, and building the next generation of clinical reports that blend genomics with imaging and other patient data to provide clinicians with a full portrait of the patient’s cancer status.

Accelerating Somatic Variant Calling in the Genomics Runtime

Our pipeline uses Apache Spark™ and Glow to parallelize BWA-MEM for alignment and GATK’s MuTect2 tool for variant calling. Alignment is embarrassingly parallel over reads, so we can map each read fragment individually for the tumor and normal samples. We then use Spark to group all reads that are relevant for a given region of the genome into the same partition, duplicating reads across partitions as necessary. This technique allows our pipeline to produce concordant results with the single node MuTect2 tool while still achieving parallelism.

The pipeline operates as a Databricks job, so users can trigger new runs using the UI or programmatically with the Databricks CLI. In addition to producing standard output files like a BAM for aligned reads and a VCF for called variants, our pipeline writes results to a Delta Lake table. This format simplifies organization of thousands of cancer samples and allows for scalable analysis with Glow using the built-in regression tests or by integrating with single node tools, such as Samtools, using the Pipe Transformer.

Evaluating our Somatic Variant Calling Pipeline

We benchmarked the accuracy and performance of our pipeline using whole exome sequencing data from the Texas Cancer Research Biobank. The normal sample was sequenced at an average coverage of 95x resulting in 6GB of bzip compressed FASTQ files, while the tumor sample was sequenced at an average coverage of 99x for 6.4GB of bzip compressed FASTQ files.

Accuracy

We compared the end-to-end results of our pipeline against command line BWA-MEM and MuTect2. We used som.py to produce concordance metrics.

Variant type	Precision	Recall
SNV	0.9998	0.9994
indel	0.9963	0.9968
all	0.9977	0.9978

Since somatic variant calling tools like MuTect2 emphasize sensitivity to variations that may be supported by few reads, these tools are highly sensitive to slight variations in the alignments produced by BWA-MEM. However, BWA-MEM uses the index of a read within a batch to choose between equally good alignments. Because the index is not stable in a distributed setting, our distributed version of BWA-MEM can report different, although equally likely, alignments than the command line version.

To verify that all discrepancies between the two variant callsets derive from randomness during alignment, we also ran command line MuTect2 against the aligned BAM files produced by our pipeline. These results were identical to the variant calls produced by the Databricks pipeline.

Variant type	Precision	Recall
SNV	1	1
indel	1	1
all	1	1

Alignments produced by our pipeline differ from command line BWA-MEM because of nondeterminism in the underlying tool. Forcing our pipeline to run alignment for all reads in a single Spark partition produced identical alignments to command line BWA-MEM.

Performance

To evaluate the efficiency of our pipeline against standalone command line tools, we compared the runtime on one c5.9xlarge instance against command line MuTect2 and BWA-MEM. Since there is limited parallelism available in the single-node MuTect2 pipeline, we ran it on an i3.2xlarge instance with 8 cores. This analysis excludes cluster initialization time, which is amortized across all samples run in a single batch.

Pipeline	Runtime (minutes)	Cores	Core-hours
Command line	259.2	8	32.4
Databricks	54.27	36	32.56

The speedup primarily derives from the limited multithreading capabilities in command line MuTect2. By efficiently utilizing cluster resources, our pipeline matches the per-core performance of single node tools while significantly reducing overall runtime.

However, unlike the open-source GATK somatic variant calling pipeline, the Databricks TNSeq pipeline is designed to scale across multiple nodes. To demonstrate the scalability of our pipeline, we ran an experiment where we added additional worker nodes. In all experiments, each worker node used an AWS c5.9xlarge instance with 36 cores and 72GB of memory.

The runtime scaled nearly linearly with the number of workers in the cluster, although beyond six workers the benefit begins to decrease. The point of diminishing returns depends on the total data size. For this dataset, the total runtime of the pipeline with six workers was only 9.95 minutes.

Try it!

The TNSeq pipeline offers industry-leading latency and allows mutational data to flow directly from a pipeline into advanced ML and population-scale analyses. This pipeline is available in the Databricks Genomics Runtime (Azure | AWS) and is generally available for all Databricks users. Learn more about our genomics solutions in the Databricks Unified Analytics Platform for Genomics and try out a preview today.

Try Databricks for free. Get started today.

The post Accelerating Somatic Variant Calling with the Databricks TNSeq Pipeline appeared first on Databricks.

Petastorm is a popular open-source library from Uber that enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. We are excited to announce that Petastorm 0.9.0 supports the easy conversion of data from Apache Spark DataFrame to TensorFlow Dataset and PyTorch DataLoader. The new Spark Dataset Converter API makes it easier to do distributed model training and inference on massive data, from multiple data sources. The Spark Dataset Converter API was contributed by Xiangrui Meng, Weichen Xu, and Liang Zhang (Databricks), in collaboration with Yevgeni Litvin and Travis Addair (Uber).

Why is data conversion for Deep Learning hard?

A key step in any deep learning pipeline is converting data to the input format of the DL framework. Apache Spark is the most popular big data framework. The data conversion process from Apache Spark to deep learning frameworks can be tedious. For example, to convert an Apache Spark DataFrame with a feature column and a label column to a TensorFlow Dataset file format, users need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling columns in the Spark DataFrames. Those engineering frictions hinder the data scientists’ productivity.

Solution at a glance

Databricks contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.

from petastorm.spark import SparkDatasetConverter, make_spark_converter

# Specify the cache directory
spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'file:///dbfs/tmp/…')

df = spark.read...

converter = make_spark_converter(df)    # create the converter

with converter.make_tf_dataset() as dataset:    # convert to TensorFlow Dataset
        # Training or inference code with dataset 
        ...
with converter.make_torch_dataloader() as dataloader:    # convert to PyTorch DataLoader
        # Training or inference code with dataloader 
        ...

What does the Spark Dataset Converter do?

The Spark Dataset Converter API provides the following features:

Cache management. The Converter caches the Spark DataFrame in a distributed filesystem and deletes the cached files when the interpreter exits with best effort. Explicit deletion API is also provided.
Rich parameters to customize the output dataset. Users can customize and control the output dataset by setting batch_size, workers_count and prefetch to achieve the best I/O performance.
Transform function defined on pandas dataframe. Many deep learning datasets include images, audio or video bytes, which can be loaded into Spark DataFrames as binary columns. These binary columns need decoding before feeding into deep learning models. The Converter exposes a hook for transform functions to specify the decoding logic. The transform function will take as input the pandas dataframe converted from the Spark DataFrame, and must return a pandas dataframe with the decoded data.

MLlib vector handling. Besides primitive data types, the Converter supports Spark MLlib Vector types by automatically converting them to array columns before caching the Spark DataFrame. You can also reshape 1D arrays to multi-dimensional arrays in the transform function.
Remote data loading. The Converter can be pickled to a Spark worker and used to create TensorFlow Dataset or PyTorch DataLoader on the worker. You can specify whether to read a specific shard or the whole dataset in the parameters.
Easy migration from single-node to distributed computing. Migrating your single-node inference code to distributed inference requires no code change in data handling, it just works on Spark. For distributed training, you only need to add two parameters to the API that indicate shard index and total number of shards. In our end-to-end example notebooks, we illustrated how to migrate single-node code to distributed inference and distributed training with Horovod.

Checkout the links in the Resources section for more details.

Getting Started

Try out the end-to-end example notebooks linked below and in the Resources section on Databricks Runtime for Machine Learning 7.0 Beta with all the requirements installed.

AWS Notebooks

Azure Notebooks

Acknowledgements

Thanks to Petastorm authors Yevgeni Litvin and Travis Addair from Uber for the detailed reviews and discussions to enable this feature!

Resources

Databricks documentation with end-to-end examples ( AWS | Azure )
Petastorm GitHub Homepage
Petastorm SparkDatasetConverter API documentation

Try Databricks for free. Get started today.

The post Simplify Data Conversion from Apache Spark to TensorFlow and PyTorch appeared first on Databricks.

The healthcare industry is in a rapid state of change. The COVID-19 pandemic has shined a light on how critical it is for healthcare payers, providers, pharmaceutical companies and government agencies to have clean, aggregated patient data at population scale along with technologies that allow them to use AI to interrogate this data. Answering questions like: how do we identify high-risk patients, predict hospital bed usage, or track disease spread, were road-blocked by data and analytics challenges. Today, more than ever before, healthcare data analytics and AI is needed to improve how we deliver care, drive better outcomes, and effectively manage through turbulent times.

For years, the Spark + AI Summit has been the premier meeting place for organizations looking to build data analytics and AI applications at scale with leading open-source technologies such as Apache Spark^TM, Delta Lake and MLflow. In 2020, we’re continuing the tradition by taking the summit entirely virtual. Data scientists and engineers from anywhere in the world will join us June 22-26, 2020 to learn and share best practices for delivering the benefits of AI.

This year we have a robust experience for data teams in the Healthcare and Life Sciences Industry looking to apply these technologies to industry challenges. Join thousands of your peers to explore how the latest innovations in data and AI are improving how we treat patients and develop life-saving therapeutics. Register for Spark + AI Summit to take advantage of all the healthcare and life sciences sessions and events.

Healthcare and Life Sciences Tech Talks

Here is an overview of some of our most highly anticipated healthcare and life sciences session talks at this year’s summit:

Rapid Response to Hospital Operations using Data and AI during COVID-19
With COVID-19 becoming an unprecedented worldwide medical crisis, the need for an intelligent, real-time system of insights is critical to providing better healthcare. KenSci recently launched a Realtime Command Center for COVID-19 Response to support their customers during these challenging times. During this session, you’ll learn how Indiana University Health leveraged the cloud and data to build self-service offerings for COVID-19 response, what Data and AI teams can do to better respond to future pandemics, and KenSci’s learning and experience with deployment of the Realtime COVID-19 Command Center at multiple large healthcare systems.

How Azure and Databricks Enabled a Personalized Experience for Customers and Patients at CVS Health
In 2018, CVS Health embarked on a journey to personalize the customer and patient experience through machine learning on a Microsoft Azure Databricks platform. In this talk, they’ll discuss how the Microsoft Azure Databricks environment enabled rapid, in-market deployment of the first ML model within six months on billions of transactions using Apache Spark. They’ll also run through several use cases for how this has driven and delivered immediate value for the business, including test and learn experimentation for how to best personalize content.

Enabling Scalable Data Science Pipelines with MLflow at Thermo Fisher Scientific
When you have vast amounts of health data, each component of your data science ecosystem, from data engineering, to model development, to delivery, has to be scalable. With that in mind, Thermo Fisher partnered up with Databricks to build an end-to-end data science pipeline with CI/CD standards, further augmenting their capabilities through the use of the latest technologies such as MLflow, Spark ML, and Delta Lake. This platform gives them an unprecedented view into the lifecycle of a Thermo Fisher customer. In this session, they’ll summarize their journey from past to current state, as well as give you a peek into what the future of their platform looks like and how it is improving the experiences of Thermo Fisher customers.

All In: Migrating a Genomics Pipeline from BASH/Hive to Spark and Azure Databricks—A Real-World Case Study
Atrium Health uses Azure Databricks to manage the precision medicine test results and clinical trials matching across their oncology patients. Migrating to Databricks was critical for Atrium Health due to the time-sensitivity of oncology data and the organization’s commitment to personalized treatment for oncology patients. This presentation will detail some of the challenges with their previous environment, why they chose Apache Spark and Databricks, migration plans and lessons learned, new technology used after the migration (Data Factory/Databricks, PowerApp/Power Automate/Logic App, Power BI), and how the business has been impacted post-migration.

Improving Therapeutic Development at Biogen with UK Biobank Data
Turning petabytes of genomics data into actionable links between genotype and phenotype is crucial, but out of reach for companies using legacy technologies. In this talk, Biogen will describe how they collaborated with DNAnexus and Databricks to move their on-premises data infrastructure into the AWS cloud. By combining the DNAnexus platform with the Databricks Genomics Runtime, Biogen was able to use the UK Biobank dataset to identify genes containing protein-truncating variants that impact human longevity and neurological status.

You can see the full list of talks on our Healthcare and Life Sciences Industry page.

Healthcare and Life Sciences Industry Forum

Join us on Thursday, June 25, 11:30am-1:00pm PST for an interactive Healthcare and Life Sciences Forum at Spark + AI Summit. In this free virtual event, you will have the opportunity to network with your peers and participate in engaging panel discussions on how data and machine learning are driving innovation in patient care.

Panel Discussion: Data and AI in the Healthcare and Life Sciences Industry

In this panel, hear industry experts speak on how healthcare organizations are collaborating on large-scale population health datasets to improve patient outcomes and accelerate pharmaceutical research including lessons learned along the way and visions for the future. Julie Yoo, General Partner and Healthcare Lead at Andreessen Horowitz, will give a keynote prior to the panel on trends in the industry. Session speakers include:

Dr. Binu Mathew

VP, Medical Intelligence and Analytics

Sanji Fernando

SVP, AI & Analytics Platforms

Joanne Hackett, PhD

General Partner, Healthcare; Former Chief Commercial Officer of Genomics England

Dr. Jeff Reid

VP, Genome Informatics & Data Engineering

Julie Yoo

General Partner, Healthcare Lead

Fireside Chat: Building a Modern Unified Data Analytics Architecture for Real-time COVID Response at the Medical University of South Carolina

Join this interactive fireside chat with Matt Turner, Chief Data Officer of MUSC, to learn how they built a modern unified data analytics architecture that enables their teams to unlock insights buried within their clinical data and build powerful predictive models. More specifically, you’ll learn how this strategy prepared MUSC to quickly respond to the dynamic environment of COVID-19.

Demos on Popular Data + AI Use Case in Healthcare and Life Sciences

Join us for live demos on the hottest data analysis use cases in the healthcare and life sciences industry:

Real-time Predictive Analytics for Modern Biopharmaceutical Manufacturing
While advanced technologies and AI continue to transform drug discovery,
biopharmaceutical companies are looking to digitize the post-discovery landscape toaccelerate drug development & manufacturing. This demo by Fluxa, a provider of cutting-edge software and services for the biopharmaceutical industry, will examine how to apply advanced technologies like Databricks to enable real-time predictive analytics on bioprocess data during clinical and commercial manufacturing.

Sepsis Prediction: A Unified Workflow from EMR Data to ML Model Design
Learn how Prominence Advisors builds end-to-end machine learning workflows from raw EMR data through ML model design on the Databricks Unified Data Analytics Platform. We’ll demonstrate how the Databricks stack facilitates the data extraction, data prep, data exploration and data science that carry us all the way from source systems to a sepsis prediction model.

The Data Lakehouse: The Key to Success for Alternative Payment Models
Despite the promise of value-based care, improving the design of these contracts within the healthcare industry has been slow. One of the key reasons is that payers and healthcare systems lack a unified data and technology platform that promotes the collaboration required to succeed under new payment models. In this demo, Tensile AI will discuss the limitations of traditional Data Warehousing and Data Lake implementations and demonstrate how the new Data Lakehouse paradigm can help those engaged in value-based analytics by enabling rapid ingestion of new data sources and improving the speed and governance around deployment of the new data into existing analyses and models.

Sign-up for the Healthcare and Life Sciences Experience at Summit!

To take advantage of the full Healthcare and Life Sciences Experience at Spark + AI Summit, simply register for our free virtual conference and select Healthcare and Life Sciences Forum during the registration process. If you’re already registered for the conference, log into your registration account, edit “Additional Events” and check the forum you would like to attend.

Try Databricks for free. Get started today.

The post Healthcare and Life Sciences Sessions You Don’t Want to Miss at Spark + AI Summit 2020 appeared first on Databricks.

The current economic environment is having a significant impact on the Retail and Consumer Goods sector. Rapid changes in how consumers shop is forcing companies to rethink their sales, marketing, and supply chain strategies. Companies can still reduce costs and win market share to drive stronger growth, but this requires new ways of understanding and acting on the consumer. Through the use of big data and AI, retailers and consumer goods companies can refocus their efforts on areas that will rapidly deliver value and drive growth into the future.

For years, the Spark + AI Summit has been the premier meeting place for organizations looking to build AI applications at scale with leading open-source technologies such as Apache SparkTM, Delta Lake and MLflow. In 2020, we’re continuing the tradition by taking the summit entirely virtual. Data scientists and engineers from anywhere in the world will be able to join June 22-26, 2020 to learn and share best practices for delivering the benefits of AI.

This year we have a robust experience for data teams in the Retail and Consumer Packaged Goods Industry. Join thousands of your peers to explore how the latest innovations in data and AI are providing new ways to optimize supply chains and connect more deeply with the modern shopper. Register for Spark + AI Summit to take advantage of all the retail analytics sessions and events.

Retail and Consumer Goods Tech Talks

Here is an overview of some of our most highly anticipated Retail and CPG talks at this year’s summit:

Starbucks

Keynote: How Starbucks is Achieving its ‘Enterprise Data Mission’ to Enable Data and ML at Scale and Provide World-Class Customer Experiences
A key aspect to ensuring the excellent customer experiences Starbucks is known for is data. Tremendous amounts of data. This keynote highlights how the company makes decisions powered by data at scale, including processing customer data on a petabyte level with governed processes, deploying platforms at the speed-of-business, and enabling ML across the enterprise. Join to learn the ins and outs of building a world-class enterprise data platform to drive world-class CX.

Columbia

Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with Delta Lake
Today, Columbia seamlessly integrates data from all line-of-business-systems to manage its wholesale and retail businesses — but it wasn’t always so. In this presentation, hear how they achieved a 70% reduction in pipeline creation time and reduced ETL workload times from four hours to just minutes when they made the switch from multiple legacy data warehouses to Azure Databricks and Delta Lake, enabling a serious boost in efficiency and near real-time analytics.

Walmart

Building Identity Graphs over Heterogeneous Data
Customers and service providers interact in a variety of modes and channels, making a unified identity view an often complex challenge. Since every interaction or transaction event contains some form of identity, a highly scalable platform is required to identify and link the identities belonging to a single user as a connected component. Walmart solved this problem by building its Identity Graph platform using the Spark processing engine. Join this talk to hear how they were able to create a solution that handles 25+ billion vertices and 30+ billion edges and an incremental 200M new linkages every day.

Mars

Building the Petcare Data Platform using Delta Lake and ‘Kyte’: Our Spark ETL Pipeline
Mars Petcare is focused on building the Petcare Data Platform of the future to support analytics and customer insights for its many petcare brands such as Iams, Whiskas, Pedigree and Nutro. In this talk, learn how Mars leveraged Spark and Databricks to build ‘Kyte’, a bespoke pipeline tool that massively accelerated their ability to ingest, cleanse, and process new data sources. Hear more why they chose a Spark-heavy ETL design and a Delta Lake-driven platform and why they’re committing to Spark and Delta Lake as the core of their Platform to support their mission: Making a Better World for Pets!

iFood

Building a Real-Time Feature Store at iFood
iFood is the largest food-tech company in Latin America. In order to maintain their top-tier status, they’ve built several machine learning models to provide accurate answers for questions such as: how long it will take for an order to be completed?; what are the best restaurants and dishes to recommend to a consumer?; is the form of payment fraudulent?; among others. To generate the training datasets for those models, and to serve features in real time so predictions can be made correctly, it’s necessary to create efficient, distributed data processing pipelines. In this talk, learn how iFood uses Databricks and Spark Structured Streaming to process events streams, store them in a historical Delta Lake Table and a Redis low-latency access cluster, and how they structure their development processes.

You can see the full list of talks on our Retail and Consumer Packaged Goods summit page.

Retail and Consumer Goods Forum

Join us on Thursday, June 25 at 11:30am-1:00pm PST for an interactive Retail and CPG Forum at Spark + AI Summit. In this free virtual event, you’ll have the opportunity to network with your peers and participate in engaging panel discussions with industry leaders on how data and machine learning are driving innovation across the entire retail value chain. Panelists include:

Ojas Nivsarkar

Sr. Director, Enterprise Data & CRM

Brad Kent

VP, Analytics & Insights

Saritha Ivaturi

Director of Data Systems

Demos on Popular Data + AI Use Cases in Retail and CPG

Join us at Summit for live demos on the hottest use cases in the retail and consumer goods industry:

Demand Forecast
The computational limitations that forced companies to compromise their demand forecasting are a thing of the past. In this demonstration, we’ll show you how to take advantage of the elastically-scalable patterns used by many companies to generate timely forecasts at levels of granularity that were out of reach in years past.

Safety Stock Analysis
A key application of demand forecasts is the calculation of the buffer inventory (aka the safety stock) required to ensure customer demand for goods is immediately met. This aspect of inventory management has become even more critical as traditional businesses pivot to curb-side fulfilment and at home delivery in the wake of the COVID crisis with customer loyalty shifting to those retailers best able to deliver on the promises made through their online applications. In this demonstration, we’ll examine how a common substitution made in the safety stock calculation puts demand fulfillment at risk.

Customer Lifetime Value
Maintaining a healthy, profitable relationship with customers requires an understanding of their individual revenue potentials. Customer Lifetime Value (CLV) is a popular metric for capturing this potential but in non-subscription retail models, determining probable future spend and variable retention rates is highly difficult. In this demonstration, we will introduce the BTYD models as a means to overcome these challenges and provide reliable CLV estimates.

Customer Segmentation
Not every customer has the same potential for revenue and profitability. In recognition of this, many organizations tier their customers based on transactional performance and align their promotional activities with these segments based on their differing potential for return. In this demonstration, we will take a look at a tried and true technique for this known as RFM segmentation.

Retail and Consumer Goods Training

Don’t miss this opportunity to sharpen your technical skills with these Retail and CPG focused training at Summit:

Building Time Series Forecasting Models using Neural Network and Statistical Models
Walmart Labs

Wrangling, analyzing, and systematically modeling time-series data for forecasting requires a unique set of techniques due to the temporal dependence nature of time series. In this training, Walmart will discuss some of the most fundamental concepts and techniques to build and deploy time-series forecasting. Join to learn the key characteristics of time-series data, statistics for summarizing time series, graphical techniques to describe the characteristics of time series, and the essential concepts and techniques required to appropriately apply the autoregressive-type and Neural Network models in practice.

Practical Problem-solving in Retail: Real-time Data Analytics with Apache Spark
Databricks Training

In this half-day course, you will learn how Databricks and Spark can help solve real-world problems you face when working with retail data. You’ll learn how to deal with dirty data, and get started with Structured Streaming and real-time analytics. Students will also receive a longer take-home capstone exercise as bonus content to the class where they can apply all the concepts presented. This class is taught concurrently in Python and Scala.

Sign-up for the Retail and Consumer Goods Experience at Summit!

To take advantage of the full Retail and Consumer Packaged Goods Experience at Spark + AI Summit, simply register for our free virtual conference and select Retail and Consumer Packaged Goods Forum during the registration process. If you’re already registered for the conference, log into your registration account, edit “Additional Events” and check the forum you would like to attend.

Try Databricks for free. Get started today.

The post Retail and Consumer Goods Sessions You Don’t Want to Miss at Spark + AI Summit 2020 appeared first on Databricks.

Before you can provide personalized services and offers to your customers, you need to know who they are. In this virtual workshop, retail and media experts will demonstrate how to build advanced customer lifetime value (CLV) models. From there companies can provide the right investment into each customer in order to create personalized offers, save tactics, and experiences.

In this on-demand virtual session, Steve Sobel and Rob Saker talk about the need, impact and challenges of companies pursuing customer lifetime value, and why Databricks Unified Data Analytics Platform is optimal for helping to simplify how data is processed and analyzed for CLV. Then Bryan Smith, Databricks Global Technical Leader for Retail, walks through different ways of calculating CLV using retail data, though this will be applicable across all industries looking to understand the value of each customer using historical behavioral patterns. There was strong audience participation during the session. We’ve provided written responses to questions below.

Watch now

Notebooks from the webinar

RFM segmentation (Recency, Frequency, Monetary)
CLV Formula (Customer LIfetime Value)
BTYD Models (“Buy ‘til you die”)
Regression Models

Relevant blog posts

Customer Lifetime Value Part 1: Estimating Customer Lifetime Values
Customer Lifetime Value Part 2 (coming soon)

Q&A from chat not answered live

Q: I understand CLV is important but shouldn’t there be an emphasis on VLC – Value to Customers as the customers see it… corporations create the churn because one arm does not know what the other arm is doing. What is your POV?
A: We absolutely agree. CLV is one slice of understanding the customer. It sits within a broad ecosystem of analysis.

Q: How does customer value decline over time? isn’t that by definition monotonically increasing over time? or is the plot showing profit rather than value? Is there an underlying cost that makes the curve eventually bend downwards? or is it going downwards because the plot is about PREDICTED value and that can change as a function of events happening?
A: Cumulatively, the value of the customer increases to the point where they stop engaging you. The CLV curve graph shown is more about the value of that customer at their point in the lifestage. As an example, if we have a declining customer but my spend is stable, their relative value will decline.

Q: What is Databricks actively doing with Microsoft to close gaps so that Delta Lake is fully available throughout the data ecosystem?
A: Delta Lake has been open sourced for about a year now. We are seeing fantastic adoption of the Delta Lake pattern and technology by customers, vendors and within the open source community at large. While we can’t speak to specific MSFT roadmaps, they are one of our closest partners and we work closely with them on many platform integrations. Delta Lake and the modern data architecture are quickly becoming the de facto approach for modern data & AI organizations.

Q: Any reason why we chose t-SNE over other clustering approaches? Why is t-SNE well suited to this problem space?
A: t-SNE like PCA is a dimension reduction technique. We’re simply using it to help us visualize our data in advance of clustering. For actual clustering, we’ll use k-means a bit later. There are certainly other techniques we could use but just think of t-SNE (and PCA) as a feature engineering step that enables us to get to visualizations.

Q: Is there a reason why we are not going with 4 clusters as it yields the similar y-value compared to 8 clusters?
A: The choice of cluster counts (k) is a bit subjective. While we use the elbow technique with a silhouette score, there’s always a balancing of the metrics with what is practical/useful. I chose 8 but you could choose another number if that worked better for you.

Q: What about gap statistic of selecting # of clusters?
A: We use silhouette scores to look at inter-cluster and intra-cluster distances but you could use other metrics that focused on one aspect or the other.

Q: How do you productionalize this model?
A: In the blogs, we show how you can transform the CLV model into a function which you could use within batch ETL, streaming jobs or interactive queries. There are other ways to deploy the model too but hopefully this will give you an idea of how to approach the task.

Q: Have you found any relation between your RFM segments and Sales (Pareto Rule)?
A: The reason we use segmentation techniques such as CLV is to avoid generalized rules. Even if true in aggregate, a rule-of-thumb such as “20% of your customers generate 80% of your sales,” is very broad. Not every customer has the same potential and it’s critical you balance your engagement around an understanding of that potential value in order to maintain profitability. An approach such as CLV enables us to be precise with how we engage customers and maximize the ROI of our marketing dollars.

Q: In a non contractual and continuous setting as this use case, we could use the pareto/NBD model to calculate the retention/survival rate. is this something you are considering?
A: Absolutely. In the blog, we consider both the Pareto/NBD and the BG/NBD for this scenario. We focused on just one in the webinar for expediency.

Q: In BTYD models, can you incorporate seasonality? Some retail businesses have very seasonal distributions in order count
A: Not really. Instead, you might want to predict future spend using a regression technique like we demonstrate in the final notebook.

Q: Have you used Koalas for getting the aggregates by any chance?
A: We haven’t but it would certainly be an option to help make some of this more distributed in places.

Q: What was the reason to filter the distribution < 2000?
A: It was an arbitrary value that cuts off outliers that make the histogram harder to render. We keep the outliers in the model but just blocked them for this one visualization.

Q: Can we save these model results using mlflow?
A: Absolutely. This is demonstrated in the blogs.

Q: How do we validate the CLV result?
A: You can use a cutoff date, or hold out, prior to the end of our dataset to train our model to a point and then forecast the remainder of the data. In the blogs, we address this practice.

Q: Is there a way to evaluate the accuracy of this model?
A: The simple way is to validate all model assumptions and then calculate an MSE against the holdout set as demonstrated in the blogs.

Q: I have encountered cases where fitting the model fails because of the distribution of the data set? Any suggestions on how to overcome this?
A: I’ve encountered some problems with the latest build of lifetimes that sounds similar to what you are describing. Notice in the notebooks that we are explicitly using the last build of the library.

Q: We are noticing marked change in shopper (in-store and online) buying behavior.. So how do you recommend in running these models given Per March 2020 (covid) and post March 2020 period?
A: This is a really good question and relevant across more than just CLV. We’ve seen a big inflection point that may change the fundamental relationship with customers. Buying patterns have fundamentally shifted in ways we don’t fully understand yet, and may not understand until quarantine and recession patterns stabilize. Key inflection points that fundamentally change the behavior of consumers happen every 7-10 years.

One approach to answering this would be to isolate customer engagement data for the period of quarantine and beyond. The reliability of this model will not be as strong as longer running models, but it would reflect more of the current volatility. Comparing this to analysis of data prior to CLV might generate unique insights on key shifts in consumer behavior.

If you have a long enough history, you could also look for other extreme events that resemble the current period. What happened in the recession of 2008? Have you had any other large scale disruptions to consumption such as natural disasters, where behavior was disrupted and returned in a different way.

Q: Is it possible to integrate explanatory characteristics of the customer into the BTYD model?
A: In the BTYD models, the answer is no. In survival models which are frequently employed in contractual situations, the Cox Proportional Hazards model is a popular choice for explaining why customers leave.

Q: In this use case you are calculating the CLV of existing customers, What about the new customers who joined recently?
A: We include them as well. Remember, the models consider individual “signals” in the context of population patterns. With this in mind, they can make some pretty sizeable leaps for new customers (such that they look more like the general population) until more information comes in which can be used to tailor the curves to the individual.

Q: Can we think about adding segmentation feature from the previous notebook as feature engineering for CLV?
A: Absolutely. If you know you have segments that behave differently, you can build separate models for each.

Q: For CLV, I saw that you were using pandas. Do you have suggestions for when the data doesn’t fit in memory?
A: The dataset that’s used to build the pandas DF is often very large. For that, we use Spark. But then the resulting dataset is one record per customer. For many organizations, we can squeeze that data set into a pandas DF so that we can use these standard libraries. Remember, we only have a couple numeric features per entry so the summary dataset is pretty light. If it was just too big, we might then do a random sample of the larger dataset to keep things manageable.

Q: Is RMSE the best way to score this model?
A: It is a way. 🙂 You can really use any error metric you feel works well for you. MAE or MAPE might work well.

Q: In the DL approach we are not estimating the retention part? We assume that all customers will remain active. right?
A: In the regression techniques, we aren’t; really addressing retention which is why they aren’t true CLV predictors. These are commonly referred to as alternative means of calculating CLV but we need to recognize they actually do something different (and still potentially useful).

Q: When using keras were you using a gpu enabled cluster?
A: We didn’t use GPUs here but we could have.

Q: Have you had any experience with exploring auto-regressive models for predicting CLV?
A: We haven’t but it might be worth exploring. We suspect they might fall into the “spend prediction” camp like our regression models instead of being CLV estimators.

Q: Do you have model interpretability built into the platform(eg: SHAP, Dalex)
A: Not on this model. It certainly would be interesting to explore but we simply didn’t get to it for this demonstration.

Q: Bryan just mentioned a pattern for partitioning data and running models in parallel across those partitions, saying it’s often used in forecasting. Can you please share that in your written follow-up to questions?
A: Sure. Check out our blog on “Databricks Fine Grained Forecasting“. That provides the most direct explanation of the pattern.

Q: How do you include shopping channels in this modeling process?
A: You could segment on channel. That said, you might want to explore how customers who operate cross-channel would be handled.

Q: Do you have any good references/sources for learning more about enhancing an organizations clustering practices?
A: The Springer Open book “Market Segmentation Analysis” is a good read.

Q: The lifetimes package in python uses maximum likelihood estimate for estimating the best fit. Have you tried using a bayesian approach using pymc3?
A: We haven’t but it might be interesting to explore if it gives faster or more accurate results.

Q: Would it be safe to say that the generative models you demonstrated adequately attempt to capture the realizations of renewal processes that often explain customer behavior? For example a gamma process?
A: The models make tailored generalizations and in that regard balance a bit of the individual with the population. Remember that we’re not necessarily looking for a perfect prediction at an individual level but instead seeking probable guidance for future investments that average out to be correct.

Q: Does keras take Spark dataframe?
A: Not that we’re aware of. We believe you must pass it pandas/numpy.

Watch now

Try Databricks for free. Get started today.

The post On-Demand Virtual Session: Customer Lifetime Value appeared first on Databricks.

Today we announce the release of %pip and %conda notebook magic commands to significantly simplify python environment management in Databricks Runtime for Machine Learning. With the new magic commands, you can manage Python package dependencies within a notebook scope using familiar pip and conda syntax. For example, you can run %pip install -U koalas in a Python notebook to install the latest koalas release. The change only impacts the current notebook session and associated Spark jobs. With simplified environment management, you can save time in testing different libraries and versions and spend more time applying them to solve business problems and making your organization successful.

Why We Are Introducing This Feature
Enable %pip and %conda magic commands
Adding Python packages to a notebook session
Managing notebook-scoped environments
Reproducing environments across notebooks
Best Practices & Limitations
Future Plan
Get started with %pip and %conda

Why We Are Introducing This Feature

Managing Python library dependencies is one of the most frustrating tasks for data scientists. Library conflicts significantly impede the productivity of data scientists, as it prevents them from getting started quickly. Oftentimes the person responsible for providing an environment is not the same person who will ultimately perform development tasks using that environment. In some organizations, data scientists need to file a ticket to a different department (ie IT, Data Engineering), further delaying resolution time.

Databricks Runtime for Machine Learning (aka Databricks Runtime ML) pre-installs the most popular ML libraries and resolves any conflicts associated with pre packaging these dependencies. The feedback has been overwhelmingly positive evident by the rapid adoption among Databricks customers. However, ML is a rapidly evolving field, and new packages are being introduced and updated frequently. Databricks users often want to customize their environments further by installing additional packages on top of the pre-configured packages or upgrading/downgrading pre-configured packages. It’s important to note that environment changes need to be propagated to all nodes within a cluster before it can be leveraged by the user.

Improving dependency management within Databricks Runtime ML has three primary use cases:

Use familiar pip and conda commands to customize Python environments and handle dependency management.
Make environment changes scoped to a notebook session and propagate session dependency changes across cluster nodes.
Enable better notebook transportability.

Enable %pip and %conda magic commands

Starting with Databricks Runtime ML version 6.4 this feature can be enabled when creating a cluster. To perform this set spark.databricks.conda.condaMagic.enabled to true under “Spark Config” (Edit > Advanced Options > Spark). See Figure 1.

Figure 1. Enable the Feature at Cluster Creation

After the cluster has started, you can simply attach a Python notebook and start using %pip and %conda magic commands within Databricks!

Note: This feature is not yet available in PVC deployments and Databricks Community Edition.

Adding Python packages to a notebook session

If you want to add additional libraries or change the versions of pre-installed libraries, you can use %pip install. For example, the following command line adds koalas 0.32.0 to the Python environment scoped to the notebook session:

%pip install koalas==0.32.0

Pinning the version is highly recommended for reproducibility. The change only impacts the current notebook session, i.e., other notebooks connected to this same cluster won’t be affected. The installed libraries will be available on the driver node as well as on all the worker nodes of the cluster in Databricks for your PySpark jobs launched from the notebook.

Databricks recommends using %pip if it works for your package. If the package you want to install is distributed via conda, you can use %conda instead. For example, the following command upgrades Intel MKL to the latest version:

%conda update mkl

The notebook session restarts after installation to ensure that the newly installed libraries can be successfully loaded. To best facilitate easily transportable notebooks, Databricks recommends putting %pip and %conda commands at the top of your notebook.

Managing notebook-scoped environments

In Databricks Runtime ML, the notebook-scoped environments are managed by conda. You can use %conda list to inspect the Python environment associated with the notebook.

Figure 2. Inspect the Python environment associated with a notebook using %conda list.

Conda provides several advantages for managing Python dependencies and environments within Databricks:

Environment and dependency management are handled seamlessly by the same tool.
Conda environments support both pip and conda to install packages.
Conda’s powerful import/export functionality makes it the ideal package manager for data scientists.

Through conda, Notebook-scoped environments are ephemeral to the notebook session. So if a library installation goes away or dependencies become messy, you can always reset the environment to the default one provided by Databricks Runtime ML and start again by detaching and reattaching the notebook.

For advanced conda users, you can use %conda config to change the configuration of the notebook-scoped environment, e.g., to add channels or to config proxy servers.

Reproducing environments across notebooks

For a team of data scientists, easy collaboration is one of the key reasons for adopting a cloud-based solution. The %conda magic command makes it easy to replicate Python dependencies from one notebook to another.

You can use %conda env export -f /dbfs/path/to/env.yml to export the notebook environment specifications as a yaml file to a designated location. Figure 4 saves the yaml file to a DBFS folder using its local file interface:

Figure 4 Use %conda env export to export environment specifications to a designated DBFS location

A different user can import the yaml file in her notebook by using %conda env update -f. By doing so, she will be installing all the libraries and dependencies from the yaml file to her current notebook session. See Figure 5.

Figure 5. Use %conda env update to import the environment specifications from a designated DBFS location

Databricks recommends using the same Databricks Runtime version to export and import the environment file for better compatibility.

Best Practices & Limitations

Databricks does not recommend users to use %sh pip/conda install in Databricks Runtime ML. %sh commands might not change the notebook-scoped environment and it might change the driver node only. It’s not a stable way to interface with dependency management from within a notebook.

As discussed above, libraries installed via %conda commands are “ephemeral”, and the notebook will revert back to the default environment after it is detached and reattached to the cluster. If you need some libraries that are always available on the cluster, you can install them in an init script or using a docker container. Libraries installed via Databricks Library UI/APIs (supports only pip packages will also be available across all notebooks on the cluster that are attached after library installation. Conda package installation is currently not available in Library UI/API.

Currently, %conda activate and %conda env create are not supported. We are actively working on making these features available.

Future Plans

We introduced dbutils.library.* APIs in Databricks Runtime to install libraries scoped to a notebook, but it is not available in Databricks Runtime ML. Conversely, this new %conda/%pip feature is only available in Databricks Runtime ML, but not in Databricks Runtime. Our long-term goal is to unify the two experiences with a minimal-effort migration path. We will be starting by bringing %pip to the Databricks Runtime, soon.

We introduced Databricks Runtime with Conda (Beta) in the past. This Runtime is meant to be experimental. With the new %pip and %conda feature now available in Databricks Runtime for ML, we recommend users running workloads in Databricks Runtime with Conda (Beta) to migrate to Databricks Runtime for ML. We do not plan to make any more releases of Databricks Runtime with Conda (Beta).

As discussed above, we are actively working on making additional Conda commands available in ML Runtime, most notably %conda activate and %conda env create. For a complete list of available or unavailable Conda commands, please refer to our Documentation.

Get started with %pip and %conda

For more usage of %pip and %conda, please see our user guide and example notebooks (AWS/Azure)

Try Databricks for free. Get started today.

The post Simplify Python environment management on Databricks Runtime for Machine Learning using %pip and %conda appeared first on Databricks.