Quantcast
Channel: Databricks
Viewing all 1872 articles
Browse latest View live

Customer Lifetime Value Part 2: Estimating Future Spend

$
0
0

Download the Customer Lifetimes Part 2 notebook to demo the solution covered below, and watch the on-demand virtual workshop to learn more. You can also go to Part 1 to learn how to estimate customer lifetime duration.

20% of your customers account for 200% of your profits.

You read that correctly. This seems a mathematical impossibility at first glance, but as Harvard Business School professor Sunil Gupta points out in Driving Digital Strategy, this calculation works when you realize that many of your customers are unprofitable.

While the exact ratio may vary by business, it is crucial that each business identifies its high-value customers, cultivates long-term relationships with them, and attracts more customers of this caliber.

While some companies may go to the extent of firing unprofitable customers, at a minimum, firms should identify unprofitable customers and minimize additional investment in them. Bain & Company 1 is famous for its analysis that shows increasing customer retention rates by 5% increases profitability by 25-to-95%, but the critical lesson is that this is achieved when we retain the right customers.

The potential profitability of any given customer is not always apparent, and the development of long-term, high-value relationships often requires a significant upfront investment. In non-subscription models where customers can come and go as they please, the best we can do is interpret the signals generated by individual customers in terms of the frequency, recency, and monetary value of their transactional interactions and from these estimate future revenue potential.

Customers with same number of historical transactions but with differing expectations for future engagement and spend
Figure 1. Three different customers indicating three different potentials for future profits

WHY CLV IS SO IMPORTANT

Customer Lifetime Value (CLV) is a cornerstone metric in modern marketing. Whether you are selling men’s fashion 2, craft spirits 3 or rideshare services 4, the net present value of future spend by a customer helps guide investments in customer retention and provides a measuring stick for overall marketing effectiveness. When calculated at the individual level, CLV can help us separate our best customers from our worst and position every customer in between.

This recognition of the differing potential of various customers, coupled with an understanding of their personal preferences, provides us a basis for effective personalization. In a 2019 survey 5 of 600 senior marketers in the retail, travel, and hospitality industries, companies reporting the highest ROI from personalization were twice as likely to name customer lifetime value as a primary business objective compared to those who achieved lower returns. CLV is foundational to customer-centric engagement. That said, CLV is a tricky metric to calculate correctly 6.

Deriving Customer Lifetime Value

The simplest CLV formulas multiply average annual revenue (or profit) by average customer lifetime to arrive at the total potential profit or revenue we may obtain from a typical customer.  Formulations of CLV, which operate on these simple averages, are helpful in orienting us to the two key levers which drive CLV, namely customer lifespan and customer spend. But if you’ve read the first part of this two-part blog series (or watched this entertaining presentation by Peter Fader 7), you know that simple averages with their assumptions of a balanced (normal) distribution of values do not reflect the reality of these measures. While this sounds a little esoteric, what’s important to understand is that in failing to account for the skewed range of frequencies and spend surrounding customer transactions, these formulas can severely misrepresent the real CLV of our customer base.

Frequency of per-customer daily spend totals showing a long right-hand tail of higher spend
Figure 2.  Averaging customer spend ignores a long tail of higher spenders

In addition, these averages articulate something about the general state of the overall customer population and not the individual customers we are attempting to serve in a more personalized manner.  Many organizations attempt to correct for this by segmenting their customers and deriving segment-specific CLVs.  While a bit more tailored to the customers in a segment, such approaches miss shifts in individual customer behavior that may indicate their lowered or elevated potential for returns.

A proper formulation of CLV examines individual customers’ patterns of engagement relative to patterns observed across the customer population.  Popular models for such emerged in the late 1980s but were underutilized due to the mathematical complexity involved with them. These Buy ‘til You Die (BYTD) models experienced a renaissance in the mid-2000s when revisions allowed the math to be greatly simplified.  Still, to call the BTYD models easy to calculate for most practitioners would be an overstatement.  Thankfully, the logic behind these models has been encapsulated in popular libraries that make the calculations far more accessible to traditional enterprises.

Bringing CLV to the Enterprise

As discussed in the previous blog, the use of these libraries makes the proper calculation of individualized CLV much easier, but there are still several technical hurdles that need to be overcome. These challenges are well addressed through a collection of capabilities popular with Data Engineering and Data Science practitioners and available through the Databricks platform. (You can read more about these challenges and see how they are addressed by reviewing the previous blog post and its associated notebook.)

Twelve-month CLV for individual customers calculated using a 1% monthly discount rate
Figure 3. Twelve-month CLV for individual customers calculated using a 1% monthly discount rate

So if the technical challenge is largely addressed, how then might we bring these per-customer CLV calculations into our day-to-day processes?   First, we need to recognize that CLV is never a given.  Product innovation, shifts in customer needs and preferences, and changes in the competitive marketplace can alter individual patterns of engagement and estimated CLV.  As such, aggregate CLV (both in total and normalized for the size of our customer base) is a metric that should be monitored on an ongoing basis to assess shifts in customer equity.

Five-year projections of aggregate CLV presented in 6-month intervals
Figure 4. Five-year projections of aggregate CLV presented in 6-month intervals

In addition, we should seek to understand what separates our higher valued customers from our lower valued ones. Differences in customer characteristics and behaviors may illuminate how different customers value our offerings and allow us to steer these in directions that maximize profitability. Similarly, we may be able to enhance our customer acquisition strategy, targeting new, look-alike customers that are likely to join our pool of high-valued customers.

Investments in capabilities and experiences may also be assessed in terms of which resonate with which customer tiers.  Higher investment offerings such as loyalty programs, mobile applications, or personalized services that lengthen customer relationship lifetimes or increase per-transaction spend may justify on-going investments.  Failure to move the needle on CLV may justify changes or abandonment of such offerings.

Finally, we need to bring CLV to the forefront of our customer engagements.  When deciding which offers or promotions to present to customers via advertisements, mailings, or banners, CLV can be used to better ensure we invest the right way in customer relationships. When handling an issue of customer satisfaction, CLV may similarly inform us of the lengths we may go to  preserve a healthy relationship with a specific customer.

No relationship need ever be managed as if it were governed by pure calculus, but, still, we might carefully consider that not every customer has the same potential for return and meter our investments appropriately. The technical barriers to process integration are largely a non-concern.  Today, it’s a matter of shifting our practices to deliver valued products and services to customers while also maintaining a healthy, profitable relationship.

Getting Started

Watch the On-Demand Virtual Workshop

Download the Notebook

Customer Lifetimes Part 2 notebook

--

Try Databricks for free. Get started today.

The post Customer Lifetime Value Part 2: Estimating Future Spend appeared first on Databricks.


Time Traveling with Delta Lake: A Retrospective of the Last Year

$
0
0

Try out Delta Lake 0.7.0 with Spark 3.0 today!

It has been a little more than a year since Delta Lake became an open-source project as a Linux Foundation project.  While a lot has changed over the last year, the challenges for most data lakes remain stubbornly the same – the inherent unreliability of data lakes.  To address this, Delta Lake brings reliability and data quality for data lakes and Apache Spark; learn more by watching Michael Armbrust’s session at Big Things Conference.

Watch Michael Armbrust discuss Delta Lake: Reliability and Data Quality for Data Lakes and Apache Spark by Michael Armbrust  in the on-demand webcast.

Delta Lake: Reliability and Data Quality for Data Lakes and Apache Spark by Michael Armbrust

With Delta Lake, you can simplify and scale your data engineering pipelines and improve your data quality data flow with the Delta Architecture.

Delta Lake Primer

To provide more details, the following section provides an overview of the features of Delta Lake.  Included are links to various blogs and tech talks that dive into the technical aspects including the Dive into Delta Lake Internals Series of tech talks.

The Delta Architecture with the medallion data quality data flow

The Delta Architecture with the medallion data quality data flow

Building upon the Apache Spark Foundation

Transactions

  • ACID transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level. Learn more at Diving into Delta Lake: Unpacking the Transaction Log blog and tech talk.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.  To see this in action, try out the Delta Lake Tutorial from Spark + AI Summit EU 2019.

Data Lake Enhancements

Schema Enforcement and Evolution

  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption. For more information, refer to Diving Into Delta Lake: Schema Enforcement & Evolution blog and tech talk.
  • Schema Evolution: Business requirements continuously change, therefore the shape and form of your data does as well. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL. For more information, refer to Diving Into Delta Lake: Schema Enforcement & Evolution blog and tech talk.

Checkpoints from the last year

In April 2019, we announced that Delta Lake would be open-sourced with the Linux Foundation; the source code for the project can be found at https://github.com/delta-io/delta.  In that time span, the project has quickly progressed with releases (6 so far), contributors (65 so far), and stars (>2500).  At this time, we wanted to call out some of the cool features.

Execute DML statements

With Delta Lake 0.3.0, you now have the ability to run DELETE, UPDATE, and MERGE statements using the Spark API.  Instead of running a convoluted mix of INSERTs, file-level deletions, and table removals and re-creations, you can execute DML statements within a single atomic transaction.

import io.delta.tables._

val deltaTable = DeltaTable.forPath(sparkSession, pathToEventsTable)
deltaTable.delete("date < '2017-01-01'")        // predicate using SQL formatted string

import org.apache.spark.sql.functions._
import spark.implicits._

deltaTable.delete($"date" < "2017-01-01")  

In addition, this release included the ability to query commit history to understand what operations modified the table.

import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, pathToTable)

val fullHistoryDF = deltaTable.history()    // get the full history of the table.

val lastOperationDF = deltaTable.history(1) // get the last operation.

The returned DataFrame will have the following structure.

+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+
|version|          timestamp|userId|userName|operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+
|      5|2019-07-29 14:07:47|  null|    null|   DELETE|[predicate -> ["(...|null|    null|     null|          4|          null|        false|
|      4|2019-07-29 14:07:41|  null|    null|   UPDATE|[predicate -> (id...|null|    null|     null|          3|          null|        false|
|      3|2019-07-29 14:07:29|  null|    null|   DELETE|[predicate -> ["(...|null|    null|     null|          2|          null|        false|
|      2|2019-07-29 14:06:56|  null|    null|   UPDATE|[predicate -> (id...|null|    null|     null|          1|          null|        false|
|      1|2019-07-29 14:04:31|  null|    null|   DELETE|[predicate -> ["(...|null|    null|     null|          0|          null|        false|
|      0|2019-07-29 14:01:40|  null|    null|    WRITE|[mode -> ErrorIfE...|null|    null|     null|       null|          null|         true|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+    

For Delta Lake 0.4.0, we made executing DML statements by supporting Python APIs as noted in Simple, Reliable Upserts, and Deletes on Delta Lake Tables using Python APIs.

Sample merge executed by a DML statement made possible by Delta Lake 0.4.0

Support for other processing engines

An important fundamental of Delta Lake was that while it is a storage layer originally conceived to work with Apache Spark, it can work with many other processing engines.  As part of the Delta Lake 0.5.0 release, we included the ability to create manifest files so that you can query Delta Lake tables from Presto and Amazon Athena.

The blog post Query Delta Lake Tables from Presto and Athena, Improved Operations Concurrency, and Merge performance provides examples of how to create the manifest file to query Delta Lake from Presto; for more information, refer to Presto and Athena to Delta Lake Integration.  Included as part of the same release was the experimental support for Snowflake and Redshift Spectrum. More recently, we’d like to call out integrations with dbt and koalas.

Delta Lake Connectors allow you to standardize your big data storage by making it accessible from various tools, such as Amazon Redshift and Athena, Snowflake, Presto, Hive, and Apache Spark.

With Delta Connector 0.1.0, your Apache Hive environment can now read Delta Lake tables.  With this connector, you can create a table in Apache Hive using STORED BY syntax to point it to an existing Delta table like this:

CREATE EXTERNAL TABLE deltaTable(col1 INT, col2 STRING)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION '/delta/table/path'

Simplifying Operational Maintenance

As your data lakes grow in size and complexity, it becomes increasingly difficult to maintain it.  But with Delta Lake, each release included more features to simplify the operational overhead.  For example, Delta Lake 0.5.0 includes improvements in concurrency control and support for file compactionDelta Lake 0.6.0 made further improvements including support for reading Delta tables from any file system and improved merge performance and automatic repartitioning.

As noted in Schema Evolution in Merge Operations and Operational Metrics in Delta Lake, Delta Lake 0.6.0 introduces schema evolution and performance improvements in merge and operational metrics in table history. By enabling automatic schema evolution in your environment,

# Enable automatic schema evolution
spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true")

you can run a single atomic operation to update values as well as merge together the new schema with the following example statement.

from delta.tables import *
deltaTable = DeltaTable.forPath(spark, DELTA_PATH)

# Schema Evolution with a Merge Operation
deltaTable.alias("t").merge(
    new_data.alias("s"),
    "s.col1 = t.col1 AND s.col2 = t.col2"
).whenMatchedUpdateAll(  
).whenNotMatchedInsertAll(
).execute()

Improvements to operational metrics were also included in the release so that you can review them from both the API and the Spark UI.  For example, running the statement:

deltaTable.history().show()

provides the abbreviated output of the modifications that had happened to your table.

+-------+------+---------+--------------------+
|version|userId|operation|    operationMetrics|
+-------+------+---------+--------------------+
|      1|100802|    MERGE|[numTargetRowsCop...|
|      0|100802|    WRITE|[numFiles -> 1, n...|
+-------+------+---------+--------------------+    

For the same action, you can view this information directly within the Spark UI as visualized in the following animated GIF.

Sample schema evolution in merge operations and operational metrics Spark UI.

Schema Evolution in Merge Operations and Operational Metrics Spark UI example

For more details surrounding this action, refer to Schema Evolution in Merge Operations and Operational Metrics in Delta Lake.

Enhancements coming with Spark 3.0

While the preceding section has been about our recent past, let’s get back to the future and focus on the enhancements coming with Spark 3.0.

Support for Catalog Tables

Delta tables can be referenced in an external catalog such as the HiveMetaStore with Delta Lake 0.7.0. Look out for Delta Lake 0.7.0 release working with Spark 3.0 in the coming weeks.

Expectations - NOT NULL columns

Delta tables can be created by specifying columns as NOT NULL. This will prevent any rows containing null values for those columns from being written to your tables.

CREATE TABLE events (
    eventTime TIMESTAMP NOT NULL,
    eventType STRING NOT NULL,
    source STRING,
    tags MAP
)
USING delta

More support is on the way, for example the definition of arbitrary SQL expressions as invariants as well as being able to define these invariants on existing tables.

DataFrameWriterV2 API

DataFrameWriterV2 is a much cleaner interface for writing a DataFrame to a table. Table creation operations such as “create”, “replace” are separate from data modification operations such as “append”, “overwrite” and provide the users a better understanding of what to expect. DataFrameWriterV2 APIs are only available in Scala with Spark 3.0.

// Create a table using the DataFrame or replace the existing table
df.writeTo(“delta_table”)
.tableProperties(“delta.appendOnly”, “true”)
.createOrReplace()

// Insert more data into the table
df2.writeTo(“delta_table”).append()    

Get Started with Delta Lake

Try out Delta Lake with the preceding code snippets on your Apache Spark 2.4.5 (or greater) instance (on Databricks, try this with DBR 6.6+). Delta Lake makes your data lakes more reliable (whether you create a new one or migrate an existing data lake).  To learn more, refer to https://delta.io/, and join the Delta Lake community via Slack and Google Group.  You can track all the upcoming releases and planned features in GitHub milestones. You can also try out Managed Delta Lake on Databricks with a free account.

--

Try Databricks for free. Get started today.

The post Time Traveling with Delta Lake: A Retrospective of the Last Year appeared first on Databricks.

Government and Education Sessions You Don’t Want to Miss at Spark + AI Summit 2020

$
0
0

For years, the Spark + AI Summit has been the premier meeting place for organizations looking to build data analytics and AI applications at scale with leading open-source technologies such as Apache SparkTM, Delta Lake and MLflow. In 2020, we’re continuing the tradition by taking the summit entirely virtual. Data scientists and engineers from anywhere in the world will join us June 22-26, 2020 to learn and share best practices for delivering the benefits of AI.

This year we have a robust experience for data teams in the Federal, State, and Local Government and Education Sector. Join thousands of your peers to explore how the latest innovations in data and AI are improving how we serve our citizens, secure our homeland, and fight waste and fraud. Register for Spark + AI Summit to take advantage of all the government and education sessions and events.

Government and Education Tech Talks

Here is an overview of some of our most highly anticipated government and education session talks at this year’s summit:

Automating Federal Aviation Administration’s (FAA) System Wide Information Management ( SWIM ) Data Ingestion and Analysis
Department of Transportation

The System Wide Information Management (SWIM) Program is a National Airspace System information system that supports Next Generation Air Transportation System (NextGen) goals. SWIM essentially facilitates the data-sharing backbone requirements for NextGen, while the SWIM Cloud Distribution Service provides publicly available Federal Aviation Administration SWIM content to approved consumers via Solace JMS messaging. In this session, Microsoft will showcase the work they did at USDOT-BTS on Automating the required Infrastructure, Configuration, Ingestion and Analysis of public SWIM Data Sets.

Using Apache Spark for Predicting Degrading and Failing Parts in Aviation
U.S. Navy

Throughout naval aviation, data lakes provide the raw material for generating insights into predictive maintenance and increasing readiness across many platforms. Civilian and military aviation datasets are extremely large and heterogeneous, but Apache Spark has enabled a small team to handle the volume and variety across hundreds of schemas. In this talk, learn about the successful utilization of these tools, and how Spark will play a major role in aviation reporting and analysis in the future.

Lessons Learned from Modernizing USCIS Data Analytics Platform
U.S. Citizenship and Immigration

USCIS seeks to secure America’s promise as a nation of immigrants by providing accurate and useful information to customers, granting immigration and citizenship benefits, promoting an awareness and understanding of citizenship, and ensuring the integrity of our immigration system. Although a recent move to the cloud improved their capabilities, they required a dynamically scalable platform that could adapt and cater to the growing data demand. Join this presentation and technical demo for a deep dive into what it took to accomplish this goal, as well as lessons learned while using Databricks and related technologies like Apache Spark, Delta Lake and MLflow.

Geospatial Options in Apache Spark
Pacific Northwest National Lab

Geospatial data appears to be simple right up until the part when it becomes intractable. In this talk, Pacific Northwest National Lab will take you through the many gotcha moments with geospatial data in Spark, as well as why geospatial data in general can be so challenging. Critically, they’ll run through how they’ve approached these issues to limit errors and reduce cost, the pros and cons of each geospatial package, and how they migrate geospatial data. This talk will also include their best practices for ingesting geospatial data as well as how they store it for long term use.

Using Apache Spark and Differential Privacy for Protecting the Privacy of the 2020 Census Respondents
U.S. Census Bureau

One of the data challenges of the 2020 Census is making high-quality data usable while protecting respondent confidentiality. The U.S. Census Bureau is achieving this with differential privacy, a mathematical approach that allows them to balance the requirements for data accuracy and privacy protection. In this talk, they’ll present the design of their custom-written, Spark-based differential privacy application, and discuss the monitoring systems they built in Amazon’s GovCloud for multiple clusters and thousands of application runs.

You can see the full list of talks on our Government and Education Summit page.

Government Industry Forum

Join us on Wednesday, June 24, 11:30 AM – 1:00 PM PST for an interactive Government Forum at Spark + AI Summit. In this free virtual event, you will have the opportunity to network with your peers and participate in engaging panel discussions with Federal and State and Local Government leaders on how data and machine learning are driving innovation across agencies. Panelists include:

Chase Baker
Unit Chief

Rob Brown
CTO

Scott Porter
CIO

Eileen M. Vidrine
CDO

Education Industry Forum

Colby Ford
Faculty Member

Charlie Lindville
Visiting Associate Professor of Statistics

Donghwa Kim
Adjunct Professor

Join us on Wednesday, June 24, 11:30 AM – 1:00 PM PST for an interactive Education Industry Forum at Spark + AI Summit. In this free virtual event, you will have the opportunity to network with your peers and participate in engaging panel discussions with leaders in academia on how data and machine learning are driving innovation in student success, the online delivery of education, and more. Panelists include:

Government and Education Fireside Chats

Matt Turner
Chief Data Officer

 Building a Modern Unified Data Analytics Architecture for Real-time COVID Response

Join this interactive fireside chat with Matt Turner, Chief Data Officer of MUSC, to learn how they built a modern unified data analytics architecture that enables their teams to unlock insights buried within their data and build powerful predictive models. More specifically, you’ll learn how this strategy prepared MUSC to quickly respond to the dynamic environment of COVID-19. 

Patrick Munis
CEO

Powering Innovation in Public Sector and Healthcare with Unified Data Analytics and AI in the Cloud

The opportunity to improve government operations and citizen services with data and AI is massive. Unfortunately, most government agencies and healthcare organizations are limited by inflexible legacy data warehouses and analytics architectures that create data silos leading to poor and disjointed analytics capabilities. Join NewWave for an interactive fireside chat to learn how agencies can take advantage of all their data and build powerful predictive models with a secure, unified approach to data and AI including real-world stories from the Centers for Medicare and Medicaid Services (CMS). The audience will also hear about the journey NewWave had with Databricks and Azure to bring Databricks onto the Microsoft Azure Government (MAG) cloud.

Scaling Mission AI Faster with Accenture AIP IQ + Databricks
Scaling Mission AI Faster with Accenture AIP IQ + Databricks

The Accenture Insights Platform for Government (AIP IQ) brings together best-of-breed analytics and data management as a FedRAMP authorized, cloud-based service. Databricks Unified Data Analytics Platform is now integrated into and accessible via AIP IQ. Join this discussion to learn how federal agencies are capitalizing on the power of AIP IQ + Databricks to support popular federal use cases. We’ll also share how agencies are using the platform to reduce the cost of advanced data science through the development of repeatable workflows and models.

Demos on Popular Data + AI Use Case in Government and Education

Join live demos on the hottest data and AI use cases in the government and education sector covering topics such as:

  • Opioid Epidemic Modeling
  • Building a Data Lake for Citizen 360
  • Real Time Cyber Threat Detection
  • Imagery Analysis & Detection
  • Improving Student Success with Predictive Analytics
  • Detecting Financial Fraud at Scale
  • Predictive Analytics in a Real Time World

Sign-up for the Government and Education Experience at Summit!

To take advantage of the full Government and Education Experience at Spark + AI Summit, simply register for our free virtual conference and select Government and Education Forum during the registration process. If you’re already registered for the conference, log into your registration account, edit “Additional Events” and check the forum you would like to attend.

--

Try Databricks for free. Get started today.

The post Government and Education Sessions You Don’t Want to Miss at Spark + AI Summit 2020 appeared first on Databricks.

Introducing Apache Spark 3.0

$
0
0

We’re excited to announce that the Apache SparkTM 3.0.0 release is available on Databricks as part of our new Databricks Runtime 7.0. The 3.0.0 release includes over 3,400 patches and is the culmination of tremendous contributions from the open-source community, bringing major advances in Python and SQL capabilities and a focus on ease of use for both exploration and production. These initiatives reflect how the project has evolved to meet more use cases and broader audiences, with this year marking its 10-year anniversary as an open-source project.

Here are the biggest new features in Spark 3.0:

No major code changes are required to adopt this version of Apache Spark. For more information, please check the migration guide.

Celebrating 10 years of Spark development and evolution

Spark started out of UC Berkeley’s AMPlab, a research lab focused on data-intensive computing. AMPlab researchers were working with large internet-scale companies on their data and AI problems, but saw that these same problems would also be faced by all companies with large and growing volumes of data. The team developed a new engine to tackle these emerging workloads and simultaneously make the APIs for working with big data significantly more accessible to developers.

Community contributions quickly came in to expand Spark into different areas, with new capabilities around streaming, Python and SQL, and these patterns now make up some of the dominant use cases for Spark. That continued investment has brought Spark to where it is today, as the de facto engine for data processing, data science, machine learning and data analytics workloads. Apache Spark 3.0 continues this trend by significantly improving support for SQL and Python — the two most widely used languages with Spark today — as well as optimizations to performance and operability across the rest of Spark.

Improving the Spark SQL engine

Spark SQL is the engine that backs most Spark applications. For example, on Databricks, we found that over 90% of Spark API calls use DataFrame, Dataset and SQL APIs along with other libraries optimized by the SQL optimizer. This means that even Python and Scala developers pass much of their work through the Spark SQL engine. In the Spark 3.0 release, 46% of all the patches contributed were for SQL, improving both performance and ANSI compatibility. As illustrated below, Spark 3.0 performed roughly 2x better than Spark 2.4 in total runtime. Next, we explain four new features in the Spark SQL engine.

The new Adaptive Query Execution (AQE) framework improves performance and simplifies tuning by generating a better execution plan at runtime, even if the initial plan is suboptimal due to absent/inaccurate data statistics and misestimated costs. Because of the storage and compute separation in Spark, data arrival can be unpredictable. For all these reasons, runtime adaptivity becomes more critical for Spark than for traditional systems. This release introduces three major adaptive optimizations:

  • Dynamically coalescing shuffle partitions simplifies or even avoids tuning the number of shuffle partitions. Users can set a relatively large number of shuffle partitions at the beginning, and AQE can then combine adjacent small partitions into larger ones at runtime.
  • Dynamically switching join strategies partially avoids executing suboptimal plans due to missing statistics and/or size misestimation. This adaptive optimization can automatically convert sort-merge join to broadcast-hash join at runtime, further simplifying tuning and improving performance.
  • Dynamically optimizing skew joins is another critical performance enhancement, since skew joins can lead to an extreme imbalance of work and severely downgrade performance. After AQE detects any skew from the shuffle file statistics, it can split the skew partitions into smaller ones and join them with the corresponding partitions from the other side. This optimization can parallelize skew processing and achieve better overall performance.

Based on a 3TB TPC-DS benchmark, compared without AQE, Spark with AQE can yield more than 1.5x performance speedups for two queries and more than 1.1x speedups for another 37 queries.

TPC-DS 3TB Parquet With vs. Without Adaptive Query Execution.
TPC-DS 1TB Parquet With vs. Without Dynamic Partition Pruning

Dynamic Partition Pruning is applied when the optimizer is unable to identify at compile time the partitions it can skip. This is not uncommon in star schemas, which consist of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table, by identifying those partitions that result from filtering the dimension tables. In a TPC-DS benchmark, 60 out of 102 queries show a significant speedup between 2x and 18x.

ANSI SQL compliance is critical for workload migration from other SQL engines to Spark SQL. To improve compliance, this release switches to Proleptic Gregorian calendar and also enables users to forbid using the reserved keywords of ANSI SQL as identifiers. Additionally, we’ve introduced runtime overflow checking in numeric operations and compile-time type enforcement when inserting data into a table with a predefined schema. These new validations improve data quality. 

Join hints: While we continue to improve the compiler, there’s no guarantee that the compiler can always make the optimal decision in every situation — join algorithm selection is based on statistics and heuristics. When the compiler is unable to make the best choice, users can use join hints to influence the optimizer to choose a better plan. This release extends the existing join hints by adding new hints: SHUFFLE_MERGE, SHUFFLE_HASH and SHUFFLE_REPLICATE_NL.

Enhancing the Python APIs: PySpark and Koalas 
Python is now the most widely used language on Spark and, consequently, was a key focus area of Spark 3.0 development. 68% of notebook commands on Databricks are in Python. PySpark, the Apache Spark Python API, has more than 5 million monthly downloads on PyPI, the Python Package Index.

Many Python developers use the pandas API for data structures and data analysis, but it is limited to single-node processing. We have also continued to develop Koalas, an implementation of the pandas API on top of Apache Spark, to make data scientists more productive when working with big data in distributed environments. Koalas eliminates the need to build many functions (e.g., plotting support) in PySpark, to achieve efficient performance across a cluster.

After more than a year of development, the Koalas API coverage for pandas is close to 80%. The monthly PyPI downloads of Koalas has rapidly grown to 850,000, and Koalas is rapidly evolving with a biweekly release cadence. While Koalas may be the easiest way to migrate from your single-node pandas code, many still use the PySpark APIs, which are also growing in popularity.

Weekly PyPI Downloads for PySpark and Koalas

Spark 3.0 brings several enhancements to the PySpark APIs:

  • New pandas APIs with type hints: pandas UDFs were initially introduced in Spark 2.3 for scaling user-defined functions in PySpark and integrating pandas APIs into PySpark applications. However, the existing interface is difficult to understand when more UDF types are added. This release introduces a new pandas UDF interface that leverages Python type hints to address the proliferation of pandas UDF types. The new interface becomes more Pythonic and self-descriptive.
  • New types of pandas UDFs and pandas function APIs: This release adds two new pandas UDF types, iterator of series to iterator of series and iterator of multiple series to iterator of series. It’s useful for data prefetching and expensive initialization. Also, two new pandas-function APIs, map and co-grouped map are added. More details are available in this blog post.
  • Better error handling: PySpark error handling is not always friendly to Python users. This release simplifies PySpark exceptions, hides the unnecessary JVM stack trace, and makes them more Pythonic.

Improving Python support and usability in Spark continues to be one of our highest priorities.

Hydrogen, streaming and extensibility 

With Spark 3.0, we’ve finished key components for Project Hydrogen as well as introduced new capabilities to improve streaming and extensibility.

  • Accelerator-aware scheduling: Project Hydrogen is a major Spark initiative to better unify deep learning and data processing on Spark. GPUs and other accelerators have been widely used for accelerating deep learning workloads. To make Spark take advantage of hardware accelerators on target platforms, this release enhances the existing scheduler to make the cluster manager accelerator-aware. Users can specify accelerators via configuration with the help of a discovery script. Users can then call the new RDD APIs to leverage these accelerators.
  • New UI for structured streaming: Structured streaming was initially introduced in Spark 2.0. After 4x YoY growth in usage on Databricks, more than 5 trillion records per day are processed on Databricks with structured streaming. This release adds a dedicated new Spark UI for inspection of these streaming jobs. This new UI offers two sets of statistics: 1) aggregate information of streaming query jobs completed and 2) detailed statistics information about streaming queries.

Trend in the number of records processed by Structured Streaming on Databricks

  • Observable metrics: Continuously monitoring changes to data quality is a highly desirable feature for managing data pipelines. This release introduces monitoring for both batch and streaming applications. Observable metrics are arbitrary aggregate functions that can be defined on a query (DataFrame). As soon as the execution of a DataFrame reaches a completion point (e.g., finishes batch query or reaches streaming epoch), a named event is emitted that contains the metrics for the data processed since the last completion point.
  • New catalog plug-in API: The existing data source API lacks the ability to access and manipulate the metadata of external data sources. This release enriches the data source V2 API and introduces the new catalog plug-in API. For external data sources that implement both catalog plug-in API and data source V2 API, users can directly manipulate both data and metadata of external tables via multipart identifiers, after the corresponding external catalog is registered.

Other updates in Spark 3.0

Spark 3.0 is a major release for the community, with over 3,400 Jira tickets resolved. It’s the result of contributions from over 440 contributors, including individuals as well as companies like Databricks, Google, Microsoft, Intel, IBM, Alibaba, Facebook, Nvidia, Netflix, Adobe and many more. We’ve highlighted a number of the key SQL, Python and streaming advancements in Spark for this blog post, but there are many other capabilities in this 3.0 milestone not covered here. Learn more in the release notes and discover all the other improvements to Spark, including data sources, ecosystem, monitoring and more.

Major Features in Spark 3.0 and Databricks Runtime 7.0.

Get started with Spark 3.0 today

If you want to try out Apache Spark 3.0 in the Databricks Runtime 7.0, sign up for a free trial account and get started in minutes. Using Spark 3.0 is as simple as selecting version “7.0” when launching a cluster.

Using Spark 3.0 in Databricks Runtime 7.0 is as simple as selecting it from the drop-down menu when launching a cluster.

Learn more about feature and release details:

--

Try Databricks for free. Get started today.

The post Introducing Apache Spark 3.0 appeared first on Databricks.

Avanade and Databricks Partner to Deliver Data and AI Solutions with Azure Databricks

$
0
0

We’re excited to share the announcement of our partnership with Avanade to enable enterprise clients to scale their Azure Data and artificial intelligence (AI) investments and to generate positive business results. In addition to the hundreds of trained Microsoft Azure Databricks specialists, Avanade has a set of solution accelerators to help operationalize data engineering, data science and machine learning on top of Azure cloud solutions. The combination of expertise from both companies, especially on Azure, will make it easier for our joint customers to modernize and implement advanced analytics using Azure Databricks.

This partnership builds upon the work Avanade and Databricks have already done together for several years to deliver client solutions that embed AI throughout various business processes and experiences. We’ve seen that despite 88% of global decision makers investing in machine learning, only 8% of these companies are engaging in core practices that support AI adoption at scale. Most companies are applying AI to just a single part of their business or running ad-hoc data science pilot programs, missing opportunities to generate new sources of revenue and engage with customers. Avanade and Databricks are working together to help data teams address these gaps and move towards a competitive advantage in their industry.

Focus on Modernization and Scalability

Avanade and Databricks have worked together on a number of solutions and projects. Talking to Luke Pritchard, the Global Data Lead for Avanade, he said “This partnership builds upon the work we have already done together for several years to deliver solutions that help our clients scale their Azure Data and AI investments to generate business results.” Here are a few key areas of our partnership to highlight:

1. Modernization and cloud migration

Limitations with on-premise data systems, like Hadoop, are pushing data teams to explore new cloud-computing alternatives. However, planning and migrating business applications from one environment to another is no easy feat. It takes a lot of time and technical expertise to develop a proper migration plan, refactor the data architecture, and validate outcomes with the desired results. This is where Avanade and Databricks enable a smoother migration path from legacy data systems to modern data architectures.

2. Production ready machine learning

Every enterprise has opportunities to accelerate innovation by building data science and machine learning into their business. When it comes time to automate and govern the preparation of large datasets for analytics, and establish processes and automation for moving models from development to production, the extent of what is needed becomes clear. Streamline the full machine learning lifecycle with a repository of industry-specific ML models and pipeline templates to automate data preparation and promote reuse of data transformation scripts.

3. Data science at scale

The power of bringing data together across business units and systems gives organizations a competitive edge, but often takes months of infrastructure and DevOps work. It also requires multiple handoffs between data engineering and data science, which is error prone and increases risk. Develop an enterprise analytics strategy that is specific to your industry and organization, bridge the talent gap in deep advanced analytics, and ensure scalability and sustainability through built-in security and maintenance.

Industry Case Studies

Avanade and Databricks have helped customers across industries leverage open-source software, big data analytics, machine learning and AI to modernize their data platforms and engage customers. Here are a few examples:

Hadoop Migration for Global Pharma

A global pharmaceutical company wanted to operationalize and expand their data science capabilities when Avanade helped them move from their on-premise system to Azure. By building a data platform that leveraged Azure and Azure Databricks, the company’s data scientists were able to automate the majority of their data preparation, experiment with their models, and train algorithms faster. As a result, the reduction in repeat work and improvement of data science capabilities allowed the company to reduce costs and uncover new revenue opportunities.

Industrial Supply Chain Optimization

When thyssenkrupp, an industrial engineering and steel production company, wanted to optimize their delivery network to address rising supply and transportation costs, they immediately thought of AI in the cloud. Together with Avanade and Databricks, thyssenkrupp built the cloud-based platform alfred.simOne to automatically analyze and run simulations. The completed simulations led to optimized operations, increased cost savings, and reduced emissions. Internally, thyssenkrupp was better able to bring their data engineering and data science teams together for improved collaboration and development of innovative solutions that had a real impact on the way they do business.

Financial Services Customer Personalization

One financial services company repeatedly saw that customers were abandoning credit card applications. They chose to work with Avanade after realizing they needed to scale their data science initiatives to maximize insights and create a more personalized customer experience. Avanade helped implement a real-time data platform using Azure Databricks with a unified view into each customer across various time intervals. The solution made it easier for their marketing team to segment customers by type, serve them a relevant application, and ultimately reduce abandonment and churn.

Learn More

To learn more, please see the Databricks page on the Avanade website, or Contact Us.

Be sure to also check out the Avanade session at the Spark and AI Summit on June 25, 2020 at 11:00-11:30 am PT. In this session, you’ll learn how to scale the use of data science and artificial intelligence (AI) for accelerated business results. You will also gain insights into high impact use cases and learn why a design led approach helps you achieve a higher success rate to accelerate value enterprise wide.

--

Try Databricks for free. Get started today.

The post Avanade and Databricks Partner to Deliver Data and AI Solutions with Azure Databricks appeared first on Databricks.

Online Learning with Databricks

$
0
0

Recently, a gentleman named Scott Galloway was featured in an article in New York Magazine about how Covid-19 is breaking the higher education business model.  The article led to an invitation to a live  interview with Anderson Cooper (full transcript of that interview can be found here).  When I watched that segment and saw Anderson Cooper’s eyes go wide, it brought me back to a faculty meeting circa 1998, wherein a roomful of tenured faculty members worried about their relevance as pre-millennium venture capital met the business opportunities of online education. Now it appears that the COVID-19 pandemic has brought this concern to our front door.

So here we are now, in uncertain times for our politics, our society, our economy, and for all of the educational systems that will produce the leaders of tomorrow.  Tough problems for a tough time… Where do we go from here?

Making Online Learning Better

Online learning is the new normal, the primary learning environment.  In this digital world, teachers are provided new lenses through which to engage students.  While teachers no longer have a brick and mortar classroom and face-to-face office hours for instruction, the digital world gives them a different set of teaching tools.

In this digital world, every interaction between the student and the teacher, the student and the content, the student and his or her project groups, viewing external information sources, etc, leaves a data footprint.  The digital learning environment can tell us what content students interact with, how long they spend on a page or video replay, if they are writing originally or executing a ‘copy and paste’ function, how they are collaborating with classmates, and a host of other information about their experience.  In a video conference session, it can tell us if they are attentive to the screen as well as if they are interacting with quiz, testing, or survey content.

Data from populations of students can identify how they respond to different combinations of classes, identifying class combinations that are toxic as well as those likely to lead them to flourish in their learning experience.  Their social media feeds can inform satisfaction with their learning experience and their engagement with their education community, allowing educators to make real-time adjustments to teaching and engagement.

COVID-19 has forced this mode of education on everyone everywhere.  We can’t change that reality, but we can certainly learn from it!  It is an educational experiment mediated by technology whose scope will never be repeated (we hope), so let us pose some timely questions:

Can online instruction scale up without losing impact?  At what scale do we have diminishing returns, if any?  One instructor for a physical classroom filled with 30 students is a norm; what is a good norm for online instruction, and how should we leverage technology to make it as effective as possible?

Is there a way to look at student interactions and student grades so that we have early warning systems for identification and intervention with at-risk students?  Can we look at data to help us tailor courses and learning materials to maximize the success potential for students with non-traditional backgrounds? Can we identify mental health concerns to help students get access to resources and support before things get worse?

If the business model of higher education is broken, how must it evolve?  Over the long run, what determines the value of a degree from a specific school with a specific major?  In the short run, how can schools leverage data from across its enterprise to optimize recruiting, retention, and job placement upon graduation?  If a university must focus on scarcity to maintain its brand value, how can it best engage its network of graduates, students, and parents to participate in school activities?

None of these questions are new, but their urgency has certainly spiked, and data analysis at scale can certainly help solve these tough problems, as shown by the following examples.

Real-world Examples

A virtual classroom at Berkeley with global reach and key technology partners

According to Scott Galloway in the transcript of his interview with Anderson Cooper, the University of California, Berkeley will graduate more kids from low-income households this year than the entire Ivy League.  In addition, the university has been using tools and methods for instruction that have a global reach for many years.  This recent article from InformationWeek highlights one such course that has been using technology to run a global classroom in real time since 2014.  Kyle Hamilton, one of the instructors for Data Science W261: Machine Learning at Scale course, teaches students the skills needed to process the big data needed to address the tough problems mentioned above.  The course builds on and goes beyond the collect-and-analyze phase of big data by focusing on how machine learning algorithms can be rewritten and extended to scale to work on petabytes of data, both structured and unstructured, to generate sophisticated models used for real-time predictions.  Databricks is honored that Kyle has chosen us as a technology partner.

Using data at scale to improve the business model of online education

Western Governors University (WGU) was founded in 1997 by 19 US governors as a non-profit, all-online competency-based university offering undergraduate and graduate degrees.  It has more than 180,000 graduates, 123,000 active students and more than 6,800 employees.  WGU has made a commitment to understanding its students, employees, and educational platforms through data.  Databricks helps Western Governors improve student success by providing a one-stop for data access, democratizing data to allow for access to all employees, freeing up employees to do deep-dives on course- and student-related dashboards, and improving time-to-insight via streamlined ETL.

WGU uses data not only to deliver education content, but for real-time assessment of each student’s learning experience, with real-time data systems to inform teachers what is working, and to intervene directly with students where the process is not working.  WGU also applies data analytics over time with AI models that help educators learn to optimize the learning experience.

Using Data to Map the Way Forward

Data has always had the potential to make education better, improving its reach and impact across digital divides, and help the millennia-old model of the classroom adjust to modern-day economic and social constructs.  It is up to us now to map the path forward, using data to find the truths that will help the current educational system efficiently produce a generation of diverse, well-educated and workforce-ready students.

Getting Started

--

Try Databricks for free. Get started today.

The post Online Learning with Databricks appeared first on Databricks.

Announcing the 2020 Databricks Data Team Award Finalists

$
0
0

SPARK + AI Summit, the world’s biggest gathering of data and artificial intelligence (AI) professionals, has arrived, and this year’s theme is ‘Data Teams Unite!’  So what better time to announce the finalists for the inaugural Databricks Data Team Awards?

The Databricks Data Team Awards celebrate the data teams of engineers, scientists and analysts who are leveraging data and AI to solve the world’s toughest problems. Here at Databricks, we are proud to support these teams with our Unified Analytics platform, which provides a single environment and common data sets to enable collaboration across organizations.

Our selected finalists have helped to redefine what a unified data team can accomplish when they work together on one platform to achieve a common goal—delivering innovation, impact, and helping to make the world a better place.

Here are the finalists in each of the three categories:

Data Team for Good Award

Some data teams are tackling issues that impact us all. And right now, there’s no problem more urgent for data teams than helping healthcare providers, governments, and life sciences organizations find ways to better manage and treat individuals and communities impacted by the COVID-19 pandemic.

Aetion

Aetion’s data team is working on a high-impact use case related to the COVID-19 crisis. Specifically, Aetion has partnered with HeathVerity to use Databricks to ingest and process data from multiple inputs into real-time data sets to be used to analyze COVID-19 interventions and to study the pandemic’s impact on health care utilization. Their integrated solution includes a Real-Time Evidence Platform that enables biopharma, regulators, and public health officials to generate evidence on the usage, safety, and effectiveness of prospective treatments for COVID-19 and to continuously update and expand this evidence over time. This new, high-priority use case for Aetion has already produced a social impact—it will be employed in the company’s new research collaboration with the U.S. FDA, which will support the agency’s understanding of and response to the pandemic.

Alignment Healthcare

Alignment Healthcare, a rapidly growing Medicare insurance provider, serves one of the most at-risk groups of the COVID-19 crisis—seniors. While many health plans rely on outdated information and siloed data systems, Alignment processes a wide variety and large volume of near real-time data into a unified architecture to build a revolutionary digital patient ID and comprehensive patient profile by leveraging Azure Databricks. This architecture powers more than 100 AI models designed to effectively manage the health of large populations, engage consumers, and identify vulnerable individuals needing personalized attention—with a goal of improving members’ well-being and saving lives.

Medical University of South Carolina (MUSC)

MUSC is dedicated to delivering the highest quality patient care available while training generations of competent, compassionate health care providers to serve the people of South Carolina and beyond. MUSC is also known as a pioneer and stepped forward with their ingenuity to assist patients during the COVID-19 pandemic. MUSC has developed machine learning models, trained on their AI Workbench and Databricks, for predicting COVID-19 positive patients and prioritizing testing for high-risk individuals. As a result, MUSC has been able to greatly increase the percentage of high-risk patients tested for COVID-19 and utilize the application to target at-risk populations across South Carolina.

Data Team Impact Award

These are the data teams delivering impact to their organizations, through measurable outcomes like more engaging customer and user experiences, reducing risk and accelerating time-to-market.

Disney+

Disney+ surpassed 50 million paid subscribers in just five months and is available in more than a dozen countries around the world.  Data is essential to understanding customer growth and to improve the overall customer experience for any streaming business.  Disney+ uses Databricks as a core component of its data lake, and using the Databricks Delta Lake, it has been able to build streaming and batch data pipelines supporting petabytes of data.  The platform is enabling teams to collaborate on ideas, explore data, and apply machine learning across the entire customer journey, to foster growth in its subscriber base.

Unilever

Unilever’s Information and Analytics Team have enabled over 50 use cases that drive their business by optimizing Data Products from the Unilever data lake.  At the heart of this data and analytics architecture are cloud-based platforms like Azure and Databricks Delta Lake, by which Unilever’s unified data team is able to process data more rapidly than before, unlocking new business insights that deliver impactful value to business analysts, data scientists and business leaders.

YipitData

YipitData provides data-driven research to empower investors by combining alternative data sources with web data for comprehensive coverage. By leveraging Databricks, YipitData’s data team has been able to reduce processing time by up to 90 percent, increasing their analysts’ ability to deliver impactful, reliable insights to their clients. Additionally, by moving to AWS Databricks and decoupling querying from storage, YipitData has reduced database expenses by almost 60%, from $1.2mm per year on databases to less than $500k.

Data Team Innovation Award

This award recognizes data teams that have pushed the boundaries of what’s possible with data and AI, implementing compelling new use cases that will not only help their organization, but also drive the whole community forward.

Comcast

As one of the key media and telecommunications leaders in the US, Comcast connects millions of people to the moments and experiences that matter most. The Product Analytics & Behavioral Science organization makes that mission possible by translating customer product interaction data to insights for internal teams that can then prescriptively improve existing products and innovate with new products. This united team of data engineers and scientists has built the end-to-end data pipeline on top of Databricks Delta Lake that has been generating data at a rate of more than 25TBs per day with over 3PBs of data being used for consumable insights. aIQ, Comcast’s customer experience platform, uses this data to develop a representative state of the customer’s products and service to contextually help resolve customer questions through digital options in an efficient and timely manner, so customers don’t have to pick up the phone and call Comcast.

Goldman Sachs

To better support its clients, the Goldman Sachs Marcus Data team continues to innovate its offerings and, in this instance, leveraged Databricks to build a next generation big data analytics platform that addresses diverse use cases, spanning from credit risk assessment, to fraud detection to marketing analytics and compliance. The unified data team not only built a robust and reliable infrastructure but also activated and empowered hundreds of analysts and developers in a short number of months.

Zalando

Zalando is Europe’s leading online platform for fashion and lifestyle, based in Berlin, Germany. The company follows a platform approach, offering fashion and lifestyle products to customers in 17 European markets. Databricks is the go-to solution for batch and streaming workloads on large-scale data. Data engineers and Business Intelligence practitioners at Zalando appreciate the ease of use and performance of Databricks.

The countdown is on

The Databricks Data Team Award winners for 2020 will be announced on Friday, June 26, and will be celebrated in an upcoming blog post, so check back to see which of the finalists earned the top spots.

--

Try Databricks for free. Get started today.

The post Announcing the 2020 Databricks Data Team Award Finalists appeared first on Databricks.

Key sessions for AWS customers at Spark + AI Summit

$
0
0

At Databricks, we are excited to have Amazon Web Services (AWS) as a sponsor of the first virtual Spark + AI Summit. Our work with AWS continues to make Databricks better integrated with other AWS services, making it easier for our customers to drive huge analytics outcomes.
 
Key sessions at the Spark + AI 2020 Summit for AWS customers
As part of Spark + AI Summit, we wanted to highlight some of the top sessions that AWS customers might be interested in. A number of customers who are running Databricks on AWS are speaking at Spark + AI Summit – from organizations such as AirBNB, CapitalOne, Lyft, Zynga and Atlassian. The sessions  below were chosen based upon their topical matter for customers using Databricks on the AWS cloud platform, demonstrating key service integrations. If you have questions about your AWS platform or service integrations, visit the AWS booth at Spark + AI Summit.

Building A Data Platform For Mission Critical Analytics
Hewlett Packard WEDNESDAY, 11:00 AM (PDT)

This session will feature a presentation by Sally Hoppe, Big Data System Architect at HP. Sally and her team have developed an amazing data platform to handle IOT information from HP printers, enabling them to derive insights into customer use and to deliver a better and more continuous service experience from HP printers. The HP platform includes integrations between Databricks and Delta Lake with AWS services such as Amazon S3 and Amazon Redshift as well as other technologies such as Apache Airflow and Apache Kafka. This session will also feature Igor Alekseev, Partner Solution Architect at AWS, and Denis Dubeau, Partner Solution Architect at Databricks.

Data Driven Decisions at Scale

Comcast WEDNESDAY, 11:35 AM (PDT)

Comcast is the largest cable and internet provider in the US, reaching more than 30 million customers, and continues to grow its presence in the EU with the acquisition of Sky. Comcast has rolled out their Flex device which allows for customers to stream content directly to their TVs without needing an additional cable subscription. Their customer data analytics program is  generating data at a rate of more than 25TBs per day with over 3PBs of data being used for consumable insights. In order for the PABS team to be able to continue to drive consumable insights on massive data sets while still being able to control the amount of data being stored, the PABS team have been using Databricks and Databricks Delta Lake on S3 to do high current low latency read/writes in order to build reliable real-time data pipelines to deliver insights and also be able to do efficient deletes in a timely manner.

Deep Learning Enabled Price Action with Databricks and AWS

WEDNESDAY, 2:30 PM (PDT)

Predicting the movements of price action instruments such as stocks, ForEx, commodities, etc., has been a demanding problem for quantitative strategists for years. Simply applying machine learning to raw price movements has proven to yield disappointing results. New tools from deep learning can substantially enhance the quality of results when applied to traditional technical indicators rather than prices including their corresponding entry and exit signals. In this session Kris Skrinak and Igor Alekseev explore the use of Databricks analysis tools combined with deep learning training accessible through Amazon’s SageMaker to enhance the quality of predictive capabilities of two technical indicators: MACD and Slow stochastics.

Saving Energy in Homes with a Unified Approach to Data and AI

Quby THURSDAY, 11:35 AM (PDT)

Quby is an Amsterdam based technology company offering solutions to empower homeowners to stay in control of their electricity, gas and water usage. Using Europe’s largest energy dataset, consisting of petabytes of IoT data, the company has developed AI-powered products that are used by hundreds of thousands of users daily to maintain a comfortable climate in their homes and reduce their environmental footprint. This session will cover how Quby leverages the full Databricks stack to quickly prototype, validate, launch and scale data science products. They will also cover how Delta Lake allows batch and streaming on the same IoT data and the impact these tools have had on the team itself.

The 2020 Census and Innovation in Surveys

US Census Bureau THURSDAY, 12:10 PM (PDT)

The U.S. Census Bureau is the leading source of quality data about the people of the United States, and its economy. The Decennial Census is the largest mobilization and operation conducted in the United States – enlisting hundreds of thousands of temporary workers – and requires years of research, planning, and development of methods and infrastructure to ensure an accurate and complete count of the U.S. population, estimated currently at 330 million.

Census Deputy Division Chief, Zack Schwartz, will provide a behind the scenes overview on how the 2020 Census is conducted with insights into technical approaches and adopting industry best practices. He will discuss the application monitoring systems built in Amazon’s GovCloud to monitor multiple clusters and thousands of application runs that were used to develop the Disclosure Avoidance System for the 2020 Census. This presentation will leave the audience with an appreciation for the magnitude of the Census and the role that technology plays, along with a vision for the future of surveys, and how attendees can do their part in ensuring that everyone is counted.

Building a Real-Time Feature Store at iFood

iFood THURSDAY, 2:30 PM (PDT)

iFood is the largest food tech company in Latin America. We serve more than 26 million orders each month from more than 150 thousand restaurants. In this talk, you will see how iFood built a real-time feature store, using Databricks and Spark Structured Streaming in order to process events streams, store them to a historical Delta Lake Table storage and a Redis low-latency access cluster, and structured their development processes in order to do it with production-grade, reliable and validated code.

Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS Sagemaker for Enterprise AI Scenarios

Outreach Corporation FRIDAY, 10:00 AM (PDT)

In this presentation, Outreach will demonstrate how they use MLflow and AWS Sagemaker to productionize deep transformer-based NLP models for guided sales engagement scenarios at the leading sales engagement platform, Outreach.io.

Example of how Outreach uses MLflow and AWS Sagemaker to productionize deep transformer-based NLP models.

We are super excited about this session from a ML Ops perspective. The Outreach team will share their experiences and lessons learned in the following areas:

  1. A publishing/consuming framework to effectively manage and coordinate data, models and artifacts (e.g., vocabulary file) at different machine learning stages
  2. A new MLflow model flavor that supports deep transformer models for logging and loading the models at different stages
  3. A design pattern to decouple model logic from deployment configurations and model customizations for a production scenario using MLProject entry points: train, test, wrap, deploy.
  4. A CI/CD pipeline that provides continuous integration and delivery of models into a Sagemaker endpoint to serve the production usage

We think this session is not to be missed.

We hope you find these sessions super useful for learning more ways to integrate Databricks on your AWS platform. Please visit the AWS booth to discuss your use cases and to get an opportunity to get the perspective of AWS experts.  More information around our AWS partnership and integrations is available at www.databricks.com/aws

--

Try Databricks for free. Get started today.

The post Key sessions for AWS customers at Spark + AI Summit appeared first on Databricks.


Key sessions for Microsoft Azure customers at Spark + AI Summit

$
0
0

At Databricks, we are extremely excited to have Microsoft as a Diamond sponsor of the first virtual Spark + AI Summit. Microsoft and Azure Databricks customers are coming together at Summit. Rohan Kumar, Corporate Vice President of Azure Data, will deliver a keynote on Thursday morning. Additional Microsoft speakers and Azure Databricks practitioners will present a wide variety of topics in breakout sessions as well.

The first-ever virtual Spark + AI Summit is this week, the premier event for data teams — data scientists, engineers and analysts — who will tune in from all over the world to share best practices, discover new technologies, network and learn. We are excited to have Microsoft as a Diamond sponsor, bringing Microsoft and Azure Databricks customers together for a lineup of great keynotes and sessions.
 
Key Sessions for Azure customers featured at the Spark + AI 2020 Summit
Rohan Kumar, Corporate Vice President of Azure Data, returns as a keynote speaker for the third year in a row, along with presenters from a number of Azure Databricks customers including Starbucks, Credit Suisse, CVS, ExxonMobil, Mars, Zurich North America and Atrium Health. Below are some of the top sessions to add to your agenda:

KEYNOTE
How Starbucks is achieving its ‘Enterprise Data Mission’ to enable data and ML at scale and provide world-class customer experiences
Starbucks During the WEDNESDAY MORNING KEYNOTE, 8:30 AM – 10:30 AM (PDT)
Vishwanath Subramanian, Director of Data and Analytics Engineering, Starbucks

Starbucks makes sure that everything we do is through the lens of humanity – from our commitment to the highest quality coffee in the world, to the way we engage with our customers and communities to do business responsibly. A key aspect to ensuring those world-class customer experiences is data. This talk highlights the Enterprise Data Analytics mission at Starbucks that helps making decisions powered by data at tremendous scale. This includes everything ranging from processing data at petabyte scale with governed processes, deploying platforms at the speed-of-business and enabling ML across the enterprise. This session will detail how Starbucks has built world-class Enterprise data platforms to drive world-class customer experiences.

KEYNOTE
Responsible ML – Bringing Accountability to Data Science
Microsoft During the THURSDAY MORNING KEYNOTE, 9:00 AM – 10:30 AM (PDT)
Rohan Kumar, Corporate Vice President of Azure Data, Microsoft
Sarah Bird, AI Research and Products, Microsoft

Responsible ML is the most talked about field in AI at the moment. With the growing importance of ML, it is even more important for us to exercise ethical AI practices and ensure that the models we create live up to the highest standards of inclusiveness and transparency. Join Rohan Kumar, as he talks about how Microsoft brings cutting-edge research into the hands of customers to make them more accountable for their models and responsible in their use of AI. For the AI community, this is an open invitation to collaborate and contribute to shape the future of Responsible ML.

KEYNOTE
How Credit Suisse is Leveraging Open Source Data and AI Platforms to Drive Digital Transformation, Innovation and Growth
Credit Suisse During the THURSDAY MORNING KEYNOTE, 9:00 AM – 10:30 AM (PDT)
Anurag Sehgal, Managing Director, Credit Suisse Global Markets

Despite the increasing embrace of big data and AI, most financial services companies still experience significant challenges around data types, privacy, and scale. Credit Suisse is overcoming these obstacles by standardizing on open, cloud-based platforms, including Azure Databricks, to increase the speed and scale of operations, and the democratization of ML across the organization. Now, Credit Suisse is leading the way by successfully employing data and analytics to drive digital transformation, delivering new products to market faster, and driving business growth and operational efficiency.

Automating Federal Aviation Administration’s (FAA) System Wide Information Management ( SWIM ) Data Ingestion and Analysis
Microsoft, Databricks and U.S. DOT WEDNESDAY, 12:10 PM (PDT)

The System Wide Information Management (SWIM) Program is a National Airspace System (NAS)-wide information system that supports Next Generation Air Transportation System (NextGen) goals. SWIM facilitates the data-sharing requirements for NextGen, providing the digital data-sharing backbone of NextGen. The SWIM Cloud Distribution Service (SCDS) is a Federal Aviation Administration (FAA) cloud-based service that provides publicly available FAA SWIM content to FAA approved consumers via Solace JMS messaging. In this session we are going to showcase the work we did at USDOT-BTS on Automating the required Infrastructure, Configuration, Ingestion and Analysis of public SWIM Data Sets.

How Azure and Databricks Enabled a Personalized Experience for Customers and Patients at CVS Health

CVS Health WEDNESDAY, 2:30 PM (PDT)

CVS Health delivers millions of offers to over 80 million customers and patients on a daily basis to improve the customer experience and put patients on a path to better health. In 2018, CVS Health embarked on a journey to personalize the customer and patient experience through machine learning on a Microsoft Azure Databricks platform. This presentation will discuss how the Microsoft Azure Databricks environment enabled rapid in-market deployment of the first machine learning model within six months on billions of transactions using Apache Spark. It will also discuss several use cases for how this has driven and delivered immediate value for the business, including test and learn experimentation for how to best personalize content to customers. The presentation will also cover lessons learned on the journey in the evolving industries of cloud computing and machine learning in a dynamic healthcare environment.

Productionizing Machine Learning Pipelines with Databricks and Azure ML

ExxonMobil WEDNESDAY, 2:30 PM (PDT)

Deployment of modern machine learning applications can require a significant amount of time, resources, and experience to design and implement – thus introducing overhead for small-scale machine learning projects.

In this tutorial, we present a reproducible framework for quickly jumpstarting data science projects using Databricks and Azure Machine Learning workspaces that enables easy production-ready app deployment for data scientists in particular. Although the example presented in the session focuses on deep learning, the workflow can be extended to other traditional machine learning applications as well.

The tutorial will include sample-code with templates and recommended project organization structure and tools, along with shared key learnings from our experiences in deploying machine learning pipelines into production and distributing a repeatable framework within our organization.

Cloud and Analytics—From Platforms to an Ecosystem

Zurich North America WEDNESDAY, 3:05 PM (PDT)

Zurich North America is one of the largest providers of insurance solutions and services in the world with customers representing a wide range of industries from agriculture to construction and more than 90 percent of the Fortune 500. Data science is at the heart of Zurich’s business with a team of 70-data scientists working on everything from optimizing claims-handling processes to protecting against the next risk to revamping the suite of data and analytics for the customers.

In this presentation, we will discuss how Zurich North America implements a scalable self-service data science ecosystem built around Databricks to optimize and scale the activities in the data science project lifecycle and integrates the Azure data lake with analytical tools to streamline machine learning and predictive analytics efforts.

Building the Petcare Data Platform using Delta Lake and ‘Kyte’: Our Spark ETL Pipeline

Mars THURSDAY, 12:10 PM (PDT)

At Mars Petcare (in a division known as Kinship Data & Analytics) we are building out the Petcare Data Platform – a cloud based Data Lake solution. Leveraging Microsoft Azure, we were faced with important decisions around tools and design. We chose Delta Lake as a storage layer to build out our platform and bring insight to the science community across Mars Petcare. We leveraged Spark and Databricks to build ‘Kyte’, a bespoke pipeline tool which has massively accelerated our ability to ingest, cleanse and process new data sources from across our large and complicated organisation. Building on this we have started to use Delta Lake for our ETL configurations and have built a bespoke UI for monitoring and scheduling our Spark pipelines. Find out more about why we chose a Spark-heavy ETL design and a Delta Lake driven platform, and why we are committing to Spark and Delta Lake as the core of our Platform to support our mission: Making a Better World for Pets!

Leveraging Apache Spark for Large Scale Deep Learning Data Preparation and Inference

Microsoft THURSDAY, 3:05 PM (PDT)

To scale out deep learning training, a popular approach is to use Distributed Deep Learning Frameworks to parallelize processing and computation across multiple GPUs/CPUs. Distributed Deep Learning Frameworks work well when input training data elements are independent, allowing parallel processing to start immediately. However preprocessing and featurization steps, crucial to Deep Learning development, might involve complex business logic with computations across multiple data elements that the standard Distributed Frameworks cannot handle efficiently. These preprocessing and featurization steps are where Spark can shine, especially with the upcoming support in version 3.0 for binary data formats commonly found in Deep Learning applications. The first part of this talk will cover how Pandas UDFs together with Spark’s support for binary data and Tensorflow’s TFRecord formats can be used to efficiently speed up Deep Learning’s preprocessing and featurization steps. For the second part, the focus will be techniques to efficiently perform batch scoring on large data volume with Deep Learning models where real-time scoring methods do not suffice. Upcoming Spark 3.0’s new Pandas UDFs’ features helpful for Deep Learning inference will be covered.

All In – Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) – A Real World Case Study

Atrium Health THURSDAY, 3:40 PM (PDT)

Molecular profiling provides precise and individualized cancer treatment options and decisions points. By assessing DNA, RNA, proteins, etc. clinical teams are able to understand the biology of the disease and provide specific treatment plans for oncology patients. An integrated database with demographic, clinical and molecular data was created to summarize individualized genomic reports. Oncologist are able to review the reports and receive assistance interpreting results and potential treatments plans. The architecture to support the current environment includes Wasbi storage, bash/corn/PowerShell, Hive and Office 365 (SharePoint). Via an automated process personalized genomics data is delivered to physicians. As we supported this environment we noted unique challenges and brainstormed a plan for the next generation of the critical business pipeline line.

After researching different platforms we felt that Databricks would allow us to cut cost, standardize our workflow and easily scale for a large organization. This presentation will detail some of the challenges with the previous environment, why we chose Apache Spark and Databricks, migration plans and lessons learned, new technology used after the migration (Data Factory/Databricks, PowerApp/Power Automate/Logic App, Power BI), and how the business has been impacted post migration. Migration to Databricks was critical for our organization due to the time sensitivity of the data and our organizational commitment to personalized treatment for oncology patients.

SparkCruise: Automatic Computation Reuse in Apache Spark

Microsoft FRIDAY, 10:35 AM (PDT)

Queries in production workloads and interactive data analytics are often overlapping, i.e., multiple queries share parts of the computation. These redundancies increase the processing time and total cost for the user. To reuse computations, many big data processing systems support materialized views. However, it is challenging to manually select common computations in the workload given the size and evolving nature of the query workloads. In this talk, we will present Spark Cruise, an automatic computation reuse system developed for Spark. It can automatically detect overlapping computations in the past query workload and enable automatic materialization and reuse in future Spark SQL queries.

SparkCruise requires no active involvement from the user as the materialization and reuse is applied automatically in the background as part of query processing. We can perform all these steps without changing the Spark code, thus demonstrating the extensibility of Spark SQL engine. Spark Cruise has shown to improve the overall runtime of TPC-DS queries by 30%. Our talk will be divided into three parts. First, we will explain the end-to-end system design with focus on how we added workload awareness to the Spark query engine. Then, we will demonstrate all the steps including analysis, feedback, materialization, and reuse on a live Spark cluster. Finally, we will show the workload insights notebook. This Python notebook displays the information from query plans of the workload in a flat table. This table helps the users and administrators to understand the characteristics of their workloads and the cost/benefit tradeoff of enabling SparkCruise.

Deploy and Serve Model from Azure Databricks onto Azure Machine Learning

Microsoft FRIDAY, 11:10 AM (PDT)

We demonstrate how to deploy a PySpark based Multi-class classification model trained on Azure Databricks using Azure Machine Learning (AML) onto Azure Kubernetes (AKS) and associate the model to web services. This presentation covers end-to-end development cycle; from training the model to using it in web application. Machine Learning problem formulation The current solutions to detect semantic types of tabular data mostly rely on dictionary/vocabulary, regular expressions and rule-based look up to identify the semantic types. However, these solutions are 1. Not robust to dirty and complex data and 2. Not generalized to diverse data types. We formulate this into a Machine Learning problem by training a multi-class classifier to automatically predict the semantic type for tabular data. Model Training on Azure Databricks We choose Azure Databricks to perform the featurization and model training using PySpark SQL and Machine Learning Library. To speed up the featurization process, we leverage the PySpark Functions (UDF) to register and distribute the featurization functions into UDFs. For the model training, we pick the Random Forests as the classification algorithms and optimize the model hyperparameters using PySpark MLLib. Model Deployment using Azure Machine Learning Azure Machine Learning provided the reusable and scalable capabilities to manage the lifecycle of Machine Learning models. We developed the E2E deployment pipeline on Azure Machine Learning including model preparation, computing initialization, model registration, and web service deployment. Serving as Web Service on Azure Kubernetes Azure Kubernetes provide the fast response and autoscaling capabilities serving model as web service together with the security authorization. We customized the AKS cluster with PySpark runtime to support PySpark based featurization and model scoring. Our model and scoring service are being deployed onto an AKS cluster and served as HTTPS endpoints with both key-based and token-based authentication.

We look forward to connecting with you at Spark + AI Summit!  If you have questions about Azure Databricks or Azure  service integrations, please visit the Microsoft Azure virtual booth at Spark + AI Summit.

For more information about Azure Databricks, go to www.databricks.com/azure

--

Try Databricks for free. Get started today.

The post Key sessions for Microsoft Azure customers at Spark + AI Summit appeared first on Databricks.

Introducing Delta Engine

$
0
0

Today, we announced Delta Engine, which ties together a 100% Apache Spark-compatible vectorized query engine to take advantage of modern CPU architecture with optimizations to Spark 3.0’s query optimizer and caching capabilities that were launched as part of Databricks Runtime 7.0. Together, these features significantly accelerate query performance on data lakes, especially those enabled by Delta Lake, to make it easier for customers to adopt and scale a lakehouse architecture.

Scaling Execution Performance

One of the big hardware trends over the last several years is that CPU clock speeds have plateaued. The reasons are outside the scope of this blog, but the takeaway is that we have to find new ways to process data faster beyond raw compute power. One of the most impactful methods has been to improve the amount of data that can be processed in parallel. However, data processing engines need to be specifically architected to take advantage of this parallelism.

In addition, data teams are being given less and less time to properly model data as the pace of business increases. Poorer modeling in the interest of better business agility drives poorer query performance. Naturally, this is not a desired state, and organizations want to find ways to maximize both agility and performance.

Announcing Delta Engine for high performance query execution

Delta Engine accelerates the performance of Delta Lake for SQL and data frame workloads through three components: an improved query optimizer, a caching layer that sits between the execution layer and the cloud object storage, and a native vectorized execution engine that’s written in C++.

Delta Engine brings increased performance to all your data workloads through several components

The improved query optimizer extends the functionality already in Spark 3.0 (cost-based optimizer, adaptive query execution, and dynamic runtime filters) with more advanced statistics to deliver up to 18x increased performance in star schema workloads.

Delta Engine’s caching layer automatically chooses which input data to cache for the user, transcoding it along the way in a more CPU-efficient format to better leverage the increased storage speeds of NVMe SSDs. This delivers up to 5x faster scan performance for virtually all workloads.

However, the biggest innovation in Delta Engine to tackle the challenges facing data teams today is the native execution engine, which we call Photon. (We know. It’s in an engine within the engine…) This completely rewritten execution engine for Databricks has been built to maximize the performance from the new changes in modern cloud hardware. It brings performance improvements to all workload types, while remaining fully compatible with open Spark APIs.

In the near future, we’ll dive under the hood of Photon in another blog to show you how it works and, most importantly, how it performs.

Getting started with Delta Engine

By linking these three components together, we think it will be easier for customers to understand how improvements in multiple places within the Databricks code aggregate into significantly faster performance for analytics workloads on data lakes. The improved query optimizer and caching improvements are available today, and we’ll be making Photon available to increasingly more customers throughout the rest of the year.

We’re excited with the value that Delta Engine delivers to our customers. While the time and cost savings are already valuable, its role in the lakehouse pattern supports new advances in how data teams design their data architectures for increased unification and simplicity.

--

Try Databricks for free. Get started today.

The post Introducing Delta Engine appeared first on Databricks.

Introducing Koalas 1.0

$
0
0

Koalas was first introduced last year to provide data scientists using pandas with a way to scale their existing big data workloads by running them on Apache SparkTM without significantly modifying their code. Today at Spark + AI Summit 2020, we announced the release of Koalas 1.0. It now implements the most commonly used pandas APIs, with 80% coverage of all the pandas APIs. In addition, Koalas supports Apache Spark 3.0, Python 3.8, Spark accessor, new type hints, and better in-place operations. This blog post covers the notable new features of this 1.0 release, ongoing development, and current status.

If you are new to Koalas and would like to learn more about how to use it, please read the launch blog post, Koalas: Easy Transition from pandas to Apache Spark. 

Rapid growth and development

The open-source Koalas project has evolved considerably. At launch, the pandas API coverage in Koalas was around 10%–20%. With heavy development from the community over many, frequent releases, the pandas API coverage ramped up very quickly and is now close to 80% in Koalas 1.0.

Increase in API coverage as Koalas development progressed from 0.1.0 to the current 1.0.0 release.

In addition, the number of Koalas users has increased rapidly since the initial announcement, comprising one-fifth of PySpark downloads, roughly suggesting that 20% of PySpark users use Koalas.
 
Koalas’ use and adoption have grown rapidly since its April 2019 release, with it now comprising 20% of all PySpark downloads.

Better pandas API coverage

Koalas implements almost all widely used APIs and features in pandas, such as plotting, grouping, windowing, I/O, and transformation.

In addition, Koalas APIs such as transform_batch and apply_batch can directly leverage pandas APIs, enabling almost all pandas workloads to be converted into Koalas workloads with minimal changes in Koalas 1.0.0.

Apache Spark 3.0, Python 3.8 and pandas 1.0

Koalas 1.0.0 supports Apache Spark 3.0. Koalas users will be able to switch their Spark version with near-zero changes. Apache Spark has more than 3,400 fixes in Spark 3.0, and Koalas shares the fixes in many components. Please see the blog, Introducing Apache Spark 3.0.

With Apache Spark 3.0, Koalas supports the latest Python 3.8 version that has many significant improvements, which you can see in the Python 3.8.0 release notes. Koalas exposes many APIs similar to pandas in order to execute native Python code against a DataFrame, which would benefit from the Python 3.8 support. In addition, Koalas aggressively leverages the Python type hints that are under heavy development in Python. Some type hinting features in Koalas will likely only be allowed with newer Python versions.

One of the goals in Koalas 1.0.0 is to track the latest pandas releases and cover most of the APIs in pandas 1.0. API coverage has been measured and improved in addition to keeping up to date with API changes and deprecation. Koalas also supports the latest pandas version as a Koalas dependency, so users of the latest pandas version can easily jump into Koalas.

Spark accessor

Spark accessor was introduced from Koalas 1.0.0 in order for Koalas users to leverage existing PySpark APIs more easily. For example, you can apply the PySpark functions as below:

import databricks.koalas as ks
import pyspark.sql.functions as F

kss = ks.Series([1, 2, 3, 4])
kss.spark.apply(lambda s: F.collect_list(s))

You can even convert a Koalas series to a PySpark column and use it with Series.spark.transform.

from databricks import koalas as ks

df = ks.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df.a.spark.transform(lambda c: c + df.b.spark.column)    

PySpark features such as caching the DataFrame are also available under Spark accessor:

from databricks import koalas as ks

df = ks.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)])
df = df.transform(lambda x: x + 1) # transform Koalas DataFrame

with df.spark.cache() as cached_df:
    # Transformed Koalas DataFrame is cached,
    # and it only requires to transform once even
    # when you trigger multiple executions.
    print(cached_df.count())
    print(cached_df.to_pandas())

Faster performance

Many Koalas APIs depend on pandas UDFs under the hood. New pandas UDFs are introduced in Apache Spark 3.0 that Koalas internally uses to speed up performance, such as in DataFrame.apply(func) and DataFrame.apply_batch(func).

Koalas 1.0.0 achieves significant performance gains over previous versions, evidenced by the 20%–25% faster performance demonstrated by Koalas 1.0.0 with Spark 3.0.0.

In Koalas 1.0.0 with Spark 3.0.0, we’ve seen 20%–25% faster performance in benchmarks.

Better type hint support

Most of Koalas APIs that execute Python native functions actually take and output pandas instances. Previously, it was necessary to use Koalas instances for the return type hints, which look slightly awkward.

def pandas_div(pdf) -> ks.DataFrame[float, float]:
    # pdf is actually a pandas DataFrame.
    return pdf[['B', 'C']] / pdf[['B', 'C']]

df = ks.DataFrame({'A': ['a', 'a', 'b'], 'B': [1, 2, 3], 'C': [4, 6, 5]})
df.groupby('A').apply(pandas_div)

In Koalas 1.0.0 with Python 3.7 and later, you can also use pandas instances in the return type:

def pandas_div(pdf) -> pd.DataFrame[float, float]:
    return pdf[['B', 'C']] / pdf[['B', 'C']]

In addition, a new type hinting has been experimentally introduced in order to allow users to specify column names in the type hints:

def pandas_div(pdf) -> pd.DataFrame['B': float, 'C': float]:
    return pdf[['B', 'C']] / pdf[['B', 'C']]

Users can also experimentally use pandas dtype instances and column indexes for the return type hint:

def pandas_div(pdf) -> pd.DataFrame[new_pdf.dtypes]:
    return pdf[['B', 'C']] / pdf[['B', 'C']]
def pandas_div(pdf) -> pd.DataFrame[zip(new_pdf.columns, new_pdf.dtypes)]:
    return pdf[['B', 'C']] / pdf[['B', 'C']]

Broader plotting support

The API coverage in Koalas’ plotting capabilities has reached 90% in Koalas 1.0.0. Visualization can now easily be done in Koalas, the same way it is done in pandas. For example, the same API call used in pandas to draw area charts can also be used against a Koalas DataFrame.

kdf = ks.DataFrame({
    'sales': [3, 2, 3, 9, 10, 6, 3],
    'signups': [5, 5, 6, 12, 14, 13, 9],
    'visits': [20, 42, 28, 62, 81, 50, 90],
}, index=pd.date_range(start='2019/08/15', end='2020/03/09', freq='M'))
kdf.plot.area()

The example draws an area chart and shows the trend in the number of sales, sign-ups, and visits over time.
Example area chart, demonstrating Koalas 1.0.0’s increased API coverage and plotting capabilities.

Wider support of in-place update

In Koalas 1.0.0, in-place updates in Series are applied into the DataFrame naturally as if the DataFrame is fully mutable. Previously, several cases of the in-place updates in Series were not reflected in the DataFrame.

For example, the in-place updates in Series.fillna updates its DataFrame as well.

kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6]})
kser = kdf.x
kser.fillna(0, inplace=True)

In addition, now it is possible to use the accessors to update the Series and reflect the changes into the DataFrame as below.

kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6]})
kser = kdf.x
kser.loc[2] = 30
kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6]})
kser = kdf.x
kdf.loc[2, "x"] = 30

Better support of missing values, NaN and NA

There are several subtle differences in handling missing data between PySpark and pandas. For example, missing data is often represented as None in PySpark but NaN in Pandas. In addition, pandas has introduced new experimental NAvalues, which are currently not supported very well in Koalas.

Most other cases are now fixed, and Koalas is under heavy development to incrementally address this issue. For example, Series.fillna now handles NaN properly in Koalas 1.0.0.

Get started with Koalas 1.0

There are many ways to install Koalas, such as with package managers like pip or conda. The instructions are available in the Koalas installation guide. For Databricks Runtime users, you can follow these steps to install a library on Databricks.

Please also refer to the Getting Started section in the Koalas documentation, which contains many useful resources.

If you have been holding off on trying Koalas, now is the time. Koalas brings a more mature implementation of pandas that’s designed to help you scale your work on Spark. Large data sets should never be a blocker to data science projects, and Koalas helps make it easy to get started.

--

Try Databricks for free. Get started today.

The post Introducing Koalas 1.0 appeared first on Databricks.

Welcoming Redash to Databricks

$
0
0

This morning at Spark and AI Summit, we announced that Databricks has acquired Redash, the company behind the popular open source project of the same name. With this acquisition, Redash joins Apache Spark, Delta Lake, and MLflow to create a larger and more thriving open source system to give data teams best-in-class tools. I would like to take this opportunity to send a warm public welcome to the Redash team and the open source community, and share with you our thinking behind the acquisition.
 
Databricks welcomes Redash
As part of the announcement, we also shared our plan for a hosted version of Redash that will be fully integrated into the Databricks platform to create a rich visualization and dashboarding experience. The integrated experience is currently available in private preview, and you can sign up for the private preview waitlist to be the first to try it out.

What is Redash?

Redash is a collaborative visualization and dashboarding platform designed to enable anyone, regardless of their level of technical sophistication, to share insights within and across teams. SQL users leverage Redash to explore, query, visualize, and share data from any data sources. Their work in turn enables anybody in their organization to use the data. Every day, millions of users at thousands of organizations around the world use Redash to develop insights and make data-driven decisions.

Redash includes the following features:

  1. Query editor: Quickly compose SQL and NoSQL queries with a schema browser and auto-complete.
  2. Visualization and dashboards: Create beautiful visualizations with drag and drop, and combine them into a single dashboard.
  3. Sharing: Collaborate easily by sharing visualizations and their associated queries, enabling peer review of reports and queries.
  4. Schedule refreshes: Automatically update your charts and dashboards at regular intervals you define.
  5. Alerts: Define conditions and be alerted instantly when your data changes.
  6. REST API: Everything that can be done in the UI is also available through REST API.
  7. Broad support for data sources: Extensible data source API with native support for a long list of common SQL, NoSQL databases and platforms.

Easily run SQL queries against Delta Lake or any other data sources

Redash features a  Query editor, which allows you to easily run SQL queries against Delta Lake any other data sources.

Quickly turn results into visualizations

Use Redash to quickly turn results into visualizations.

Share live dashboards with collaborators

Use Redash to build dashboards and share them live with collaborators.

Redash and Databricks

We first heard about Redash a few years ago through some of our early customers. As time progressed, more and more of them asked us to improve the integration between Databricks and Redash. Earlier this year, we invited Arik Fraimovich, Redash’s founder and CEO, to visit Databricks to discuss how we can collaborate and make data easier to consume.

Within the first hour of meeting Arik, it became very obvious to us that the two companies have so much in common. Our acquisition of Redash was driven not only by the great community and product they’ve developed, but also the same core values we share. Both our organizations have sought to make it easy for data practitioners to collaborate around data, and democratize its access for all teams. Most importantly though, has been the alignment of our cultures to help data teams solve the world’s toughest problems with open technologies.

We’re excited to welcome Arik and the Redash team to Databricks, and to further develop Redash together and deliver a seamless and more powerful experience for our customers and the broader open source communities.

Manage, use, and now consume data with a single platform

The new integrated Redash service gives SQL analysts a familiar home in Databricks, and gives data scientists and data engineers a place to easily query and visualize data in Delta Lakes and other data sources.

It seamlessly integrates with the existing Databricks platform: this service will be available in all data centers Databricks operate in; identity management and data governance are unified without additional configuration; SQL endpoints are automatically populated in Redash; catalogs and metadata are shared by two products.

And most importantly, for customers who are already using the two products: shift+enter (the keyboard shortcut to execute a query in Databricks) will also now function the same way in Redash! 🙂

Creating an open, unified platform for all data teams

Our vision at Databricks has been to deliver a unified data analytics platform that can help every data team throughout a company solve the world’s toughest problems — including data analysts, data engineers, data scientists, and machine learning engineers. By giving each team the tools they need for their own work, while also having a shared platform where they can collaborate, every data team can be successful together. This ultimately helps to deliver on the promise of the lakehouse data management paradigm, by combining the best capabilities of data lakes and data warehouses together, in a unified architecture where every team can work together on the same complete and authoritative source of data.

Again, we’re excited to be bringing this new data visualization and dashboarding experience to our customers. The integrated Redash experience is currently available in private preview, and you can sign up for the private preview waitlist to be the first to try it out.

Read the Redash team’s blog post on redash.io

--

Try Databricks for free. Get started today.

The post Welcoming Redash to Databricks appeared first on Databricks.

Introducing GlowGR: An industrial-scale, ultra-fast and sensitive method for genetic association studies

$
0
0

Today, we announce that we are making a new whole genome regression method available to the open source bioinformatics community as part of Project Glow.

Large cohorts of individuals with paired clinical and genome sequence data enable unprecedented insight into human disease biology. Population studies such as the UK Biobank, Genomics England, or Genome Asia 100k datasets are driving a need for innovation in methods for working with genetic data. These methods include genome wide association studies (GWAS), which enrich our understanding of the genetic architecture of the disease and are used in cutting-edge industrial applications, such as identifying therapeutic targets for drug development. However, these datasets pose novel statistical and engineering challenges. The statistical challenges have been addressed by tools such as SAIGE and Bolt-LMM, but they are difficult to set up and prohibitively slow to run on biobank-scale datasets.

In a typical GWAS, a single phenotype (the observable traits of an organism) such as cholesterol levels or diabetes diagnosis status is tested for statistical association with millions of genetic variants across the genome. Sophisticated mixed model and whole genome regression-based approaches have been developed to control for relatedness and population structure inherent to large genetic study populations when testing for genetic associations; several methods such as BOLT-LMM, SAIGE, and fastGWA use a technique called whole genome regression to sensitively analyze a single phenotype in biobank-scale projects. However, deeply phenotyped biobank-scale projects can require tens of thousands of separate GWASs to analyze the full spectrum of clinical variables, and current tools are still prohibitively expensive to run at scale. In order to address the challenge of efficiently analyzing such datasets, the Regeneron Genetics Center has just developed a new approach for the whole-genome regression method that enables running GWAS across upwards of hundreds of phenotypes simultaneously. This exciting new tool provides the same superior test power as current state-of-the-art methods at a small fraction of the computational cost.

This new whole genome regression (WGR) approach recasts the whole genome regression problem to an ensemble model of many small, genetic region-specific models. This method is described in a preprint released today, and implemented in the C++ tool regenie. As part of the collaboration between the Regeneron Genetics Center and Databricks on the open source Project Glow, we are excited to announce GlowGR, a lightning-fast and highly scalable distributed implementation of this WGR algorithm, designed from the ground up with Apache Spark™ and integrated with other Glow functionality. With GlowGR, performing WGR analyses on dozens of phenotypes can be accomplished simultaneously in a matter of minutes, a task that would require hundreds or thousands of hours with existing state-of-the-art tools. Moreover, GlowGR distributes along both the sample and genetic variant matrix dimensions, allowing for linear scaling and a high degree of data and task parallelism. GlowGR plugs seamlessly into any existing GWAS workflow, providing an immediate boost to association detection power at a negligible computational cost.

Achieving High Accuracy and Efficiency with Whole-Genome Regression

This whole genome regression tool has a number of virtues. First, it is more efficient: as implemented in the single node, open-source regenie tool, whole genome regression is orders of magnitude faster than either SAIGE, Bolt-LMM, or fastGWA, while producing equivalent results (Figure 1). Second, it is straightforward to parallelize: in the next section, we describe how we implemented whole genome regression using Apache Spark in the open-source Project Glow.

Comparison of GWAS results for three quantitative phenotypes from the UK Biobank project, produced by REGENIE/GloWGR, BOLT-LMM, and fastGWA.

Figure 1: Comparison of GWAS results for three quantitative phenotypes from the UK Biobank project, produced by REGENIE/GloWGR, BOLT-LMM, and fastGWA.

In addition to performance considerations, the whole genome regression approach produces covariates that are compatible with standard GWAS methods, and which eliminate spurious associations caused by population structure that are seen with traditional approaches. The Manhattan plots in figure 2 below compare the results of a traditional linear regression GWAS using standard covariates, to a linear regression GWAS using the covariates generated by WGR. This flexibility of GlowGR is another tremendous advantage over existing GWAS tools, and will allow for a wide variety of exciting extensions to the association testing framework that is already available in Glow.

Example plots demonstrating the gains provided by GlowGR over the GWAS linear regressions.

Figure 2: Comparison of GWAS results of the quantitative phenotype bilirubin from the UK Biobank project, evaluated using standard linear regression and linear regression with GlowGR. The heightened peaks in the highlighted regions show the increase in power to detect subtler associations that is gained with GlowGR.

Figure 3 shows performance comparisons between GlowGR, REGENIE, BoltLMM, and fastGWA. We benchmarked the whole genome regression test implemented in Glow against the C++ implementation available in the single-node regenie tool to validate the accuracy of the method. We found that the two approaches achieve statistically identical results. We also found that the Apache Spark™ based implementation in Glow scales linearly with the number of nodes used.

Sample visualization benchmarking the whole genome regression test implemented in Glow against the C++ implementation available in the single-node regenie tool, fastGWA, and BOLT-LMM

Figure 3: Left: end-to-end GWAS runtime comparison for 0 quantitative traits from the UK Biobank project. Right: Run time comparison to fit WGR models against 50 quantitative phenotypes from the UK Biobank project. GlowGR scales well with cluster size, allowing for modeling of dozens of phenotypes in minutes without costing additional CPU efficiency. The exact list of phenotypes and computation environment details can be found here.

Scaling Whole Genome Regression within Project Glow

Performing WGR analysis with GlowGR has 5 steps:

  • Dividing the genotype matrix into contiguous blocks of SNPs (~1000 SNPs per block, referred to as loci)
  • Fitting multiple ridge models (~10) within each locus
  • Using the resulting ridge models to reduce the locus from a matrix of 1,000 features to 10 features (each feature is the prediction of one of the ridge models)
  • Pooling the resulting features of all loci into a new reduced feature matrix X (N individuals by L loci x J ridge models per locus)
  • Fitting a final model from X for the genome-wide contribution to phenotype Y.

Glow provides the easy-to-use abstractions shown in figure 4 for transforming large genotype matrices into the blocked matrix (below, left) and then fitting the whole genome regression model (below, right). These can be applied to data loaded in any of the genotype file formats that Glow understands, including VCF, Plink, and BGEN formats, as well as genotype data stored in Apache Spark™ native file formats like Delta Lake.

Creating a matrix grouped by locus and fitting mixed ridge regression models using GlowGR.

Figure 4: Creating a matrix grouped by locus and fitting mixed ridge regression models using GlowGR

Glow provides an implementation of the WGR method for quantitative traits, and a binary trait variant is in progress. The covariate-adjusted phenotype created by GlowGR can be written out as an Apache Parquet™ or Delta Lake dataset, which can easily be loaded by and analyzed within Apache Spark, pandas, and other tools. Ultimately, using the covariates computed with WGR in a genome-wide association study is as simple as running the command shown in Figure 5, below. This command is run by Apache Spark™, in parallel, across all of the genetic markers under test.

 Updating phenotypes with the WGR results and running a GWAS using the built-in association test methods from Glow

Figure 5: Updating phenotypes with the WGR results and running a GWAS using the built-in association test methods from Glow

Join us and try whole genome regression in Glow!

Whole genome regression is available in Glow, which is an open source project hosted on Github, with an Apache 2 license. You can get started with this notebook that shows how to use GloWGR, by reading the preprint, by reading our project docs, or you can create a fork of the repository to start contributing code today. Glow is installed in the Databricks Genomics Runtime (Azure | AWS and you can start a preview today.

--

Try Databricks for free. Get started today.

The post Introducing GlowGR: An industrial-scale, ultra-fast and sensitive method for genetic association studies appeared first on Databricks.

MLflow Joins the Linux Foundation to Become the Open Standard for Machine Learning Platforms

$
0
0

Watch Spark + AI Summit Keynotes here

At today’s Spark + AI Summit 2020, we announced that MLflow is becoming a Linux Foundation project.

Two years ago, we launched MLflow, an open source machine learning platform to let teams reliably build and productionize ML applications. Since then, we have been humbled and excited by the adoption of the data science community. With more than 2.5 million monthly downloads, 200 contributors from 100 organizations, and 4x year-on-year growth, MLflow has become the most widely used open source ML platform, demonstrating the benefits of an open platform to manage ML development that works across diverse ML libraries, languages, and cloud and on-premise environments.

Together with the community, we intend to keep growing MLflow. Thus, we’re happy to announce that we’ve moved MLflow into the Linux Foundation as a vendor-neutral non-profit organization to manage the project long-term. We are excited to see how this will bring even more contributions to MLflow.

At Databricks, we’re also doubling down on our investment in MLflow. At Spark+AI Summit we talked about three ongoing efforts to further simplify the machine learning lifecycle: autologging, model governance and model deployment.

Autologging: data versioning and reproducibility

MLflow already has the ability to track metrics, parameters and artifacts as part of experiments. You can manually declare each element to record, or simply use the autologging capability to log all this information with just one line of code for the supported libraries. Since introducing this feature last year, we’ve seen rapid adoption of autologging, so we’re excited to extend the capabilities of this feature.

One of the biggest challenges machine learning practitioners face is how to keep track of intermediate data sets (training and testing) used during model training. Therefore, we introduced autologging for Apache Spark data sources in MLflow 1.8, our first step into data versioning with MLflow. This means that if you’re using Spark to create your features or training pipelines, you can turn on Spark autologging and automatically record exactly which data was queried for your model.

And if you’re using Delta Lake — which supports table versioning and traveling back in time to see an old version of the data — we also record exactly which version number was used. This means that if you’re training a model based on a Delta table and use Spark autologging, MLflow automatically records which version of the data was used. This information can be useful for debugging your models or reproducing a previous result.


Figure 1: MLflow 1.8 introduced autologging from Spark data sources including Delta table versions

Autologging currently supports six libraries: TensorFlow, Keras, Gluon, LightGBM, XGBoost and Spark. There is also ongoing work from Facebook to add support for PyTorch soon, and from Databricks to add support for scikit-learn.

For users of the Databricks platform, we’re also integrating autologging with the cluster management and environment features in Databricks. This means that  if you’re tracking your experiments on Databricks — from a notebook or a job — we will automatically record the snapshot of the notebook that you used, the cluster configuration, and the full list of library dependencies.

This will allow you and your peers to quickly recreate the same conditions from when the run was originally logged. Databricks will clone the exact snapshot of the Notebook, create a new cluster with the original cluster specification, and install all library dependencies needed. This makes it easier than ever to pick up from a previous run and iterate on it, or to reproduce a result from a colleague.


Figure 2: MLflow supports reproducibility by allowing data teams to replicate a run based on the auto-logged notebook snapshot, cluster configuration, and library dependencies on Databricks.

Stronger model governance with model schemas and MLflow Model Registry tags

Once you’ve logged your experiments and have produced a model, MLflow provides the ability to register your model into one centralized repository – the MLflow Model Registry – for model management and governance. The MLflow Model Registry rate of adoption is growing exponentially and we’re seeing hundreds of thousands of models being registered on Databricks every week. We’re excited to add more features to strengthen model governance with the Model Registry.

One of the most common pain points when deploying models is making sure that the schema for production data used to score your models is compatible to the schema of data used when training the model, and that the output from a new model version is what you expect in production. Therefore, we’re extending the MLflow model format to include support for model schemas, which will store the features and predictions requirements for your models (inputs/outputs names and data types). One of the most common sources of production outages in ML is a mismatch of model schemas when a new model is deployed. With the integration of the model schema and the model registry, MLflow will allow you to compare model versions and their schemas, and alert you if there are incompatibilities.

Figure 3: With built-in model schemas compatibility checking, MLflow Model Registry eliminates the possibility for mismatched model schemas –one of the biggest sources of ML production outages.

To make custom model management workflows easier and more automated, we’re introducing custom tags as part of the MLflow Model Registry.

Many organizations have custom internal processes for validating models. For example, models may have to pass legal review for GDPR compliance or pass a performance test before deploying them to edge devices. Custom tags allow you to add your own metadata for these models and keep track of their state. This capability is also provided through APIs so you can run automatic CI/CD pipelines that test your models, add these tags, and make it very easy to check whether your model is ready for deployment.

Figure 4: The introduction of custom tags in MLflow Model Registry makes it easier for data teams to validate and monitor the state of their models.

Accelerating model deployment with simplified API and model serving on Databricks

MLflow already has integrations with several model deployment options, including batch or real-time serving platforms. Because we’ve seen an increasing number of contributions in this space, we wanted to provide the community with a simpler API to manage model deployment.

The new Deployments API for managing and creating deployment endpoints will give you the same commands to deploy to a variety of environments, removing the need to write custom code for the individual specifications of each. This is already being used to develop two new endpoints for RedisAI and Google Cloud Platform, and we are working on porting a lot of the past integrations (including Kubernetes, SageMaker and AzureML) to this API. This will give you a simple and uniform way to manage deployments and to push the models to different serving platforms as needed.

mlflow deployments create -t gcp -n spam -m models:/spam/production

mlflow deployments predict -t gcp –n spam -f emails.json

Finally, for Databricks customers, we are excited to announce that we’ve integrated Model Serving as a turnkey solution on Databricks.

Setting up environments to serve ML models as REST endpoints can be cumbersome and require significant integration work. With this new capability Databricks streamlines the process of taking models from experimentation to production. While this service is in preview, we recommend its use for low throughput and non-critical applications.


Figure 5: MLflow’s integrated Model Serving solution streamlines the process of taking models from experimentation to production.

Next Steps

You can watch the official announcement and demo by Matei Zaharia and Sue Ann Hong at Spark + AI Summit:

Ready to get started with MLflow? You can read more about MLflow and how to use it on AWS or Azure. Or you can try an example notebook [AWS] [Azure]

If you are new to MLflow, read the open source MLflow quickstart with the lastest MLflow 1.9 to get started with your first MLflow project. For production use cases, read about Managed MLflow on Databricks and get started on using the MLflow Model Registry.

--

Try Databricks for free. Get started today.

The post MLflow Joins the Linux Foundation to Become the Open Standard for Machine Learning Platforms appeared first on Databricks.

Introducing the Next-Generation Data Science Workspace

$
0
0

At today’s Spark + AI Summit 2020, we unveiled the next generation of the Databricks Data Science Workspace: An open and unified experience for modern data teams.

Existing solutions make data teams choose from three bad options. Giving data scientists the freedom to use any open-source tools on their laptops doesn’t provide a clear path to production and governance and creates compliance risks. Simply hosting those same tools in the cloud may solve some of the data privacy and security issues but doesn’t provide a clear path to production, nor improve productivity and collaboration. Finally, the most robust and scalable DevOps production environments can hinder innovation and experimentation by slowing data scientists down.

The next-generation Data Science Workspace on Databricks navigates these trade-offs to provide an open and unified experience for modern data teams. Specifically, it will provide you with the following benefits:

  • Open and collaborative notebooks on a secure and scalable platform: Databricks is built on the premise that developer environments need to be open and collaborative. Because Databricks is rooted in open source, your tools of choice are made available on an open and collaborative platform capable of running all your analytics workloads at massive big data scale while helping you meet security and compliance needs. With native support for the Jupyter notebook format, the next-generation Data Science Workspace eliminates the trade-off between open standards and collaborative features provided by Databricks.
  • Best-of-breed developer environment for Git-based collaboration and reproducibility: The industry is already leveraging best practices for robust code management in complex settings, and they are Git based. So we further integrated our platform with the Git ecosystem to help bring those best practices to data engineering and data science, where reproducibility is becoming more and more important. To facilitate this integration, we’re introducing a new concept called Databricks Projects. This will allow data teams to keep all project dependencies in sync via Git repositories.
  • Low-friction CI/CD pipelines from experimentation to production deployments: With a new API surface based on the aforementioned Git-based Projects functionality, we’re introducing new capabilities to more seamlessly integrate developer workflows with automated CI/CD pipelines. This will allow data teams to take data science and ML code from experimentation to production faster, leveraging scalable production jobs, MLflow Model Registry, and new model serving capabilities — all on an open and unified platform that can scale to meet any use case.

We’re very excited about bringing these innovations to the Unified Data Analytics Platform. Over the past few years, we continuously gathered feedback from thousands of users to help shape our road map and design these features. In order to enable this new experience, we’ll release new capabilities in phases, as described below.

Watch Spark + AI Summit Keynotes here

Available in preview: Git-based Databricks Projects

First, we’re introducing a new Git-based capability named Databricks Projects to help data teams keep track of all project dependencies including notebooks, code, data files, parameters, and library dependencies via Git repositories (with support for Azure DevOps, GitHub and BitBucket as well as newly added support for GitLab and the on-premises enterprise/server offerings of these Git providers).

Databricks Projects allow practitioners to create new or clone existing Git repositories on Databricks to carry out their work, quickly experiment on the latest data, and easily access the scalable computing resources they need to get their job done while meeting security and compliance needs.


Figure 1: Databricks Projects allows data teams to quickly create or clone existing Git repositories as a project.

This also means that exploratory data analysis, modeling experiments and code reviews can be done in a robust, collaborative and reproducible way. Simply create a new branch, edit your code in open and collaborative notebooks, commit and push changes.


Figure 2: The new dialog for Databrick’s Git-based Projects allows developers to switch between branches, create new branches, pull changes from a remote repository, stage files, and commit and push changes.

In addition, this will also help accelerate the path from experimentation to production by enabling data engineers and data scientists to follow best practices of code versioning and CI/CD. As part of the new Projects feature, a new set of APIs allows developers to set up robust automation to take data science and ML code from experimentation to production faster.


Figure 3: With Git-based Projects and associated APIs, the new Databricks Data Science Workspace makes the path from experimentation to production easier, faster and more reliable.

As  a result, setting up CI/CD pipelines to manage data pipelines, keep critical dashboards up to date, or iteratively train and deploy new ML models to production has never been this seamless. Data engineers and data scientists using the Git-based Projects feature ensure that their code is delivered to repositories in an easy and timely manner, where Git automation can pick up and improve reliability and availability of production systems by performing tests before the code is deployed to production Projects on Databricks.

This enables a variety of use cases, from performing exploratory data analysis, to creating dashboards based on the most recent data sets, training ML models and deploying them for batch, streaming or real-time inference — all on an open and unified platform that can scale to meet demanding business needs.

Coming soon: Project-scoped environment configuration with Conda

At the intersection of Git-based Projects and environment management is the ability to store your environment configuration alongside your code. We’ll integrate the Databricks Runtime for Machine Learning with Projects to automatically detect the presence of environment configuration files (e.g., requirements.txt or conda.yml) and activate an environment scoped to your project. This means, you’ll no longer have to worry about installing library dependencies, such as NumPy, yourself.


Figure 4: Integration between Databricks Runtime and Projects allows data teams to automatically detect the presence of environment specification files (e.g. requirements.txt) and install the library dependencies.

To go beyond what you are used to on your laptop, Databricks makes sure that, once an environment is created for your Project, all workers of an autoscaling cluster consistently have the exact same environment enabled.

Coming soon: Databricks Notebook Editor support for Jupyter notebooks

The Databricks Notebook Editor already provides collaborative features like co-presence, co-editing and commenting, all in a cloud-native developer environment with access control management and the highest security standards. To unify data teams, the Databricks Notebooks Editor also supports switching between the programming languages Python, R, SQL and Scala, all within the same notebook. Today, the Databricks Notebook Editor is used by tens of thousands of data engineers, data scientists and machine learning engineers daily.

To bring the real-world benefits of the Databricks Notebook Editor to a broader audience, we’ll support Jupyter notebooks in their native format on Databricks, providing you with the ability to edit Jupyter notebooks directly in the Databricks Notebook Editor. As a result, you’ll no longer have to make the trade-off between collaborative features and open-source standards like Jupyter.


Figure 5: Support for opening Jupyter notebooks with the Databricks Notebook Editor  provides data teams with collaborative features for standard file formats.

However, if your tool of choice is Jupyter, you’ll still be able to edit the same notebook using Jupyter embedded directly in Databricks, as shown below.


Figure 6: Support for opening Jupyter notebooks with JupyterLab is embedded within the Databricks Workspace.

Next steps

You can watch the official announcement and demo by Clemens Mewald and Lauren Richie at Spark + AI Summit:

As shared in our keynote today, we’ve been testing these capabilities in private preview for a while and are now excited to open up access to existing customers in preview. Sign up here to request access. We look forward to your feedback!

--

Try Databricks for free. Get started today.

The post Introducing the Next-Generation Data Science Workspace appeared first on Databricks.


Announcing MLflow Model Serving on Databricks

$
0
0

Databricks MLflow Model Serving provides a turnkey solution to host machine learning (ML) models as REST endpoints that are updated automatically, enabling data science teams to own the end-to-end lifecycle of a real-time machine learning model from training to production.

When it comes to deploying ML models, data scientists have to make a choice based on their use case. If they need a high volume of predictions and latency is not an issue, they typically perform inference in batch, feeding the model with large amounts of data and writing the predictions into a table. If they need predictions at low latency, e.g. in response to a user action in an app, the best practice is to deploy ML models as REST endpoints. This allows apps to send requests to an endpoint that’s always up and receive the prediction immediately.

On Databricks, we have already simplified the workflow of deploying ML models in a batch or streaming fashion to big data, using MLflow’s spark_udf. For situations that require deploying models in a real-time fashion, we are introducing Databricks MLflow Model Serving: a new turnkey service that simplifies both the workflow of initially deploying a model and also of keeping it updated. Databricks MLflow Model Serving ties directly into the MLflow Model Registry to automatically deploy new versions of a model and route requests to them, making it easy for ML developers to directly manage which models they are serving.

Watch Spark + AI Summit Keynotes here

Serving models

Today, serving models can be complex because it requires running a separate serving system, such as Kubernetes, which ML developers might not have access to. Moreover, developers must be careful to update the versions of the model used there as they design new models, and route requests to the right model.

Databricks MLflow Model Serving solves this issue by integrating with the Model Registry. The model registry can store models from all machine learning libraries (TensorFlow, scikit-learn, etc), and lets you store multiple versions of a model, review them, and promote them to different lifecycle stages such as Staging and Production. Model Serving makes use of these stages;  you can make the latest production model available at “/model/<model_name>/Production” and other models available at URIs for those specific models. Under the hood, Model Serving manages compute clusters to execute the requests and ensure that they are always up to date and healthy.

Databricks MLflow architecture highlighting Model Serving, giving data teams end-to-end control of the real-time machine learning model development and deployment lifecycle.

Once Model Serving is enabled, a Databricks cluster launches, which hosts all active model versions associated with the registered model as REST endpoints. Each model runs in a conda environment that reflects the environment it was trained with.

Once the endpoint is running, you can test queries from the Databricks UI, or submit them yourself using the REST API. We also integrate with the recently released model schema and examples (available in MLflow 1.9 to allow annotating models with their schema and example inputs) to make it even easier and safer to test out your served model.

Once Model Serving is enabled, a Databricks cluster launches, thus hosting all active model versions associated with the registered model as REST endpoints.

The same request can be sent through the REST API using standard Databricks authentication, for example using curl:

curl -u token:XXX
https://dogfood.staging.cloud.databricks.com/model/model_with_example/Production/invocations
-H 'Content-Type: application/json; format=pandas-records' -d '[[5.1,3.5,1.4,0.2]]'

Note that the URL contains “Production”, meaning that this is a stable URL that points to the latest Production version. You can also directly reference a model version by number, if you want to lock your application to a specific version (for example “/model/model_with_example/1”).

Evolving your model

Many use cases start with an initial model as a proof-of-concept, but in the course of model development, data scientists often iterate and produce newer and better versions of models. Model Serving makes this process as easy as possible.

Suppose you have Version 1 of your model in production, and are ready to try out and release the next version. You first register the second model in the Model Registry and promote it to “Staging”, indicating that you want to test it out a bit more before replacing your Production version.

Since the model has model serving enabled, new model versions are automatically launched onto the existing cluster as they’re added. You can see below that you have both versions and can query either of them.

Sample MLflow Model Serving UI, demonstrating how it facilitates the iterative process by allowing data teams to automatically launch and test new models in their clusters.

Note the URL for each model: you can query either by the version number (1 or 2) or by the stage (Production or Staging). This way you can have your live site point to the current Production version and have a test site pointed to the Staging version, and it will automatically pick up the latest model versions as they’re promoted through the Registry.

When you’re ready to promote a model version to Production, you simply transition its stage in the Registry, moving it from Staging to Production. This change will be reflected within the served model and REST endpoints within a few seconds — the URL for Production will now point to Version 2.

Because everything is running in the same cluster, the marginal resource and time cost of spinning up a new version is very small. You don’t have to worry about a multi-minute iteration cycle, or losing track of old versions.

Monitoring your model

Since model servers are long-lived, it’s important to be able to easily monitor and maintain the availability of your models. Model Serving makes this easy by exposing two kinds of information: logs and events.

Logs for each model version are available via UI and API, allowing you to easily emit and see issues that are related to malformed data or other runtime errors. Events supplement the model’s own logs by detailing when a model process crashed and was restarted, or when a whole virtual machine was lost and replaced. As simple as it sounds, having easy access to these logs and events makes the process of developing, iterating, and maintaining model servers much less time-consuming.

To recap, Model Serving on Databricks provides cost-effective, one-click deployment of models for real-time inference, integrated with the MLflow model registry for ease of management. Use it to simplify your real-time prediction use cases! Model Serving is currently in Private Preview, and will be available as a Public Preview by the end of July. While this service is in preview, we recommend its use for low throughput and non-critical applications.

Happy serving!

--

Try Databricks for free. Get started today.

The post Announcing MLflow Model Serving on Databricks appeared first on Databricks.

Announcing GPU-aware scheduling and enhanced deep learning capabilities

$
0
0

Databricks is pleased to announce the release of Databricks Runtime 7.0 for Machine Learning (Runtime 7.0 ML) which provides preconfigured GPU-aware scheduling and adds enhanced deep learning capabilities for training and inference workloads.

Preconfigured GPU-aware scheduling

Project Hydrogen is a major Apache Spark™ initiative to bring state-of-the-art artificial intelligence (AI) and Big Data solutions together. Its last major project, accelerator-aware scheduling, is made available in Apache Spark 3.0 by a collaboration among developers at Databricks, NVIDIA, and other community members.

In Runtime 7.0 ML, Databricks preconfigures GPU-aware scheduling for you on GPU clusters. The default configuration uses one GPU per task, which is ideal for distributed inference workloads and distributed training if you use all GPU nodes. If you want to do distributed training on a subset of nodes, Databricks recommends setting spark.task.resource.gpu.amount to the number of GPUs per worker node in the cluster Spark configuration, to help reduce communication overhead during distributed training.

For PySpark tasks, Databricks automatically remaps assigned GPU(s) to indices 0, 1, …. Under the default configuration that uses one GPU per task, your code can simply use the default GPU without checking which GPU is assigned to the task. This is ideal for distributed inference. See our model inference examples (AWS | Azure).

For the distributed training tasks with HorovodRunner (AWS | Azure), users do not need to do any modifications when migrating their training code from older versions to the new release.

Simplified data conversion to Deep Learning frameworks

Databricks Runtime 7.0 ML includes Petastorm 0.9.2 to simplify data conversion from Spark DataFrame to TensorFlow and PyTorch. Databricks contributed a new Spark Dataset Converter API to Petastorm to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader. For more details, check out the blog post for Petastorm in Databricks and our user guide (AWS | Azure).

NVIDIA TensorRT for high-performance inference

Databricks Runtime 7.0 ML now also includes NVIDIA TensorRT. TensorRT is an SDK that focuses on optimizing pre-trained networks to run efficiently for inferencing especially with GPUs. For example, you can optimize performance of the pre-trained model by using reduced-precision (e.g. FP16 instead of FP32) for production deployments of deep learning inference applications. For example, for a pre-trained TensorFlow model, the model can be optimized with the following python snippet

conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS
params = conversion_params._replace(precision_mode='FP16')
converter = trt.TrtGraphConverterV2(
        input_saved_model_dir=saved_model_dir,
        conversion_params=conversion_params,
    )
converter.convert()
converter.save(output_saved_model_dir)    

After a deep learning model is optimized with TensorRT, it can be used for inference just as unoptimized models. See our example notebook for using TensorRT with TensorFlow (AWS| Azure).

To achieve the best performance and cost for reduced-precision inference workloads, we highly recommend using TensorRT with the newly supported G4 instance types on AWS.

Support for TensorFlow 2

Runtime 7.0 ML includes TensorFlow 2.2. TensorFlow 2 contains many new features as well as major breaking changes. If you are migrating from TensorFlow 1.x, Databricks recommends reading TensorFlow’s official migration guide and Effective TensorFlow 2. If you have to stay with TensorFlow 1.x, you can enable %pip and downgrade TensorFlow, e.g., to 1.15.3, using the following command in a Python notebook:

%pip install tensorflow==1.15.3

Read our blog post and user guide (AWS | Azure) to learn how to enable and use %pip and %conda on Runtime 7.0 ML.

Resources

--

Try Databricks for free. Get started today.

The post Announcing GPU-aware scheduling and enhanced deep learning capabilities appeared first on Databricks.

Meet the 2020 Databricks Data Team Awards Winners

$
0
0

On the closing day of Spark + AI Summit, Databricks CEO Ali Ghodsi recognized three exceptional data teams for how they came together to solve a tough problem– delivering impact, innovation, and helping to make the world a better place.

These are the inaugural Databricks Data Team Awards, and we were blown away by the submissions and the finalists, representing a wide variety of data science use cases spanning multiple industries. Across the board, they showcased how a unified analytics platform like Databricks can help bring together the diverse talents of data engineers, data scientists and data analysts to focus their ideas, skills and energy toward accomplishing amazing things.

Here are, the 2020 Databricks Data Team Awards winners:

Data Team for Good Award:  Aetion

Aetion’s data team is working on a high-impact use case related to the COVID-19 crisis. Specifically, Aetion has partnered with HeathVerity to use Databricks to ingest and process data from multiple inputs into real-time data sets to be used to analyze COVID-19 interventions and to study the pandemic’s impact on health care utilization. Their integrated solution includes a Real-Time Evidence Platform that enables biopharma, regulators, and public health officials to generate evidence on the usage, safety, and effectiveness of prospective treatments for COVID-19 and to continuously update and expand this evidence over time. This new, high-priority use case for Aetion has already produced a social impact—it will be employed in the company’s new research collaboration with the U.S. FDA, which will support the agency’s understanding of and response to the pandemic.

Accepting the award on behalf of the team at Aetion was John Turek, CTO.

Runners-up:  Alignment Healthcare, Medical University of South Carolina 

Data Team Impact Award: Disney+

Disney+ surpassed 50 million paid subscribers in just five months and is available in more than a dozen countries around the world.  Data is essential to understanding customer growth and to improve the overall customer experience for any streaming business.  Disney+ uses Databricks as a core component of its data lake, and using the Databricks Delta Lake, it has been able to build streaming and batch data pipelines supporting petabytes of data.  The platform is enabling teams to collaborate on ideas, explore data, and apply machine learning across the entire customer journey, to foster growth in its subscriber base.

Accepting the award on behalf of his team at Disney+ was Tom LeRoux, VP of Engineering.

Runners-up: Unilever, YipitData

Data Team Innovation Award: Goldman Sachs

To better support its clients, the Marcus by Goldman Sachs data team continues to innovate its offerings and, in this instance, leveraged Databricks to build a next generation big data analytics platform that addresses diverse use cases, spanning from credit risk assessment, to fraud detection to marketing analytics and compliance. The unified data team not only built a robust and reliable infrastructure but also activated and empowered hundreds of analysts and developers in a short number of months.

Accepting the award for the Marcus by Goldman Sachs team was Executive Director Karthik Ravindra.

Runners-up:  Comcast, Zalando

Data Teams Unite!

Congratulations to the Data Team Award winners for their exceptional achievements!

We look forward to seeing data teams worldwide leverage the power of Databricks to unite. And we will continue to celebrate them for using data and artificial intelligence to help to solve the world’s toughest problems.

--

Try Databricks for free. Get started today.

The post Meet the 2020 Databricks Data Team Awards Winners appeared first on Databricks.

Celebrating Pride Month at Databricks

$
0
0

Members of Databricks Queries Network Employee Resource Group celebrate Pride Month.

With Pride month coming to an end, Databricks’ newest Employee Resource Group, Queeries Network, is looking back at how we celebrated. The mission of this group is to create opportunities to come together to discuss topics important to the LGBTQ+ community, network and build community. To celebrate Pride Month, our Queeries Network Employee Resource Group hosted virtual pride socials, participated in virtual pride parades across the globe, and had dialogues around the history of Pride and the intersections with current events impacting the Black community.

When we look back at how Pride started, from the Stonewall uprising to what is now an international celebration of the LGBTQ+ community and their contributions to the world, there has been a lot of progress in the fight for gay rights and equality, but there is also more we can do.

As we continue this reflection, we’ve asked a few Bricksters, who have made amazing contributions to Databricks, for their thoughts on how we can all create a more inclusive workspace that encourages everyone to bring their authentic self to work. Read more below. 

How can we create an environment that encourages everyone to be their authentic self and do their best work?

One of our core values at Databricks is to be customer obsessed, which at first glance, can sound like a narrow and hyperfocused attention to a singular goal. In practice, what I think it actually means is to be genuinely curious about our customers — to struggle along with them in their challenges and to celebrate in their victories. I think the same applies to how we treat each other. Taking the time to be genuinely curious about someone’s lived experience is hard because it requires a willingness to not have all the answers. But when we spend time and energy investing in and learning from each other, that’s when, in my experience, real success is realized.

— Jess Chace (she/her), Customer Success Engineer (New York)

It’s critical to show everyone in the company, particularly the newer members who might have less confidence, that there are many ways to be, to find personal success, and to develop as an individual. Offices are usually pretty homogeneous. We have to raise up the less obvious people in our midst, to showcase the realities and the successes of people of all colours, creeds, kinds, sexual and gender identities. A lot of people I speak to who don’t fit naturally into the white, male, heteronormative milieu will talk about the first time they met someone who was different, but successful in their own way. Someone they wanted to emulate. That’s powerful — to show that there are many paths, all of them perfectly valid. It creates new leaders, and it creates the foundation for the next generation of difference and variety, of authenticity.

— Rob Anderson (he/him), Director of Field Engineering (London)

I start by focusing on the details of things like inclusive language, especially within an interview setting. Those early interactions are some of the most stressful moments in anyone’s life, we shouldn’t make them more difficult. I’ve been particularly thoughtful about my choice of words while answering questions during the hiring process about the challenges of relocating with a partner or family or when discussing parental benefits. I’ve had some people mention, after joining Databricks, how significant these small choices were to them as a signal of our values.

— Stacy Kerkela (she/her), Director of Engineering (Amsterdam)

Creating an environment at work for Bricksters to be their most authentic selves is critical to building an inclusive and collaborative culture.  Being “authentic” might seem like a simple concept; however, many individuals from marginalized backgrounds are oftentimes code switching or shrinking pieces of themselves to fit into the majority culture. One of our company values that is most salient to me is: Teamwork makes the dream work. To promote authenticity within the workplace, we should encourage our team members to build genuine connections with one another, amplify our strengths and, most importantly, actively listen and promote empathy — even during challenging or unpredictable times. 

As a queer Black femme, who holds many other marginalized identities, oftentimes I am navigating various environments and interacting with diverse audiences and stakeholders. I am always looking to build strong partnerships and bring my authentic self to interactions with candidates, colleagues and executives.  Promoting an inclusive environment where trust and vulnerability are at the core is extremely important for me to do my best work and feel supported in doing so. 

— Kaamil Al-Hassan (she/her), Talent Acquisition (San Francisco)

How does focusing on nurturing this inclusive environment help us have a greater impact at Databricks and in our communities?

Another one of our core values at Databricks is to let the data decide, and it’s no secret that having a more diverse and inclusive work environment correlates with many positive business outcomes like higher innovation revenues, greater market share and greater returns to shareholders. The true value of promoting diversity and inclusion, however, is so much more than monetary gains. When our work reflects the voices and minds of our actual communities, it is better positioned to serve us all.

— Jess Chace (she/her), Customer Success Engineer (New York)

Databricks is a member of the technology community, where there is a great opportunity to continue to increase the representation of women, ethnic minorities, and people of various ages as well as sexual and gender identities. I believe that we will make the biggest impact by investing in how we hire diverse teams and support the incredible talent from these under-represented groups. There is a lot we can do to have a big impact on our companies and our communities at large.

— Rob Anderson (he/him), Director of Field Engineering (London)

At Databricks, I build teams that I would want to belong to. That means having the safety to be my authentic self and providing that same comfort zone to everyone around me. This year as entire companies have moved to work-from-home models, there’s a lot of visibility into all of our home lives that makes it impossible for me to live less openly. Engaging with teammates via video calls at home can make it difficult to create separate home and work identities, especially for members of the LGBTQ+ community. It is important that we focus on creating an inclusive culture so everyone can feel comfortable and confident being their authentic selves and doing their best work without the stress or anxiety of wondering if they will be accepted. The impact of that care we show for one another and our dedication to inclusion is huge for our entire community. 

— Stacy Kerkela (she/her), Director of Engineering (Amsterdam)

Being queer means that every day I am ”coming out.”  When I go to the grocery store with my partner, when someone inquires about my ”boyfriend,” or when I attend our company holiday party and my partner is slaying in her tuxedo but is also wearing makeup. Coming out is not a singular moment. Creating an inclusive culture includes feeling supported by my colleagues and exemplifying my authentic queer self.

Our impact will be even greater when we invite members of the LGBTQ+ community to join our global teams through investing in targeted outreach events. Building an inclusive culture means exploring the possibility of gender-neutral bathrooms and encouraging pronoun usage in everyday conversations. I am excited for us to celebrate Pride and for the resilience of this particular community that is still fighting every day for equal rights and visibility. I look forward to getting more involved in the newly formed LGBTQ+ ERG and exploring ways to create inclusivity and highlight the queer experience here at Databricks.    

— Kaamil Al-Hassan (she/her), Talent Acquisition (San Francisco)

It was great to launch our Queeries Network Employee Resource Group and celebrate Pride Month together. To learn more about how you can join us, check out the Careers page

--

Try Databricks for free. Get started today.

The post Celebrating Pride Month at Databricks appeared first on Databricks.

Databricks Announces 2020 North America Partner Awards

$
0
0

Databricks has a great partner ecosystem of over 450 partners that are critical to building and delivering the best data and AI solutions in the world for our joint customers.  We are proud of this collaboration and know it’s the result of the mutual commitment and investment that spans on-going training, solution development, field programs and workshops for customers. As a result, Databricks’ partners are highly-qualified with the expertise to accelerate success for data teams with the right software, services and strategic consulting expertise.

Databricks hosted our first-ever virtual Partner Executive Summit on on June 23 and recognized a select few of these partners for their exceptional accomplishments in the past year.  The award winners, by category, are:

Consulting & System Integrator Partners

Innovation Award: Accenture

Accenture developed the Industrialized Machine Learning solution that leverages Databricks and reusable components to streamline methodologies proven successful for large-scale ML deployments.

Congratulations to Atish Ray and the Accenture team.

Databricks 2020 North America Partner Innovation Award winner Accenture

Rising Star Award: phData

As a new partner, phData, quickly ramped new programs that engaged customers for on-premises Hadoop migration, Delta Lake and Machine Learning.

Congratulations to Jordan Birdsell and the phData team.

Databricks 2020 North America Partner Rising Star Award winner phData

National Consulting & SI Partner of the Year: Insight

Insight engaged customers acrossNorth America with repeatable solutions for several industries as part of the Insight Connected Platform.

Congratulations to Brandon Ebken and the Insight team
Databricks 2020 North America Partner National C&SI Partner of the Year Award winner Insight.

Customer Impact Award: Pariveda

Pariveda led successful implementations at a number of large, enterprise Databricks customers including a heavy equipment manufacturer, a frozen foods producer and a Fortune 500 CPG company.

Congratulations to Ryan Gross and the Pariveda team
Databricks 2020 North America Customer Impact Award winner Pariveda.

Global Consulting & SI Partner of the Year Award: Avanade

Avanade achieved significant impact at global scale with customer engagements in each region, and made strong investments in training, marketing and solutions combined with strong executive support and leadership alignment.

Congratulations to Luke Pritchard and the Avanade team
Databricks 2020 North America Global Partner of the Year Award winner Avanade.

Technology Partner Awards

Customer Impact Ward: Privacera

Privacera has consistently gone above and beyond to meet the security and governance needs of several of our large enterprise customers, especially for Hadoop migration projects.

Congratulations to Balaji Ganesan and the Privacera team
Databricks 2020 North America Customer Impact Award winner Privacera.

Innovation Award: Mathworks

MathWorks has helped make data science more approachable to all domain experts, not just data scientists. Their product, MATLAB, enables scientists and engineers to use Databricks without requiring Python or Scala.

Congratulations to Yuval Zukerman and the Mathworks team
Databricks 2020 North America Innovate Award winner MathWorks

Momentum Award:Talend

The Talend integration footprint with Databricks and Delta Lake grew at an impressive rate — Talend Stitch joined the Databricks Data Ingestion Network, and Talend Studio Cloud was fully integrated into Delta Lake.

Congratulations to Mike Pickett and the Talend team
Databricks 2020 North America Momentum Award winner Talend

Congratulations to these incredible partners!  We sincerely appreciate their positive impact on the Databricks community, and we look forward to working with all of our partners to help more and more data teams succeed every year.

--

Try Databricks for free. Get started today.

The post Databricks Announces 2020 North America Partner Awards appeared first on Databricks.

Viewing all 1872 articles
Browse latest View live