Quantcast
Channel: Databricks
Viewing all 1874 articles
Browse latest View live

Guide to Healthcare & Life Sciences Sessions at Data + AI Summit 2022

$
0
0

 

Every year, data leaders and their teams from across the globe join Data + AI Summit to discuss the latest trends in data, analytics, and machine learning. For data teams in the Healthcare and Life Sciences industry, we’re excited to announce a full agenda of Healthcare and Life Science sessions. Leaders from Providence Health, Walgreens, Amgen, Cigna, California Healthcare Eligibility, Enrollment, and Retention System (CalHEERS), and many other organizations across the health ecosystem will share how they are using data to power real-time patient insights, accelerate drug discovery and improve health equity for all. Join us live in San Francisco or tune-in virtually for free.

Healthcare and Life Sciences Industry Forum

Our Healthcare and Life Sciences Forum kicks off on Wednesday, June 29 at 3:30pm PT. During this capstone event, you’ll have the opportunity to join keynotes and panel discussions with data analytics and AI leaders on the most pressing topics in the industry. Here’s a rundown of the agenda:

Keynote: The Journey Toward Delivering Real-time Pharmacy Insights at Walgreens with the Lakehouse

In this keynote, Luigi Gudagno and Sashi Venkatesan from Walgreens will share how Walgreens is building the pharmacy of the future with a modern data lakehouse to better meet the evolving needs of the millions of patients they serve. With over 850 million prescriptions a year, delivering real-time and personalized insights to pharmacists and patients across their 10,000 U.S. stores is critical to driving better outcomes. Hear their vision for the future and what it takes to unlock the value of data at scale.

Luigi Guadagno
VP, Rx Renewal & HC Platform Technology, Walgreens

Sashi Venkatesan
Director of Product Engineering, Walgreens

Panel Discussion: Life’s Not Fair, But AI Can Be – Responsible and Equitable AI in Healthcare

Join our esteemed panel of data leaders from some of the biggest names in healthcare, insurance and pharma as they discuss biases in healthcare model development, best practices for building safe and responsible AI systems, and how data can be used to address health inequities.

Gayathri Namasivayam,
Sr. Data Scientist, McKesson
Lindsay Mico,
Director of Data Science, Providence Health
Jeffrey Reid,
Chief Data Officer, Regeneron Genetics Center
Surekha Durvasula,
Senior Director Data & Analytics Platform, Walgreens
Miguel Martinez,
Senior Data Scientist, Optum

Healthcare and Life Sciences Tech Talks

Here’s an overview of some of our most highly-anticipated Healthcare and Life Sciences sessions at this year’s summit:

Amgen’s Journey To Building a Global 360 View of its Customers with the Lakehouse

Amgen is a leading global biotech company focused on developing therapies that have the power to save lives. Delivering on this mission requires their commercial teams to regularly meet with healthcare providers to discuss new treatments that can help patients in need. With the onset of the pandemic, where face-to-face interactions with doctors and other Healthcare Providers were impacted, Amgen had to rethink how their teams interacted with these audiences through digital channels underpinned by data analytics and AI.

Learn more

Solving Healthcare Price Transparency with Databricks and Delta Lake

CMS published the Price Transparency mandate requiring healthcare providers and payers to publish the cost of services based on procedure codes. In this talk, Cigna shares how they embarked on this journey by embracing the scalability of AWS cloud, Apache Spark, Databricks, and Delta Lake to deal with generating and hosting file sizes ranging from megabytes to 100’s GBs.

Learn more

The Semantics of Biology — Vaccine and Drug Research with Knowledge Graphs and Logical Inferencing on Apache Spark

In this talk, GSK presents the new logical inferencing capabilities that they’ve built into the Bellman library — an open-source project for graph queries. GSK will demonstrate how connections between biological entities that are not explicitly connected in the data are deduced from ontologies. These inferred connections are returned to the scientist to aid in the discovery of new connections with the intent on accelerating gene to disease research.

Learn more

Lessons Learned from Deidentifying 700 Million Patient Notes

Providence embarked on an ambitious journey to de-identify all of their clinical electronic medical record (EMR) data to support medical research and the development of novel treatments. This talk shares how this was done for patient notes and how you can achieve the same.

Learn more

Data Lake for State Health Exchange Analytics Using Databricks

The California Healthcare Eligibility, Enrollment, and Retention System (CalHEERS) — one of the largest State-based health exchanges in the country — was looking to modernize their data warehouse environment to support the vision that every decision to design, implement and evaluate their state-based health exchange portal was informed by timely and rigorous evidence about its consumers’ experience. The scope of the project was to replace the existing Oracle-based DWH with an analytics platform that could support a much broader range of requirements including ML. Learn about their journey to building this modern architecture on the Lakehouse and the outcomes they achieved.

Learn more

Accelerating the Pace of Autism Diagnosis with Machine Learning Models

A formal autism diagnosis can be an inefficient and lengthy process. Families may wait months or longer before receiving a diagnosis for their child despite evidence that earlier intervention leads to better treatment outcomes. In this talk, Anish Lakkapragada, a sophomore at Lynbrook High School and researcher at Stanford’s Wall Lab, shares his work on how digital technologies and deep learning can be used to analyze unstructured home videos to aid in the rapid detection of autism.

Learn more

Check out the full list of Healthcare and Life Sciences talks at Summit, including speakers from Bayer, OncoHealth, Humana, and many others.

Demos on Popular Data + AI Use Case in Healthcare and Life Sciences

Delta Sharing for Healthcare and Life Sciences Healthcare Data Interoperability and Patient Analytics Patient Cohort Building with NLP and Knowledge Graphs Real-world Evidence and Propensity Score Matching

Sign-up for the Healthcare and Life Sciences Experience at Summit!

  • Make sure to register for the Data + AI Summit to take advantage of all the amazing Healthcare and Life Sciences sessions, demos, and talks scheduled to take place! All content will be recorded and shared with virtual attendees!
  • Download our Guide to Healthcare and Life Sciences Sessions at Data + AI Summit 2022.

--

Try Databricks for free. Get started today.

The post Guide to Healthcare & Life Sciences Sessions at Data + AI Summit 2022 appeared first on Databricks.


Apache Spark and Photon Receive SIGMOD Awards

$
0
0

This week, many of the most influential engineers and researchers in the data management community are convening in-person in Philadelphia for the ACM SIGMOD conference, after two years of meeting virtually. As part of the event, we were thrilled to see the following two awards:

  • Apache Spark was awarded the SIGMOD Systems Award
  • Databricks Photon was awarded the Best Industry Paper award

We thought we would take this opportunity to discuss the background to this and how we got here.

What is ACM SIGMOD and what are the awards?

ACM SIGMOD stands for Association of Computing Machinery’s Special Interest Group in the Management of Data. We know, long name. Everybody just says SIGMOD. It is the most prestigious conference for database researchers and engineers, as many of the most seminal ideas in the field of databases, from column stores to query optimizations, have been published in this venue.

The SIGMOD Systems Award is given annually to one “system whose technical contributions have had significant impact on the theory or practice of large-scale data management systems.” These systems tend to have large-scale real-world applications as well as having influenced how future database systems are designed. The past winners include Postgres, SQLite, BerkeleyDB, and Aurora.

The Best Industry Paper Award is awarded annually to one paper based on the combination of real-world impact, innovation, and quality of the presentation.

Apache Spark’s Data and AI Origin

About a decade ago, Netflix started a competition called Netflix Prize, in which they anonymized their vast collection of user movie ratings and asked competitors to come up with algorithms to predict how users would rate movies. The $1m USD trophy would go to the team with the best machine learning model.

A group of PhD students at UC Berkeley decided to compete. The first challenge they ran into was that the tooling simply wasn’t good enough. In order to build better models, they needed a fast, iterative way to clean, analyze, process large amounts of data (that didn’t fit on a student laptop), and they needed a framework expressive enough to compose experimental ML algorithms on.

Data warehouses, which were the standard for enterprise data, could not deal with the unstructured data and lacked expressiveness. They discussed this challenge with another PhD student, Matei Zaharia. Together, they designed a new parallel computing framework called Spark, with a new innovative distributed data structure called RDDs. Spark enabled its users to run data parallel operations quickly and concisely.

Or put it differently, it’s fast to write code in and fast to run. Fast to write is important because it makes the program more understandable, and can be used to compose more complex algorithms easily. Fast to run means users can get feedback faster, and build their models using ever-growing data.

It turned out the students were not alone. These were the early days of data and AI applications in the industry, and everybody faced similar challenges. With popular demand, the project moved to the Apache Software Foundation and grew into a massive community.

Today, Spark is the de facto standard for data processing, and growing:

  • It has been downloaded 45 million times last month, in PyPI and Maven Central alone. This represents a 90% year-over-year growth in downloads.
  • It is used in at least 204 countries and regions.
  • It is ranked the #1 in top paying technologies in Stack Overflow’s 2021 developer survey.

The SIGMOD Systems Award is a validation of the project’s adoption as well as its influence over the generations of systems to come to think of data and AI as a unified package.

Photon: New Workloads and Lakehouse

As Apache Spark grew in popularity, we found that organizations wanted to do more than large-scale data processing and machine learning with it: they wanted to run traditional interactive data warehousing applications on the same datasets they were using elsewhere in their business, eliminating the need to manage multiple data systems. This led to the concept of lakehouse systems: a single data store that can do large-scale processing and interactive SQL queries, combining the benefits of data warehouse and data lake systems.

To support these types of use cases, we developed Photon, a fast C++, vectorized execution engine for Spark and SQL workloads that runs behind Spark’s existing programming interfaces. Photon enables much faster interactive queries as well as much higher concurrency than Spark, while supporting the same APIs and workloads, including SQL, Python and Java applications. We’ve seen great results with Photon on workloads of all sizes, from setting the world record in the large-scale TPC-DS data warehouse benchmark last year to offering 3x higher performance on small, concurrent queries.

10 GB TPC-DS Queries/Hr at 32 Concurrent Streams (Higher is better)

Designing and implementing Photon was challenging because we needed the engine to retain the expressiveness and flexibility of Spark (to support the wide range of applications), never slower (to avoid performance regressions), and significantly faster in our target workloads. In addition, unlike a traditional data warehouse engine that assumes all the data has been loaded into a proprietary format, Photon needed to work in the lakehouse environment, processing data in open formats such as Delta Lake and Apache Parquet, with minimal assumptions about the ingestion process (e.g., availability of indexes or data statistics). Our SIGMOD paper describes how we tackled these challenges and many of the technical details of Photon’s implementation.

We were thrilled to see this work recognized as the Best Industry Paper and we hope it gives database engineers and researchers good ideas about what’s challenging in this new model of lakehouse systems. Of course, we have also been very excited about what our customers have done with Photon so far — the new engine has already grown to a significant fraction of our workload.

If you are attending SIGMOD, drop by the Databricks booth and say hi. We would love to chat about the future of data systems together. In return, we will give you a “the best data warehouse is a lakehouse” t-shirt!

--

Try Databricks for free. Get started today.

The post Apache Spark and Photon Receive SIGMOD Awards appeared first on Databricks.

Announcing New Partner Integrations in Partner Connect

$
0
0

We are excited to announce six new integrations in Databricks Partner Connect. Our expanding partnerships enable our users to integrate the freshest data into the Databricks Lakehouse Platform, use pre-built deep learning models to process clinical and biomedical text at scale, use cutting-edge notebooks for analytics and data science collaboration and automatically track and improve data quality. This is part of our commitment to bringing the best partners in the world to our customers through Partner Connect, a one-stop portal for discovering, trying, and connecting the best-validated data, analytics, and AI tools with Databricks.

This quarter, we have added the following partner integrations to Partner Connect.

Data Ingestion

Arcion
Arcion is a cloud-native, change data capture-based data replication platform that integrates data from relational databases such as Oracle, Oracle Exadata, and Oracle RAC, IBM DB2, Microsoft SQL Server, and SaaS applications such as Salesforce into Delta Lake on Databricks. Arcion offers transactional integrity, automatic schema conversion, and zero latency. This makes it an ideal solution for business-critical analytics and AI workloads that benefit from real-time, freshest possible data, for example, click-stream data, customer 360, threat protection, etc. Arcion is also one of the first zero-code data mobility platforms in the Databricks Ventures portfolio. Click here to view a demo.

Data Transformation

Matillion
Matillion ETL for Delta Lake on Databricks brings no-code/low-code data integration to a lakehouse architecture. Users across the business can take ownership of their data, leveraging best-in-class data transformation from Matillion ETL to enable on-demand machine learning, faster reporting, and BI improvements powered by Delta Lake. Availability in Partner Connect allows users to easily extract and load business-critical data from operational databases, files, NoSQL, and API sources into Delta Lake without any prior pre-configuration.

Artificial intelligence and machine learning

John Snow Labs
Over the past year, John Snow Labs has developed solutions for key use cases, such as detecting adverse events, extracting oncology insights and automating the removal of PHI, that are optimized to run on Databricks to help healthcare and life science companies transform large volumes of unstructured text data into patient insights. With Partner Connect, John Snow Labs’ Spark NLP for Healthcare – a best-of-breed and the most widely-used NLP library – is easily available to all Databricks customers. This allows healthcare organizations to unify clinical and biomedical text data at scale into a single, high-performance lakehouse platform for analytics and data science.

Hex
Together, Databricks and Hex are powering the next generation of analytics. Databricks provides unparalleled data processing and storage, and Hex makes that data available for data science exploration and analytics. Hex simply and securely connects to the Databricks Lakehouse, so users can query structured and unstructured data in collaborative SQL and Python-powered notebooks, and then share their work as interactive data apps that are usable by anyone. Hex empowers users of all skill sets to make decisions backed by data. Learn more about how Hex & Databricks work seamlessly together and sign up for a free trial today.

Data quality

Anomalo
Anomalo automatically detects data issues and understands their root cause, ensuring you can trust the quality of your data in your Databricks Lakehouse before it is consumed by BI, analytics, and machine learning frameworks. Unlike traditional rules-based approaches to data quality, Anomalo provides automated checks for data quality using machine learning. When issues are detected, Anomalo provides a rich set of visualizations to contextualize and explain issues as well as an instant root-cause analysis that points to the likely source of the problem. This means customers spend more time making data-driven decisions, and less time investigating and fire-fighting issues with data. To learn more about Anomalo, please visit Anomalo.com, or request a demo here.

Lightup
Lightup empowers data-driven companies to quickly and continuously perform accurate and comprehensive data quality checks with ease. With deep integration with Databricks Lakehouse Platform, Lightup provides an additional layer of data reliability across the Lakehouse — empowering enterprises to perform in-place data quality checks — accelerating data migrations from legacy big data systems to Databricks, all with a single click.
Start optimizing the data quality in your Lakehouse today! To learn more about Lightup, visit lightup.ai.

Customer spotlight: Enabling data literacy at Aktify

Partner Connect eliminates the complexity of finding, and configuring partner tools securely with Databricks to support your analytics and AI use cases. For example, Aktify, a conversational AI company, is on a mission to delight customers by powering conversations using robust machine learning. A one-click Tableau connector in Partner Connect allows all users at Aktify to spin up their own Tableau dashboards to answer key business questions instead of waiting on data engineers. As a result, Aktify now has over two dozen data professionals focused on high-value tasks working collaboratively and sharing data insights seamlessly.

“We no longer stage data in Snowflake as all data (including ~85 GB of operational data) is instantly available in the Databricks Lakehouse,” explained Brandon Smith, the Director of Data & Analytics at Aktify. “Partner Connect also helps the team discover new data and AI solutions, bringing us closer to our vision of organization-wide data literacy.”

Get started with Partner Connect

Sign up for a free trial of Databricks to try Partner Connect at no additional cost. Stay tuned as we announce integrations with Matillion, Hightouch, Preset, Thoughtspot and many others in the coming months.

You will have the opportunity to meet all Partner Connect partners in-person or online at the Data + AI Summit, the world’s largest data and AI conference from June 27-30. If you haven’t already, be sure to register here for talks, technical demos, hands-on training and networking.

--

Try Databricks for free. Get started today.

The post Announcing New Partner Integrations in Partner Connect appeared first on Databricks.

Defining the Future of Data & AI: Announcing the Finalists for the 2022 Databricks Data Team OSS Award

$
0
0

The annual Databricks Data Team Awards recognize data teams who are harnessing the power of data and AI to deliver solutions for some of the world’s toughest problems.

Nearly 250 teams were nominated across six categories from all industries, regions, and companies – all with impressive stories about the work they are doing with data and AI. As we lead up to Data and AI Summit, we will be showcasing the finalists in each of the categories over the coming days.

First up: The Data Team OSS Award celebrates those who are making the most out of leveraging, or contributing to, the open-source technologies that are defining the future of data and AI, including Delta Lake, MLflow, and Apache Spark™.

Meet the five finalists for the Data Team OSS Award category:

Apple
As one of the most iconic and recognizable brands in the world with over 1 billion iPhone users worldwide, it’s obvious that data and AI are at the forefront of its innovation strategy. Part of what has contributed to the loyalty of Apple customers is peace of mind — knowing that their devices and data are secure from malicious attacks. Apple’s early commitment to Delta Lake has allowed them to build a foundation to operate massive streaming, SQL, graph, and ML workloads, ingesting 100s of terabytes of daily log and telemetry data required to detect, diagnose and respond to cyber threats in real-time. Apple continues to be instrumental in the building of Delta Lake since its inception, contributing code and key design elements, while actively participating in the various Delta Lake community forums to help other organizations looking to democratize data and AI with Delta Lake.

Back Market
The refurbished consumer electronics market is growing significantly as an alternative to purchasing new devices. With over 6 million customers, Back Market is the leading dedicated renewed tech marketplace bringing high-quality professionally refurbished electronic devices and appliances, including smartphones, laptops, gaming consoles, and more. The key to ensuring they meet the needs of each customer and seller is data, but as analytical workloads rose, so did the need to consume their data in a rapid, efficient and secure manner. To enable this, Florian Valeye (a Delta Lake Committer) and the engineering team at Back Market have been important contributors to the creation of the Delta Rust API and associated Python bindings, in an effort to enable low-latency queries of delta table without having to spin up a Spark cluster. They’ve also been actively involved in Delta Lake community office hours, contributors to the AWS Labs Athena Federation, reviewing code, and even our partnerships with Google BigQuery.

Samba TV
As more people consume content across internet-connected TVs, the data captured is enabling unparalleled levels of audience targeting and personalized experiences. Samba TV provides first-party data from tens of millions of televisions, across more than 20 TV brands sold in over 100 countries, providing advertisers and media companies a unified view of the entire consumer journey. Due to S3’s lack of putIfAbsent transactional consistency thus requiring single cluster writes support, the team at Samba TV became an instrumental driver for S3 Multi-cluster writes which allows writes to S3 across multiple clusters and/or Spark drivers while ensuring that only one writer succeeds with each transaction. This maintains atomicity and prevents file contents from ever being overwritten. Their contributions don’t stop there as Samba TV continues to actively participate in the Delta Lake community, forums, and feature discussions.

Scribd
Scribd is on a mission to change the way the world reads. Scribd offers a monthly subscription, providing online access to the best ebooks, audiobooks, magazines, and podcasts for over a million consumers and 100 million monthly visitors. With the world’s largest library of digital content spanning more than 60 million titles, Scribd relies on the lakehouse architecture built on Delta Lake, to build performance-optimized data pipelines that easily support both historical and streaming data to power recommendations that serve compelling and interesting content to users. As invaluable contributors to the Delta Lake community, the team at Scribd has leveraged their deep understanding of the Delta Lake ecosystem to create the Delta Rust API, kafka-delta-ingest, sql-import, and has provided an immense amount of feedback on Apache Spark, Delta Lake, MLflow, machine learning, Databricks, and more. They have also helped us with community office hours, reviewing Delta code, and working with the Delta community.

T-Mobile
T-Mobile’s mission is to build the nation’s best 5G network while reducing customer pain points every day. To meet the Un-carrier’s aggressive build plans and customer-focused goals, they embarked on a digital transformation — relying on their data to optimize back-office business processes, streamline network builds, mitigate fraud, and improve the overall experience for the enterprise’s business teams. At the heart of their data strategy is the lakehouse architecture and Delta Lake — democratizing access to data for BI and ML workloads at the speed of business. As valuable members of the Delta Lake community, they have been pushing the boundaries of Delta Lake to solve their toughest data problems by optimizing their procurement and supply chain process, ensuring billions of dollars of cell-site equipment is at the right place at the right time, to streamlining internal initiatives that better engage customers, save money and drive revenue.

Check out the award finalists in the other categories and come raise a glass and celebrate these amazing data teams during the awards ceremony at the Data and AI Summit on June 29.

--

Try Databricks for free. Get started today.

The post Defining the Future of Data & AI: Announcing the Finalists for the 2022 Databricks Data Team OSS Award appeared first on Databricks.

Introducing Apache Spark™ 3.3 for Databricks Runtime 11.0

$
0
0

Today we are happy to announce the availability of Apache Spark™ 3.3 on Databricks as part of Databricks Runtime 11.0. We want to thank the Apache Spark community for their valuable contributions to the Spark 3.3 release.

The number of monthly PyPI downloads of PySpark has rapidly increased to 21 million, and Python is now the most popular API language. This year-over-year growth rate represents a doubling of monthly PySpark downloads in the last year. Also, the number of monthly Maven downloads exceeded 24 million. Spark has become the most widely-used engine for scalable computing.

The number of monthly PyPI downloads of PySpark has rapidly increased to 21 million.

Continuing with the objectives to make Spark even more unified, simple, fast, and scalable, Spark 3.3 extends its scope with the following features:

  • Improve join query performance via Bloom filters with up to 10x speedup.
  • Increase the Pandas API coverage with the support of popular Pandas features such as datetime.timedelta and merge_asof.
  • Simplify the migration from traditional data warehouses by improving ANSI compliance and supporting dozens of new built-in functions.
  • Boost development productivity with better error handling, autocompletion, performance, and profiling.

Performance Improvement

Bloom Filter Joins (SPARK-32268): Spark can inject and push down Bloom filters in a query plan when appropriate, in order to filter data early on and reduce intermediate data sizes for shuffle and computation. Bloom filters are row-level runtime filters designed to complement dynamic partition pruning (DPP) and dynamic file pruning (DFP) for cases when dynamic file skipping is not sufficiently applicable or thorough. As shown in the following graphs, we ran the TPC-DS benchmark over three different variations of data sources: Delta Lake without tuning, Delta Lake with tuning, and raw Parquet files, and observed up to ~10x speedup by enabling this Bloom filter feature. Performance improvement ratios are larger for cases lacking storage tuning or accurate statistics, such as Delta Lake data sources before tuning or raw Parquet file based data sources. In these cases, Bloom filters make query performance more robust regardless of storage/statistics tuning.

Performance of TPC-DS queries with Bloom filters

Query Execution Enhancements: A few adaptive query execution (AQE) improvements have landed in this release:

  1. Propagating intermediate empty relations through Aggregate/Union (SPARK-35442)
  2. Optimizing one-row query plans in the normal and AQE optimizers (SPARK-38162)
  3. Supporting eliminating limits in the AQE optimizer (SPARK-36424).

Whole-stage codegen coverage is further improved in multiple areas, including:

Parquet Complex Data Types (SPARK-34863): This improvement adds support in Spark’s vectorized Parquet reader for complex types such as lists, maps, and arrays. As micro-benchmarks show, Spark obtains an average of ~15x performance improvement when scanning struct fields, and ~1.5x when reading arrays comprising elements of struct and map types.

Scale Pandas

Optimized Default Index: In this release, in the Pandas API on Spark (SPARK-37649), we switched the default index from ‘sequence’ to ‘distributed-sequence’, where the latter is amenable to optimization with the Catalyst Optimizer. Scanning data with the default index in Pandas API on Spark became 2 times faster in the benchmark of i3.xlarge 5 node cluster.

The performance of 5 GB data scans between different index types 'sequence' and 'distributed-sequence'

Pandas API Coverage:
PySpark now natively understands datetime.timedelta (SPARK-37275, SPARK-37525) across Spark SQL and Pandas API on Spark. This Python type now maps to the date-time interval type in Spark SQL. Also, many missing parameters and new API features are now supported for Pandas API on Spark in this release. Examples include endpoints like ps.merge_asof (SPARK-36813), ps.timedelta_range (SPARK-37673) and ps.to_timedelta (SPARK-37701).

Migration Simplification

ANSI Enhancements: This release completes the support of the ANSI interval data types (SPARK-27790). Now we can read/write interval values from/to tables, and use intervals in many functions/operators to do date/time arithmetic, including aggregation and comparison. Implicit casting in ANSI mode now supports safe casts between types while protecting against data loss. A growing library of “try” functions, such as “try_add” and “try_multiply”, complement ANSI mode allowing users to embrace the safety of ANSI mode rules while also still allowing for fault tolerant queries.

Built-in Functions: Beyond the try_* functions (SPARK-35161), this new release now includes nine new linear regression functions and statistical functions, four new string processing functions, aes_encryption and decryption functions, generalized floor and ceiling functions, “to_number” formatting, and many others.

Boosting Productivity

Error Message Improvements: This release starts a journey wherein users observe the introduction of explicit error classes like “DIVIDE_BY_ZERO.” These make it easier to search online for more context about errors, including in the formal documentation.

For many runtime errors Spark now returns the exact context where the error occurred, such as the line and column number in a specified nested view body.

An example of error message improvements

Profiler for Python/Pandas UDFs (SPARK-37443): This release introduces a new Python/Pandas UDFs profiler, which provides deterministic profiling of UDFs with useful statistics. Below is an example by running PySpark with the new infrastructure:


Output examples of Python/Pandas UDF profiler

Better Auto-Completion with Type Hint Inline Completion (SPARK-39370):
All type hints have migrated from stub files to inlined type hints in this release in order to enable better autocompletion. For example, showing the type of parameters can help provide useful context.

Better auto-completion by type hint inline completion

In this blog post, we summarize some of the higher-level features and improvements in Apache Spark 3.3.0. Please keep an eye out for upcoming posts that dive deeper into these features. For a comprehensive list of major features across all Spark components and JIRA tickets resolved, please visit the Apache Spark 3.3.0 release notes.

The Apache Spark 3.3 release includes a long list of major and minor enhancements, focused on usability, stability and refinement, and reflects the work of 226 contributors across 1604 JIRA tickets.

Get started with Spark 3.3 today

To try out Apache Spark 3.3 in Databricks Runtime 11.0, please sign up for the Databricks Community Edition or Databricks Trial, both of which are free, and get started in minutes. Using Spark 3.3 is as simple as selecting version “11.0” when launching a cluster.

Databricks Runtime version selection when creating a cluster.

Databricks Runtime 11.0 (Beta)

--

Try Databricks for free. Get started today.

The post Introducing Apache Spark™ 3.3 for Databricks Runtime 11.0 appeared first on Databricks.

Using a Knowledge Graph to Power a Semantic Data Layer for Databricks

$
0
0

This is a collaborative post between Databricks and Stardog. We thank Aaron Wallace, Sr. Product Manager at Stardog, for their contribution.

 

Knowledge Graphs have become ubiquitous, we just don’t know it. We experience it every day when we search on Google or watch the feeds that run through our social media accounts of people we know, companies we follow or the content we like. Similarly, Enterprise Knowledge Graphs provide a foundation for structuring your organization’s content, data and information assets by extracting, relating and delivering knowledge as answers, recommendations and insights to every data-driven application from chatbots to recommendation engines or supercharging your BI and Analytics.

In this blog, you will learn how Databricks and Stardog solve the last mile challenge in democratizing data and insights. Databricks provides a lakehouse platform for data, analytics and artificial intelligence (AI) workloads on a multi–cloud platform. Stardog provides a knowledge graph platform that can model complex relationships against data that is wide, and not just big, to describe people, places, things and how they relate. The Databricks Lakehouse Platform, coupled with Stardog’s Knowledge Graph-enabled semantic layer, provide organizations with a foundation for an enterprise data fabric architecture that makes it possible for cross-functional, cross-enterprise or cross-organizational teams to ask and answer complex queries across domain silos.

The growing need for a Data Fabric Architecture

Rapid innovation and disruption in the data management space are helping organizations unlock value from data available both inside and outside the enterprise. Organizations operating across physical and digital boundaries are finding new opportunities to serve customers in the way they want to be served.

These organizations have connected all relevant data across the data supply chain to create a complete and accurate picture in the context of their use-cases. Most industries that look to operate and share data across organizational boundaries to harmonize data and enable data sharing are adopting open standards in the form of prescribed ontologies, from FIBO in Financial Services to D3FEND in the Cybersecurity domain. These business ontologies (or semantic models) reflect how we think about data with meaning attached, i.e. “things” rather than how data is structured and stored, i.e. “strings”, and make data sharing and re-use possible.

The idea of a semantic layer is not new. It has been around for over 30 years, often promoted by BI vendors helping companies build purpose-built dashboards. However, broad adoption has been impeded, given the embedded nature of that layer as part of a proprietary BI system. This layer is often too rigid and complex, suffering from the same limitations as a physical relational database system which models data to optimize for its structured query language rather than how data is related in the real world—many-to-many. A knowledge graph-powered semantic data layer that operates between your storage and consumption layers provides that glue and multiplier that connects all data to deliver value in context of the business use-case to citizen data scientists and analysts that otherwise are unable to participate and collaborate in data-centric architectures outside of a handful of specialists.

Enable a use case around insurance

Let’s look at a real-world example of a multi-carrier insurance organization to illustrate how Stardog and Databricks work together. Like most large companies, many insurance companies struggle with similar challenges when it comes to data, such as the lack of broad availability of data from internal and external sources for decision-making by critical stakeholders. Everyone from underwriting risk assessment to policy administration to claims management and agencies struggle with leveraging the right data and insights to make critical decisions. They all need an enterprise-wide data fabric that brings the elements of a modern data and analytics architecture to make data FAIR – Findable, Accessible, Interoperable and Reusable. Most companies start their journey by bringing all data sources into a data lake. The Databricks lakehouse approach provides companies with a great foundation for storing all their analytics data and making all data accessible to anyone inside the enterprise. In this data layer, all cleansing, transformation, and disambiguation takes place. The next step in that journey is data harmonization, connecting data based on its meaning to provide richer context. A semantic layer, delivered by a knowledge graph, shifts the focus to data analysis and processing and provides a connected fabric of cross-domain insights to underwriters, risk analysts, agents and customer service teams to manage risk and deliver an exceptional customer experience.

We will examine how this would work with a simplified semantic model as a starting point.

Easily model domain-specific entities and cross-domain relationships

Visually creating a semantic data model through a whiteboard-like experience is the initial step in creating a semantic data layer. Inside the Stardog Designer project, just click to create specific classes (or entities) that are critical in answering your business questions. Once a class is created, you can add all the necessary attributes and data types to describe this new entity. Linking classes (or entities) together is easy. With an entity selected, just click to add a link and drag the point of the new relationship until it snaps to the other entity. Give this new relationship a name that describes the business meaning (e.g., a “Customer” “owns” a “Vehicle”).

Add a new class and link it to an existing class to create a relationship

Map metadata from the Databricks Lakehouse Platform

What’s a model without data? Stardog users can connect to a variety of structured, semi-structured and unstructured data sources by persisting or virtualizing data, or some combination, when and where it makes sense. In Designer, it is easy to connect data from existing sources like Delta Lake to connect the metadata from user-specified tables. This enables initial access to that data through its virtualization layer without moving or copying it into the knowledge graph. The virtualization layer automatically translates incoming queries from Stardog from its open-standards based SPARQL to optimized push-down SQL queries in Databricks SQL.

Add a new data source as a project resource

Click to add a new project resource and select from one of the available connections, such as Databricks. This connection leverages the new SQL endpoint recently released by Databricks. Define a scope for the data and specify any additional properties. Use the preview pane to quickly glance at the data before adding to your project.

Incorporate additional data from a variety of locations

Designer makes it simple to incorporate data from other data sources and files such as CSVs, for teams looking to conduct ad-hoc data analysis, combining data from Delta with this new information. Once added as a resource, you simply add a link and drag and drop to a class to map the data. Give the mapping a meaningful name, specify a data column for the primary identifier, the label, and any other data columns that match the attributes for the entity.

Map data from a project resource to a class

Publish your work

Within the Designer you can publish this project’s model and data directly to your Stardog server for use in Stardog Explorer. The designer also allows you to publish and consume the output of the knowledge graph in various ways. You can publish directly to a zipped folder of files, including your model and mappings, to your version control system.

Publish directly to a Stardog database

Once the data has been published to Stardog, data analysts can also use popular BI tools like Tableau to connect through Stardog’s BI/SQL Endpoint to pull data through the semantic layer into a report or dashboard. Auto-generated schema within any SQL-compatible tool allows users to write SQL queries against the Knowledge Graph. Queries coming through the SQL layer are automatically translated to SPARQL, the query language of the Knowledge Graph and pushed down using auto-generated source optimized queries, through the virtual layer, for computation at the source, in this case, Databricks via the Databricks SQL endpoint. The same information can also be made available to Databricks users in a notebook using Stardog’s python API, pystardog. You can also embed the virtual graph for use directly inside your applications using Stardog’s GraphQL API. The semantic layer on top of the lakehouse provides a single environment for all types of users and their preferred tools, keeping operations backed by a consistent set of data.

Exploration of the connected data via Stardog Explorer Application

Data visualization of the connected knowledge graph in Tableau via Stardog’s BI-SQL end-point

Data Science notebook in Databricks using pystardog to query data from the Knowledge Graph

Increase productivity & develop new insights

By organizing data in a Knowledge Graph, data teams increase their productivity by decreasing the amount of time they spend wrangling data from external sources in support of ad hoc data analysis. Data outside Databricks can be federated through Stardog’s virtualization layer and connected to data inside Databricks. Additionally, new relationships can be inferred between entities without explicitly modeling them into the knowledge graph using techniques like statistical and/or logical inference. Because Databricks and Stardog work seamlessly together, the combination provides a true end-to-end experience that simplifies complex cross-domain query and analysis. Moreover, the semantic layer becomes a living, sharing and easy-to-use layer as part of an enterprise data fabric foundation, providing enterprise-wide knowledge in support of new data-driven initiatives.

Getting started with Databricks and Stardog

In this blog, we’ve provided a high-level overview of how Stardog enables a knowledge graph-powered semantic data layer on top of the Databricks Lakehouse Platform. To get an in-depth overview, check out our deep dive demo. Stardog provides knowledge workers with critical just-in-time insight across a connected universe of data assets to supercharge their analytics and accelerate the value of their data lake investments. By using Databricks and Stardog together, data and analytics teams can quickly establish a data fabric that evolves with your organization’s growing needs.

To get started with Databricks and Stardog, request a free trial below:
https://databricks.com/try-databricks
https://cloud.stardog.com/get-started
https://www.stardog.com/learn-stardog/

--

Try Databricks for free. Get started today.

The post Using a Knowledge Graph to Power a Semantic Data Layer for Databricks appeared first on Databricks.

Deploy Fully Managed Change Data Capture Pipelines With Arcion and Databricks Partner Connect

$
0
0

This is a collaborative post between Databricks and Arcion. We thank Rajkumar Sen, Founder & CTO of Arcion, for their contribution.

 

We are thrilled to announce that Arcion, the cloud-native, distributed change data capture replication platform for simpler real-time data pipelines, is now available in Databricks Partner Connect. Arcion enables real-time data ingestion from transactional databases like Oracle and MySQL into the Databricks Lakehouse Platform with their fully-managed cloud service.

Arcion and Databricks have been working towards simplifying data replication and real-time data ingestion for over two years now. This integration is the latest in our continued effort to make real-time data sync with the lakehouse even easier for our joint customers and will result in faster and highly-automated analytics and AI and ML workflows.

Real-time data ingestion to Databricks starts with just a click

Transactional databases like Oracle have become a critical part of modern data infrastructure. They are extremely secure and often store mission-critical business data. Unfortunately, the design of transactional databases limits collaboration teams, especially analytics, resulting in stale data and limited business visibility. Arcion solves this issue – while combating the slow, expensive batch processes and brittle pipelines of traditional solutions – with its fully-managed, distributed change data capture (CDC) technology that ensures lower cost of ownership, reduced DevOps, and peace of mind with end-to-end data consistency. Arcion’s pipelines can be stopped and resumed at will without causing data loss and has minimal impact on the production source.

Arcion brings high-volume, concurrent data ingestion into Databricks through data pipelines that can achieve 10k ops/sec/table and support tables with billions of rows. But connecting the platforms still required users to configure, transfer credentials, and validate the connection manually. Or it did until today.

With Partner Connect, users can simply choose Arcion as the data ingestion partner of choice, and Databricks will automatically configure resources, provision an SQL endpoint, and transfer credentials. Once a secure connection has been established, users will be taken to Arcion directly where they can log in (or start a free trial).

Easy access to fully-managed data pipelines

With Partner Connect, users can simply choose Arcion as the data ingestion partner of choice, and Databricks will automatically configure resources, provision an SQL endpoint, and transfer credentials.

Deploying pipelines and starting real-time data ingestion in Arcion only takes a few steps:

  • Select the Replication Mode
  • Choose a Source (launching with Oracle, Oracle Exadata, Oracle RAC, MySQL and Snowflake, and more sources coming in the coming months). For the destination, Databricks is automatically pre-selected and pre-configured as the target.
  • Filter the data (schemas, tables, and columns)
  • Start replication

And that’s it. Once the replication completes, you can go into Databricks and view the ingested Delta tables in the Databricks Data Explorer, query them, or go straight to analytics in the Lakehouse.

Databricks and Arcion support some of the most demanding data requirements across a myriad of industries, AI-based or otherwise. From real-time fraud detection in finance to more accurate demand forecasting in retail, and hundreds of other use cases in between – Arcion + Databricks can boost your data strategy and results.

Arcion + Databricks for data-driven enterprises

Our partnership with Arcion transcends just connectors and integrations, Databricks and Arcion share a common philosophy of greater data accessibility and improved data analytics. For instance, Arcion handles schema changes out of the box, requiring no user intervention. This helps mitigate data loss and eliminate downtime caused by pipeline-breaking schema changes by intercepting changes in the source database and propagating them while ensuring compatibility with the target’s schema evolution. Pairing this technology with Partner Connect’s automatic configuration helps enterprises unify data silos much faster and more reliably.

Try out Arcion for yourself (for free)

Not an existing Arcion user? No worries, Arcion offers a 14-day free trial so you can try out Partner Connect and start ingesting data into Databricks in real-time right away. For a more detailed walkthrough of real-time data ingestion into Databricks Partner Connect using Arcion, read this Arcion blog with step by step breakdown.

--

Try Databricks for free. Get started today.

The post Deploy Fully Managed Change Data Capture Pipelines With Arcion and Databricks Partner Connect appeared first on Databricks.

Build Reliable Production Data and ML Pipelines With Git Support for Databricks Workflows

$
0
0

We are happy to announce native support for Git in Databricks Workflows, which enables our customers to build reliable production data and ML workflows using modern software engineering best practices. Customers can now use a remote Git reference as the source for tasks that make up a Databricks Workflow, for example, a notebook from the main branch of a repository on GitHub can be used in a notebook task. By using Git as the source of truth, customers eliminate the risk of accidental edits to production code. They also remove the overhead of maintaining a production copy of the code in Databricks and keeping it updated, and improve reproducibility as each job run is tied to a commit hash. Git support for Workflows is available in Public Preview and works with a wide range of Databricks supported Git providers including GitHub, Gitlab, Bitbucket, Azure Devops and AWS CodeCommit.

Customers have asked us for ways to harden their production deployments by only allowing peer-reviewed and tested code to run in production. Further, they have asked for the ability to simplify the automation and improve reproducibility of their workflows. Git support in Databricks Workflows has already helped numerous customers achieve these goals.

“Being able to tie jobs to a specific Git repo and branch has been super valuable. It has allowed us to harden our deployment process, instill more safeguards around what gets into production, and prevent accidental edits to prod jobs. We can now track each change that hits a job through the related Git commits and PRs.” – said Chrissy Bernardo, Lead Data Scientist at Disney Streaming

“We used the Databricks Terraform provider to define jobs with a git source. This feature simplified our CI/CD setup, replacing our previous mix of python scripts and Terraform code and relieved us of managing the ‘production’ copy. It also encourages good practices of using Git as a source for notebooks, which guarantees atomic changes of a collection of related notebooks” – said Edmondo Procu, Head of Engineering at Sapient Bio.

“Repos are now the gold standard for our mission critical pipelines. Our teams can efficiently develop in the familiar, rich notebook experience Databricks offers and can confidently deploy pipeline changes with Github as our source of truth – dramatically simplifying CI/CD. It is also straightforward to set up ETL workflows referencing Github artifacts without leaving the Databricks UI.
” – says Anup Segu, Senior Software Engineer at YipitData

“We were able to reduce the complexity of our production deployments by a third. No more needing to keep a dedicated production copy and having a CD system, invoke APIs to update it.” – says Arash Parnia, Senior Data Scientist at Warner Music Group

Getting started

It takes just a few minutes to get started:

  1. First, you will need to add your Git provider personal access token (PAT) token to Databricks. This can be done in the UI via Settings > User Settings > Git Integration or programmatically via the Databricks Git credentials API
  2. Next, create a Job and specify a remote repository, a git ref (branch, tag or commit) and the relative path to the notebook (relative to the root of the repository).
  3. A sample job creation demonstrating one of the four simple steps involved with setting up the new Databricks feature for running notebook tasks against remote repositories.

    Designating a Git repository, demonstrating one of the four simple steps involved with setting up the new Databricks feature for running notebook tasks against remote repositories.

    These actions can also be performed via v2.1 and v.2.0 of the Jobs API.

  4. Add more tasks to your job
  5. Once you have added the Git reference you can use the same reference for other notebook tasks in a job with multiple tasks.

    Adding more tasks to a job, demonstrating one of the four simple steps involved with setting up the new Databricks feature for running notebook tasks against remote repositories.

    Every notebook task in that job will now fetch the pre-defined commit/branch/tag from the repository on every run. For each run the git commit SHA will be logged and it is guaranteed that all notebook tasks in a job are run from the same commit.

    Please note that in a multitask job, there can’t be a notebook task that uses a notebook in Databricks Workspace or Repos and another task that uses a remote repository. This restriction doesn’t apply to non-notebook tasks.

  6. Run the job and view its details
  7. Running and viewing job details, demonstrating the last of four simple steps involved with setting up the new Databricks feature for running notebook tasks against remote repositories.

All Databricks notebook tasks in the job run from the same Git commit. For each run, the commit is logged and visible in the UI. You can also get this information from the Jobs API.

Ready to get started? Take Git support in workflows for a spin or dive deeper with the below resources:

  • Dive deeper into Databricks Workflows documentation
  • Check out this code sample and the accompanying webinar recording showing a end to end notebook production flow using Git support in Databricks workflows

--

Try Databricks for free. Get started today.

The post Build Reliable Production Data and ML Pipelines With Git Support for Databricks Workflows appeared first on Databricks.


Impacting the World with Data & AI: Announcing the Finalists for the 2022 Databricks Data Team for Good Award

$
0
0

The annual Databricks Data Team Awards recognize data teams who are harnessing the power of data and AI to deliver solutions for some of the world’s toughest problems.

Nearly 250 teams were nominated across six categories from all industries, regions, and companies – all with impressive stories about the work they are doing with data and AI. As we lead up to Data and AI Summit, we will be showcasing the finalists in each of the categories over the coming days.

The Data Team for Good Award salutes the data teams who are making a positive impact in the world, delivering solutions for global challenges — from healthcare to sustainability.

Meet the five finalists for the Data Team for Good Award category:

Cognoa
Cognoa’s mission is to enable earlier and more equitable access to care and improve the lives and outcomes of children living with behavioral health conditions. Research demonstrates that the sooner a diagnosis is made and interventions can begin, the more positive outcome can be achieved in a child’s life. To that end, Cognoa has developed the first FDA-authorized diagnosis aid, Canvas Dx, to help physicians diagnose or rule out autism in children as early as age 18 months through to 72 months. Powered by data and AI, the Software as a Medical Device leverages the Databricks Lakehouse Platform to tap into the power of AI and ML to help clinicians uncover the relationships between thousands of data points gathered from multiple video recordings, a questionnaire completed by the caregiver and by the physician – all to identify non-obvious patterns that point towards or away from autism. The result is an accurate and data-driven tool that empowers primary care providers to more efficiently diagnose or rule out autism in young children, enabling connection of children and families to appropriate therapy and supportive resources.

Karius
Karius has developed a liquid biopsy test for infectious diseases, using innovations across chemistry, data, and AI, to non-invasively detect over 1,000 pathogens from a single blood sample. The Karius Test, offered to hundreds of hospitals across the country, can help decrease the time and effort it takes clinicians to accurately diagnose an infection, without the need for an invasive diagnostic procedure or the application of slower, less-effective methods like a blood culture. To go beyond the diagnosis of an infection in a single patient, Karius is leveraging Databricks Lakehouse to unlock the promise of a new data type — microbial cell-free DNA — with AI to“see” patterns across infections, expanding from a few pathogens to the wider microbial landscape. The new capability allows Karius to identify novel biomarkers connecting microbes to opportunities across human health and disease. Furthermore, the organization has super-charged its biomarker discovery platform by developing a de-identified clinicogenomics database, which connects Karius molecular data to clinical data, empowering scientists, and physicians, to better interpret the patterns. Karius is now looking to apply its new data and AI capabilities beyond infectious disease, including opportunities across oncology, autoimmune disease, and response to therapy.

National Heavy Vehicle Regulator
The National Heavy Vehicle Regulator (NHVR), is on a mission to lower driver fatalities on Australian roads by mitigating risks associated with driver fatigue. NHVR leverages Databricks to use data and AI to provide preventative incident monitoring and insights from crash prediction models that helps save lives. With Databricks Lakehouse, they are able to capture and analyze high volumes of data, such as 4.5 million monthly vehicle sightings from around the country, in real-time to identify patterns that help predict risks and administer timely and effective intervention across a fleet of almost 1 million heavy commercial vehicles. NHVR is able to send real-time alerts to safety and compliance officers in the field to intercept vehicles potentially posing a danger to public safety. Among the many insights they gather and action on are the weight of a vehicle, travel times, the frequency and duration of driver breaks, all contributing to more effective regulation. In addition, the data team has enabled NHVR to create a more reliable crash prediction model by leveraging AI to identify vehicles and operators that have a higher probability of being involved in a fatal or serious incident.

Regeneron Genetics Center
The Regeneron Genetics Center (RGC) is on a mission to tap into the power of genomic data to bring new medicines to patients in need. But genomic and clinical data is highly decentralized and both difficult and costly to scale, which is why the RGC data team turned to the Databricks Lakehouse to help it scale its data systems from supporting thousands of patient participants to millions over only a few years. On top of Databricks Lakehouse, RGC has built one of the largest genomics databases in the world, and the ability to derive faster insights from this data has led to important discoveries in cardiovascular disease, obesity, immunology, oncology, COVID-19, and much more. The RGC has contributed to Glow, an open-source data toolkit that enables scaling genomic analyses to millions of samples contributed by research organizations across the world, resulting in new findings like determining genetic susceptibility to COVID-19.

US DoD Chief Data and Artificial Intelligence Office, Advana Program
Advana (a mash-up of the words “Advancing Analytics”) is a division of the US Department of Defense (DoD) that supports multiple missions –from defending the US soil, to foreign aggression, to climate change, to protecting global citizens from the risks of the COVID-19 pandemic. They are leveraging Databricks Lakehouse to ethically provide a unified view of all their data and deliver actionable insights from the boardroom to the battlefield for those moment-to-moment responses that decision-makers need. Today, Advana offers more than 250 applications in production drawing from more than 390 data sources. The Lakehouse provides them with the right-time data, data tools, AI and ML enablers, and other self-service products to put the power of data in the hands of more than 30,000 users, across many organizations. For instance, Advana continues to expand its COVID-19 analytic capabilities to help the DoD actively manage its ongoing response. They launched new functionality around HPCON tracking, travel and installation support, COVID cases, PPEs, and data which informed school opening decisions and vaccine dose administrations. Disclaimer: receipt of this award does not constitute a DoD endorsement of Databricks, Booz Allen, or any other non-Federal entity.

Check out the award finalists in the other five categories and come raise a glass and celebrate these amazing data teams during an award ceremony at the Data and AI Summit on June 29.

--

Try Databricks for free. Get started today.

The post Impacting the World with Data & AI: Announcing the Finalists for the 2022 Databricks Data Team for Good Award appeared first on Databricks.

Your Guide to the Databricks Experience at 2022 Data & AI Summit

$
0
0

We’re just days away from this year’s Data + AI Summit! Whether you’re attending in-person or virtually, you will have access to hundreds of talks, training, demos, and workshops from June 27–30.

To help you get the best from the event, we’ve introduced the Databricks Experience (DBX), a curated set of content and sessions to help you learn more about the amazing innovation happening at Databricks. Designed to help you make the most of your time at the summit, DBX offers quick access to the subjects most relevant to your needs. Within DBX, you can learn from our product experts and customers on how to build the modern data stack with the data lakehouse.

With DBX, you’ll get access to handpicked content, including:

  • 25+ product deep-dive sessions covering the Databricks Lakehouse Platform, with sessions on Delta Lake, Unity Catalog, Databricks SQL, Delta Live Tables, MLflow and more.
  • 30+ training and certification sessions – level up your data and AI skills through training and new certification bundles designed to support all data roles and skillsets. Whether you’re new to the lakehouse architecture or a seasoned pro looking to dive deep, we have a course for you.
  • News on upcoming products and features.

DBX Sessions You Don’t Want to Miss

Here’s an overview of some of our most highly-anticipated DBX sessions that can help you build and implement a data lakehouse making all the data available for any number of data-driven use cases.

Destination Lakehouse: All your data, analytics and AI on one platform In this session, learn how the Databricks Lakehouse Platform can meet your needs for every data and analytics workload, with examples of real-customer applications, reference architectures, and demos to showcase how you can create modern data solutions on your own.

Unified governance for your Data and AI assets on Lakehouse – Modern data assets take many forms: not just files or tables, but dashboards, ML models, and unstructured data like video and images, all of which cannot be governed and managed by legacy data governance solutions. In these curated sessions, learn how you can use Unity Catalog to centrally manage all data and AI assets with a common governance model based on familiar ANSI SQL, ensuring improved native performance and security. If you’re deploying Unity Catalog for your lakehouse, our governance experts will prepare you for a smooth governance implementation with tips, tricks and best practices. In a connected digital economy, data sharing and data collaboration have become important. Learn how Databricks Lakehouse Platform simplifies secure data sharing and enables data collaboration across organizations in a privacy-centric way.

Data warehousing is one of the most business-critical workloads for data teams,
and the best data warehouse is a lakehouse. You will hear from the experts, customer success stories, use cases, and best practices learned from the field and customers, and discover what’s under the hood of Databricks SQL, scale workloads with Databricks Serverless, and how you can radically improve performance with Photon. Additionally, you can find sessions on how to ingest data into the lakehouse, and how to store and govern business-critical data at scale with curated data warehousing, SQL and BI workloads in the lakehouse.

Data engineering and streaming on the lakehouse. Come and learn how the lakehouse provides end-to-end data engineering and how you can realize the promise of streaming with the Databricks Lakehouse Platform. You will explore how data engineering and streaming solutions automate the complexity of building and maintaining pipelines and running ETL/streaming workloads so data engineers and analysts can focus on quality and reliability to drive valuable insights. You’ll also have the opportunity to dive into Delta Live Tables (DLT) so you can apply modern software engineering and management techniques to ETL. With the launch of Databricks Workflows, we have created a session to introduce you to Databricks Workflows: a fully managed orchestration service for all your workloads, built in the Databricks Lakehouse Platform.

Data science and machine learning (DSML) on the lakehouse. Learn how using a data-centric AI platform can simplify your production ML projects through the use of a feature store and see how easy it is to launch machine learning experiments on your data with only a few lines of code. Then take a walk through the “front door” of the lakehouse and experience how Databricks Notebooks enable users to explore, discover, and share data insights with each other.

Now that you’ve started using Databricks, your initial set of users, access controls, and workspace are configured. Everything is great, but you’re onboarding more and more teams to the platform. How do you make the most of the Databricks administration capabilities to operate at scale? Come to the admin session, where we will share best practices and tips for future-proofing your Databricks Account, helping you speed up onboarding for new users and give them a positive experience, while following best practices to secure and manage your Databricks workspace.

Simplify migration to the lakehouse – we know organizations are migrating from legacy on-premises Hadoop architectures to a data lakehouse architecture. Learn how we, at Databricks, formulated a successful migration methodology to help organizations minimize risks and simplify the process of migrating to Databricks Lakehouse Platform.

Lastly, as you have learned, the Databricks Lakehouse Platform is open and provides the flexibility to continue using existing infrastructure, to easily share data, and build your modern data stack with unrestricted access to the ecosystem of open source data
projects and the broad Databricks partner network. Come and learn more about partners and the modern data stack as we take a deep dive into our technology.

Register for the Data + AI Summit to take advantage of all the Databricks Experience: sessions, training and certifications scheduled to take place at the summit.

Registration to attend virtually is free!

--

Try Databricks for free. Get started today.

The post Your Guide to the Databricks Experience at 2022 Data & AI Summit appeared first on Databricks.

Lakehouse for Financial Services Blueprints

$
0
0

The Lakehouse architecture has tremendous momentum and is being realized by hundreds of our customers. As a data-driven organization in the regulated industries segment, how often have you wondered whether there is a Lakehouse blueprint, tailored to your unique security and industry needs? This has now arrived.

Databricks is excited to introduce a new set of automation templates to deploy a data lakehouse, specifically defined for Financial Services (FS). Lakehouse for FS Blueprints is a set of Terraform templates, specific to Financial Services , that incorporates best practices and patterns from over 600 FS customers. It is tailored for key security and compliance policies and provides FS-specific libraries as quickstarts for key use cases, including Regulatory Reporting and Post-Trade Analytics. You can now be up and running in a matter of minutes vs. weeks or months. All of this work builds upon the widely adopted Databricks Terraform provider, which is deployed at 1,000+ Databricks customers as of this writing.

The core components of the automated deployment templates include:

  • Secure connectivity for AWS, Azure and GCP.
  • Secure Access to External cloud storage buckets (AWS S3, Azure Blob Storage) access configured to allow for fine grained access permissions based on sensitivity of the data.
  • Creation of Databricks Groups tailored to personas across an FSI’s organization with restricted access (configurable) that is useful for Personally Identifiable Information (PII) restrictions.
  • Pre-installed Libraries, Quickstarts and Clusters to handle key FS use cases, including data quality enforcement, data model schema enforcement and time series ETL packages.

The Databricks Lakehouse for Financial Services, a set of Terraform templates tailored for key Finserv security and compliance policies and provides industry-specific libraries as quickstarts for key use cases including Regulatory Reporting and Post-Trade Analytics

Let’s walk through in more detail the key capabilities that the Lakehouse for FS Blueprints provides to accelerate your journey on deploying the Lakehouse architecture.

Security

FSIs have to deal with an increasing number and sophistication of security threats, as well as a continuously evolving regulatory landscape — and all of this is happening as the sheer volume (and importance) of data grows. For FSIs, ensuring data security, privacy and compliance is absolutely critical. Databricks has created Lakehouse for FS Blueprints to better incorporate key security and compliance policies directly into the deployment configuration.

Based on FSI adopters of the Databricks Lakehouse Platform, there are standard security best practices already established in the market. Banking, Insurance, and Capital Markets firms demand features such as secure connectivity (no public IPs), secure communication via cloud backbone (Private Link, for AWS, for example), and well-defined data isolation between groups of users. In our Terraform templates, we have codified all of these best practices for automated deployment.

Data governance

As many FSIs build out their data lakehouse, they’re able to democratize their data and make it accessible throughout the organization. FSIs must know how sensitive data is processed and be able to control and audit access to it. To govern data lakes, administrators have often relied on cloud-vendor-specific security controls, such as IAM roles or role-based access control (RBAC) and file-oriented access control, to manage data. We have assumed the need for groups to restrict certain data classifications and encoded these in the workspace setup.

Note that Databricks has launched Unity Catalog in public preview, which brings fine-grained governance and security to lakehouse data using a familiar, open interface. Unity Catalog lets organizations manage fine-grained data permissions using standard ANSI SQL or a simple UI, enabling them to safely open their lakehouse for broad internal consumption. It works uniformly across clouds and data types. Finally, it goes beyond managing tables to govern other types of data assets, such as machine learning (ML) models and files. Thus, enterprises get a simple way to govern all their data and artificial intelligence (AI) assets. The Lakehouse for FS Blueprints will be updated to incorporate Unity Catalog when it is generally available.

Financial services quickstarts

We often hear that data teams and data leaders need to deliver value in weeks, not months or years. Data teams will often spend many weeks understanding the problem, before acquiring, integrating, and transforming the data. Only then can the data team begin developing, optimizing, and deploying models into production. This lag from identifying the need, working through potential solutions, finalizing an implementation, and seeing results takes away momentum from even the most important data science initiatives.

To help our customers overcome these challenges, Databricks created Python libraries, which help accelerate use cases in Financial Services. As part of the Lakehouse for FS Blueprints, we’ve pre-installed these libraries on a standard cluster, starting with two quickstarts to help enterprises get up to speed with best practices:

  • Waterbear: Waterbear can interpret enterprise wide data models (such as regulatory reporting) and pre-provision tables, processes and data quality rules that accelerate the ingestion of production data and development of production workflows. This allows FSIs to deploy their Lakehouse for Financial Services with resilient data pipelines and minimum development overhead. For more information, read this blog.
  • Tempo: Tempo is a set of time-series utilities to make time-series processing simpler on Databricks. By combining the versatile nature of tick data, reliable data pipelines and Tempo, FSIs can unlock exponential value from a variety of use cases at minimal costs and fast execution cycles. For more information, read this blog.

This is only the beginning. As we continue to build out our portfolio and create standardized libraries for our existing Solution Accelerators, we will be adding new libraries or preconfigured clusters into the Lakehouse for FS Blueprints to provide an ever growing set of key capabilities for our FS customers.

Key benefits of Lakehouse for FS Blueprints

Built for Financial Services

Lakehouse for FS Blueprints are designed specifically to support the compliance and security needs in Financial Services. Best practices are implemented out of the box — including key security and governance controls — based on best practices and patterns that we’ve seen across 600+ customers. You have a starting point that you can build upon to configure additional policies as required.

Save time and resources through automation

With the Lakehouse for FS Blueprints, you don’t need to spend lots of time configuring Databricks. Instead, build upon an open source deployment framework and focus on the customizations that are unique to your company. Developers can move faster, and data migrations are less time consuming. You can now be up and running within a matter of minutes, instead of weeks or months.

Acceleration to value

Preconfigured clusters simplify and accelerate the deployment of core FS use cases to allow you and your business stakeholders to get to value faster. These libraries reduce your time spent in a variety of areas including data engineering, schema development, and model development. To address even more use cases, check out all of the Databricks Solution Accelerators, where you can easily download and import into your workspace.

Multi-cloud

Multi-cloud adoption is gaining momentum, and Gartner predicts that by 2022, 75% of enterprise customers using cloud infrastructure as a service (IaaS) will adopt a deliberate multi-cloud strategy. With that in mind, Databricks has created the Lakehouse for FS Blueprints for each major public cloud — AWS, Azure and GCP. You can avoid duplication of best practices across clouds and better ensure consistent deployment of the lakehouse across your clouds.

Getting Started

The Terraform modules can be downloaded from the Databricks GitHub repository under the project for Lakehouse for FS Blueprints, and a more detailed markdown of the Terraform provider is available.

We’ve also included an instructional video that provides a step-by-step tutorial for deploying the Lakehouse for FS in a matter of minutes.

Lakehouse for FS Blueprints will continue to evolve as we incorporate new solution accelerators and Databricks capabilities as they become Generally Available, such as Unity Catalog for Governance and Delta Sharing. Stay tuned for related blog posts in the future, and the launch of new blueprints covering additional Industries.

Read more

 

Lakehouse for FS Blueprints

--

Try Databricks for free. Get started today.

The post Lakehouse for Financial Services Blueprints appeared first on Databricks.

Databricks Terraform Provider Is Now Generally Available

$
0
0

Today, we are thrilled to announce that Databricks Terraform Provider is generally available (GA)! HashiCorp Terraform is a popular open source infrastructure as code (IaC) tool for creating safe and reproducible cloud infrastructure across cloud providers.

The first Databricks Terraform Provider was released more than two years ago, allowing engineers to automate all management aspects of their Databricks Lakehouse Platform. Since then, adoption has grown significantly by more than 10 times.

Number of monthly downloads of the Databricks Terraform Provider, growing 10x from May 2021 to May 2022

Number of monthly downloads of the Databricks Terraform Provider, growing 10x from May 2021 to May 2022

More importantly, we also see significant growth in the number of customers using Databricks Terraform Provider in to manage their production and development environments:

Databricks Terraform provider customer adoption trend

Databricks Terraform provider customer adoption trend

Customers win with the Lakehouse as Code

There are multiple areas where Databricks customers successfully use Databricks Provider.

Automating all aspects of provisioning the Lakehouse components and implementing DataOps/DevOps/MLOps

That covers multiple use cases – promotion of jobs between dev/staging/prod environments, making sure that upgrades are safe, creating reproducible environments for new projects/teams, and many other things.

In our joint DAIS talk, Scribd talked about how their data platform relies on Platform Engineering to put tools in the hands of developers and data scientists to “choose their own adventure”. With Databricks Terraform Provider, they can offer their internal customers flexibility without acting as gatekeepers. Just about anything they might need in Databricks is a pull request away.

Other customers are also very complimentary about their usage of Databricks Terraform Provider: “swift structure replication”, “maintaining compliance standards”, “allows us to automate everything”, “democratized or changed to reduce the operational burden from our SRE team”.

Implementing an automated disaster recovery strategy

Disaster recovery is the “must have” for all regulated industries and for any company realizing the importance of data accessibility. And Terraform plays a significant role in making sure that failover processes are correctly automated and will be performed in the predictable amount of time without errors that are common when there is no automation in place.

For example, illimity’s data platform is centered on Azure Databricks and its functionalities as noted in our previous blog post. They designed a data platform DR scenario using the Databricks Terraform Provider, ensuring RTOs and RPOs required by the regulatory body at illimity and Banca d’Italia (Italy’s central bank). Stay tuned for more detailed blogs on disaster recovery preparation with our Terraform integration!

Implementing secure solutions

Security is a crucial requirement in the modern world. But making sure that your data is secure is not a simple task, especially for regulated industries. For these solutions there are many requirements, such as, preventing data exfiltration, controlling users access to data, etc.

Let’s take deployment of the workspace on AWS with data exfiltration protection as an example. It is generally recommended that customers deploy multiple Databricks workspaces alongside a hub and spoke topology reference architecture, powered by AWS Transit Gateway. As documented in our previous blog, the setup is through the AWS UI which includes several manual steps. With the Databricks Terraform Provider, this can now be automated and deployed in just a few steps with a detailed guide.

Enabling data governance with Unity Catalog

Databricks Unity Catalog brings fine-grained governance and security to lakehouse data using a familiar, open interface. Combining Databricks Terraform Provider and Unity Catalog enables customers to govern their Lakehouse with ease and at scale through automation. And this is very critical for big enterprises.

Provider quality and support

The provider is now officially supported by Databricks, and has an established issue tracking through Github. Pull requests are always welcome. Code undergoes heavy integration testing each release and has significant unit test code coverage.

Terraform supports management of all Databricks resources and underlying cloud infrastructure

Terraform supports management of all Databricks resources and underlying cloud infrastructure

Most used resources

The use cases described above are the most typical ones – there are many different things that our customers are implementing using Databricks Terraform Provider. The next diagram may give another view onto which resource types are most often managed by customers via Terraform.

Percentage of customers using specific resources / data sources in the Databricks Terraform Provider

Percentage of customers using specific resources / data sources in the Databricks Terraform Provider

Migrating to GA version of Databricks Terraform Provider

To make Databricks Terraform Provider generally available, we have moved it from https://github.com/databrickslabs to https://github.com/databricks. We have worked closely with the Terraform Registry team at Hashicorp to ensure a smooth migration. Existing terraform deployments continue to work as expected without any action from your side. You should have a .terraform.lock.hcl file in your state directory that is checked into source control. terraform init will give you the following warning:

Warning: Additional provider information from registry

The remote registry returned warnings for registry.terraform.io/databrickslabs/databricks:
- For users on Terraform 0.13 or greater, this provider has moved to databricks/databricks. Please update your source in required_providers.

After you replace databrickslabs/databricks with databricks/databricks in the required_providers block, the warning will disappear. Do a global “search and replace” in *.tf files. Alternatively you can run python3 -c "$(curl -Ls https://dbricks.co/updtfns)" from the command-line, that would do all the boring work for you.

However, you may run into one of the following problems when running the terraform init command: “Failed to install provider” or “Failed to query available provider packages”. This is because you didn’t check in .terraform.lock.hcl to source code version control.

Error: Failed to install provider

Error while installing databrickslabs/databricks: v1.0.0: checksum list has no SHA-256 hash for "https://github.com/databricks/terraform-provider-databricks/releases/download/v1.0.0/terraform-provider-databricks_1.0.0_darwin_amd64.zip"

You can fix it by following three simple steps:

  • Replace databrickslabs/databricks with databricks/databricks in all your .tf files via python3 -c "$(curl -Ls https://dbricks.co/updtfns)" command.
  • Run the terraform state replace-provider databrickslabs/databricks databricks/databricks command and approve the changes. See Terraform CLI docs for more information.
  • Run terraform init to verify everything is working.

That’s it. The terraform apply command should work as expected now.

You can check out the Databricks Terraform Provider documentation, and start automating your Databricks lakehouse management with Terraform beginning with this guide and examples repository. If you already have an existing Databricks setup not managed through Terraform, you can use the experimental exporter functionality to get a starting point.

Going forward, our engineers will continue to add Terraform Provider support for new Databricks features, as well as new modules, templates and walkthroughs.

--

Try Databricks for free. Get started today.

The post Databricks Terraform Provider Is Now Generally Available appeared first on Databricks.

Guide to Delta Lake Sessions at Data + AI Summit 2022

$
0
0

Looking to learn more about Delta Lake? Want to see what’s the latest development in the project? Want to engage with other community members? If so, we invite you to attend this year’s Data + AI Summit! This global event brings together thousands of practitioners, industry leaders, and visionaries to engage in thought-provoking dialogue and share the latest innovations in data and AI.

At this year’s summit, we are very excited to have visionaries from the Data and AI community, including Andrew Ng, Zhamak Dehgani, Christopher Manning, Matei Zaharia, Tarika Barrett, Peter Norvig, Daphne Koller, Ali Ghodsi, as well as companies who are building innovative products like NASDAQ, Scribd, and Apple. They will share how they are leveraging Delta Lake to solve high-impact data-driven use cases that can benefit any organization. From learning how to deliver interactive analytics at a massive scale to solving healthcare price transparency to modernizing big data for finance and more – this conference will provide high-value insights for all technical and business-focused stakeholders.

Events for the Delta Lake Community!

Whether you are an active contributor, a regular user of Delta Lake, or just curious about the fast-growing Delta Lake community, we invite you to these community-focused events at the Data + AI Summit. This is a great opportunity for the community to come together, celebrate and engage with the project maintainers and leading contributors. Don’t forget to tune into Day 1 Opening Keynote on Delta Lake by Michael Armbrust scheduled for Tuesday, June 28 at 8:30 AM PST. Then head over to the following sessions:

Ask me Anything : Delta Lake
Tuesday, June 28 @10:30 AM PST
Speakers: Allison Portis, Tathagata Das, Ryan Zhu, Scott Sandre, Vini Jaiswal (Delta Lake)

Bring any and all questions regarding Delta Lake. Are you curious about the Delta Lake roadmap, upcoming features, and recent releases by the community? This is an in-person version of our Delta Lake Community Office Hours. In this rapid-fire question format, our panel will field your toughest questions! That’s right, ask them anything about Delta Lake and how to engage with the community!

Ask me Anything : Delta Lake Committers
Wednesday, June 29 @11:40 AM PST
Speakers: Christian Williams, R. Tyler Croy and QP Hou (delta-rs)

This is your chance to get your questions answered and learn about what others are asking. In this AMA-style session, we are bringing a panel of Delta Lake Committers. and Rust Programming experts too! That’s right! So join them and ask away your questions!

Delta Lake Contributor Meetup with Delta Lake Birthday Party
Wednesday, June 29 @6:30 PM PST

Featured Guests: Dominique Brezinski (Apple) and Michael Armbrust (Databricks)

It’s a Delta Lake birthday party, so come meet and greet with Delta Lake contributors and committers on all things data engineering, data architecture, and Delta Lake. But we’re not here just to enjoy your festivities, come with your technical questions as we will have multiple panels to answer your questions. You will have the opportunity to learn more about how Delta Lake started with a fireside chat with Dominique Brezinski from Apple and Michael Armbrust from Databricks.

Can’t-Miss Sessions Featuring Delta Lake

The volumes of data that are being collected and stored for analysis and to drive decisions are reaching levels that make it difficult for even the most seasoned data engineering and data science teams to manage and extract insights. With the advent of Delta Lake, data that was previously locked inside data lakes or proprietary data warehouses can be processed and operationalized, turning data into insights quickly and reliably.

Here are four sessions that put Delta Lake, front and center, and are sure to capture the attention of data scientists or ML engineers interested in maximizing the value of their data lake.

Diving into Delta Lake Integrations, Features, and Roadmap
Thursday, June 30 @9:15 AM

  • Tathagata Das, Databricks
  • Denny Lee, Databricks

The Delta ecosystem rapidly expanded with the release of Delta Lake 1.2 which included a variety of integrations and feature updates. Join this session to learn about how the wider Delta community collaborated together to bring these features and integrations together; as well as the current roadmap. This will be an interactive session so come prepared with your questions – we should have answers!

Delta Lake, the Foundation of Your Lakehouse
Tuesday, June 28 @2:05 PM

  • Himanshu Raja, Databricks
  • Hagai Attias, Akamai Technologies

Delta Lake is the open source storage layer that makes the Databricks Lakehouse Platform possible by adding reliability, performance, and scalability to your data, wherever it is located. Join this session for an inside look at what is under the hood of Databricks – see how Delta Lake, by adding ACID transactions and versioning to Parquet files together with the Photon engine, provides customers with huge performance gains and the ability to address new challenges. This session will include a demo and overview of customer use cases unlocked by Delta Lake and the benefits of running Delta Lake workloads on Databricks.

Ensuring Correct Distributed Writes to Delta Lake in Rust with Formal Verification
Tuesday, June 28 @4:00 PM PST

  • QP Hou, Neuralink

Rust ensures memory safety, but bugs can still make it into implementations, so what can be done to avoid this? In this session, the concept of common, formal verification methods used in distributed system designs and implementations will be reviewed, as well as the use of TLA+ and stateright to formally model delta-rs multi-writer S3 backend implementation. Learn how the combination of both Rust and formal verification in this way results in an efficient, native Delta Lake implementation that is free of both memory and logical bugs!

A Modern Approach to Big Data for Finance
Wednesday, June 29 @2:05 PM PDT

  • Bill Dague, Nasdaq
  • Leonid Rosenfeld, Nasdaq

There are unique challenges associated with working with big data for applications in finance including the impact of high data volumes, disparate storage, variable sharing protocols, and more. By leveraging open source technologies such as Databricks’ Delta Sharing, in combination with a flexible data management stack, organizations can be more nimble and innovative with their analytics strategies. See this in action with a live demonstration of Delta Sharing in combination with Nasdaq Data Fabric.

Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing
Thursday, June 30 @10:45 AM PDT

  • Christina Taylor, Carvana

While microservice architectures are embraced by application teams as it enables services to be developed and scaled independently and in a distributed fashion, data teams must take a consolidated approach with centralized repositories where data from these services actually come together to be joined and aggregated for analysis. Learn how these approaches can work together through a streaming architecture leveraging Delta Live Tables and Delta Sharing to enable near real-time analytics and business intelligence – even across clouds.

Additional Sessions Featuring Delta Lake

Journey to Solving Healthcare Price Transparency with Databricks and Delta Lake
Tuesday, June 28 @10:45 AM PDT

  • Ross Silberquit, Cigna
  • Narayanan Hariharasubramanian, Cigna

US Government’s Price Transparency mandate requires the Healthcare industry to generate Machine-Readable Files (MRF) of different types for different procedures, which involves handling Terabytes of data by ingesting, transforming, aggregating and hosting on a public domain. Join this session and learn how Cigna was able to deliver on this mandate with the help of Databricks running on AWS, Delta Lake, and Apache Spark.

Streaming ML Enrichment Framework Using Advanced Delta Table Features
Tuesday, June 28 @10:45 AM PDT

  • Peter Vasko, Socialbakers

The challenge for Socialbakers’ marketing SaaS platform was how to build a scalable framework for data scientists and ML engineers that could accommodate hundreds of generic or customer-specific ML models, running both in streaming and batch, capable of processing 100+ million records per day from customer social media networks. The goal was achieved using Apache Spark™, Delta Lake, and clever usage of Delta Table features.
In this session we will share the ideas behind the framework and how to efficiently combine Spark structured streaming and Delta Tables.

The Road to a Robust Data Lake: Utilizing Delta Lake and Databricks to Map 150 Million Miles of Roads a Month
Tuesday, June 28 @11:30 AM PDT

  • Itai Yaffe, Databricks
  • Ofir Kerker, Nexar

In the past, stream processing over data lakes required a lot of development efforts from data engineering teams. Today, with Delta Lake and Databricks Auto Loader, this becomes a few minutes’ work and unlocks a new set of ways to efficiently leverage your data.
In this talk, learn how Nexar, a leading provider of dynamic mapping solutions, utilizes Delta Lake and advanced features such as Auto Loader to map 150 million miles of roads a month and provide meaningful insights to cities, mobility companies, driving apps, and insurers.

Automate Your Delta Lake or Practical Insights on Building Distributed Data Mesh
Tuesday, June 28 @2:05 PM PDT

  • Serge Smertin, Databricks

We all live in the exciting times and the hype of Distributed Data Mesh (or just mess!). In this session, we will discuss a couple of architectural and organizational approaches for achieving Distributed Data Mesh, which is essentially a combination of mindset, fully automated infrastructure, continuous integration for data pipelines, dedicated team collaborative environments, and security enforcement. This should appeal to data leaders, data scientists, and anyone interested in DevOps

Self-Serve, Automated and Robust CDC pipeline using AWS DMS, DynamoDB Streams and Delta Lake
Tuesday, June 28 @ 2:05 PM PDT

  • Dibyendu Karmakar, Swiggy

In this session, learn how the team at Swiggy designed and developed a CDC-based system to solve the challenges of ingesting transactional data in Delta Lake, and dealt with late-arriving updates and deletes, enabling near real-time availability of data, eliminated bulk ingestion, and optimized costs.

Productionizing Ethical Credit Scoring Systems with Delta Lake, Feature Store and MLFlow
Tuesday, June 28 @4:00 PM PDT

  • Jeanne Choo, Databricks

Although Fairness, Ethics, Accountability, and Transparency (FEAT) are must-haves for high-stakes machine learning models (eg. credit scoring systems), a lack of concrete guidelines, common standards, and technical templates make productionizing responsible AI systems challenging. In this talk, we demonstrate how an open-source code example of a responsible credit scoring application, developed by the Monetary Authority of Singapore’s Veritas Consortium, might be put into production using tools such as Delta Lake and MLflow.

Enabling BI in a Lakehouse Environment: How Spark and Delta Can Help With Automating a DWH Development
Wednesday, June 29 @10:45 AM PDT

  • Ivana Pejeva, element61
  • Yoshi Coppens, element61

Traditional data warehouses typically struggle when it comes to handling large volumes of data and traffic, particularly when it comes to unstructured data. By contrast, data lakes overcome such issues and have become the central hub for storing data. In this session, learn how the team at element61 uses a framework that includes Apache Spark™ and Delta Lake, to bridge BI with modern-day use cases, such as machine learning and real-time analytics. The session outlines the original challenges, the steps taken, and the technical hurdles that were faced.

Streaming Data into Delta Lake with Rust and Kafka
Wednesday, June 29 @11:30 AM PDT

  • Christian Williams, Scribd

The future of Scribd’s data platform is trending toward real-time. A notable challenge has been streaming data into Delta Lake in a fast, reliable, and efficient manner. To help address this problem, Scribd developed two foundational open source projects: delta-rs and kafka-delta-ingest. Join this session for a closer look at the architecture of kafka-delta-ingest, and how it fits into a larger, real-time data ecosystem at Scribd.

Lakehouses: A portmanteau of Data Lakes and Data Warehouses
Thursday, June 30 @8:30 AM PDT

  • Vini Jaiswal, Databricks

A lakehouse architecture combines data management capability including reliability, integrity, and quality from the data warehouse and supports workloads from different data domains including advanced analytics and Artificial Intelligence with the low cost and open approach of data lakes. Data Practitioners will learn core concepts of building an efficient Lakehouse with Delta Lake.

DBA Perspective—Optimizing Performance Table-by-Table
Thursday, June 30 @9:15 AM PDT

  • Douglas Moore, Databricks

As a DBA for your organization’s lakehouse, it’s your job to stay on top of performance & cost optimization techniques. In this session, learn how to use the available Delta Lake tools to tune your jobs and optimize your tables.

Discover Data Lakehouse With End-to-End Lineage
Thursday, June 30 @10:00 AM PDT

  • Tao Feng, Databricks

Data Lineage is key for managing change, ensuring data quality, and implementing Data Governance in an organization. In this talk, we will talk about how to capture table and column lineage for Spark, Delta Lake, and Unity Catalog and how users could leverage data lineage to serve various use cases.

GIS Pipeline Acceleration with Apache Sedona
Thursday, June 30 @10:00 AM PDT

  • Alihan Zihna, CKDelta
  • Fernando Ayuso Palacios, CKDelta (Hutchison Group)

CKDelta ingests and processes a massive amount of geospatial data to deliver insights for their customers. Using Apache Sedona together with Databricks they have been able to accelerate their data pipelines many times over. In this session, learn how the CKDelkta data team migrated their existing data pipelines to Sedona and PySpark and the pitfalls encountered along the way.

Expert Training featuring Delta Lake

Take your understanding of Delta Lake to the next level. Check out the following training session designed to broaden your experience with and usage of Delta Lake features and functionality.

Training: Lakehouse with Delta Lake Deep Dive
Monday, June 27 @8:00 AM PDT, and @1:00 PM PDT
Thursday, June 30 @ 8:00 AM PDT, and @1:00 PM PDT

  • Audience: All Audiences
  • Duration: 1 half-day
  • Hands-on labs: No

Want to develop your expertise on building end-to-end data pipelines using Delta Lake?
In this course, you will learn about applying software engineering principles with Databricks and how to build end-to-end OLAP data pipelines using Delta Lake for batch and streaming data. The course also discusses serving data to end-users through aggregate tables and Databricks SQL Analytics.

Prerequisites:

  • Familiarity with data engineering concepts
  • Basic knowledge of Delta Lake core features and use cases

Sign-up for Delta Lake Talks at Summit!

Make sure to register for the Data + AI Summit to take advantage of all the amazing sessions and training featuring Delta Lake. Registration is free!

And… be engaged in the Delta Lake Community beyond the Summit!

Your active participation doesn’t have to be limited to the Summit. If you want to stay connected beyond the summit, we have active GitHub, Slack, Google Group, Linux Foundation chapter, Youtube, Community Office Hours, Twitter, Linkedin channels where you can connect with the community, participate in the discussions, get help on getting started with the Delta Lake or start contributing to the project with a good first issue. Hope to see you on one of these channels.

--

Try Databricks for free. Get started today.

The post Guide to Delta Lake Sessions at Data + AI Summit 2022 appeared first on Databricks.

Architecting MLOps on the Lakehouse

$
0
0

Here at Databricks, we have helped thousands of customers put Machine Learning (ML) into production. Shell has over 160 active AI projects saving millions of dollars; Comcast manages 100s of machine learning models with ease with MLflow; and many others have built successful ML-powered solutions.

Before working with us, many customers struggled to put ML into production—for a good reason: Machine Learning Operations (MLOps) is challenging. MLOps involves jointly managing code (DevOps), data (DataOps), and models (ModelOps) in their journey towards production. The most common and painful challenge we have seen is a gap between data and ML, often split across poorly connected tools and teams.

To solve this challenge, Databricks Machine Learning builds upon the Lakehouse architecture to extend its key benefits—simplicity and openness—to MLOps.

Our platform simplifies ML by defining a data-centric workflow that unifies best practices from DevOps, DataOps, and ModelOps. Machine learning pipelines are ultimately data pipelines, where data flows through the hands of several personas. Data engineers ingest and prepare data; data scientists build models from data; ML engineers monitor model metrics; and business analysts examine predictions. Databricks simplifies production machine learning by enabling these data teams to collaborate and manage this abundance of data on a single platform, instead of silos. For example, our Feature Store allows you to productionize your models and features jointly: data scientists create models that are “aware” of what features they need so that ML engineers can deploy models with simpler processes.

The Databricks approach to MLOps is built on open industry-wide standards. For DevOps, we integrate with Git and CI/CD tools. For DataOps, we build upon Delta Lake and the lakehouse, the de facto architecture for open and performant data processing. For ModelOps, we build upon MLflow, the most popular open-source tool for model management. This foundation in open formats and APIs allows our customers to adapt our platform to their diverse requirements. For example, customers who centralize model management around our MLflow offering may use our built-in model serving or other solutions, depending on their needs.

We are excited to share our MLOps architecture in this blog post. We discuss the challenges of joint DevOps + DataOps + ModelOps, overview our solution, and describe our reference architecture. For deeper dives, download The Big Book of MLOps and attend MLOps talks at the upcoming Data+AI Summit 2022.

Building MLOps on top of a lakehouse platform helps to simplify the joint management of code, data and models.

Jointly managing code, data, and models

MLOps is a set of processes and automation to manage code, data, and models to meet the two goals of stable performance and long-term efficiency in ML systems. In short, MLOps = DevOps + DataOps + ModelOps.

Development, staging and production

In their journey towards business- or customer-facing applications, ML assets (code, data, and models) pass through a series of stages. They need to be developed (“development” stage), tested (“staging” stage), and deployed (“production” stage). This work is done within execution environments such as Databricks workspaces.

All the above—execution environments, code, data and models—are divided into dev, staging and prod. These divisions can be understood in terms of quality guarantees and access control. Assets in development may be more widely accessible but have no quality guarantees. Assets in production are generally business critical, with the highest guarantees of testing and quality but with strict controls on who can modify them.

MLOps requires jointly managing execution environments, code, data and models. All four are separated into dev, staging and prod stages.

Key challenges

The above set of requirements can easily explode in complexity: how should one manage code, data and models, across development, testing and production, across multiple teams, with complications like access controls and multiple technologies in play? We have observed this complexity leading to a few key challenges.

Operational processes
DevOps ideas do not directly translate to MLOps. In DevOps, there is a close correspondence between execution environments, code and data; for example, the production environment only runs production-level code, and it only produces production-level data. ML models complicate the story, for model and code lifecycle phases often operate asynchronously. You may want to push a new model version before pushing a code change, and vice versa. Consider the following scenarios:

  • To detect fraudulent transactions, you develop an ML pipeline that retrains a model weekly. You update the code quarterly, but each week a new model is automatically trained, tested and moved to production. In this scenario, the model lifecycle is faster than the code lifecycle.
  • To classify documents using large neural networks, training and deploying the model is often a one-time process due to cost. But as downstream systems change periodically, you update the serving and monitoring code to match. In this scenario, the code lifecycle is faster than the model lifecycle.

Collaboration and management
MLOps must balance the need for data scientists to have flexibility and visibility to develop and maintain models with the conflicting need for ML engineers to have control over production systems. Data scientists need to run their code on production data and to see logs, models, and other results from production systems. ML engineers need to limit access to production systems to maintain stability and sometimes to preserve data privacy. Resolving these needs becomes even more challenging when the platform is stitched together from multiple disjoint technologies that do not share a single access control model.

Integration and customization
Many tools for ML are not designed to be open; for example, some ML tools export models only in black-box formats such as JAR files. Many data tools are not designed for ML; for example, data warehouses require exporting data to ML tools, raising storage costs and governance headaches. When these component tools are not based on open formats and APIs, it is impossible to integrate them into a unified platform.

Simplifying MLOps with the Lakehouse

To meet the requirements of MLOps, Databricks built its approach on top of the Lakehouse architecture. Lakehouses unify the capabilities from data lakes and data warehouses under a single architecture, where this simplification is made possible by using open formats and APIs that power both types of data workloads. Analogously, for MLOps, we offer a simpler architecture because we build MLOps around open data standards.

Before we dive into the details of our architectural approach, we explain it at a high level and highlight its key benefits.

Operational processes
Our approach extends DevOps ideas to ML, defining clear semantics for what “moving to production” means for code, data and models. Existing DevOps tooling and CI/CD processes can be reused to manage code for ML pipelines. Feature computation, inference, and other data pipelines follow the same deployment process as model training code, simplifying operations. A designated service—the MLflow Model Registry—permits code and models to be updated independently, solving the key challenge in adapting DevOps methods to ML.

Collaboration and management
Our approach is based on a unified platform that supports data engineering, exploratory data science, production ML and business analytics, all underpinned by a shared lakehouse data layer. ML data is managed under the same lakehouse architecture used for other data pipelines, simplifying hand-offs. Access controls on execution environments, code, data and models allow the right teams to get the right levels of access, simplifying management.

Integration and customization
Our approach is based on open formats and APIs: Git and related CI/CD tools, Delta Lake and the Lakehouse architecture, and MLflow. Code, data and models are stored in your cloud account (subscription) in open formats, backed by services with open APIs. While the reference architecture described below can be fully implemented within Databricks, each module can be integrated with your existing infrastructure and customized. For example, model retraining may be fully automated, partly automated, or manual.

Reference architecture for MLOps

We are now ready to review a reference architecture for implementing MLOps on the Databricks Lakehouse platform. This architecture—and Databricks in general—is cloud-agnostic, usable on one or multiple clouds. As such, this is a reference architecture meant to be adapted to your specific needs. Refer to The Big Book of MLOps for more discussion of the architecture and potential customization.

Overview

This architecture explains our MLOps process at a high level. Below, we describe the architecture’s key components and the step-by-step workflow to move ML pipelines to production.


This diagram illustrates the high-level MLOps architecture across dev, staging and prod environments.

Components

We define our approach in terms of managing a few key assets: execution environments, code, data and models.

Execution environments are where models and data are created or consumed by code. Environments are defined as Databricks workspaces (AWS, Azure, GCP) for development, staging, and production, with workspace access controls used to enforce separation of roles.
In the architecture diagram, the blue, red and green areas represent the three environments.

Within environments, each ML pipeline (small boxes in the diagram) runs on compute instances managed by our Clusters service (AWS, Azure, GCP). These steps may be run manually or automated via Workflows and Jobs (AWS, Azure, GCP). Each step should by default use a Databricks Runtime for ML with preinstalled libraries (AWS, Azure, GCP), but it can also use custom libraries (AWS, Azure, GCP).

Code defining ML pipelines is stored in Git for version control. ML pipelines can include featurization, model training and tuning, inference, and monitoring. At a high level, “moving ML to production” means promoting code from development branches, through the staging branch (usually `main`), and to release branches for production use. This alignment with DevOps allows users to integrate existing CI/CD tools. In the architecture diagram above, this process of promoting code is shown at the top.

When developing ML pipelines, data scientists may start with notebooks and transition to modularized code as needed, working in Databricks or in IDEs. Databricks Repos integrate with your git provider to sync notebooks and source code with Databricks workspaces (AWS, Azure, GCP). Databricks developer tools let you connect from IDEs and your existing CI/CD systems (AWS, Azure, GCP).

Data is stored in a lakehouse architecture, all in your cloud account. Pipelines for featurization, inference and monitoring can all be treated as data pipelines. For example, model monitoring should follow the medallion architecture of progressive data refinement from raw query events to aggregate tables for dashboards. In the architecture diagram above, data are shown at the bottom as general “Lakehouse” data, hiding the division into development-, staging- and production-level data.

By default, both raw data and feature tables should be stored as Delta tables for performance and consistency guarantees. Delta Lake provides an open, efficient storage layer for structured and unstructured data, with an optimized Delta Engine in Databricks (AWS, Azure, GCP). Feature Store tables are simply Delta tables with additional metadata such as lineage (AWS, Azure, GCP). Raw files and tables are under access control that can be granted or restricted as needed.

Models are managed by MLflow, which allows uniform management of models from any ML library, for any deployment mode, both within and without Databricks. Databricks provides a managed version of MLflow with access controls, scalability to millions of models, and a superset of open-source MLflow APIs.

In development, the MLflow Tracking server tracks prototype models along with code snapshots, parameters, metrics, and other metadata (AWS, Azure, GCP). In production, the same process saves a record for reproducibility and governance.

For continuous deployment (CD), the MLflow Model Registry tracks model deployment status and integrates with CD systems via webhooks (AWS, Azure, GCP) and via APIs (AWS, Azure, GCP). The Model Registry service tracks model lifecycles separately from code lifecycles. This loose coupling of models and code provides flexibility to update production models without code changes, and vice versa. For example, an automated retraining pipeline can train an updated model (a “development” model), test it (“staging” model) and deploy it (“production” model), all within the production environment.

The table below summarizes the semantics of “development,” “staging” and “production” for code, data and models.

Asset Semantics of dev/staging/prod Management Relation to execution environments
Code Dev: Untested pipelines.
Staging: Pipeline testing.
Prod: Pipelines ready for deployment.
ML pipeline code is stored in Git, separated into dev, staging and release branches. The prod environment should only run prod-level code. The dev environment can run any level code.
Data Dev: “Dev” data means data produced in the dev environment.

(ditto for Staging, Prod)

Data sits in the Lakehouse, shareable as needed across environments via table access controls or cloud storage permissions. Prod data may be readable from the dev or staging environments, or it could be restricted to meet governance requirements.
Models Dev: New model.
Staging: Testing versus current prod models.
Prod: Model ready for deployment.
Models are stored in the MLflow Model Registry, which provides access controls. Models can go through their dev->staging->prod lifecycle within each environment.

Workflow

With the main components of the architecture explained above, we can now walk through the workflow of taking ML pipelines from development to production.

Development environment: Data scientists primarily operate in the development environment, building code for ML pipelines which may include feature computation, model training, inference, monitoring, and more.

  1. Create dev branch: New or updated pipelines are prototyped on a development branch of the Git project and synced with the Databricks workspace via Repos.
  2. Exploratory data analysis (EDA): Data scientists explore and analyze data in an interactive, iterative process using notebooks, visualizations, and Databricks SQL.
  3. Feature table refresh: Featurization logic is encapsulated as a pipeline which can read from the Feature Store and other Lakehouse tables and which writes to the Feature Store. Feature pipelines may be managed separately from other ML pipelines, especially if they are owned by separate teams.
  4. Model training and other pipelines: Data scientists develop these pipelines either on read-only production data or on redacted or synthetic data. In this reference architecture, the pipelines (not the models) are promoted towards production; see the full whitepaper for discussion of promoting models when needed.
  5. Commit code: New or updated ML pipelines are committed to source control. Updates may affect a single ML pipeline or many at once.

Staging environment: ML engineers own the staging environment, where ML pipelines are tested.

  1. Merge (pull) request: A merge request to the staging branch (usually the “main” branch) triggers a continuous integration (CI) process.
  2. Unit tests (CI): The CI process first runs unit tests which do not interact with data or other services.
  3. Integration tests (CI): The CI process then runs longer integration tests which test ML pipelines jointly. Integration tests which train models may use small data or few iterations for speed.
  4. Merge: If the tests pass, the code can be merged to the staging branch.
  5. Cut release branch: When ready, the code can be deployed to production by cutting a code release and triggering the CI/CD system to update production jobs.

Production environment: ML engineers own the production environment, where ML pipelines are deployed.

  1. Feature table refresh: This pipeline ingests new production data and refreshes production Feature Store tables. It can be a batch or streaming job which is scheduled, triggered or continuously running.
  2. Model training: Models are trained on the full production data and pushed to the MLflow Model Registry. Training can be triggered by code changes or by automated retraining jobs.
  3. Continuous Deployment (CD): A CD process takes new models (in Model Registry “stage=None”), tests them (transitioning through “stage=Staging”), and if successful deploys them (promoting them to “stage=Production”). CD may be implemented using Model Registry webhooks and/or your own CD system.
  4. Inference & serving: The Model Registry’s production model can be deployed in multiple modes: batch and streaming jobs for high-throughput use cases and online serving for low-latency (REST API) use cases.
  5. Monitoring: For any deployment mode, the model’s input queries and predictions are logged to Delta tables. From there, jobs can monitor data and model drift, and Databricks SQL dashboards can display status and send alerts. In the development environment, data scientists can be granted access to logs and metrics to investigate production issues.
  6. Retraining: Models can be retrained on the latest data via a simple schedule, or monitoring jobs can trigger retraining.

Implement your MLOps architecture

We hope this blog has given you a sense of how a data-centric MLOps architecture based around the Lakehouse paradigm simplifies the joint management of code, data and models. This blog is necessarily short, omitting many details. To get started with implementing or improving your MLOps architecture, we recommend the following:

For more background on MLOps, we recommend:

--

Try Databricks for free. Get started today.

The post Architecting MLOps on the Lakehouse appeared first on Databricks.

AWS Guide to Data + AI Summit 2022 featuring Capital One, McAfee, Cigna and Carvana

$
0
0

This is a collaborative post from Databricks and Amazon Web Services (AWS). We thank Venkatavaradhan Viswanathan, Senior Partner Solutions Architect at AWS, for his contributions.

Data + AI Summit 2022: Register now to join this in-person and virtual event June 27-30 and learn from the global data community.

Amazon Web Services (AWS) is a Platinum Sponsor of Data + AI Summit 2022, one of the largest events in the industry. Join this event and learn from joint Databricks and AWS customers like Capital One, McAfee, Cigna and Carvana, who have successfully leveraged the Databricks Lakehouse Platform for their business, bringing together data, AI and analytics on one common platform.

At Data + AI Summit, Databricks and AWS customers will take the stage for sessions to help you see how they achieved business results using the Databricks on AWS Lakehouse. Attendees will have the opportunity to hear data leaders from McAfee and Cigna on Tuesday, June 28, then join Capital One on Wednesday, June 29 and Carvana on Thursday, June 30.

The sessions below are a guide for everyone interested in Databricks on AWS and span a range of topics — from building recommendation engines to fraud detection to tracking patient interactions. If you have questions about Databricks on AWS or service integrations, connect with Databricks on AWS Solutions Architects at Data + AI Summit.

Databricks on AWS customer breakout sessions

Capital One: Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights. This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs – by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend).

Learn more


How McAfee Leverages Databricks on AWS at Scale

McAfee, a global leader in online protection security, enables home users and businesses to stay ahead of fileless attacks, viruses, malware, and other online threats. Learn how McAfee leverages Databricks on AWS to create a centralized data platform as a single source of truth to power customer insights. We will also describe how McAfee uses additional AWS services, specifically Amazon Kinesis and Amazon CloudWatch to provide real time data streaming and monitor and optimize their Databricks on AWS deployment. Finally, we’ll discuss business benefits and lessons learned during McAfee’s petabyte scale migration to Databricks on AWS using Databricks Delta clone technology coupled with network, compute, storage optimizations on AWS.

Learn more


Cigna: Journey to Solving Healthcare Price Transparency with Databricks and Delta Lake

Centers for Medicare & Medicaid Services (CMS) published Price Transparency mandate for health care service providers and payers to adhere to publish the cost of services provided based on procedure codes on public domain. This enabled us to create a comprehensive solution that can process tens of Terabytes data combined to create Machine Readable Files in the form JSON files and host them on public domain. We embarked on a journey that embraces the scalability of AWS cloud, Apache Spark, Databricks and DeltaLake to deal with generating and hosting file sizes ranging from megabytes to 100’s GBs.

Learn more


Carvana: Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing

Microservices is an increasingly popular architecture much loved by application teams, for it allows services to be developed and scaled independently. Data teams, though, often need a centralized repository where all data from different services come together to join and aggregate. The data platform can serve as a single source of company facts, enable near real time analytics, and secure sharing of massive data sets across clouds. A viable microservices ingestion pattern is Change Data Capture, using AWS Database Migration Services or Debezium. CDC proves to be a scalable solution ideal for stable platforms, but it has several challenges for evolving services: Frequent schema changes, complex, unsupported DDL during migration, and automated deployments are but a few. An event streaming architecture can address these challenges.

Learn more


Amgen: Building Enterprise Scale Data and Analytics Platforms at Amgen

Over the past few years, Amgen have developed a suite of modern enterprise platforms that have served as a core foundational capability for data & analytics transformation for our business functions. We operate in mature agile teams with a dedicated product team for each of our platforms to build reusable capabilities and integrating with business programs in line with SAFe. We have massive business impact created by our platforms, be it for business teams looking to self-serve onboarding data into our Data Lake or those looking to build advanced analytics applications powered by advanced NLP, knowledge graphs, and more. Our platforms are powered by modern technologies, extensively using Databricks, AWS native services, and several open source technologies.

Learn more


Amgen: Amgen’s Journey To Building a Global 360 View of its Customers with the Lakehouse

Serving patients in over 100 countries, Amgen is a leading global biotech company focused on developing therapies that have the power to save lives. Delivering on this mission requires our commercial teams to regularly meet with healthcare providers to discuss new treatments that can help patients in need. With the onset of the pandemic, where face-to-face interactions with doctors and other Healthcare Providers (HCPs) were severely impacted, Amgen had to rethink these interactions. With that in mind, the Amgen Commercial Data and Analytics team leveraged a modern data and AI architecture built on the Databricks Lakehouse to help accelerate its digital and data insights capabilities. This foundation enabled Amgen’s teams to develop a comprehensive, customer-centric view to support flexible go-to-market models and provide personalized experiences to our customers. In this presentation, we will share our recent journey of how we took an agile approach to bringing together over 2.2 petabytes of internally generated and externally sourced vendor data, and onboard into our AWS Cloud and Databricks environments to enable a standardized, scalable and robust capabilities to meet the business requirements in our fast-changing life sciences environment.

Learn more


Sapient: Turning Big Biology Data into Insights on Disease – The Power of Circulating Biomarkers

Profiling small molecules in human blood across global populations gives rise to a greater understanding of the varied biological pathways and processes that contribute to human health and diseases. Herein, we describe the development of a comprehensive Human Biology Database, derived from non-targeted molecular profiling of over 300,000 human blood samples from individuals across diverse backgrounds, demographics, geographical locations, lifestyles, diseases, and medication regimens, and its applications to inform drug development. Built on a customized AWS and Databricks “infrastructure-as-code” Terraform configuration, we employ streamlined data ETL and machine learning-based approaches for rapid rLC-MS data extraction.

Learn more


Scribd: Streaming Data into Delta Lake with Rust and Kafka

Scribd’s data architecture was originally batch-oriented, but in the last couple years, we introduced streaming data ingestion to provide near-real-time ad hoc query capability, mitigate the need for more batch processing tasks, and set the foundation for building real-time data applications. In this talk I will describe Scribd’s unique approach to ingesting messages from Kafka topics into Delta Lake tables. I will describe the architecture, deployment model, and performance of our solution, which leverages the kafka-delta-ingest Rust daemon and the delta-rs crate hosted in auto-scaling Amazon ECS services. I will discuss foundational design aspects for achieving data integrity such as distributed locking with Amazon DynamoDB to overcome S3’s lack of “PutIfAbsent” semantics, and avoiding duplicates or data loss when multiple concurrent tasks are handling the same stream. I’ll highlight the reliability and performance characteristics we’ve observed so far. I’ll also describe the Terraform deployment model we use to deliver our 70-and-growing production ingestion streams into AWS.

Learn more


Scribd: Doubling the Capacity of the Data Platform Without Doubling the Cost

The data and ML platform at Scribd is growing. I am responsible for understanding and managing its cost, while enabling the business to solve new and interesting problems with our data. In this talk we’ll discuss each of the following concepts and how they apply at Scribd and more broadly to other Databricks customers. Optimize infrastructure costs: Compute is one of the main cost line items for us in the cloud. We are early adopters of Photon and Databricks Serverless SQL, which help us to minimize these costs. We combine these technologies with off the shelf analysis tools in AWS and some helpful optimizations around Databricks and Delta Lake that we’d like to share.

Learn more


Huuuge Games: Real-Time Cost Reduction Monitoring and Alerting

Huuuge Games is building a state-of-the-art data and AI platform that serves as a unified data hub for all company needs and for all data and AI business insights. We built an advanced architecture based on Databricks which is built on top of AWS. Our Unified data infrastructure handles several billions of records per day in batch and real-time mode, generating players’ behavioral profiles, predicting their future behavior, and recommending the best customization of game content for each of our players.

Learn more


Databricks on AWS breakout sessions

Secure Data Distribution and Insights with Databricks on AWS

Every industry must comply with some form of compliance or data security in order to operate. As data becomes more mission critical to the organization, so does the need to protect and secure it. Public Sector organizations are responsible for securing sensitive data sets and complying with regulatory programs such as HIPAA, FedRAMP, and StateRAMP.

Learn more


Building a Lakehouse on AWS for Less with AWS Graviton and Photon

AWS Graviton processors are custom-designed by AWS to enable the best price performance for workloads in Amazon EC2. In this session we will review benchmarks that demonstrate how AWS Graviton based instances run Databricks workloads at a lower price and better performance than x86-based instances on AWS, and when combined with Photon, the new Databricks engine, the price performance gains are even greater. Learn how you can optimize your Databricks workloads on AWS and save more.

Learn more


Securing Databricks on AWS Using Private Link

Minimizing data transfers over the public internet is among the top priorities for organizations of any size, both for security and cost reasons. Modern cloud-native data analytics platforms need to support deployment architectures that meet this objective. For Databricks on AWS such an architecture is realized thanks to AWS PrivateLink, which allows computing resources deployed on different virtual private networks and different AWS accounts to communicate securely without ever crossing the public internet.

Learn more


Register now to join this free virtual event and join the data and AI community. Learn how companies are successfully building their Lakehouse architecture with Databricks on AWS to create a simple, open and collaborative data platform. Get started using Databricks with $50 in AWS credits and a free trial on AWS Marketplace.

--

Try Databricks for free. Get started today.

The post AWS Guide to Data + AI Summit 2022 featuring Capital One, McAfee, Cigna and Carvana appeared first on Databricks.


Equipping all Teams with Data & AI: Announcing the Finalists for the 2022 Databricks Data Team Democratization Award

$
0
0

The annual Databricks Data Team Awards recognize data teams who are harnessing the power of data and AI to deliver solutions for some of the world’s toughest problems.

Nearly 250 teams were nominated across six categories from all industries, regions, and companies – all with impressive stories about the work they are doing with data and AI. As we lead up to Data and AI Summit, we will be showcasing the finalists in each of the categories over the coming days.

The Data Team Democratization Award recognizes data teams who are driving the CoEs that are delivering data into the hands of empowered users across the organization — making every team a data team.

Meet the five finalists for the Data Team Democratization Award category:

Centers for Disease Control and Prevention
The Centers for Disease Control (CDC) has been on the frontlines guiding communities, governments and healthcare workers in response to the COVID-19 pandemic. Throughout this time, data and AI has played a critical role in helping to deliver fast insights across the U.S., helping to save lives. The Databricks Lakehouse has empowered the CDC to democratize data at massive scale — ingesting high volumes of all kinds of data on CDC’s Enterprise Analytics and Visualization (EDAV) platform. The lakehouse paradigm was implemented at CDC for COVID-19 vaccines data coming in from states and federal agencies (at a pace of 5+ million new records per day) and sharing vaccination and mortality rate metrics with cities, states, the White House and the general public so that they can make more informed decisions at local, regional and national level. These decisions included when to reopen businesses, enforce mask mandates, school closures, and more. Through the democratization of data and unification with analytics, they’ve been able to deliver on many more use cases to inform the people within the US of current health situations and provide the government and general public with actionable insights needed to ensure the highest levels of health within the US.

Conde Nast
Condé Nast is at the forefront of the publishing industry’s digital revolution, delivering engaging online content to millions of readers of iconic titles like Vogue, The New Yorker, Vanity Fair, and Wired. To do so, the data team at Condé Nast has harnessed data and AI to improve content performance and enrichment, fuel process innovation, and increase market revenue, all built upon Databricks Lakehouse. The Lakehouse supports “Evergreen,” a unified platform — from data teams to data consumers —using ML and analytics to derive faster insights that expand the reach and impact of their content, and enable data-driven decision-making to steer operations at a global level. As a result, Condé Nast has been able to increase revenue across multiple networks through predictive ad optimization; use SQL to power dashboards and reports to help teams improve syndication volume and performance; unify performance insights across its global brands; and enabled their video division to report on a global scale with a single source of truth for analysis. With the help of Databricks, Condé Nast has fostered a culture where data is the common language and core of everything they do, helping to enhance the reach and impact of the distinctive content that the company is renowned for.

Corning Incorporated
Corning, a leading innovator in materials science, develops products that transform industries and enhance people’s lives. As data and AI continue to play a critical role for Corning to advance its leadership position, they’ve been focused on the use of data across all business units. Corning has created an Emerging Technologies team within in the IT function that is leveraging the Databricks Lakehouse. The data lakehouse enables Corning scientists, engineers and business knowledge workers to access vast amounts of data supporting many use cases across all of Corning’s businesses. The lakehouse enables teams to use data and AI for predictive maintenance, predictive demand planning, image recognition, and advanced supply chain analytics to explore and identify business value metrics. Whether it’s data scientists looking to build advanced machine learning and deep learning models or analysts using SQL to explore data and build BI reports for business stakeholders, Corning continues to look for new opportunities to advance data consumption/use and empower users across the organization.

Gap Inc.
Gap Inc. is a collection of purpose-led lifestyle brands including Old Navy, Gap, Banana Republic, and Athleta, and the largest American specialty apparel company. We use omni-channel capabilities to bridge the digital world and physical stores to further enhance the shopping experience for our customers. To enable this at scale, Gap Inc. fundamentally redesigned its data architecture to make it simple, secure and accessible. With principles of federated data ownership, kappa architecture, and powered by Databricks Lakehouse, they have eliminated data silos and brought consistency across data science models, analytics and BI. With the Gap Data Platform COE driving data governance, federation and self-service, teams across Gap Inc. can qualify and publish data, search data and request access, comment and collaborate with SMEs and peers, and get a common understanding of shared datasets. With a petabyte of customer and foundational data in the Lakehouse, and other domains in progress, they continue to lower the time and cost of insights and innovation across Gap Inc. and their partners. Case in point: The Gap Inc. Data Sciences teams now have easy access to qualified, granular, near-real-time data and metadata – plus a 40x decrease in query times and a 5x decrease in data latency. Along with a migration to the Databricks ML Platform (feature stores, MLflow, distributed training/scoring, integration with Lakehouse), the teams got to a 95% decrease in end-to-end for the largest production models, and substantially reduced complexity and time to market overall.

Sam’s Club
Sam’s Club provides superior products and savings to millions of members with its highly curated assortment of items. It’s a huge undertaking that requires exemplary demand forecasting, supply chain optimization, and overall member experience. As an early adopter of the lakehouse architecture, Sam’s Club has built the Common Data Platform — an internal analytics service that has put the power of all their data in the hands of over 1,600 monthly active users. With Databricks Lakehouse powering its data platform, teams have been able to transform data from billions of transactions and events into actionable insights and predictive ML models. These models can yield more accurate financial forecasts, optimize pricing, boost engagement with home delivery, curbside pickup, Scan & Go™, fight credit card fraud and forecast supply which has helped reduce food waste.

Check out the award finalists in the other five categories and come raise a glass and celebrate these amazing data teams during an award ceremony at the Data and AI Summit on June 29.

--

Try Databricks for free. Get started today.

The post Equipping all Teams with Data & AI: Announcing the Finalists for the 2022 Databricks Data Team Democratization Award appeared first on Databricks.

Automating PHI Removal from Healthcare Data With Natural Language Processing

$
0
0

Minimum necessary standard and PHI in healthcare research

Under the Health Insurance Portability and Accountability Act (HIPAA), minimum necessary standard, HIPAA-covered entities (such as health systems and insurers) are required to make reasonable efforts to ensure that access to Protected Health Information (PHI) is limited to the minimum necessary information to achieve the intended purpose of a particular use, disclosure, or request.

In Europe, the GDPR lays out requirements for anonymization and pseudo-anonymization that companies must meet before they can analyze or share medical data. In some cases, these requirements go beyond US regulations by also requiring that companies redact gender identity, ethnicity, religious, and union affiliations. Almost every country has similar legal protections on sensitive personal and medical information.

The challenges of working with personally identifiable health data

Minimum necessary standards such as these can create obstacles to advancing population-level healthcare research. This is because much of the value in healthcare data is in the semi-structured narrative text and unstructured images, which often contain personally identifiable health information that is challenging to remove. Such PHI makes it difficult to enable clinicians, researchers, and data scientists within an organization to annotate, train, and develop models that have the power to predict disease progression, as an example.

Beyond compliance, another key reason for the de-identification of PHI and medical data before analysis — especially for data science projects — is to prevent bias and learning from spurious correlations. Removing data fields such as patients’ addresses, last names, ethnicity, occupation, hospital names, and doctor names prevents machine learning algorithms from relying on these fields when making predictions or recommendations.

Automating PHI removal with Databricks and John Snow Labs

John Snow Labs, the leader in Healthcare natural language processing (NLP), and Databricks are working together to help organizations process and analyze their text data at scale with a series of Solution Accelerator notebook templates for common NLP use cases. You can learn more about our partnership in our previous blog, Applying Natural Language Processing to Health Text at Scale.

To help organizations automate the removal of sensitive patient information, we built a joint Solution Accelerator for PHI removal that builds on top of the Databricks Lakehouse for Healthcare and Life Sciences. John Snow Labs provides two commercial extensions on top of the open-source Spark NLP library — both of which are useful for de-identification and anonymization tasks — that are used in this Accelerator:

  • Spark NLP for Healthcare is the world’s most widely-used NLP library for the healthcare and life science industries. Optimized to run on Databricks, Spark NLP for Healthcare seamlessly extracts, classifies, and structures clinical and biomedical text data with state-of-the-art accuracy at scale.
  • Spark OCR provides production-grade, trainable, and scalable algorithms and models for a variety of visual image tasks, including document understanding, form understanding, and information extraction. It extends the core libraries’ ability to analyze digital text to also read and write PDF and DOCX documents as well as extract text from images – either within such files or from JPG, TIFF, DICOM, and similar formats.

A high-level walkthrough of our Solution Accelerator is included below.

PHI removal in action

In this Solution Accelerator, we show you how to remove PHI from medical documents so that they can be shared or analyzed without compromising a patient’s identity. Here is a high-level overview of the workflow:

  • Build an OCR pipeline to process PDF documents
  • Detect and extract PHI entities from unstructured text with NLP models
  • Use obfuscation to de-identify data, such as PHI text
  • Use redaction to de-identify PHI in the visual document view

You can access the notebooks for a full walkthrough of the solution.

End-to-end workflow for automating PHI removal from documents and images using the Databricks Lakehouse Platform.

Parsing the files through OCR

As a first step, we load all PDF files from our cloud storage, assign a unique ID to each one, and store the resulting DataFrames into the Bronze layer of the Lakehouse. Note that the raw PDF content is stored in a binary column and can be accessed in the downstream steps.

Sample Delta bronze table, assigning a unique ID to each PDF file, created as part of the Databricks-John Snow Labs PHI de-identification solution.

In the next step, we extract raw text from each file. Since PDF files can have more than one page, it is more efficient to first transform each page into an image (using PdfToImage()) and then extract the text from the image by using ImageToText() for each image.

# Transform PDF document to images per page
pdf_to_image = PdfToImage()\
     .setInputCol("content")\
     .setOutputCol("image")

# Run OCR
ocr = ImageToText()\
     .setInputCol("image")\
     .setOutputCol("text")\
     .setConfidenceThreshold(65)\
     .setIgnoreResolution(False)

ocr_pipeline = PipelineModel(stages=[
   pdf_to_image,
   ocr
])

Similar to SparkNLP, transform is a standardized step in Spark OCR for aligning with any Spark-related transformers and can be executed in one line of code.

ocr_result_df = ocr_pipeline.transform(pdfs_df)

Note that you can view each individual image directly within the notebook, as shown below:

After applying this pipeline, we then store the extracted text and raw image in a DataFrame. Note that the linkage between image, extracted text and the original PDF is preserved via the path to the PDF file (and the unique ID) within our cloud storage.

Often, scanned documents are low quality (due to skewed image, poor resolution, etc.) which results in less accurate text and poor data quality. To address this problem, we can use built-in image pre-processing methods within sparkOCR to improve the quality of the extracted text.

Skew correction and image processing

In the next step, we process images to increase confidence. Spark OCR has ImageSkewCorrector which detects the skew of the image and rotates it. Applying this tool within the OCR pipeline helps to adjust images accordingly. Then, by also applying the ImageAdaptiveThresholding tool, we can compute a threshold mask image based on a local pixel neighborhood and apply it to the image. Another image processing method that we can add to the pipeline is the use of morphological operations. We can use ImageMorphologyOperation which supports Erosion (removing pixels on object boundaries), Dilation (adding pixels to the boundaries of objects in an image), Opening (removing small objects and thin lines from an image while preserving the shape and size of larger objects in the image) and Closing (the opposite of opening and useful for filling small holes in an image).

Removing background objects ImageRemoveObjects can be used as well as adding ImageLayoutAnalyzer to the pipeline, to analyze the image and determine the regions of text. The code for our fully developed OCR pipeline can be found within the Accelerator notebook.

Let’s see the original image and the corrected image.

Applying a Skew corrector within the OCR pipeline helps to straighten an improperly rotated document.

After the image processing, we have a cleaner image with an increased confidence of 97%.

After the image processing, the Databricks-John Snow Labs PHI de-identification solution produces a cleaner image with an increased confidence, or model accuracy, of 97%.

Now that we have corrected for image skewness and background noise, and extracted the corrected text from images we write the resulting DataFrame into the Silver layer in Delta.

Extracting and obfuscating PHI entities

Once we’ve finished using Spark OCR to process our documents, we can use a clinical Named Entity Recognition (NER) pipeline to detect and extract entities of interest (like name, birthplace, etc.) in our document. We covered this process in more detail in a previous blog post about extracting oncology insights from lab reports.

However, there are often PHI entities within clinical notes that can be used to identify and link an individual to the identified clinical entities (for example disease status). As a result, it is critical to identify PHI within the text and obfuscate those entities.

There are two steps in the process: extract the PHI entities, and then hide them; while ensuring that the resulting dataset contains valuable information for downstream analysis.

Similar to clinical NER, we use a medical NER model (ner_deid_generic_augmented) to detect PHI and then we use the “faker method” to obfuscate those entities. Our full PHI extraction pipeline can also be found in the Accelerator notebook.

The pipeline detects PHI entities, which we can then visualize with the NerVisualizer as shown below.

The Databricks-John Snow Labs PHI de-identification solution detects PHI entities which can be visualized with the NerVisualizer.

Now to construct an end-to-end deidentification pipeline, we simply add the obfuscation step to the PHI extraction pipeline which replaces PHI with fake data.

obfuscation = DeIdentification()\
    .setInputCols (["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate")\
    .setObfuscateRefSource("faker")\
    .setObfuscateDate (True)

obfuscation_pipeline = Pipeline(stages=[
    deid_pipeline,
    obfuscation
])

In the following example, we redact the birthplace of the patient and replace it with a fake location:

In addition to obfuscation, SparkNLP for Healthcare offers pre-trained models for redaction. Here is a screenshot showing the output of those redaction pipelines.

With the Databricks-John Snow Labs PHI de-identification solution, PDF images are updated with a black line to redact PHI.

PDF images are updated with a black line to redact PHI entities

SparkNLP and Spark OCR work well together for de-identification of PHI at scale. In many scenarios, Federal and industry regulations prohibit the distribution or sharing of the original text file. As demonstrated, we can create a scalable and automated production pipeline to classify text within PDFs, obfuscate or redact PHI entities, and write the resulting data back into the Lakehouse. Data teams can then comfortably share this “cleansed” data and de-identified information with downstream analysts, data scientists, or business users without compromising a patient’s privacy. Included below is a summary chart of this data flow on Databricks.

Data flow chart for PHI obfuscation using Spark OCR abd SparkNLP on Databricks.

Data flow chart for PHI obfuscation using Spark OCR abd SparkNLP on Databricks.

Start building your PHI removal pipeline

With this Solution Accelerator, Databricks and John Snow Labs make it easy to automate the de-identification and obfuscation of sensitive data contained within PDF medical documents.

To use this Solution Accelerator, you can preview the notebooks online and import them directly into your Databricks account. The notebooks include guidance for installing the related John Snow Labs NLP libraries and license keys.

You can also visit our Lakehouse for Healthcare and Life Sciences page to learn about all of our solutions.

--

Try Databricks for free. Get started today.

The post Automating PHI Removal from Healthcare Data With Natural Language Processing appeared first on Databricks.

Databricks Now Provides HIPAA Compliance Features on Google Cloud

$
0
0

Customers in regulated industries rely on Databricks on Google Cloud to analyze and gain insights from their most sensitive data using the data lakehouse paradigm. Our security program incorporates industry-leading best practices to fulfill our customers’ security needs. We are pleased to announce a new set of security controls (available in public preview) that can be enabled for your Databricks account and assist with the Health Insurance Portability and Accountability Act (HIPAA) compliance. The new security controls include features such as encryption of data at rest and in transit between the cluster nodes.

Visit the HIPAA on Google Cloud page to learn more about the new security controls. Visit the Databricks Google Cloud pricing page to learn about the pricing and please fill out this sign up form to request access to the preview.

--

Try Databricks for free. Get started today.

The post Databricks Now Provides HIPAA Compliance Features on Google Cloud appeared first on Databricks.

Accelerating Business With Data & AI: Announcing the Finalists for the 2022 Databricks Data Team Transformation Award

$
0
0

The annual Databricks Data Team Awards recognize data teams who are harnessing the power of data and AI to deliver solutions for some of the world’s toughest problems.

Nearly 250 teams were nominated across six categories from all industries, regions, and companies – all with impressive stories about the work they are doing with data and AI. As we lead up to Data and AI Summit, we will be showcasing the finalists in each of the categories over the coming days.

The Data Team Transformation Award honors the data teams who are taking their business to the next level with data-driven transformation, accelerating operations that lead to clear, impactful results.

Meet the five finalists for the Data Team Transformation Award category:

Compass
Compass provides an end-to-end platform for real estate agents to manage and grow their business, from customer relationship management, to marketing and brokerage services. To fuel these innovations, they integrated data and AI into a single cloud platform with Databricks Lakehouse. The Compass data team has securely onboarded 100+ data sources, migrated all workspaces to Unity Catalog, and created a gold layer for key stakeholders to tap into, to gain insights into areas like customer, product, usage, revenue and more — creating a single source of truth, with complete access controls across the company. They now have over 300 monthly active users across the business, with insights being delivered via a wide variety of dashboards: Retention and renewal reports that enable executives to identify trends and best practices; Transaction dashboards for sales managers to review listings, closed deals and financial performance, helping to identify opportunities recognize individuals or provide extra coaching. In addition, the ML team has productionized a “likely-to-sell model” that predicts high-value sellers, leading to millions of dollars in incremental revenue for Compass.

H&R Block
H&R Block is leveraging big data and artificial intelligence to drive tax preparation and financial services and achieve their Block Horizons 2025 long-term growth and transformation strategy. Core to their data-transformation strategy is their Enterprise Data Platform (EDP), built upon the Databricks Lakehouse Platform and utilizing Delta Live Tables. The EDP helped H&R Block evolve from disparate, on-premises, legacy data-warehousing sources to an aggregated, cloud-based, and scalable data-lakehouse solution. With this technology foundation in place, H&R Block is dedicated to embedding data and analytics into every aspect of their business, allowing the company to better know and interact with its customers, streamline the customer experience, introduce data-driven products, innovate with operational models, and automate processes.

Providence
Providence’s vision is health for a better world, and predictive data-driven solutions are a part of achieving that vision. Leveraging an architecture built on top of Azure and Databricks Lakehouse, Providence can now unify a diversity of clinical and operational source data in a single cloud environment for both analytics and machine learning (ML). By parsing and scoring streaming messages from electronic medical records (EMR) and other on-premises systems, they can deliver insights that help hospital workers better manage operations and clinical care. For example, Providence is applying ML at scale to forecast patient census, acuity, and length of stay to optimize staffing levels across its 50+ hospitals. Looking ahead, Providence is building on its streaming capabilities to deploy a Mission Control Center designed to monitor and manage facility activity in real-time to further improve patient care and operations across all of its facilities.

Samsung Electronics
Samsung Electronics strives to delight its customers and provide the best connected experience across platforms. Personalization is a key focus area, where Samsung Electronics leverages data to provide viewers with the most relevant content attuned to their interests. The data team at Samsung Electronics implemented the Databricks Lakehouse Platform to build and transform their modeling environment. Since moving to Databricks to transform their approach in an intelligent and cost-effective way, the results have been highly impactful and improved viewer satisfaction and user experience. They are among the most innovative in the industry, leading the way with new and creative ways to provide a better user experience.

Toyota
Toyota’s mission is to “continuously strive to transform the very nature of movement.” For the company to deliver on that promise, they are moving forward with an aggressive commitment to sustainability–not only by shifting their focus to electrified vehicles, but also changing how those vehicles are manufactured to achieve carbon neutrality by 2035. The data team at Toyota is using Databricks Lakehouse to help power this change, moving away from their legacy on premise data warehouse, to a cloud-based unified platform for data and AI. With the Lakehouse ingesting and standardizing petabytes of data, we are able to use machine learning and advanced analytics to analyze trillions of batch and real-time records to optimize manufacturing processes, predict energy demand, improve the utilization of renewable energy sources, and identify opportunities to further reduce our carbon footprint. With the Lakehouse, Toyota is tackling the challenge of decarbonization head-on, while also solving a range of other problems– from supply chain and revenue forecasting, to quality assurance– to make sure that the company never stops moving while surpassing customers’ expectations.

Check out the award finalists in the other categories and come raise a glass and celebrate these amazing data teams during an award ceremony at the Data and AI Summit on June 29.

--

Try Databricks for free. Get started today.

The post Accelerating Business With Data & AI: Announcing the Finalists for the 2022 Databricks Data Team Transformation Award appeared first on Databricks.

Production ML Application in 15 Minutes With Tecton and Databricks

$
0
0

Getting machine-learning systems to production is (still) hard. For many teams, building real-time ML systems that operate on real-time data is still a daydream – in most cases, they are tied to building batch prediction systems. There are a lot of challenges on the road to real-time ML, including building scalable real-time data pipelines, scalable model inference endpoints, and integrating it all into production applications.

This blog will show you how you can dramatically simplify these challenges with the right tools. With Tecton and Databricks, you’ll be able to build the MVP for a real-time ML system in minutes, including real-time data processing and online inference.

Building a Real-Time ML System in One Notebook

In this example, we’ll focus on building a real-time transaction fraud system that decides whether to approve or reject transactions. Two of the most challenging requirements of building real-time fraud detection systems are:

  • Real-time inference: predictions need to be extremely fast – typically model inference should happen in < 100ms to not slow down payment processing
  • Real-time features: often the most critical data needed to detect fraudulent transactions describes what has happened in the last few seconds. To build a great fraud detection system, you’ll need to update features within seconds of a transaction occurring.

With Tecton and Databricks, these challenges can be greatly simplified:

  • Databricks and its native MLflow integration will allow us to create and test real-time serving endpoints to make real-time predictions.
  • Tecton helps build performant stream aggregations that will compute features in real-time

Let’s walk through the following 4-steps on how to build a real-time production ML system:

  • Building performant stream processing pipelines in Tecton
  • Training a model with features from Tecton using Databricks and MLflow
  • Creating a model serving endpoint in Databricks using MLflow
  • Making real-time predictions using the model serving endpoint and real-time features from Tecton

Building stream processing pipelines in Tecton
Tecton’s feature platform is built to make it simple to define features, and to help make those features available to your ML models – however quickly you need them. You’ll need a few different types of features for a fraud model, for example:

  • Average transaction size in a country for the last year (computed once daily)
  • Number of transactions by a user in the last 1 minute (computed continuously from a stream of transactions)
  • Distance from the point of the transaction to the user’s home (computed on-demand at the time of a transaction)

Each type of feature requires a different type of data pipeline, and Tecton can help build all three of these types of features. Let’s focus on what would typically be the most challenging type of feature, real-time streaming features.

Here’s how you can implement the feature “Number of transactions by a user in the last 1 minute and last 5 minutes” in Tecton:

@stream_window_aggregate_feature_view(
    inputs={'transactions': Input(transactions_stream)},
    entities=[user],
    mode='spark_sql',
    aggregation_slide_period='continuous',
    aggregations=[
        FeatureAggregation(column='counter', function='count', time_windows=['1m', '5m'])
    ],
    online=True,
    offline=True,
    feature_start_time=datetime(2022, 4, 1),
    family='fraud',
    tags={'release': 'production'},
    owner='david@tecton.ai',
    description='Number of transactions a user has made recently'
)
def user_continuous_transaction_count(transactions):
    return f'''
        SELECT
            user_id,
            1 as counter,
            timestamp
        FROM
            {transactions}
        '''

In this example we take advantage of Tecton’s built-in support for low-latency streaming aggregations, allowing us to maintain a count of the number of transactions a user has made in real-time.

A feature pipeline built in Tecton for computing a count of user’s transactions

When this feature is applied, Tecton starts orchestrating data pipelines in Databricks to make this feature available in real time (for model inference) and offline (for model training). Historical features are stored in Delta Lake, meaning all of the features you build are natively available in your data lakehouse.

Training a model with features from Tecton using Databricks and MLflow

Once our features are built-in Tecton, we can train our fraud detection model. Check out the notebook below where we:

  • Generate training data using Tecton’s time-travel capabilities
  • Train a SKLearn model to predict whether or not a transaction is fraudulent
  • Track our experiments using MLflow
# 1. Fetching a Spark DataFrame of historical labeled transactions
# 2. Renaming columns to match the expected join keys for the Feature Service
# 3. Selecting the join keys, request data, event timestamp, and label
training_events = ws.get_data_source("transactions_batch").get_dataframe().to_spark() \
                        .filter("partition_0 == 2022").filter("partition_2 == 05") \
                        .select("user_id", "merchant", "timestamp", "amt", "is_fraud") \
                        .cache()


training_data = fraud_detection_feature_service.get_historical_features(spine=training_events, timestamp_key="timestamp").to_spark()
training_data_pd = training_data.drop("user_id", "merchant", "timestamp", "amt").toPandas()
y = training_data_pd['is_fraud']
x = training_data_pd.drop('is_fraud', axis=1)
X_train, X_test, y_train, y_test = train_test_split(x, y)
with mlflow.start_run() as run:
  n_estimators = 100
  max_depth = 6
  max_features = 3
  # Create and train model
  rf = RandomForestRegressor(n_estimators = n_estimators, max_depth = max_depth, max_features = max_features)
  rf.fit(X_train, y_train)
  # Make predictions
  predictions = rf.predict(X_test)
  
  # Log parameters
  mlflow.log_param("num_trees", n_estimators)
  mlflow.log_param("maxdepth", max_depth)
  mlflow.log_param("max_feat", max_features)
  mlflow.log_param("tecton_feature_service", feature_service_name)
  
  # Log model
  mlflow.sklearn.log_model(rf, "random-forest-model")
  
  # Create metrics
  mse = mean_squared_error(y_test, predictions)
    
  # Log metrics
  mlflow.log_metric("mse", mse)

Creating a model serving endpoint in Databricks using MLflow
Now that we have a trained model, we’ll use MLflow in Databricks to create a model endpoint. First, we’ll register the model in the MLflow Model Registry:

Register your trained model in the MLflow model registry

Create a new model called “tecton-databricks-fraud-model”

Next, we’ll use MLflow to create a serving endpoint:

Enable serving from the MLflow model registry UI

Once our model is deployed, we’ll take note of the endpoint url:

Now, we have a prediction endpoint that can perform real-time transaction scoring – the only thing left is collecting the features needed at prediction-time.

Making real-time predictions using the model serving endpoint and real-time features from Tecton

The model endpoint that we just created takes features as inputs and outputs a prediction of the probability that a transaction is fraudulent. Retrieving those features poses some challenging problems:

  • Latency constraints: we need to look up (or compute) the features very quickly (<50 ms) to fit within our overall latency budget
  • Feature freshness: we expect the features we defined (like the one-minute transaction count) to be updated in real-time as transactions are occurring.

Tecton provides feature serving infrastructure to solve these challenging problems. Tecton is built to serve feature vectors at high scale and low latency. When we built our features, we already set up the real-time streaming pipelines that will be used to produce fresh features for our models.

Thanks to Tecton, we can retrieve features in real-time with a simple REST call to Tecton’s feature serving endpoint:

curl -X POST https://app.tecton.ai/api/v1/feature-service/get-features\
     -H "Authorization: Tecton-key $TECTON_API_KEY" -d\
'{
  "params": {
    "feature_service_name": "fraud_detection_feature_service",
    "join_key_map": {
      "user_id": "USER_ID_VALUE"
    },
    "request_context_map": {
      "amt": 12345678.9
    },
    "workspace_name": "tecton-databricks-demo"
  }
}'

Check out the rest of the notebook where we’ll wire it all together to retrieve features from Tecton and send them to our model endpoint to get back real-time fraud predictions:

def score_model(dataset):
  headers = {'Authorization': f'Bearer {my_token}'}
  data_json = dataset.to_dict(orient='split')
  response = requests.request(method='POST', headers=headers, url=model_url, json=data_json)
  if response.status_code != 200:
    raise Exception(f'Request failed with status {response.status_code}, {response.text}')
  return response.json()

amount=12345.0
df = fraud_detection_feature_service.get_online_features(
  join_keys={'user_id': 'user_131340471060', 'merchant': 'fraud_Schmitt Inc'},
  request_data={"amt": amount}
).to_pandas().fillna(0)

prediction = score_model(df)

print(prediction[0])

Conclusion

Building real-time ML systems can be a daunting task! Individual components like building streaming-data pipelines can be months-long engineering projects if done manually. Luckily building real-time ML systems with Tecton and Databricks can simplify a lot of that complexity. You can train, deploy, and serve a real-time fraud detection model in one notebook – and it only takes about 15 minutes.

To learn more about how Tecton is powering real-time ML systems, check out talks about Tecton at Data and AI Summit: Scaling ML at CashApp with Tecton and Building Production-Ready Recommender Systems with Feature Stores

--

Try Databricks for free. Get started today.

The post Production ML Application in 15 Minutes With Tecton and Databricks appeared first on Databricks.

Viewing all 1874 articles
Browse latest View live