Analyzing Okta Logs With Databricks Lakehouse Platform to Detect Unusual Activity

April 7, 2022, 8:00 am

≫ Next: How Caresource Modernized Its Data Architecture to Provide Better Healthcare to Members

≪ Previous: Announcing Generally Availability of Databricks’ Delta Live Tables (DLT)

With the recent social media reports of an Okta incident through a third party contractor, security teams ran to their logs and asked vendors like Databricks for detection and analytics advice. Prior to being notified by Okta that we were not among the potentially impacted customers, we used the Databricks Lakehouse for our own investigation. We would like to show how we performed that investigation and share both insights and technical details. We will also provide notebooks that you can import into your Databricks deployment or our community edition to ingest your Okta logs so that by the end of this blog, you can perform the same analysis for your company.

Background

Okta is a market-leading cloud-based identity platform used for Single Sign-on (SSO) authentication and authorization, multi-factor authentication, and user management services with their customers’ enterprises or business applications.

In January 2022 hackers gained access to an endpoint (user system) owned and operated by a third-party organization providing support services to Okta customers. These actors were potentially able to perform actions as if they were the employee assigned to that endpoint. Like most organizations, Databricks immediately launched an investigation into the incident, analyzing several years of Okta data we have stored in our Lakehouse. We built our own queries, but we also found tremendous value in the posts and tweets from others in the industry.

Our favorite industry blog post was from Cloudflare. Two statements in particular resonated with our security team:

“Even though logs are available in the Okta console, we also store them in our own systems. This adds an extra layer of security as we are able to store logs longer than what is available in the Okta console. That also ensures that a compromise in the Okta platform cannot alter evidence we have already collected and stored.”

Thanks to this approach, they were able to “search the Okta System logs for any signs of compromise (password changes, hardware token changes, etc.). Cloudflare reads the system Okta logs every five minutes and stores these in our SIEM so that if we were to experience an incident such as this one, we can look back further than the 90 days provided in the Okta dashboard.”

Figure 1. Quote from industry post from Cloudflare

In the wake of the incident, many of our customers reached out to us asking if Databricks can help them ingest and analyze their Okta System Logs, and the answer is a resounding YES! The Databricks Lakehouse Platform lets you store, process and analyze your data at multi-petabyte scale, allowing for much longer retention and lookback periods and advanced threat detection with data science and machine learning. What’s more, you can even query them via your SIEM tool, providing a 360 view of your security events.

In this blog post, we will demonstrate how to integrate Okta System Logs with your Databricks Lakehouse Platform, and collect and monitor them. This integration allows your security teams far greater visibility into the authentication and authorization behaviors of your applications and end-users, and lets you look for specific events tied to the recent Okta compromise.

If your goal is to quickly get started, you can skip reading the rest of the blog and use these notebooks in your own Databricks deployment, referring to the comments in each section of the notebook if you get stuck.

Please read on for a technical explanation of the integration and the analysis provided in the notebooks.

About Okta System Logs

The Okta System Log records system events that are related to your organization in order to provide an audit trail that can be used to understand platform activity and diagnose problems. The Okta System Log API provides near real-time, read-only access to your organization’s system log. These logs provide critical insights into user activity, categorized by Okta event type. Each event type represents a specific activity (e.g., login attempt, password reset, creating a new user). You can search on event types and correlate activity with other Okta log attributes such as the event outcome (e.g., SUCCESS or FAILURE), IP address, user name, browser type, and geographic location.

There are many methods to ingest Okta System Log events into other systems, but we are using the System Log API to retrieve the latest System Log events.

Lakehouse architecture for Okta System Logs

Databricks Lakehouse is an open architecture that combines the best elements of data lakes and data warehouses. We recommend the following lakehouse architecture for cybersecurity workloads, such as Okta System Log analysis:

Step 1: The Okta System Log records system events that are related to your organization in order to provide an audit trail that can be used to understand platform activity and to diagnose problems.
Step 2: The Okta System Log API provides near real-time, read-only access to your organization’s system log.
Step 3: You can use the notebook provided to connect to Okta System Log API and ingest records into Databricks Delta automatically at short intervals (optionally, schedule it as a Databricks job).
Step 4: At the end of this blog, and with the notebooks provided, you will be ready to use the data for analysis.

Databricks Lakehouse architecture for Okta System Logs

Figure 2. Lakehouse architecture for Okta System Logs

In the next sections, we’ll walk through how you can ingest Okta log attributes to monitor activity across your applications.

Ingesting Okta System Logs into Databricks Delta

If you are following along at work or home (or these days most often both) with this notebook, we will be using Delta Lake batch capability to ingest the data using Okta System Log API to a Delta table by fetching the list of ordered log events from your Okta organization’s system log. We will be using the bounded requests type (bounded requests are for situations when you know the definite time frame of logs you want to retrieve).

For a request to be a bounded request, it must meet the following request parameter criteria:

since must be specified.
until must be specified.

Bounded requests to the /api/v1/logs API have the following semantics:

The returned events are time filtered by their associated published field (unlike Polling Requests).
The returned events are guaranteed to be in order according to the published field.
They have a finite number of pages. That is, the last page doesn’t contain a next link relation header.
Not all events for the specified time range may be present — events may be delayed. Such delays are rare but possible.

For performance, we are going to use an adaptive watermark approach: i.e query for the last 72 hours to find the latest ingest time; if we can’t find something within that time frame, then we requery the whole table to find the latest ingest time. This is better than querying the whole table every time.


d = datetime.today() - timedelta(days=3)
beginDate = d.strftime("%Y-%m-%d")

watermark = sql("SELECT coalesce(max(published)) FROM okta_demo.okta_system_logs WHERE date >= '{0}'".format(beginDate)).first()[0]
if not watermark:
  watermark = sql("SELECT coalesce(max(published)) FROM okta_demo.okta_system_logs").first()[0]

Figure 3. Cmd 3 of “2.Okta_Ingest_Logs” notebook

We will construct an API request as below by using the Okta API Token, and break apart the records into individual JSON rows


headers = {'Authorization': 'SSWS ' + TOKEN}
url = URL_BASE + "api/v1/logs?limit=" + str(LIMIT) + "&sortOrder=ASCENDING&since=" + SINCE
r = requests.get(url, headers=headers)
jsons = []
  jsons.extend([json.dumps(x) for x in r.json()])

Figure 4. Cmd 4 of “2.Okta_Ingest_Logs” notebook

Transform the JSON rows into a dataframe


df = (
    sc.parallelize([Row(recordJson=x) for x in jsons]).toDF()
    .withColumn("record", f.from_json(f.col("recordJson"), okta_schema))
    .withColumn("date", f.col("record.published").cast("date"))
    .select(
      "date",
      "record.*",
 "recordJson",
    )
  )

Figure 5. Cmd 4 of “2.Okta_Ingest_Logs” notebook

Persist the records into delta table


df.write \
   .option("mergeSchema", "true")\
   .format('delta') \
   .mode('append') \
   .partitionBy("date") \
   .save(STORAGE_PATH)

Figure 6. Cmd 4 of “2.Okta_Ingest_Logs” notebook

As shown above the Okta data collection is less than 50 lines of code and you can run that code automatically at short intervals by scheduling it as a Databricks job.

Your Okta system logs are now in Databricks. Let’s do some analysis!

Analyzing Okta System Logs

For our analysis, we will be referring to the “System Log queries for attempted account takeover” knowledge content that the nice folks at Okta published along with their docs.

Okta Impersonation Session Search

Reportedly, it appears an attacker compromised the endpoint for a third-party support employee with elevated permissions (such as the ability to force a password reset on an Okta customer account). Customer security teams may want to start looking for a few events in the logs for any indications of compromise to their Okta tenant.

Let us start with administrator activity. This query searches for impersonation events reportedly used in the LAPSUS$ activity. User.session.impersonation are rare events, normally triggered when an Okta support person requests admin access for troubleshooting, so you probably won’t see many.


SELECT
  eventType,
  count(eventType)
from
  okta_demo.okta_system_logs
where
  date >= date('2021-12-01')
  and eventType in (
    "user.session.impersonation.initiate",
    "user.session.impersonation.grant",
    "user.session.impersonation.extend",
 "user.session.impersonation.end",
    "user.session.impersonation.revoke"
  )
group by eventType

Figure 7. Cmd 4 of “3.Okta_Analytics” notebook

In the results, if you see a user.session.impersonation.initiate event (triggered when a support staff impersonates an admin) but no user.session.impersonation.grant event (triggered when an admin grants access to support), that is cause for a concern! We provided a detailed query in the notebooks that detects “impersonation initiations” that are missing a corresponding “impersonation grant” or “impersonation end”. You can review user.session.impersonation events and correlate that with legitimate opened Okta support tickets to determine if these are anomalous. See Okta API event types for documentation and Cloudflare’s investigation of the January 2022 Okta compromise for a real world scenario.

Okta Recent Employee who had Reset their Password or Modified their MFA

Now, let’s look for any employee account who had their password reset or modified their multi factor authentication (MFA) in any way since December 1. Event types within Okta that help with this search are: user.account.reset_password, user.mfa.factor.update, system.mfa.factor.deactivate, user.mfa.attempt_bypass, or user.mfa.factor.reset_all (you can look into Okta docs to capture more events to expand your analysis as needed). We’re looking for an “actor.alternateId” of system@okta.com that appears when the Okta support organization initiates a password reset. Note that although we are also looking for the “Update Password” event below, Okta’s support representatives do not have the capability of updating passwords – they can only reset them.


SELECT
  actor.alternateId,
  *
from
  okta_demo.okta_system_logs
where
  (
    (
      eventType = "user.account.update_password"
      and actor.alternateId = "system@okta.com"
    )
    or (
      eventType = "user.account.reset_password"
      and actor.alternateId = "system@okta.com"
    )
    or eventType = "user.mfa.factor.update"
    or eventType = "system.mfa.factor.deactivate"
    or eventType = "user.mfa.attempt_bypass"
    or eventType = "user.mfa.factor.reset_all"
  )
  and date >= date('2021-12-01')

Figure 8. Cmd 8 of “3.Okta_Analytics” notebook

If you see results from this query, you may have Okta user accounts which require further investigation – especially if they are privileged or sensitive users.

MFA Fatigue Attacks

Multi factor authentication (MFA) is among the most effective security controls at preventing account takeovers, but it isn’t infallible. It was reportedly abused during the Solarwinds compromise and by LAPSUS$. This technique is called an MFA fatigue attack or MFA prompt bombing. With it, an adversary uses previously stolen usernames and passwords to login into an account protected by push MFA and triggers many push notifications to the victim (typically to their phone) until they tire of the alerts and approve a request. Simple! How would we detect these attacks? We took inspiration from this blog post by James Brodsky at Okta to address just that.


SELECT
    authenticationContext.externalSessionId externalSessionId, actor.alternateId, min(published) as firstTime, max(published) as lastTime,  
    count(eventType) FILTER (where eventType="system.push.send_factor_verify_push") pushes,
    count(legacyEventType) FILTER (where legacyEventType="core.user.factor.attempt_success") as successes, 
    count(legacyEventType) FILTER (where legacyEventType="core.user.factor.attempt_fail") as failures,  
    unix_timestamp(max(published)) -  unix_timestamp(min(published)) as elapsetime 
from
  okta_demo.okta_system_logs
where
  eventType = "system.push.send_factor_verify_push" 
   OR 
  ((legacyEventType = "core.user.factor.attempt_success") AND (debugContext.debugData like "%OKTA_VERIFY_PUSH%"))
  OR 
  ((legacyEventType = "core.user.factor.attempt_fail") AND (debugContext.debugData like "%OKTA_VERIFY_PUSH%"))
  
group by authenticationContext.externalSessionId, actor.alternateId
having elapsetime <=600  and successes>0 AND pushes>=3 and failures >= 1

Figure 9. Cmd 11 of “3.Okta_Analytics” notebook

Here is what the above query looks for: First, it reads in MFA push notification events and their matching success or failure events, per unique session ID and user. It then calculates the time elapsed during the login period (limited to 10 minutes), and calculates the number of push notifications sent, along with the number of push notifications responded to affirmatively and negatively. Then it makes simple decisions based on the combinations of results returned. If more than three pushes are seen, and a single successful notification is seen, then this could be something worth more investigation.

Recommendations

If you are an Okta customer, we recommend reaching out to your account team for further information and guidance.

We also suggest the following actions:

Enable and strengthen MFA implementation for all user accounts.
Ingest and store Okta logs in your Databricks Lakehouse.
Investigate and respond:
- Continuously monitor rare support-initiated events, such as Okta impersonation sessions, using Databricks jobs.
- Monitor for suspicious password resets and MFA-related events.

Conclusion

In this blog post you learned how easy it is to ingest Okta system logs into your Databricks Lakehouse. You also saw a couple of analysis examples to hunt for signs of compromise within your Okta events. Stay tuned for more blog posts that build even more value on this use case by applying ML and using Databricks SQL.

We invite you to log in to your own Databricks account or in Databricks Community Edition and run these notebooks. Please refer to the docs for detailed instructions on importing the notebook to run.

We look forward to your questions and suggestions. You can reach us at: cybersecurity@databricks.com . Also if you are curious about how Databricks approaches security, please review our Security & Trust Center.

Acknowledgments

Thank you to all of the staff across the industry, at Okta, and at Databricks, who have been working to keep everyone secure.

Try Databricks for free. Get started today.

The post Analyzing Okta Logs With Databricks Lakehouse Platform to Detect Unusual Activity appeared first on Databricks.

↧

How Caresource Modernized Its Data Architecture to Provide Better Healthcare to Members

April 7, 2022, 4:55 pm

≫ Next: Unlocking the Power of Data: AT&T’s Modernization Journey to the Lakehouse

≪ Previous: Analyzing Okta Logs With Databricks Lakehouse Platform to Detect Unusual Activity

Data needs in a rapidly growing health care organization

CareSource is nationally recognized for leading the industry in providing member-centric health care coverage. The company’s managed care business model was founded in 1989 and today CareSource is one of the nation’s largest Medicaid managed care plans. CareSource serves more than 2 million members across six states supported by a growing workforce of 4,500. CareSource’s holistic model of care breaks down the hurdles of clinical treatment and social qualities that can lead to reduced health outcomes. The company’s regional, community-based multi-disciplinary care management teams comb through the data and social aspects that could affect physical, mental, and psychosocial health and integrate insights into how to improve the health and overall well-being of its members and the populations it serves.

CareSource has grown exponentially over the past 30 years. As a result, our legacy data systems couldn’t keep up with the influx of new members and we had to start applying band-aids just to get by. For example, we had to run jobs that were designed to be run monthly on a daily basis, which the systems were simply not designed to handle. As a result, we expended significant resources to maintain our systems.

All of this meant that CareSource needed to implement a more modern data platform — one that lived in the cloud and could scale, be performant, and be future-proof.

Creating a modern data platform for modern data needs

Deploying the Databricks Lakehouse Platform on Azure helped CareSource fast-track our data analytics journey by removing data silos and creating a single source of truth for all of the data coming in. The driving force was to create a data platform that was agile, efficient, secure, and highly trusted by decision-makers.

To accomplish this, we turned to the experts at Databricks to help with implementing a three-layer architecture, with the first layer serving as a way to ingest large amounts of raw data. Then, we wrote all our transformation logic and processing logic in Databricks notebooks to feed into an integrated layer that leverages an industry-standard data model: the IBM Universal Health Care Data Model. From there, the second level of transformation (again, through Databricks notebooks) occurs in the final layer to get the data into an easily consumable structure primed for downstream analytics use cases designed to better serve our members and their health care needs

With this modern lakehouse architecture in place, our analysts have also been excited to turn to Databricks SQL to better analyze the data in a more intuitive and visual way. In fact, some analysts are even using SQL directly within the Databricks SQL interface, and are planning to rewrite our existing predictive models in a more efficient way within the Databricks environment.

Creating real health care impact with data

Thanks to the power of Databricks’ Lakehouse, we’ve been able to drive real impact on the way we process and analyze data. As an example, when our members get a prescription filled, the claim goes to a third-party company. While we receive data about the claim being adjudicated and paid, we also receive an invoice from this third party.

All this data flows between our two entities, but we’ve never been able to automatically reconcile that data. With Databricks serving as the foundation for our modern data platform, we’re able to proactively generate our own invoice dollar amount before actually receiving the invoice, and it matches down to the penny. This is a huge differentiator for CareSource. Most peers in the industry can’t do this, both from an operating perspective and in terms of the reliability and usability of the data.

Looking ahead, our work on predictive modeling with the help of Databricks will also have a huge impact on the well-being of our members. We’re currently in the process of rewriting a predictive model for high-risk pregnancies, for instance. By processing data points and setting certain parameters, we can essentially flag when someone is potentially going to have a high-risk pregnancy, and a workflow can automatically be created for a nurse to do proactive outreach. None of that would be possible without Databricks.

There’s still a lot we hope to accomplish with our modern data platform. Not only is more effective data sharing with Delta Sharing on our roadmap, but so is a deeper exploration of AI and ML to create even more robust and accurate predictive models. The high-performance platform will enable us to build and execute complex models more frequently leading to proactive health interventions. With the Databricks Lakehouse Platform powering our new data architecture, we can rest easy knowing that we’ll be making the best use of the data to better the lives of our millions of members.

Try Databricks for free. Get started today.

The post How Caresource Modernized Its Data Architecture to Provide Better Healthcare to Members appeared first on Databricks.

↧

Unlocking the Power of Data: AT&T’s Modernization Journey to the Lakehouse

April 11, 2022, 5:37 pm

≫ Next: Delivering Real-Time Data to Retailers with Delta Live Tables

≪ Previous: How Caresource Modernized Its Data Architecture to Provide Better Healthcare to Members

AT&T started its data transformation journey in late 2020 with a sizable mission: to move from our core on-premises Hadoop data lake to a modernized cloud architecture. Our strategy was to empower data teams by democratizing data, as well as scale AI efforts without overburdening our DevOps team. We saw enormous potential in increasing our use of insights for improving the AT&T customer experience, growing the AT&T business and operating more efficiently.

While some businesses might accomplish smaller migrations more readily, we at AT&T had a lot to consider. Our data platform ecosystem of technologies ingests over 10 petabytes of data per day, and we manage over 100 million petabytes of data across the network. It was extremely important for us to take our time selecting the right tool for the job. Not only because of the large data volumes but also because we have 182 million wireless subscribers and 15 million broadband households to support who are using data. In addition, we have important systems that protect our customers against breaches and fraud. Essentially, we needed to democratize our data in order to use it to its full potential but balance that democratization with privacy, security, and data governance.

Our legacy architecture, which includes over six different data management platforms, enabled data teams to work closely with data and act on it quickly. But at the same time, it locked those efforts in silos. These distributed pockets of work led to challenges accessing and acquiring data, as well as data duplication and latency issues. Without a single truth from which to draw information, metrics were created out of different versions of data that reflected different points in time and levels of quality.

Ultimately, to realize the data-driven innovations we desired, we needed to modernize our infrastructure by moving to the cloud and adopting a data architecture built on the premise of open formats, simplicity, and cross-team collaboration. We chose Databricks Lakehouse as a critical component for this monumental initiative.

Accessible data leads to better insights and a center of excellence

2021 was all about getting AT&T’s on-premises data into the Databricks Lakehouse Platform. I’m excited to say that with Lakehouse as our unified platform, we’ve successfully moved all our core data lake data to the cloud.

Our data science team, who were the first adopters, adjusted to this change with ease. They have since been able to move their machine learning (ML) workloads to the cloud. This has enabled faster data retrieval, more data (if you can believe it), and accessibility to modernized technologies that have brought fraud down to the lowest level in years. For example, we’ve been able to train and deploy models that detect fraudulent phone purchase attempts and then communicate that fraud across all channels to stop it completely. We’ve also seen a significant increase in operational efficiency, a reduction in customer churn, and an increase of customer LTV.

Within CDO, we’ve been onboarding a large data engineering and data science community. We’re ingesting both structured customer data into Delta Lake, as well as a large amount of raw, unstructured, real-time data to help continue powering these important use cases.

But the value doesn’t stop at our ability to scale data science. Our business users have also been able to extract data insights through integrations that run Power BI and Tableau dashboards off the data in Delta Lake. The sales organization uses data-driven insights fed through Tableau to uncover new upsell opportunities. They are also able to generate recommendations on ideal responses based on the questions customers are asking.

Most importantly, moving to Databricks Lakehouse has enabled AT&T to move to the analytics center of excellence (COE) model. As we decentralize our technology team to support businesses more closely, we’re ultimately aiming to empower each business unit to serve themselves. This includes knowing who to reach out to if they have a question, where to find training, how to get a deeper understanding of how much they’re spending, and more. And for all of those reasons, the center of excellence has been key. It’s led to greater product adoption, and so much meaningful trust and appreciation from our partners.

Retiring on-prem entirely, making cost-saving gains, and accelerating success in 2022

In 2020, we succeeded in making the case and proving the benefits of moving to the cloud. The ability to rapidly execute our transformation plan helped us exceed our savings targets for 2021, and I’m expecting to do the same in 2022. The real win, however, is going to be the increased business benefits we expect to see this year as we continue moving our data over to Delta Lake so we can retire our on-prem system entirely.

This move will enable us to do really exciting things, like standardize our artificial intelligence (AI) tooling, scale data science and AI adoption across the business, support business agility through federation, and leverage more capabilities as our roadmap evolves.

I’m certain that the Databricks Lakehouse architecture will enable our future here at AT&T. It’s the target architecture for our AI use cases, and we are confident it will increase our business agility because in less than a year we have already seen the results of the federation and business value it enables. Critically, it also supports required data security and the governance for a single version of truth across our complex data ecosystem.

Try Databricks for free. Get started today.

The post Unlocking the Power of Data: AT&T’s Modernization Journey to the Lakehouse appeared first on Databricks.

↧

Delivering Real-Time Data to Retailers with Delta Live Tables

April 12, 2022, 9:43 am

≫ Next: DoD Data Decrees and the path to Lakehouse

≪ Previous: Unlocking the Power of Data: AT&T’s Modernization Journey to the Lakehouse

Register for the Deliver Retail Insights webinar to learn more about how retailers are enabling real-time decisions with Delta Live Tables.

The pandemic has driven a rapid acceleration in omnichannel adoption. Retailers who were able to re-envision customer experiences for digital channels and deliver a variety of fast, convenient and safe alternatives for order fulfillment separated themselves from their peers during the crisis. As the virus abates, consumers are increasingly returning to physical stores while a sizable demand for the blending of online and in-store experiences remains.

For some retailers, the movement to omnichannel began well before the emergence of COVID-19. For these organizations, the pandemic moved forward timelines already in place. For others, a wait-and-see approach to buy-online pickup-in-store, curbside, home delivery and other fulfillment innovations meant a hurried bootstrapping of capabilities in the midst of lockdowns and labor disruptions. But whether an organization found itself picking up its existing pace or exiting the stands to join a race it was previously content to observe, most struggled with operational gaps and a diminished profitability in individual transactions.

Inventory challenges are becoming more visible

Well before supply chain disruptions became a topic for the evening news, retailers reported significant on-shelf availability challenges causing nearly a third of shoppers to not find an item they were looking for and retailers to miss out on nearly $1T in sales. Often detected after the fact through customer surveys and analysis of historical data, day-to-day occurrences remained somewhat hidden through customer-elected substitutions and inbound replenishment stock.

It wasn’t until the adoption of in-store fulfillment practices that sent employees to shelves on behalf of customers that the extent of the problem more fully came to light. Notifications sent by pickers informing customers of an item’s unavailability provided a more timely record of gaps between reported and actual inventory. As retailers juggled substitutions and order cancellations, customers found themselves switching to retail businesses that could best meet the expectations set through online experiences.

Resolving these challenges requires transformation

In-store inventory management practices have long been a sore point for the retail industry. Large footprints supporting numerous customers and housing a sizable number of SKUs are simply difficult to keep consistently stocked. Products can be easily misplaced, left in the backroom, restocked without being properly recorded, or otherwise lost in the shuffle of in-store activity. Periodic inventory counts provide a point-in-time assessment of units on-hand, but the infrequency of these events means the current state of in-store inventory is infrequently known with any certainty.

For some retailers, the solution is to move away from in-store fulfillment toward local order fulfillment centers optimized for this kind of activity. But this option may not be viable in every market and for every retailer. Instead, many retailers are focused on transforming existing store locations to better serve a variety of fulfillment options.

Real-time insights are key

Regardless of which path is taken, the more fragmented fulfillment landscape coupled with increasing expectations for faster delivery puts pressure on retailers to improve their knowledge of what products reside where. To address this, many are turning to real-time technologies which process large volumes of inventory-relevant event data from across multiple locations to track units on-hand. This information may be used to alert store associates of an issue requiring attention, trigger automated replenishment or adjust which items are presented as shoppers navigate online platforms.

Real-time data processing technologies are not new, but with advances in technology, like the introduction of the Lakehouse architecture, the Delta Lake data format and Delta Live Tables, organizations are finding that the development of enterprise-scale real-time inventory management systems is within reach.

Technology innovations enabling real-time insights

The Lakehouse architecture breaks down the lengthy and complex logic required to transform raw event data into actionable insight. Sometimes referred to as a medallion architecture, this architectural pattern breaks the end-to-end data flow into three stages or layers referred to as the bronze, silver and gold layers (Figure 1).

end-to-end data flow of point-of-sale data through the Databricks Retail Lakehouse architecture to calculate current inventory.

Figure 1. The end-to-end data flow of point-of-sale data through the Lakehouse architecture to calculate current inventory.

In the bronze layer, raw data received from in-store inventory management systems are persisted as-is to provide a record of the data in its original state. In the silver layer, bronze data has been deduplicated, restructured and otherwise transformed to improve their accessibility to more technical users responsible for building downstream workflows. In the gold layer, data is delivered as deliver business-aligned information assets, such as current state inventories, accessible across the organization.

This decomposition of work into digestible steps not only simplifies implementation; it also enables developers to more easily reuse information assets as they deliver various business-aligned outputs. The key challenge then becomes how to keep data in motion as it moves through the different stages so that the data objects the business consumes provide current state information.

This problem is addressed through the use of Delta Lake. Delta Lake is a variation of the highly popular Parquet data format. It preserves the performance characteristics of Parquet while enhancing it through the inclusion of a transaction log.

The transaction log allows Delta Lake to support traditional data modification patterns that greatly simplify workflow development. It also enables workflows to recognize upstream data modifications so that even though data is persisted in each of the bronze, silver and gold layers of the lakehouse architecture, end-to-end, real-time data processing can continue uninterrupted.

If the lakehouse architecture, in combination with Delta Lake, provides the foundation for a robust real-time data processing and analytics infrastructure, how exactly are the workflows defining the movement of data through that infrastructure assembled? In the past, real-time data processing took place through specialized technologies that employed novel ways of thinking about data and their interactions. With Spark Structured Streaming, real-time data processing mechanics were brought inline with batch processing to make development much simpler. And now with Delta Live Tables, the definition of persistent streaming objects, the scheduling and orchestration of data movement between spans of objects, and the resource provisioning, monitoring and alerting that surrounds real-time workflows have been simplified even further.

To learn more about Delta Live Tables, please check out this blog announcing its general availability. To see how the Lakehouse architecture, Delta Lake and Delta Live Tables can be employed together to deliver real-time insights into in-store inventories, please check out these notebooks.

The need of retailers for real-time insights has never been greater, and the analytics solutions producing these insights has never been more accessible. We hope this information and the associated notebooks help your organization deliver the functionality your organization requires to achieve its omnichannel objectives.

Register for our webinar to learn more about how retailers are enabling real-time decisions with Delta Live Tables.

Try Databricks for free. Get started today.

The post Delivering Real-Time Data to Retailers with Delta Live Tables appeared first on Databricks.

↧

DoD Data Decrees and the path to Lakehouse

April 13, 2022, 10:00 am

≫ Next: Launching dbt Cloud in Databricks Partner Connect

≪ Previous: Delivering Real-Time Data to Retailers with Delta Live Tables

Throughout both private industry and government, data-driven decision-making has made the quantity and quality of information critical for organizations. In 2018, the United States Congress signed into law the Foundations for Evidence-Based Decisions Act, which establishes a framework for using data to facilitate the use of evidence in policy making. And more recently, in May of 2021, the Department of Defense (DoD) issued five decrees to “create a data advantage” by improving data sharing throughout the department, which will help eliminate data silos. According to the memo, the DoD rightly understands that becoming a “data-centric organization is critical to improving performance and creating decision advantage at all echelons from the battlespace to the board room.”

These five decrees recognize data’s importance as a strategic asset and aim to help the DoD build capabilities around data engineering and analysis, which are just as important to national security as any weapon system. Without good data and strategies around data management, engineering and security, the DoD could fall behind in critical command and control of data.

The DoD’s Five Data Decrees

As data grows across government, a number of bad patterns can emerge if there is not careful attention paid to management. Whether its data being siloed in databases resulting in sharing challenges, or applying data of uncertain quality for decision making, without clearly defined governance patterns poor data practice can and will proliferate. The DoD data decrees aim to reduce these negative scenarios.

The five decrees are as follows:

Maximize data sharing and rights for data use: All DoD data is an enterprise resource.
Publish data assets in the DoD federated data catalog along with common interface specifications.
Use automated data interfaces that are externally and machine-readable; ensure interfaces use industry-standard, non-proprietary, preferably open source, technologies, protocols and payloads.
Store data in a manner that is platform- and environment-agnostic, uncoupled from hardware or software dependencies.
Implement industry best practices for secure authentication, access management, encryption, monitoring and protection of data at rest, in transit and in use.

Meeting the tenets of the Decrees

The Databricks Lakehouse Platform meets each of the tenets described in the memo by combining the best elements of data warehouses and data lakes to provide strong data governance with the performance of a data warehouse.

Sub: Data Sharing and Catalog

Although the DoD has moved from a “need to know” security basis – where access is only granted to those data which are necessary for one to conduct official duties – to a “need to share” approach which fosters much broader sharing of data between departments and agencies, sharing data between agencies can be challenging – and risky – with legacy tools that enable copies of data to proliferate. Data sharing can improve information-gathering within the department, and facilitate better cooperation with allies. However, unencrypted legacy technologies, like FTP for example, make it challenging to easily and securely share data. Having technology that forwards data sharing in a secure and open fashion is the key to this tenant.

Databricks’ Unity Catalog meets the data catalog tenet and provides a single-user interface to discover, audit, provide lineage, and govern data assets in one place. Some of the features include the ability to add role and attributed-based security and metadata, such as tags on columns or tables, which help make data more identifiable and secure. The Unity Catalog also provides a single interface that builds on the open source Delta Sharing protocol to manage and govern shared assets within an organization. This allows you to publish data assets in the federated DoD catalog, along with common interface specifications.

Each of these guiding principles aims to get data out of closed systems that cannot be easily shared across DoD. Tenets 3 and 4 set a direct course toward modern data platforms like Databricks, which allow datasets to be stored in low-cost object-based storage, and to be separated from compute. This allows for flexibility in choosing your compute tier and, more importantly, allows for much easier sharing of data than those locked in proprietary databases within agency and department walls.

Sub: Using open source technologies and uncoupling storage from compute

By moving data out of proprietary databases — where it needs to be extracted and loaded into another system — and into a data lake model, sharing data within the DoD becomes a much easier task. While there is always a lot of data gravity involved when you have petabytes and exabytes of data, modern storage makes sharing much easier.

Traditionally, analytics, business intelligence, data science and machine learning workloads were separate systems, creating organizational silos within an organization. With the data lakehouse architecture, you bring all of those tools together in one open system with Databricks.

Databricks for DoD

The Databricks platform is built on top of the open source Delta Lake storage platform, which brings reliability, security and performance to your data lake for streaming and batch workloads. This provides a single location where you can store structured data like CSV files, semi-structured data like JSON or XML, and unstructured files like video and audio. Delta Lake, an open source project powered by Databricks, provides a single source of truth and provides transactional support and schema enforcement that many data lakes and other big data platforms lack.

Databricks also provides a single interface for managing access to those data assets in the catalog. In addition to multi cloud support, Databricks implements best practices for secure authentication, access management and data protection, and meets the high demands of federal compliance protocols such as FedRamp-High and DoD IL6.

Making the shift

Making the organizational shift to being more open with data can be a challenge both organizationally and technically. However, the shift becomes easier when it has broad organizational support. Databricks can help the DoD meet these goals and support better data sharing throughout the department in the following ways:

Allowing for better data access and easier data sharing of key data assets in the DoD.
Federating data assets into a catalog that is accessible throughout the organization.
Databricks is built on robust open source technologies like Apache Spark™ and Delta Lake.
Decoupling storage from compute allows for a great deal of flexibility in the tools used for data analysis, and reduces data gravity.
Databricks is built with strong security controls that meet the highest-level DoD standards.

To learn more about how Databricks enables the DoD to create a data advantage, visit our Federal solution page.

Try Databricks for free. Get started today.

The post DoD Data Decrees and the path to Lakehouse appeared first on Databricks.

↧

Launching dbt Cloud in Databricks Partner Connect

April 14, 2022, 9:50 am

≫ Next: Announcing Databricks Support for AWS Graviton2 With up to 3x Better Price-Performance

≪ Previous: DoD Data Decrees and the path to Lakehouse

We are delighted to announce that dbt Cloud, the fastest and most reliable way to build, manage and monitor dbt projects, is now available in Databricks Partner Connect. Users who want to use dbt’s industry standard data transformation framework to transform data in their lakehouse can now connect Databricks to dbt Cloud with a few clicks, and even start a free trial if they do not have an existing account. They can develop, test, and deploy data models in dbt Cloud, while leveraging Photon-accelerated compute on Databricks which makes running data transformation workflows much faster than on legacy cloud data warehouses. We believe this latest integration, following on the heels of the recently announced native dbt-databricks adapter, makes Databricks the best place to build and productionize dbt projects.

Connect dbt Cloud to Databricks with a few clicks

Previously, connecting dbt Cloud to Databricks required multiple steps, including transferring credentials. Partner Connect lets you easily try and integrate partner offerings ranging from ingest, to ETL and ML/AI integrations into the lakehouse. Now, with Partner Connect, you have a seamless experience to try dbt Cloud with just a few clicks. This integration with Databricks will securely configure resources and set up dbt Cloud as well. You can run your first dbt models on Databricks in just a few minutes.

Once you have connected dbt Cloud with Databricks, you can use it to orchestrate SQL data pipelines, transforming raw data into model-ready data that can be used for downstream analytics and BI use cases.

Collaboratively develop in dbt Cloud

As your data team grows and dbt projects become more complex, you need to support CI/CD, monitor data models and receive alerts when things go wrong. dbt Cloud offers a collaborative IDE which is fully hosted and managed, allowing teams to onboard new members quickly without needing to manage infrastructure. dbt Cloud also offers turnkey support for CI/CD, robust version control, job scheduling, and testing as well as the ability to serve documentation and lineage. dbt Cloud generates standard SQL which runs on Databricks compute, including on SQL endpoints.

Databricks is a first class place to run dbt

We are excited about the power of dbt, and continue to add improvements which make the Databricks Lakehouse a great place to run dbt models. The Photon execution engine, only available on Databricks, automatically accelerates and improves SQL generated by dbt. This means your data models run faster and you don’t need any additional code changes or optimizations. Furthermore, data teams can continue to use their existing access control and governance process when using dbt, making it more scalable and easier to maintain.

Take dbt Cloud for a spin

dbt Cloud is now available to try in Partner Connect at no additional cost. To learn more, sign up for a live, hands-on workshop on building a modern data stack with dbt and Databricks. We will walk you through a step-by-step guide to achieving scalable data transformation pipelines entirely from scratch. Or you can join us on slack: #db-databricks-and-spark.

Try Databricks for free. Get started today.

The post Launching dbt Cloud in Databricks Partner Connect appeared first on Databricks.

↧

Announcing Databricks Support for AWS Graviton2 With up to 3x Better Price-Performance

April 18, 2022, 2:00 pm

≫ Next: Increasing Healthcare Equity With Data

≪ Previous: Launching dbt Cloud in Databricks Partner Connect

Today, we are excited to announce the public preview of Databricks support for AWS Graviton2-based Amazon Elastic Compute Cloud (Amazon EC2) instances. The Graviton processors are custom designed and optimized by AWS to deliver the best price-performance for cloud workloads running in Amazon EC2. When used with Photon, the high performance Databricks query engine, Graviton2-based Amazon EC2 instances can deliver up to 3x-4x better price-performance than comparable Amazon EC2 instances for your data lakehouse workloads. In this blog post, we will go over the price-performance of Photon with Graviton2, and also give you additional tips to further reduce your AWS infrastructure cost.

Price-performance with Photon and Graviton2

To determine the price-performance of Photon + Graviton2, we did a simple test running two different workloads (TPC-DS and a standard ETL workload with bulk inserts and merge statements) on an Graviton2-based R6gd EC2 instance and a comparable I3 EC2 instance. We found that just the Photon engine significantly improved the price-performance for an EC2 instance. But Photon on the Graviton2-based instance took it a step further and delivered 3.3x better price-performance for the ETL workload and 3.7x better price-performance for the TPC-DS workload compared to the previous Databricks runtime on the I3 instance. Customers who tried Graviton2-based instances have reported similar results and share our excitement! Here’s a quote from a Databricks customer who happens to know all about Arm-based Graviton instances

“Cloud computing is driving significant innovation in semiconductor design, and by moving our design workloads to Arm-based AWS Graviton2-based instances that provide significant price performance gains, we see first-hand the benefits enabled by the Arm Neoverse N1 platform,” said Mark Galbraith, VP of productivity engineering, Arm. “This is especially evident for Databricks on Graviton2 and we look forward to migrating our production use of Databricks to Graviton2 to further enhance user experience and reduce expenses.”

Price-performance Comparison for Graviton and Photon

Additional cost savings with Amazon EC2 Spot Instances and Amazon EBS gp3 volumes support

In addition to Graviton2 and Photon, there are other ways to improve price-performance for your Databricks workloads on AWS. These include:

Amazon EC2 Spot Instances – Spot Instances enables you to take advantage of spare EC2 capacity and are available at up to a 90% discount compared to On-Demand prices. Depending on the nature of your workload, you may be able to replace the On-Demand or Reserved EC2 instances in your Databricks cluster with Spot Instances and save cost.
Amazon EBS gp3 volumes – Storage can be a big part of your cloud infrastructure cost. Databricks supports gp3 volumes. gp3 SSD volumes for Amazon Elastic Block Store (Amazon EBS) enable you to provision performance independent of storage capacity and can provide up to 20% better price-performance per GB than existing gp2 volumes.

To learn more about price-performance optimizations, please read our cluster best practices documentation.

Get Started with Graviton2

AWS Graviton2-based instance support in public preview is currently rolling out and will be available in all supported regions in the next few weeks. To get started and for guidance on migration to Graviton2 and Photon please read our Graviton documentation.

Try Databricks for free. Get started today.

The post Announcing Databricks Support for AWS Graviton2 With up to 3x Better Price-Performance appeared first on Databricks.

↧

Increasing Healthcare Equity With Data

April 18, 2022, 5:20 pm

≫ Next: Supercharge Your Machine Learning Projects With Databricks AutoML — Now Generally Available!

≪ Previous: Announcing Databricks Support for AWS Graviton2 With up to 3x Better Price-Performance

Social determinants of health (SDOH) have an indisputable impact on health equity. They have long been a concern of the CDC, healthcare professionals, and many government agencies whose constituents experience health inequities due to nonmedical social and economic factors, such as race, income and sexual orientation. According to the CDC, “Health inequities are reflected in differences in length of life; quality of life; rates of disease, disability and death; severity of disease; and access to treatment.” Negative consequences of health inequities include lower quality of life, but the good news is that use of data relevant to social determinants of health can play a large role in helping to identify disparities and prioritize health equity. Closing the gap on health disparities requires analyzing many rich sources of data, which can be challenging. The pandemic and the accompanying vaccine distribution rates among various socioeconomic and social groups provide the most recent example. It can be helpful to use COVID-19 to bring visibility to this issue and illustrate such disparities through the use of data. However, it’s important to note that health equity is relevant to many use cases across local, state and federal governments.

Using the example of COVID-19 vaccinations, existing data sources can provide valuable insights into the causes that may underlie lower vaccination rates in certain communities. COVID-19 vaccines have been widely available across the United States for at least a year, but vaccination rates vary widely not only across states but within counties and at subcounty levels. While basic information about those who have been vaccinated — for example, age, ethnicity and gender — provides limited insight into groups of underserved people, there are many additional data sources that can be leveraged to gain a more comprehensive view. For our analysis, we’ll use existing and public data sets such as income, educational attainment, population density and health traits such as asthma, cancer rates, obesity rates and medical insurance coverage, among others.

How’s it done today?

Although the above data sets and other private data sets exist within various county and state departments such as health, labor, justice and family services, the challenge that has historically faced decision makers is the inability to access all of these data sets. To help visualize these challenges, let’s consider an all too common conversation between Heather, a biostatistician who is looking for correlations between cost of claims and social determinants of health, and Ryan, a database administrator for the Medicare and Medicaid database.

A similarly aggravating process plays out for each additional data source that is needed. Even though in this scenario, accessing sensitive public health data like medical claims would likely require a security review no matter the data platform, consider what would change if Heather had nonsensitive data she sourced locally on her laptop and just needed more compute power than what her laptop was capable of. She would still need:

The infrastructure team to provide compute
A data platform to process the data
The ETL team to load the data into the data platform
Analytics tools to perform the analysis

Even in a cloud environment, biostatisticians and data analysts are not expected to know how to provide their own database, ETL tools and compute, and so additional teams would need to be involved.

A better way: a modern data platform

Now let’s look at how Heather would use Databricks Lakehouse Platform, a modern data platform, to support her initiative. She would:

Upload her data to her S3, ADLS or GCS account
Perform any data cleansing required using R, SQL or Python
Use ephemeral compute for data cleansing and analytics
Leverage collaborative notebooks to conduct her analysis
Share the results of her analysis both within Databricks and externally to other BI tools

Note the key differences between the lakehouse and the “way it’s always been done.” Using the existing skill set of Python, R or SQL, Heather can ingest, cleanse and use the data without going through a lengthy and complex process of coordinating across multiple IT teams.

COVID-19 vaccination rates

Using the lakehouse, we will perform an analysis very similar to the one Heather was attempting to do. Using JSON and CSV files collected from various public data sources, we will upload the data to our cloud storage account, cleanse it and identify what factors are most influential for COVID-19 vaccination rates.

The data is aggregated at a county level and covers the population percentage that is fully vaccinated, as well as data for racial and population density, education and income level, and obesity, cancer, smoking, asthma and health insurance coverage rates. Initially, we will ingest the data in a mostly raw form. This allows for quick data exploration. Below is the step that takes the vaccination rates from a CSV file, performs a simple date parsing step, then saves the data to a Delta table.

from pyspark.sql.functions import to_date, col

dfVaccs = spark.read.csv(storageBase + "/COVID-19_Vaccinations_in_the_United_States_County.csv", header=True, inferSchema=True)
dfVaccs = dfVaccs.withColumn("Date",to_date(col("Date"),"MM/dd/yyyy"))
display(dfVaccs)
dfVaccs.write.format("delta").mode("overwrite").option("mergeSchema",True).option("path",storageBase + "/delta/bronze_vaccinations").saveAsTable("sdoh.bronze_vaccinations")

Similar steps are repeated for the other data sets to complete the Bronze, or raw, layer of data. Next, the Silver layer of refined data is created where missing data, such as FIPS code, are added and unneeded data is filtered out. The following step creates a health traits table that includes only the traits we are interested in and pivots the table to make it easier to work with for our use case.

create table sdoh.silver_health_stats
using delta
location '{storageBase}/sdoh/delta/silver_health_stats';
select * from sdoh.bronze_health_stats
pivot(
   MAX(data_value) AS data_v
   FOR measure IN ('Current_smoking_among_adults_aged_18_years' AS SmokingPct, 
   'Obesity_among_adults_aged_18_years' AS ObesityPct,
   'Coronary_heart_disease_among_adults_aged_18_years' AS HeartDiseasePct,
   'Cancer_excluding_skin_cancer_among_adults_aged_18_years' AS CancerPct,
   'Current_lack_of_health_insurance_among_adults_aged_18-64_years' AS NoHealthInsPct,
   'Current_asthma_among_adults_aged_18_years' AS AsthmaPct)
)
order by LocationID

After the data cleansing is complete, we have one row per county that includes each attribute we intend on analyzing. Below is a partial listing of the data.

To perform the analysis, we are going to use XGBoost to create a linear regression model. For brevity, only the model setup and training are shown.

xgb_regressor = XGBRegressor(objective='reg:squarederror', max_depth=max_depth, learning_rate=learning_rate, reg_alpha=reg_alpha, n_estimators=3000, importance_type='total_gain', random_state=0)

xgb_model = xgb_regressor.fit(X_train, y_train, eval_set=[(X_test, y_test)],eval_metric='rmse', early_stopping_rounds=25)

The model has a mean square error rate of 6.8%, meaning the vaccination rate could be +/- 6.8% of the actual rate. While we are not interested in predicting future vaccination rates, we can use the model to explain how each attribute influenced the vaccination rate. To perform this analysis we will use SHAP. There is a dedicated Databricks blog entry on SHAP that shows why SHAP is so powerful for calculating the influence the attributes had on the model.

Results

When we summarize and visualize the results for all attributes in every county, we see that lack of health insurance was the most influential factor in determining vaccination rates. What makes this interesting is that the COVID-19 vaccine has been free to everyone, so health insurance or a lack thereof should not have been a barrier to getting vaccinated. After health insurance, level of income and population density rounded out the top three factors.

While creating a model that covers the entire United States is interesting and insightful, local trends may not be apparent on such a large scale. Creating the same model but with the data limited to counties within the state of California produces a very different picture.

By a large margin, population density was the most influential factor in the vaccination rate of the counties within California. The percentage of the population who identified as smokers was a distant second, whereas health insurance status was not even in the top six factors.

Finally, we can take the top factor for every county from our whole country model and visualize it as a map (below). These details can show us factors that are relevant by state or region and compare them to those of an individual county to understand outliers or patterns. This knowledge can help us begin to address gaps in health equity impacting the most vulnerable members of our constituency.

What’s next

Publicly available data sets provide a great starting point in visualizing gaps in population health, as you can see through this example with COVID-19 vaccinations. However, this is one small use case that I hope illustrates the insights possible and progress toward health equity that is within reach when leveraging the Databricks Lakehouse. When we are able to bring together more data from a variety of sources, we can achieve greater insight and positively impact health policy and outcomes for citizens who need our support in ensuring a more equitable distribution of health in the future.

Read more about Data Analytics and AI for State and Local Governments on our Databricks industry page.

Try Databricks for free. Get started today.

The post Increasing Healthcare Equity With Data appeared first on Databricks.

↧

Supercharge Your Machine Learning Projects With Databricks AutoML — Now Generally Available!

April 18, 2022, 11:44 am

≫ Next: Model Evaluation in MLflow

≪ Previous: Increasing Healthcare Equity With Data

Machine Learning (ML) is at the heart of innovation across industries, creating new opportunities to add value and reduce cost. At the same time, ML is hard to do and it takes an enormous amount of skill and time to build and deploy reliable ML models. Databricks AutoML — now generally available (GA) — automatically trains models on a dataset and generates customizable source code, significantly reducing the time-to-value of ML projects. This glass-box approach to automated ML provides a realistic path to production with low to no code, while also giving ML experts a jumpstart by creating baseline models that they can reproduce, tune, and improve.

What can Databricks AutoML do for you?

No matter your background in data science, AutoML can help you get to production machine learning quickly. All you need is a training dataset and AutoML does the rest. AutoML prepares the data for training, runs data exploration, trials multiple model candidates, and generates a Python notebook with the source code tailored to the provided dataset for each trial run. It also automatically distributes hyperparameter tuning and records all experiment artifacts and results in MLflow. It is ridiculously easy to get started with AutoML, and hundreds of customers are using this tool today to solve a variety of problems.

Fabletics, for example, is leveraging AutoML — from data prep to model deployment — to predict customer churn. Allscripts, a leader in electronic healthcare systems, is applying AutoML to improve their customer service experience by predicting outages. Both customers chose AutoML not just for its simplicity, but also its transparency and openness. While most automated machine learning solutions in the market today are opaque boxes with no visibility under the hood, Databricks AutoML generates editable notebooks with the source code of the model, visualizations and summary of the input data, and explanations on feature importance and model behavior.

Our customers’ use-cases also signify the broad relevance of AutoML. The team of data scientists at Fabletics uses AutoML to quickly generate baseline models that they can tune and improve. At Allscripts, on the other hand, 3 customer success engineers with no prior background in data science were able to train and deploy classification models in a few weeks. In both cases, the results were impressive – Fabletics was able to generate and tune models in 30 minutes (which previously had taken them days), and Allscripts saw massive improvement in their customer service operations when they put their AutoML model into production. AutoML is now the starting point for new ML initiatives at both companies, and their deployments are part of multi-task workflows they’ve built within the Databricks Lakehouse.

AutoML is now generally available – here’s how to get started

Databricks AutoML is now generally available (GA); here’s how you can get up and running with AutoML in a few quick steps –

Step1: Ingest data into the lakehouse. For this example, where we want a predictive troubleshooting model based on server logs, we have generated some training data. We have done this right in our notebook, which you can import here, and in just a few seconds, ingested this data into your lakehouse

In this example, there are 5 million rows of network logs being generated with some of the data being biased towards causing network failures and other random data meant to simulate noise or uncorrelated data. Each row of data is labeled with a classification stating if the system had failed recently or not.

Step 2: Let AutoML automatically train the models for you. We can simply feed the data into Databricks AutoML, tell it which field we’d like it to predict, and AutoML will begin training and tracking many different approaches to creating the best model for the provided data. Even as a seasoned ML practitioner, the amount of time saved by automatically iterating over many models and surfacing the resulting metrics is amazing.

Step 3: Choose the model that best fits your needs and optimize. In only a couple of minutes AutoML is able to train several models and churns out model performance metrics. For this specific set of data, the highest-performing model was a decision tree, but there was also a logistic regression model that performed well. Both models had satisfactory f1 scores, which shows the model fits the validation data well. But that’s not all – each model created by AutoML comes with customizable source-code notebooks specific to the dataset and model. This means that once a trained model shows promise, it’s exceptionally easy to begin tailoring it to meet the desired threshold or specifications.

Step 4: Deploy with MLflow. Select the best model – as defined by your metrics – and register it to the MLflow Model Registry. From here, you can serve the model using MLflow Model Serving on Databricks as a REST endpoint.

Ready to get started? Take it for a spin, or dive deeper into AutoML with the below resources.

Learn more about AutoML

Take it for a spin! Check out the AutoML free trial
Dive deeper into the Databricks AutoML documentation
Check out this introductory video: AutoML – A glass-box approach to automated machine learning
Check out this fabulous use-case with our customer Fabletics: Using AutoML to predict customer churn
Learn more about time-series forecasting with AutoML in this blog

Try Databricks for free. Get started today.

The post Supercharge Your Machine Learning Projects With Databricks AutoML — Now Generally Available! appeared first on Databricks.

↧

Model Evaluation in MLflow

April 19, 2022, 2:07 pm

≫ Next: Announcing Gated Public Preview of Unity Catalog on AWS and Azure

≪ Previous: Supercharge Your Machine Learning Projects With Databricks AutoML — Now Generally Available!

Many data scientists and ML engineers today use MLflow to manage their models. MLflow is an open-source platform that enables users to govern all aspects of the ML lifecycle, including but not limited to experimentation, reproducibility, deployment, and model registry. A critical step during the development of ML models is the evaluation of their performance on novel datasets.

Motivation

Why Do We Evaluate Models?

Model evaluation is an integral part of the ML lifecycle. It enables data scientists to measure, interpret, and explain the performance of their models. It accelerates the model development timeframe by providing insights into how and why models are performing the way that they are performing. Especially as the complexity of ML models increases, being able to swiftly observe and understand the performance of ML models is essential in a successful ML development journey.

State of Model Evaluation in MLflow

Until now, users could evaluate the performance of their MLflow model of the python_function (pyfunc) model flavor through the mlflow.evaluate API, which supports the evaluation of both classification and regression models. It computes and logs a set of built-in task-specific performance metrics, model performance plots, and model explanations to the MLflow Tracking server.

To evaluate MLflow models against custom metrics not included in the built-in evaluation metric set, users would have to define a custom model evaluator plugin. This would involve creating a custom evaluator class that implements the ModelEvaluator interface, then registering an evaluator entry point as part of an MLflow plugin. This rigidity and complexity could be prohibitive for users.

According to an internal customer survey, 75% of respondents say they frequently or always use specialized, business-focused metrics in addition to basic ones like accuracy and loss. Data scientists often utilize these custom metrics as they are more descriptive of business objectives (e.g. conversion rate), and contain additional heuristics not captured by the model prediction itself.

In this blog, we introduce an easy and convenient way of evaluating MLflow models on user-defined custom metrics. With this functionality, a data scientist can easily incorporate this logic at the model evaluation stage and quickly determine the best-performing model without further downstream analysis

Usage

Built-in Metrics

MLflow bakes in a set of commonly used performance and model explainability metrics for both classifier and regressor models. Evaluating models on these metrics is straightforward. All we need is to create an evaluation dataset containing the test data and targets and make a call to mlflow.evaluate.

Depending on the type of model, different metrics are computed. Refer to the Default Evaluator behavior section under the API documentation of mlflow.evaluate for the most up-to-date information regarding built-in metrics.

Example

Below is a simple example of how a classifier MLflow model is evaluated with built-in metrics.

First, import the necessary libraries

import xgboost
import shap
import mlflow
from sklearn.model_selection import train_test_split

Then, we split the dataset, fit the model, and create our evaluation dataset

# load UCI Adult Data Set; segment it into training and test sets
X, y = shap.datasets.adult()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# train XGBoost model
model = xgboost.XGBClassifier().fit(X_train, y_train)

# construct an evaluation dataset from the test set
eval_data = X_test
eval_data["target"] = y_test

Finally, we start an MLflow run and call mlflow.evaluate

with mlflow.start_run() as run:
   model_info = mlflow.sklearn.log_model(model, "model")
   result = mlflow.evaluate(
       model_info.model_uri,
       eval_data,
       targets="target",
       model_type="classifier",
       dataset_name="adult",
       evaluators=["default"],
   )

We can find the logged metrics and artifacts in the MLflow UI:

Using the MLfow UI to find the logged metrics and artificats.

Custom Metrics

To evaluate a model against custom metrics, we simply pass a list of custom metric functions to the mlflow.evaluate API.

Function Definition Requirements

Custom metric functions should accept two required parameters and one optional parameter in the following order:

eval_df: a Pandas or Spark DataFrame containing a prediction and a target column.
E.g. If the output of the model is a vector of three numbers, then the eval_df DataFrame would look something like:

builtin_metrics: a dictionary containing the built-in metrics

E.g. For a regressor model, builtin_metrics would look something like:

{
   "example_count": 4128,
   "max_error": 3.815,
   "mean_absolute_error": 0.526,
   "mean_absolute_percentage_error": 0.311,
   "mean": 2.064,
   "mean_squared_error": 0.518,
   "r2_score": 0.61,
   "root_mean_squared_error": 0.72,
   "sum_on_label": 8520.4
}

(Optional) artifacts_dir: path to a temporary directory that can be used by the custom metric function to temporarily store produced artifacts before logging to MLflow.
E.g. Note that this will look different depending on the specific environment setup. For example, on MacOS it look something like this:
```
/var/folders/5d/lcq9fgm918l8mg8vlbcq4d0c0000gp/T/tmpizijtnvo
```
If file artifacts are stored elsewhere than artifacts_dir, ensure that they persist until after the complete execution of mlflow.evaluate.

Return Value Requirements

The function should return a dictionary representing the produced metrics and can optionally return a second dictionary representing the produced artifacts. For both dictionaries, the key for each entry represents the name of the corresponding metric or artifact.

While each metric must be a scalar, there are various ways to define artifacts:

The path to an artifact file
The string representation of a JSON object
A pandas DataFrame
A numpy array
A matplotlib figure
Other objects will be attempted to be pickled with the default protocol

Refer to the documentation of mlflow.evaluate for more in-depth definition details.

Example

Let’s walk through a concrete example that uses custom metrics. For this, we’ll create a toy model from the California Housing dataset.

from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
import mlflow
import os

Then, setup our dataset and model

# loading the California housing dataset
cali_housing = fetch_california_housing(as_frame=True)

# split the dataset into train and test partitions
X_train, X_test, y_train, y_test = train_test_split(
   cali_housing.data, cali_housing.target, test_size=0.2, random_state=123
)

# train the model
lin_reg = LinearRegression().fit(X_train, y_train)

# creating the evaluation dataframe
eval_data = X_test.copy()
eval_data["target"] = y_test

Here comes the exciting part: defining our custom metrics function!

def example_custom_metric_fn(eval_df, builtin_metrics, artifacts_dir):
   """
   This example custom metric function creates a metric based on the ``prediction`` and
   ``target`` columns in ``eval_df`` and a metric derived from existing metrics in
   ``builtin_metrics``. It also generates and saves a scatter plot to ``artifacts_dir`` that
   visualizes the relationship between the predictions and targets for the given model to a
   file as an image artifact.
   """
   metrics = {
       "squared_diff_plus_one": np.sum(np.abs(eval_df["prediction"] - eval_df["target"] + 1) ** 2),
       "sum_on_label_divided_by_two": builtin_metrics["sum_on_label"] / 2,
   }
   plt.scatter(eval_df["prediction"], eval_df["target"])
   plt.xlabel("Targets")
   plt.ylabel("Predictions")
   plt.title("Targets vs. Predictions")
   plot_path = os.path.join(artifacts_dir, "example_scatter_plot.png")
   plt.savefig(plot_path)
   artifacts = {"example_scatter_plot_artifact": plot_path}
   return metrics, artifacts

Finally, to tie all of these together, we’ll start an MLflow run and call mlflow.evaluate:

with mlflow.start_run() as run:
   mlflow.sklearn.log_model(lin_reg, "model")
   model_uri = mlflow.get_artifact_uri("model")
   result = mlflow.evaluate(
       model=model_uri,
       data=eval_data,
       targets="target",
       model_type="regressor",
       dataset_name="cali_housing",
       custom_metrics=[example_custom_metric_fn],
   )

Logged custom metrics and artifacts can be found alongside the default metrics and artifacts. The red boxed regions show the logged custom metrics and artifacts on the run page.

Accessing Evaluation Results Programmatically

So far, we have explored evaluation results for both built-in and custom metrics in the MLflow UI. However, we can also access them programmatically through the EvaluationResult object returned by mlflow.evaluate. Let’s continue our custom metrics example above and see how we can access its evaluation results programmatically. (Assuming result is our EvaluationResult instance from here on).

We can access the set of computed metrics through the result.metrics dictionary containing both the name and scalar values of the metrics. The content of result.metrics should look something like this:

{
   'example_count': 4128,
   'max_error': 3.8147801844098375,
   'mean_absolute_error': 0.5255457157103748,
   'mean_absolute_percentage_error': 0.3109520331276797,
   'mean_on_label': 2.064041664244185,
   'mean_squared_error': 0.5180228655178677,
   'r2_score': 0.6104546894797874,
   'root_mean_squared_error': 0.7197380534040615,
   'squared_diff_plus_one': 6291.3320597821585,
   'sum_on_label': 8520.363989999996,
   'sum_on_label_divided_by_two': 4260.181994999998
}

Similarly, the set of artifacts is accessible through the result.artifacts dictionary. The values of each entry is an EvaluationArtifact object. result.artifacts should look something like this:

{
   'example_scatter_plot_artifact': ImageEvaluationArtifact(uri='some_uri/example_scatter_plot_artifact_on_data_cali_housing.png'),
   'shap_beeswarm_plot': ImageEvaluationArtifact(uri='some_uri/shap_beeswarm_plot_on_data_cali_housing.png'),
   'shap_feature_importance_plot': ImageEvaluationArtifact(uri='some_uri/shap_feature_importance_plot_on_data_cali_housing.png'),
   'shap_summary_plot': ImageEvaluationArtifact(uri='some_uri/shap_summary_plot_on_data_cali_housing.png')
}

Example Notebooks

Underneath the Hood

The diagram below illustrates how this all works under the hood:

Conclusion

In this blog post, we covered:

The significance of model evaluation and what’s currently supported in MLflow.
Why having an easy way for MLflow users to incorporate custom metrics into their MLflow models is important.
How to evaluate models with default metrics.
How to evaluate models with custom metrics.
How MLflow handles model evaluation behind the scenes.

Try Databricks for free. Get started today.

The post Model Evaluation in MLflow appeared first on Databricks.

↧

Announcing Gated Public Preview of Unity Catalog on AWS and Azure

April 20, 2022, 7:10 am

≫ Next: Introducing Lakehouse for Media & Entertainment

≪ Previous: Model Evaluation in MLflow

At the Data and AI Summit 2021, we announced Unity Catalog, a unified governance solution for data and AI, natively built-into the Databricks Lakehouse Platform. Today, we are excited to announce the gated public preview of Unity Catalog for AWS and Azure.

Sign-up for Public Preview

In this blog, we will summarize our vision behind Unity Catalog, some of the key data governance features available with this release, and provide an overview of our coming roadmap.

Why Unity Catalog for data and AI governance?

Key challenges with data and AI governance

Diversity of data and AI assets

The increased use of data and the added complexity of the data landscape has left organizations with a difficult time managing and governing all types of data-related assets. Not just files or tables, modern data assets today take many forms, including dashboards, machine learning models, and unstructured data like video and images that legacy data governance solutions simply weren’t built to govern and manage.

Two disparate and incompatible data platforms

Organizations today use two different platforms for their data analytics and AI efforts – data warehouses for BI and data lakes for big data and AI. This results in data replication across two platforms, presenting a major governance challenge as it becomes difficult to create a unified view of the data landscape to see where data is stored, who has access to what data, and consistently define and enforce data access policies across the two platforms with different governance models.

Data warehouses offer fine-grained access controls on tables, rows, columns, and views on structured data; but they don’t provide agility and flexibility required for ML/AI or data streaming use cases. In contrast, data lakes hold raw data in its native format, providing data teams the flexibility to perform ML/AI. However, existing data lake governance solutions don’t offer fine-grained access controls, supporting only permissions for files and directories. Data lake governance also lacks the ability to discover and share data – making it difficult to discover data for analytics or machine-learning.

Two disparate and incompatible data platforms

Rising multi-cloud adoption

More and more organizations are now leveraging a multi-cloud strategy for optimizing cost, avoiding vendor lock-in, and meeting compliance and privacy regulations. With nonstandard cloud-specific governance models, data governance across clouds is complex and requires familiarity with cloud-specific security and governance concepts such as Identity and Access Management (IAM).

Disjointed tools for data governance on the Lakehouse

Today, data teams have to manage a myriad of fragmented tools/services for their data governance requirements such as data discovery, cataloging, auditing, sharing, access controls etc. This inevitably leads to operational inefficiencies and poor performance due to multiple integration points and network latency between the services.

Our vision for a governed Lakehouse

Our vision behind Unity Catalog is to unify governance for all data and AI assets including dashboards, notebooks, and machine learning models in the lakehouse with a common governance model across clouds, providing much better native performance and security. With automated data lineage, Unity Catalog provides end-to-end visibility into how data flows in your organizations from source to consumption, enabling data teams to quickly identify and diagnose the impact of data changes across their data estate. Get detailed audit reports on how data is accessed and by whom for data compliance and security requirements. With rich data discovery,data teams can quickly discover and reference data for BI, analytics and ML workloads, accelerating time to value.

Unity Catalog also natively supports Delta Sharing, world’s first open protocol for data sharing, enabling seamless data sharing across organizations, while preserving data security and privacy.

Finally, Unity Catalog also offers rich integrations across the modern data stack, providing the flexibility and interoperability to leverage tools of your choice for your data and AI governance needs.

Key features of Unity Catalog available with this release

Centralized Metadata Management and User Management

Without Unity Catalog, each Databricks workspace connects to a Hive metastore, and maintains a separate service for Table Access Controls (TACL). This requires metadata such as views, table definitions, and ACLs to be manually synchronized across workspaces, leading to issues with consistency on data and access controls.

Unity Catalog introduces a common layer for cross workspace metadata, stored at the account level in order to ease collaboration by allowing different workspaces to access Unity Catalog metadata through a common interface. Further, the data permissions in Unity Catalog are applied to account-level identities, rather than identities that are local to a workspace, enabling a consistent view of users and groups across all workspaces.

Create a single source of truth for your data estate with Unity Catalog

The Unity catalog also enables consistent data access and policy enforcement on workloads developed in any language – Python, SQL, R, and Scala.

Three-level namespace in SQL

Unity Catalog also introduces three-level namespaces to organize data in Databricks. You can define one or more catalogs, which contain schemas, which in turn contain tables and views. This gives data owners more flexibility to organize their data and lets them see their existing tables registered in Hive as one of the catalogs (hive_metastore), so they can use Unity Catalog alongside their existing data.

For example, you can still query your legacy Hive metastore directly:


SELECT * from hive_metastore.prod.customer_transactions

You can also distinguish between production data at the catalog level and grant permissions accordingly:


SELECT * from production.sales.customer_address


SELECT * from staging.sales.customer_address

This gives you the flexibility to organize your data in the taxonomy you choose, across your entire enterprise and environment scopes. You can use a Catalog to be an environment scope, an organizational scope, or both.

Three-level namespaces are also now supported in the latest version of the Databricks JDBC Driver, which enables a wide range of BI and ETL tools to run on Databricks.

Unified Data Access on the Lakehouse

Unity Catalog offers a unified data access layer that provides Databricks users with a simple and streamlined way to define and connect to your data through managed tables, external tables or files, as well as to manage access controls over them. Using External locations and Storage Credentials, Unity Catalog can read and write data in your cloud tenant on behalf of your users.

Unity Catalog enables fine-grained access control for managed tables, external tables, and files

Centralized Access Controls

Unity Catalog centralizes access controls for files, tables, and views. It leverages dynamic views for fine grained access controls so that you can restrict access to rows and columns to the users and groups who are authorized to query them.

Centrally grant access permissions

Access Control on Tables and Views

Unity Catalog’s current support for fine grained access control includes Column, Row Filter, and Data masking through the use of Dynamic Views.

A Dynamic View is a view that allows you to make conditional statements for display depending on the user or the user’s group membership.

For example the following view only allows the ‘admin@example.com‘ user to view the email column.


CREATE VIEW sales_redacted AS
SELECT
 user_id,
 CASE WHEN
  current_user() = 'admin@example.com' THEN email
  ELSE 'REDACTED'
 END AS email,
 country,
 product,
 total
FROM sales_raw

Access Control on Files

External Locations control access to files which are not governed by an External Table. For example, in the examples above, we created an External Location at s3://depts/finance and an External Table at s3://depts/finance/forecast.

This means we can still provide access control on files within s3://depts/finance, excluding the forecast directory.

For example consider the following:


GRANT READ_FILE ON EXTERNAL LOCATION finance to finance_dataengs;

Open, simple, and secure data sharing with Delta Sharing

During the Data + AI Summit 2021, we announced Delta Sharing, the world’s first open protocol for secure data sharing. Delta Sharing is natively integrated with Unity Catalog, which enables customers to add fine-grained governance, and data security controls, making it easy and safe to share data internally or externally, across platforms or across clouds.

Delta Sharing allows customers to securely share live data across organizations independent of the platform on which data resides or consumed. Organizations can simply share existing large-scale datasets based on the Apache Parquet and Delta Lake formats without replicating data to another system. Delta Sharing also empowers data teams with the flexibility to query, visualize, and enrich shared data with their tools of choice.

Delta Sharing Ecosystem

One of the new features available with this release is partition filtering, allowing data providers to share a subset of an organization’s data with different data recipients by adding a partition specification when adding a table to a share. We have also improved the Delta Sharing management and introduced recipient token management options for metastore Admins. Today, metastore Admin can create recipients using the CREATE RECIPIENT command and an activation link will be automatically generated for a data recipient to download a credential file including a bearer token for accessing the shared data. With the token management feature, now metastore admins can set expiration date on the recipient bearer token and rotate the token if there is any security risk of the token being exposed.

To learn more about Delta Sharing on Databricks, please visit the Delta Sharing documentation [AWS and Azure].

Centralized Data Access Auditing

Unity Catalog also provides centralized fine-grained auditing by capturing an audit log of actions performed against the data. This enables fine-grained details about who accessed a given dataset, and helps you meet your compliance and business requirements .

What’s coming next

This is just the beginning, and there is an exciting slate of new features coming soon as we work towards realizing our vision for unified governance on the lakehouse. Below you can find a quick summary of what we are working next:

End-to-end Data lineage
Unity Catalog will automatically capture runtime data lineage, down to column and row level, providing data teams an end-to-end view of how data flows in the lakehouse, for data compliance requirements and quick impact analysis of data changes.

End-to-end data lineage

Deeper Integrations with enterprise data catalogs and governance solutions
We are working with our data catalog and governance partners to empower our customers to use Unity Catalog in conjunction with their existing catalogs and governance solutions.

Data discovery and search
With built-in data search and discovery, data teams can quickly search and reference relevant data sets, boosting productivity and accelerating time to insights.

Governance and sharing of machine learning models/dashboards
We are also expanding governance to other data assets such as machine learning models, dashboards, providing data teams a single pane of glass for managing, governing, and sharing different data assets types.

Fine-grained governance with Attribute Based Access Controls (ABACs)
We are also adding a powerful tagging feature that lets you control access to multiple data items at once based on user and data attributes , further simplifying governance at scale. For example, you will be able to tag multiple columns as PII and manage access to all columns tagged as PII in a single rule.

Unity Catalog on Google Cloud Platform (GCP)
Unity Catalog support for GCP is also coming soon.

Getting Started with Unity Catalog on AWS and Azure

Unity Catalog is currently in gated public preview on AWS and Azure and is available to customers upon request. Existing Databricks customers can request access to Unity Catalog by contacting their Databricks account executives or by requesting access here. Visit the Unity Catalog documentation [AWS, Azure] to learn more.

Sign-up for preview

Try Databricks for free. Get started today.

The post Announcing Gated Public Preview of Unity Catalog on AWS and Azure appeared first on Databricks.

↧

Introducing Lakehouse for Media & Entertainment

April 21, 2022, 6:00 am

≫ Next: Simplifying Change Data Capture With Databricks Delta Live Tables

≪ Previous: Announcing Gated Public Preview of Unity Catalog on AWS and Azure

There are few industries that have been disrupted more by the digital age than media & entertainment. For decades, media organizations acted as wholesalers for content, which was a vehicle monetized mostly through advertising – with very little focus on the consumer side. Beyond the advent of cable in the 1980s, broadcasting, outdoor, publishing and entertainment saw very little change over a long period of time. Then came digital.

The rise of FANG companies has heightened consumer expectations around smarter, personalized experiences, making data and AI table stakes for success. Brands have shifted their ad budgets to digital channels such as connected TV, mobile and search advertising to more definitively target their ad spend, while also driving compliance with increasing privacy regulations.

Driving better AI outcomes for consumers, advertisers and employees is now a board-level initiative for most media and entertainment companies. The problem? Traditional data architectures weren’t built to support AI/ML use cases, especially across broad teams of data engineers, data scientists and analysts, while supporting the scale and agility media companies need to support evolving customer demands. This has led to heavy investments in more modern data technologies and industry partnerships that help organizations use data more thoughtfully to shape the entire consumer, advertising and content lifecycle. This is achieved by:

Having a single view of all data in a single architecture, including unstructured data like video, images and voice content.
Ensuring data is in a ready state for all analytics and AI/ML use cases.
Having a cloud infrastructure environment based on open source and open standards so IT and data teams can move with agility.

In a nutshell, ensuring all of your data is AI and business intelligence (BI) ready and being able to move fast to stay ahead of consumer and employee expectations is a critical strategy for every media organization.

Introducing the Lakehouse for Media & Entertainment

Today, we are thrilled to announce the Lakehouse for Media & Entertainment (M&E), which enables organizations across the media ecosystem to deliver better outcomes for consumers, advertisers, partners and employees with the power of data and AI. By eliminating the technical limitations of legacy systems, the Lakehouse for M&E empowers organizations to leverage all of their data to build a holistic view of consumers and advertisers, make real-time decisions and drive innovation in engagement and advertising outcomes with advanced analytics.

So, why is Lakehouse for M&E critical for success? Through purpose-built capabilities, such as solution accelerators, libraries for common use cases and a certified ecosystem of partners, the platform brings together learnings from industry innovators to foster collaboration and accelerate analytics and AI use cases that provide the ability to personalize, monetize and innovate the consumer and content lifecycle. Here are the biggest challenges around transforming into a data-driven M&E organization (and how Lakehouse addresses them):

Creating a unified audience profile

Audience data has traditionally been captured, stored and managed directly in disparate systems (e.g., DMP, ESP, data lake, data warehouse), depending on size/granularity, intended use case(s), and data types. This siloed approach is incredibly complex, especially when it comes to managing customer data as an asset that can be used to support a variety of use cases (e.g., content recommendations, next best offer).

How Lakehouse Helps: Lakehouse supports the use of all data types (structured, unstructured and semi-structured) with Delta Lake and Apache Spark™ at the foundation and data stored in an open-source format that prevents vendor lock-in. Additionally, Databricks provides technical assets in the form of notebooks, deployment guides and reference architectures to help customers stand up new use cases in days to weeks – not months – specifically aligned to helping organizations build and maintain their audience profiles. And as data sharing becomes critical to every media organization, Delta Sharing provides an open-source sharing capability that promotes data collaboration.

Delivering a 1:1 user experience

A byproduct of media consumers having more choice than ever before is that delivering a flawless customer experience is now merely table stakes. At the same time, doing so requires being able to identify the quality of service issues in near real-time, a capability that is not directly supported by the existing tech stack at many companies. Legacy data warehouses cannot support data processing at B2C scale, nor are they the right place to handle streaming ML workloads for real-time consumer lifecycle use cases.

How Lakehouse Helps: The Lakehouse for Media & Entertainment overcomes these challenges with a scalable platform built in the cloud with:

Lightning-fast performance at B2C scale. With Spark and Delta Lake – the defacto enterprise standards for driving more performance and reliability for data at massive scale – under the hood, the Lakehouse delivers massive scale and speed. And because it’s optimized with performance features like indexing and caching, Databricks customers have seen ETL workloads execute up to 50% faster.
Elastic cloud scale. Built in the cloud, the Databricks Lakehouse provides scalable resources at the click of a button to meet the demands of any sized job. Autoscaling compute clusters scale up or down based on the size of your workload so you only use as much processing power as needed to meet the demands of your workloads.

Moving beyond aggregation to advanced analytics

Prior to using analytical techniques, such as media mix modeling for spend optimization or survival analysis for churn mitigation, a big lift is often needed to acquire and harmonize at scale. In some cases, this work requires a capital investment and cross-team coordination.

The Lakehouse for M&E combines your consumer, content, advertiser and operational data with a full suite of capabilities to deliver on all of your analytics and AI use cases.

Ability to Handle All Data: Lakehouse has an end-to-end environment for unstructured data workflows – a query engine built around Delta Lake, fast annotation tools, and a powerful ML compute environment. This allows users to unlock the value of unstructured data, an impossibility for most data warehousing solutions.
Collaborative data science: The Lakehouse provides an interactive notebook environment that enables cross-functional teams to collaborate on data products with a wide range of analytics and ML capabilities, including support for multiple languages (R, Python, SQL and Scala) and popular ML libraries.
Easily manage the ML lifecycle: Manage the complete ML lifecycle from model development through deployment with managed MLflow. Centralize models and features in the registry so teams can easily collaborate on highly iterative data science projects and reuse existing work.

Driving value with the Lakehouse for Media & Entertainment

The Lakehouse for M&E builds off learnings from industry innovators to foster collaboration and provide the ability to personalize, monetize and innovate around the consumer and content lifecycle.

Pre-built solution accelerators for media & entertainment

Built on top of Lakehouse for M&E, Databricks and our ecosystem of partners offer packaged solution accelerators to help organizations tackle the most common and high-value use cases in the industry. Popular accelerators include:

Multi-Touch Attribution: Measure ad effectiveness and optimize marketing spend with better channel attribution
Gamer/User Toxicity: Foster healthier user communities with real-time detection of toxic language and behavior
Behavioral Segmentation: Create advanced segments to drive better purchasing predictions based on behaviors
Recommendation Engines: Increase conversions and engagement with personalized omnichannel recommendations
Video Quality of Experience: Analyze batch and streaming data to ensure a performant content experience for streaming services

A Growing partner ecosystem

Databricks and AWS: Databricks is working with industry-leading cloud, consulting and technology partners to enable best-in-class solutions. We have a long-standing relationship with AWS helping customers across the media industry ecosystem deliver real-time audience experiences, better advertiser outcomes and derive more value from their digital media assets. Databricks and AWS have hundred of joint Lakehouse customers, including Sega, which is delivering the next generation of 1:1 gamer experiences at scale; Discovery which is focused on frictionless, smarter experiences for viewers around the glove; and Acxiom which is helping its customers collect and activate personalization anywhere, anytime and on any channel.

Databricks M&E Implementation Partners: Databricks has also partnered with system integrators to deliver scalable industry solutions that help customers more rapidly address common use cases:

Cognizant has jointly built a streaming quality of experience solution that enables customers to mitigate video quality issues that drive viewers to churn. Cognizant’s solution pairs fine-grained telemetry data with AI/ML to quickly identify and remedy video quality issues in near real-time.
We have partnered with Lovelytics on a sports and entertainment analytics solution that brings streaming data to life. With AI and predictive analytics to predict and forecast performance, the Lovelytics solution enables sports and entertainment organizations to optimize strategy in-game, as well as the fan and live event experience.

Databricks M&E Technology Partners: Our technology partners are critical to success and augment Databricks with industry-specific capability.

Labelbox – a leading data training platform for machine learning and the first company Databricks invested in as part of Databricks Ventures – helps media organizations label and derive actionable insights from their unstructured video and images files, which has historically been a massive challenge for media organizations.
As a data integration platform, Fivetran helps our media customers connect to the dozens of ad and mar tech data sources in their organization so they can better understand and activate the data coming from various sources in their media ecosystem.

Want learn more about Lakehouse for Media & Entertainment? Click here for our solutions page. We could not be more excited to launch the Lakehouse for Media & Entertainment as we seek to help media leaders put data, AI and analytics at the very center of their organization.

Try Databricks for free. Get started today.

The post Introducing Lakehouse for Media & Entertainment appeared first on Databricks.

↧

Simplifying Change Data Capture With Databricks Delta Live Tables

April 25, 2022, 8:00 am

≫ Next: Democratizing Data for Supply Chain Optimization

≪ Previous: Introducing Lakehouse for Media & Entertainment

This guide will demonstrate how you can leverage Change Data Capture in Delta Live Tables pipelines to identify new records and capture changes made to the dataset in your data lake. Delta Live Tables pipelines enable you to develop scalable, reliable and low latency data pipelines, while performing Change Data Capture in your data lake with minimum required computation resources and seamless out-of-order data handling.

Note: We recommend following the Getting Started with Delta Live Tables which explains creating scalable and reliable pipelines using Delta Live Tables (DLT) and its declarative ETL definitions.

Background on Change Data Capture

Change Data Capture (CDC) is a process that identifies and captures incremental changes (data deletes, inserts and updates) in databases, like tracking customer, order or product status for near-real-time data applications. CDC provides real-time data evolution by processing data in a continuous incremental fashion as new events occur.
Since over 80% of organizations plan on implementing multi-cloud strategies by 2025, choosing the right approach for your business that allows seamless real-time centralization of all data changes in your ETL pipeline across multiple environments is critical.

By capturing CDC events, Databricks users can re-materialize the source table as Delta Table in Lakehouse and run their analysis on top of it, while being able to combine data with external systems. The MERGE INTO command in Delta Lake on Databricks enables customers to efficiently upsert and delete records in their data lakes – you can check out our previous deep dive on the topic here. This is a common use case that we observe many of Databricks customers are leveraging Delta Lakes to perform, and keeping their data lakes up to date with real-time business data.

While Delta Lake provides a complete solution for real-time CDC synchronization in a data lake, we are now excited to announce the Change Data Capture feature in Delta Live Tables that makes your architecture even simpler, more efficient and scalable. DLT allows users to ingest CDC data seamlessly using SQL and Python.

Earlier CDC solutions with delta tables were using MERGE INTO operation which requires manually ordering the data to avoid failure when multiple rows of the source dataset match while attempting to update the same rows of the target Delta table. To handle the out-of-order data, there was an extra step required to preprocess the source table using a foreachBatch implementation to eliminate the possibility of multiple matches, retaining only the latest change for each key (See the Change data capture example). The new APPLY CHANGES INTO operation in DLT pipelines automatically and seamlessly handles out-of-order data without any need for data engineering manual intervention.

CDC with Databricks Delta Live Tables

In this blog, we will demonstrate how to use the APPLY CHANGES INTO command in Delta Live Tables pipelines for a common CDC use case where the CDC data is coming from an external system. A variety of CDC tools are available such as Debezium, Fivetran, Qlik Replicate, Talend, and StreamSets. While specific implementations differ, these tools generally capture and record the history of data changes in logs; downstream applications consume these CDC logs. In our example data is landed in cloud object storage from a CDC tool such as Debezium, Fivetran, etc.

We have data from various CDC tools landing in a cloud object storage or a message queue like Apache Kafka. Typically we see CDC used in an ingestion to what we refer as the medallion architecture. A medallion architecture is a data design pattern used to logically organize data in a Lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture. Delta Live Tables allows you to seamlessly apply changes from CDC feeds to tables in your Lakehouse; combining this functionality with the medallion architecture allows for incremental changes to easily flow through analytical workloads at scale. Using CDC together with the medallion architecture provides multiple benefits to users since only changed or added data needs to be processed. Thus, it enables users to cost-effectively keep gold tables up-to-date with the latest business data.

NOTE: The example here applies to both SQL and Python versions of CDC and also on a specific way to use the operations, to evaluate variations, please see the official documentation here.

Prerequisites

To get the most out of this guide, you should have a basic familiarity with:

SQL or Python
Delta Live Tables
Developing ETL pipelines and/or working with Big Data systems
Databricks interactive notebooks and clusters
You must have access to a Databricks Workspace with permissions to create new clusters, run jobs, and save data to a location on external cloud object storage or DBFS.
For the pipeline we are creating in this blog, “Advanced” product edition which supports enforcement of data quality constraints, needs to be selected.

Enabling CDC in a Delta Live Table pipeline

To have the CDC feature available in a DLT pipeline, you must first enable the feature in the DLT pipeline settings, by adding the configuration for it either at the time of pipeline creation or later when you edit the UI/JSON file under the pipeline settings. Below is an example of enabling CDC for the DLT pipeline a. At the pipeline creation, or b. When editing settings of an existing pipeline. To create a DLT pipeline, go to the Jobs & Workflows in the navigation UI, and under switch to the Delta Live Tables tab, then click on the blue highlighted “Create Pipeline”. Once a pipeline is created, you can access pipeline settings by selecting the created pipeline and selecting “Settings” option. Settings are available and editable in both the UI and JSON.

To learn more about DLT settings, please see the documentation here.

"configuration": {
        "pipelines.applyChangesPreviewEnabled": "true"
    },

The Dataset

Here we are consuming realistic looking CDC data from an external database. In this pipeline, we will use the Faker library to generate the dataset that a CDC tool like Debezium can produce and bring into cloud storage for the initial ingest in Databricks. Using Auto Loader we incrementally load the messages from cloud object storage, and store them in the Bronze table as it stores the raw messages. The Bronze tables are intended for data ingestion which enable quick access to a single source of truth. Next we perform APPLY CHANGES INTO from the cleaned Bronze layer table to propagate the updates downstream to the Silver Table. As data flows to Silver tables, generally it becomes more refined and optimized (“just-enough”) to provide an enterprise a view of all its key business entities. See the diagram below.

This blog focuses on a simple example that requires a JSON message with four fields of customers name, email, address and id along with the two fields: operation (which stores operation code (DELETE, APPEND, UPDATE, CREATE), and operation_date (which stores the date and timestamp for the record came for each operation action) to describe the changed data.

To generate a sample dataset with the above fields, we are using a Python package that generates fake data, Faker. You can find the notebook related to this data generation section here. In this notebook we provide the name and storage location to write the generated data there. We are using the DBFS functionality of Databricks, see the DBFS documentation to learn more about how it works. Then, we use a PySpark User-Defined-Function to generate the synthetic dataset for each field, and write the data back to the defined storage location, which we will refer to in other notebooks for accessing the synthetic dataset.

Ingesting the raw dataset using Auto Loader

According to the Medallion architecture paradigm, the bronze layer holds the most raw data quality. At this stage we can incrementally read new data using Autoloader from a location in cloud storage. Here we are adding the path to our generated dataset to the configuration section under pipeline settings, which allows us to load the source path as a variable. So now our configuration under pipeline settings looks like below:

"configuration": {
      
        "pipelines.applyChangesPreviewEnabled": "true",
         "source": "/tmp/demo/cdc_raw"
    }

Then we load this configuration property in our notebooks.

Let’s take a look at the Bronze table we will ingest, a. In SQL, and b. Using Python

a. SQL

SET spark.source;
CREATE OR REFRESH STREAMING LIVE TABLE customer_bronze
(
address string,
email string,
id string,
firstname string,
lastname string,
operation string,
operation_date string,
_rescued_data string 
)
TBLPROPERTIES ("quality" = "bronze")
COMMENT "New customer data incrementally ingested from cloud object storage landing zone"
AS 
SELECT * 
FROM cloud_files("${source}/customers", "json", map("cloudFiles.inferColumnTypes", "true"));

b. Python

import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *

source = spark.conf.get("source")

@dlt.table(name="customer_bronze",
                  comment = "New customer data incrementally ingested from cloud object storage landing zone",
  table_properties={
    "quality": "bronze"
  }
)
def customer_bronze():
  return (
    spark.readStream.format("cloudFiles") \
      .option("cloudFiles.format", "json") \
      .option("cloudFiles.inferColumnTypes", "true") \
      .load(f"{source}/customers")
  )

The above statements use the Auto Loader to create a Streaming Live Table called customer_bronze from json files. When using Autoloader in Delta Live Tables, you do not need to provide any location for schema or checkpoint, as those locations will be managed automatically by your DLT pipeline.

Auto Loader provides a Structured Streaming source called cloud_files in SQL and cloudFiles in Python, which takes a cloud storage path and format as parameters.
To reduce compute costs, we recommend running the DLT pipeline in Triggered mode as a micro-batch assuming you do not have very low latency requirements.

Expectations and high-quality data

In the next step to create high-quality, diverse, and accessible dataset, we impose quality check expectation criteria using Constraints. Currently, a constraint can be either retain, drop, or fail. For more detail see here. All constraints are logged to enable streamlined quality monitoring.

a. SQL

CREATE OR REFRESH TEMPORARY STREAMING LIVE TABLE customer_bronze_clean_v(
  CONSTRAINT valid_id EXPECT (id IS NOT NULL) ON VIOLATION DROP ROW,
  CONSTRAINT valid_address EXPECT (address IS NOT NULL),
  CONSTRAINT valid_operation EXPECT (operation IS NOT NULL) ON VIOLATION DROP ROW
)
TBLPROPERTIES ("quality" = "silver")
COMMENT "Cleansed bronze customer view (i.e. what will become Silver)"
AS SELECT * 
FROM STREAM(LIVE.customer_bronze);

b. Python

@dlt.view(name="customer_bronze_clean_v",
  comment="Cleansed bronze customer view (i.e. what will become Silver)")

@dlt.expect_or_drop("valid_id", "id IS NOT NULL")
@dlt.expect("valid_address", "address IS NOT NULL")
@dlt.expect_or_drop("valid_operation", "operation IS NOT NULL")

def customer_bronze_clean_v():
  return dlt.read_stream("customer_bronze") \
            .select("address", "email", "id", "firstname", "lastname", "operation", "operation_date", "_rescued_data")

Using APPLY CHANGES INTO statement to propagate changes to downstream target table

Prior to executing the Apply Changes Into query, we must ensure that a target streaming table which we want to hold the most up-to-date data exists. If it does not exist we need to create one. Below cells are examples of creating a target streaming table. Note that at the time of publishing this blog, the target streaming table creation statement is required along with the Apply Changes Into query, and both need to be present in the pipeline, otherwise your table creation query will fail.

a. SQL

CREATE OR REFRESH STREAMING LIVE TABLE customer_silver
TBLPROPERTIES ("quality" = "silver")
COMMENT "Clean, merged customers";

b. Python

dlt.create_target_table(name="customer_silver",
  comment="Clean, merged customers",
  table_properties={
    "quality": "silver"
  }
)

Now that we have a target streaming table, we can propagate changes to the downstream target table using the Apply Changes Into query. While CDC feed comes with INSERT, UPDATE and DELETE events, DLT default behavior is to apply INSERT and UPDATE events from any record in the source dataset matching on primary keys, and sequenced by a field which identifies the order of events. More specifically it updates any row in the existing target table that matches the primary key(s) or inserts a new row when a matching record does not exist in the target streaming table. We can use APPLY AS DELETE WHEN in SQL, or its equivalent apply_as_deletes argument in Python to handle DELETE events.

In this example we used “id” as my primary key, which uniquely identifies the customers and allows CDC events to apply to those identified customer records in the target streaming table. Since “operation_date” keeps the logical order of CDC events in the source dataset, we use “SEQUENCE BY operation_date” in SQL, or its equivalent “sequence_by = col(“operation_date”)” in Python to handle change events that arrive out of order. Keep in mind that the field value we use with SEQUENCE BY (or sequence_by) should be unique among all updates to the same key. In most cases, the sequence by column will be a column with timestamp information.

Finally we used “COLUMNS * EXCEPT (operation, operation_date, _rescued_data)” in SQL, or its equivalent “except_column_list”= [“operation”, “operation_date”, “_rescued_data”] in Python to exclude three columns of “operation”, “operation_date”, “_rescued_data” from the target streaming table. By default all the columns are included in the target streaming table, when we do not specify the “COLUMNS” clause.

a. SQL

APPLY CHANGES INTO LIVE.customer_silver
FROM stream(LIVE.customer_bronze_clean_v)
  KEYS (id)
  APPLY AS DELETE WHEN operation = "DELETE"
  SEQUENCE BY operation_date 
  COLUMNS * EXCEPT (operation, operation_date, 
_rescued_data);

b. Python

dlt.apply_changes(
  target = "customer_silver", 
  source = "customer_bronze_clean_v", 
  keys = ["id"], 
  sequence_by = col("operation_date"), 
  apply_as_deletes = expr("operation = 'DELETE'"), 
  except_column_list = ["operation", "operation_date", "_rescued_data"])

To check out the full list of available clauses see here.
Please note that, at the time of publishing this blog, a table that reads from the target of an APPLY CHANGES INTO query or apply_changes function must be a live table, and cannot be a streaming live table.

A SQL and python notebook is available for reference for this section. Now that we have all the cells ready, let’s create a Pipeline to ingest data from cloud object storage. Open Jobs in a new tab or window in your workspace, and select “Delta Live Tables”.

Select “Create Pipeline” to create a new pipeline
Specify a name such as “Retail CDC Pipeline”
Specify the Notebook Paths that you already created earlier, one for the generated dataset using Faker package, and another path for the ingestion of the generated data in DLT. The second notebook path can refer to the notebook written in SQL, or Python depending on your language of choice.
For the scope of this blog, add pipelines.applyChangesPreviewEnabled in the configuration as a key and set value to true. Please note you can safely skip from this step once this feature goes in General Availability (GA).
To access the data generated in the first notebook, add the dataset path in configuration. Here we stored data in “/tmp/demo/cdc_raw/customers”, so we set “source” to “/tmp/demo/cdc_raw/” to reference “source/customers” in our second notebook.
Specify the Target (which is optional and referring to the target database), where you can query the resulting tables from your pipeline.
Specify the Storage Location in your object storage (which is optional), to access your DLT produced datasets and metadata logs for your pipeline.
Set Pipeline Mode to Triggered. In Triggered mode, DLT pipeline will consume new data in the source all at once, and once the processing is done it will terminate the compute resource automatically. You can toggle between Triggered and Continuous modes when editing your pipeline settings. Setting “continuous”: false in the JSON is equivalent to setting the pipeline to Triggered mode.
For this workload you can disable the autoscaling under Autopilot Options, and use only 1 worker cluster. For production workloads, we recommend enabling autoscaling and setting the maximum numbers of workers needed for cluster size.

The pipeline associated with this blog, has the following DLT pipeline settings:

{
     "clusters": [
        {
            "label": "default",
            "num_workers": 1
        }
    ],
    "development": true,
    "continuous": false,
    "edition": "advanced",
    "photon": false,
    "libraries": [
        {
            "notebook": {
"path":"/Repos/mojgan.mazouchi@databricks.com/Delta-Live-Tables/notebooks/1-CDC_DataGenerator"
            }
        },
        {
            "notebook": {
"path":"/Repos/mojgan.mazouchi@databricks.com/Delta-Live-Tables/notebooks/2-Retail_DLT_CDC_sql"
            }
        }
    ],
    "name": "CDC_blog",
    "storage": "dbfs:/home/mydir/myDB/dlt_storage",
    "configuration": {
        "source": "/tmp/demo/cdc_raw",
        "pipelines.applyChangesPreviewEnabled": "true"
    },
    "target": "my_database"
}

Select “Start”
Your pipeline is created and running now!

DLT Pipeline Lineage Observability, and Data Quality Monitoring

All DLT pipeline logs are stored in the pipeline’s storage location. You can specify your storage location only when you are creating your pipeline. Note once the pipeline is created you can no longer modify storage location.

You can check out our previous deep dive on the topic here. Try this notebook to see pipeline observability and data quality monitoring on the example DLT pipeline associated with this blog.

Conclusion

In this blog, we showed how we made it seamless for users to efficiently implement change data capture (CDC) into their Lakehouse platform with Delta Live Tables (DLT). DLT provides built-in quality controls with deep visibility into pipeline operations, observing pipeline lineage, monitoring schema, and quality checks at each step in the pipeline. DLT supports automatic error handling and best in class auto-scaling capability for streaming workloads, which enables users to have quality data with optimum resources required for their workload.

Data engineers can now easily implement CDC with a new declarative APPLY CHANGES INTO API with DLT in either SQL or Python. This new capability lets your ETL pipelines easily identify changes and apply those changes across tens of thousands of tables with low-latency support.

Ready to get started and try out CDC in Delta Live Tables for yourself?
Please watch this webinar to learn how Delta Live Tables simplifies the complexity of data transformation and ETL, and see our Change data capture with Delta Live Tables document, official github and follow the steps in this video to create your pipeline!

Try Databricks for free. Get started today.

The post Simplifying Change Data Capture With Databricks Delta Live Tables appeared first on Databricks.

↧

Democratizing Data for Supply Chain Optimization

April 25, 2022, 9:06 am

≫ Next: Disaster Recovery Overview, Strategies, and Assessment

≪ Previous: Simplifying Change Data Capture With Databricks Delta Live Tables

This is a guest authored post by Mrunal Saraiya, Sr. Director – Advance Technologies (Data, Intelligent Automation and Advanced Technology Incubation), Johnson & Johnson

As a cornerstone, global consumer goods and pharmaceutical provider, Johnson & Johnson serves businesses, patients, doctors, and people around the world, and has for more than 150 years. From life-sustaining medical devices to vaccines, over-the-counter and prescription medications (plus the tools and resources used to create them), we must ensure the availability of everything we bring to market—and guarantee consistency in the quality, preservation, and timely delivery of those various goods to our customers.

How we serve our community with these products and services is core to our business strategy, particularly when it comes to ensuring that items get delivered on time, to the right place, and are sold at a fair price, so that consumers can access and use our products effectively. And while logistical challenges have long existed across market supply chains, streamlining those pathways and optimizing inventory management and costs on a global scale is impossible without data—and lots of it. That data has to be accurate, too, which requires us to have tools in place to distill and interpret an incredibly complex array of information to make that data useful. With that data, we are able to use supply chain analytics to help optimize various aspects of the supply chain, from keeping shelves stocked across our retail partners and ensuring the vaccines that we provide are temperature-controlled and delivered on time to managing global category spend and identifying further costs improvement initiatives.

And the impact of these supply chain optimizations is significant. For example, within our Janssen global supply chain procurement organization, the inability to understand and control spend and pricing can ultimately lead to limited identification of future strategic decisions and initiatives that could further the effectiveness of global procurement. All said, if this problem is not solved, we can miss the opportunity to achieve $6MM in upside.

Over the decades, we’ve grown extensively as an organization, both organically and through numerous acquisitions. Historically, our supply chain data was coming to our engineers through fragmented systems with disparate priorities and unique configurations—the result of multiple organizations coming under the Johnson & Johnson umbrella over time, and bringing their proprietary resources with them. Additionally, data was largely being extracted and analyzed manually. Opportunities for speed and scalability were extremely limited, and actionable insights were slow to surface. The disconnection was negatively impacting how we served our customers, and impeding our ability to make strategic decisions about what to do next.

Migrating from Hadoop to a unified approach with Databricks

With this in mind, we embarked on an important journey to democratize our data across the entire organization. The plan was to create a common data layer that would drive higher performance, allow for more versatility, improve decision making, bring scalability to engineering and supply chain operations, and make it easy to modify queries and insights efficiently in real-time. This led us to the Databricks Lakehouse Platform on the Azure cloud.

Ultimately, our goal was to bring all of Johnson & Johnson’s global data together, replacing 35+ global data sources that were created by our fragmented systems with a single view into data that could then be readily available for our data scientists, engineers, analysts, and of course, applications, to contextualize as needed. But rather than continuing to repeat the data activity from previous pipelines, we decided to create a single expression of the data—allowing the data itself to be the provider. From that common layer, insights can be drawn by various users for various use cases across the Johnson & Johnson sphere, enabling us to develop valuable applications that bring true value to the people and businesses we serve, gain deeper insights into data more quickly, synergize the supply chain management process across multiple sectors, and improve inventory management and predictability.

But transforming our data infrastructure into one that can handle an SLA of around 15 minutes for data delivery and accessibility required having to tackle a couple of immediate challenges. First and foremost was the problem of scalability; our existing Hadoop infrastructure was not able to meet those service-level agreements while supporting data analytics requirements.The effort to push that legacy system to deliver in time and at scale was cost and resource prohibitive. Secondly, we had critical strategic imperatives around supply chain strategy and data optimization that were going unmet due to a lack of scalability. Finally, limited flexibility and growth provided fewer use cases and reduced the accuracy of the data models.

With a Lakehouse approach to data management, we now have a common data layer to feed a myriad of data pipelines that can scale alongside business needs. This has produced significantly more use cases from which to extract actionable insights and value from our data. It has also given us the ability to go beyond a broad business context, allowing us to use predictive analytics to anticipate key trends and needs around optimizing logistics, understand our customers’ therapy needs, and how that impacts our supply chain with greater confidence and agility.

Delivering accurate healthcare solutions at scale

To guarantee the best outcome for our efforts, we engaged with Databricks to expand resources and deploy a highly-effective common data ingestion layer for data analytics and machine learning. With strategic and intentional collaboration between our engineering and AI functions, we were able to fully optimize the system to deliver much greater processing power.

Today, our data flow has been significantly streamlined. Data teams use Photon – the next-generation query engine on Databricks – to enable extremely fast query performance at a low cost for SQL workloads. All data pipelines now feed through Delta Lake, which helps to simplify data transformation in Photon. From there, Databricks SQL provides high-performance data warehousing capabilities, feeding data through optimized connectors to various applications and business intelligence (BI) tools (e.g., PowerBI, Qlik, Synapse, Teradata, Tableau, ThoughtSpot) for our analysts and scientists to consume in near real-time.

To date, we’ve achieved a 45-50% reduction in cost for data engineering workloads and dropped data delivery lag from around 24 hours to under ten minutes.

By being able to run our BI workloads directly off of Delta Lake using Databricks SQL, rather than our legacy data warehouse, we have an improved understanding of consumer and business needs. With these insights, we can focus our efforts on surfacing ongoing opportunities such as forecasting product demand to help our retail partners ensure the right levels of stock are available to their consumers; ensure cost-efficient distribution of drugs around the world, and more. Additionally, because we’re able to analyze directly from Delta Lake with Databricks SQL, we expect to further reduce time and money spent on the demand planning process.

One really interesting solution is our ability to now track patient therapy products throughout the supply chain. Within our cell therapy program, we can now track the therapy journey of our Apheresis patients across 14+ milestones and how 18+ JNJ global subsidiaries and vendor partners — from donor collection to manufacturing and to administration of the final product — are working in tandem to support these patients.

With the cloud-based Databricks Lakehouse Platform in place, we’ve greatly simplified our operational data infrastructure in the Azure cloud, enabling us to consistently meet our SLAs, reduce overall costs, and most importantly better serve our customers and community.

Looking ahead with Databricks

Data’s potential is limitless, and in the data science space, exploration is as much a priority as collection and analysis. As a direct result of our migration from a legacy Hadoop infrastructure to the Databricks Lakehouse Platform, we’ve streamlined our data pathways and removed barriers for users across the business. Today, we’re more able than ever before to transform our innovative explorations into real solutions for people everywhere.

Try Databricks for free. Get started today.

The post Democratizing Data for Supply Chain Optimization appeared first on Databricks.

↧

Disaster Recovery Overview, Strategies, and Assessment

April 25, 2022, 10:43 am

≫ Next: Cortex Labs is Joining Databricks to Accelerate Model Serving and MLOps

≪ Previous: Democratizing Data for Supply Chain Optimization

When deciding on a Disaster Recovery (DR) strategy that serves the entire firm for most applications and systems, an assessment of priorities, capabilities, limitations, and costs is necessary.

While it is tempting to expand the scope of this conversation to various technologies, vendors, cloud providers, and on-premises systems, we’ll only focus on DR involving Databricks workspaces. DR information specific to cloud infrastructure for AWS, Azure, and GCP is readily available at other sources.

In addition, how DR fits into Business Continuity (BC), and its relation to High Availability (HA) is out of scope for this series. Suffice to say that solely leveraging HA services are not sufficient for a comprehensive DR solution.

This initial part of the series will focus on determining an appropriate DR strategy and implementation for critical use cases and/or workloads running on Databricks. We will discuss some general DR concepts, but we’d encourage readers to visit the Databricks documentation (AWS | Azure | GCP) for an initial overview of DR terminology, workflow, and high-level steps to implement a solution.

Disaster Recovery planning for a Databricks workspace

Defining Loss Limits

Determining an acceptable Recovery Point Objective (RPO), the maximum targeted period in which data might be lost, and Recovery Time Objective (RTO), the targeted duration of time and service level within which a business process must be restored, is a fundamental step toward implementing a DR strategy for a use case or workload running on Databricks. RTO and RPO should be decided within specific contexts, for example at the use case or workload level, and independently of each other. These define the loss limits during a disaster event, will inform the appropriate DR strategy, and determine how fast the DR implementation should recover from a disaster event.

RPO for Databricks Objects

Databricks objects should be managed with CI/CD and Infrastructure as Code (IaC) tooling, such as Terraform (TF), for replication to a DR site. Databricks Repos (AWS | Azure | GCP) provides git integration that facilitates pulling source code to a single or multiple workspaces from a configured git provider, for example, GitHub.

Databricks REST APIs (AWS | Azure | GCP) can be used to publish versioned objects to the DR site at the end of a CI/CD pipeline, however, this approach has two limitations. First, the REST APIs will not track the target workspace’s state, requiring additional effort to ensure all operations are safe and efficient. Second, an additional framework would be required to orchestrate and execute all of the needed API calls.

Terraform eliminates the need to track state manually by versioning it and then applying required changes to the target workspace, making any necessary changes on behalf of the user in a declarative fashion. A further advantage of TF is the Databricks Terraform Provider which permits interaction with almost all Dataricks and Cloud resources needed for a DR solution. Finally, Databricks is an official partner of Hashicorp, and Terraform can support multi-cloud deployments. The Databricks Terraform Provider will be used to demonstrate a DR implementation as part of this series. DB-Sync is a command-line tool for the Databricks Terraform Provider that may be easier to use for managing replication for non-TF users.

The RPO for Databricks objects will be the time difference between the most recent snapshot of an object’s state and the disaster event. System RPO should be determined as the maximum RPO of all objects.

RPO for Databases and Tables

Multiple storage systems can be required for any given workload. Data Sources that Apache Spark accesses through JDBC, such as OLAP and OLTP systems, will have options available for DR offered through the cloud provider. These systems should be considered as in-scope for the DR solution, but will not be discussed in-depth here since each cloud provider has varying approaches to backups and replication. Rather, the focus will be on the logical databases and tables that are created on top of files in Object storage.

Each Databricks workspace uses the Databricks File System (DBFS), an abstraction on top of Object storage. The use of DBFS to store critical, production source code and data assets are not recommnded. All users will have read and write access to the object storage mounted to DBFS, with the exception of the DBFS root. Furthermore, DBFS Root does not support cloud-native replication services and will rely solely on a combination of Delta DEEP CLONE, scheduled Spark Jobs, and the DBFS CLI to export data. Due to these limitations, anything that must be made replicated to the DR site should not be stored in DBFS.

Object storage can be mounted to DBFS (AWS | Azure | GCP), and this creates a pointer to the external storage. The mount prevents data from being synced locally on DBFS but will require the mount to be updated as part of DR since the mount point path will need to direct to different Object storage in the DR site. Mounts can be difficult to manage and could potentially point to the wrong storage, which requires additional automation and validation as part of the DR solution. Accessing Object Storage directly through external tables reduces both complexity and points of failure for DR.

Using Apache Spark, a user can create managed and unmanaged tables (AWS | Azure | GCP). The metastore will manage data for managed tables, and the default storage location for managed tables is `/user/hive/warehouse` in DBFS. If managed tables are in use for a workload that requires DR, data should be migrated from DBFS, and use a new database with the location parameter specified to avoid the default location.

An unmanaged table is created when the `LOCATION` parameter is specified during the `CREATE TABLE` statement. This will save the table’s data at the specified location, and it will not be deleted if the table is dropped from the metastore. The metastore can still be a required component in DR for unmanaged tables if the source code accesses tables that are defined in the metastore by using SparkSQL or the `table` method ( Python | Scala ) of the SparkSession.

Directly using an object store, for example, Amazon S3, allows for the use of Geo-Redundant Storage if required and avoids the concerns associated with DBFS.

Data Replication

For DR, the recommended storage format is Delta Lake. Parquet is easily converted to Delta format with the `CONVERT TO DELTA` command. The ACID guarantees of Delta Lake virtually eliminate data corruption risks during failover to and failback from a DR site. Furthermore, a deep clone should be used to replicate all Delta tables. It provides an incremental update ability to avoid unnecessary data transfer costs and has additional built-in guarantees for data consistency that are not available with az- and region-based Geo-Redundant Replication (GRR) services. A further disadvantage of GRR is that the replication is one-way, creating the need for an additional process when failing back to the primary site from the DR site, whereas deep clones can work in both directions, primary site to the DR site and vice versa.

The diagram below demonstrates the initial shortcoming of using GRR with Delta Tables:

To use GRR and Delta Tables, an additional process would need to be created to certify a complete Delta Table Version (AWS | Azure | GCP) at the DR site.

Comparing the above graphic with using Delta DEEP CLONE to simplify replication for DR, a deep clone ensures that the latest version of the Delta table is replicated in its entirety, guarantees file order, and provides additional control over when the replication happens:

Delta DEEP CLONE will be used to demonstrate replicating Delta Tables from a primary to the secondary site as part of this blog series.

Files that cannot be converted to Delta should rely on GRR. In this case, these files should be stored within a different location than Delta files to avoid conflicts from running both GRR and Delta DEEP CLONE. A process in the DR site will need to be put in place to guarantee complete dataset availability to the business when using GRR; however, this would not be needed when using deep clones.

For data in an object store, RPO will depend on when it was last replicated using Delta Deep Clone or the SLAs provided by the cloud provider in the case of using Geo-Redundant Storage.

Metastore Replication

Several metastore options are available for a Databricks deployment, and we’ll briefly consider each one in turn.

Beginning with the default metastore, a Databricks deployment has an internal Unity Catalog ( AWS | Azure | GCP ) or Hive (AWS | Azure | GCP) metastore accessible by all clusters and SQL Endpoints to persist table metadata. A custom script will be required to export tables and table ACLs from the internal Hive metastore. This couples RPO to the Databricks workspace, meaning that the RPO for the metadata required for managed tables will be the time difference in hours between when those internal metastore tables and table ACLs were last exported and the disaster event.

A cluster can connect to an existing, external Apache Hive metastore (AWS | Azure | GCP). The external metastore allows for additional replication options by leveraging cloud provider services for the underlying database instance. For example, leveraging a multi-az database instance or a cloud-native database, such as Amazon Aurora, to replicate the metastore. This option is available for Databricks on AWS, Azure Databricks, and Databricks on GCP. RPO will depend on SLAs provided by the cloud provider services if there is no manual or automated export process.

The ability to use Glue Catalog as a metastore is unique to Databricks deployments on AWS. RPO for the Glue Catalog will depend on a replication utility and/or SLAs provided by AWS.

Recovery Time Objective (RTO) in Databricks

RTO will be measured from the time that the Databricks workspace in the primary site is unavailable to the time that the workspace in the DR site reaches a predetermined level of operation to support critical activities.

Generally speaking and assuming that data is already replicated, this will require source code and dependencies for the workload/use case to be available, a live cluster or SQL Endpoint, and the metastore if accessing data through a database and tables (as opposed to accessing files directly) before RTO can be achieved.

RTO will depend on the DR strategy and implementation that is selected for the workload or use case.

Disaster Recovery Strategies

Disaster Recovery strategies can be broadly broken down into two categories: active/passive and active/active. In an active/passive implementation, the primary site is used for normal operations and remains active. However, the DR (or secondary) site requires pre-planned steps, depending on a specific implementation, to be taken for it to be promoted to primary. Whereas in an active/active strategy, both sites remain fully operational at all times.

Active/Passive Strategies

Backup & Restore is largely considered the least efficient in terms of RTO. Backups are created on the primary site and copied to the DR site. For regional failover, infrastructure must be restored as well as performing recovery from data backups. In the case of an object store, data would still need to be replicated from the Primary site.
In this scenario, all the necessary configurations for infrastructure and Databricks Objects would be available at the DR site but not provisioned. File-based tables may need to be backfilled and other data sources would need to be restored from the most recent backup. Given the nature of Big Data, RTO can be days.
In a Pilot Light implementation, data stores and databases are up-to-date based on the defined RPO for each, and they are ready to service workloads. Other infrastructure elements are defined, usually through an Infrastructure as Code (IaC) tool, but not deployed.

The main difference between Pilot Light and Backup & Restore is the immediate availability of data, including the metastore if reads and writes are not path-based. Some infrastructure, generally that has no or little cost, is provisioned. For a Databricks workspace, this would mean a workspace and required cloud infrastructure is deployed with required Databricks Objects provisioned (source code, configurations, instance profiles, service principals, etc.), but clusters and SQL Endpoints would not be created.

Warm Standby maintains live data stores and databases, in addition to a minimum live deployment. The DR site must be scaled up to handle all production workloads.
Building off of Pilot Light, Warm Standby would have additional objects deployed. For example, specific clusters or SQL Endpoints may be created, but not started, to reduce RTO for serving data to downstream applications. This can also facilitate continuous testing which can increase confidence in the DR solution and the health of the DR site. Clusters and SQL Endpoints may be turned on periodically to prevent deletion and for testing or even kept turned on in extreme cases that have a very strict RTO.

Active/Active Strategy (Multi-Site)

For a multi-site DR implementation, both the primary and DR sites should be fully deployed and operational. In the event of a disaster event, new requests would simply need to be routed to the DR site instead of the primary. Multi-site offers the most efficient RTO but is also the most costly and complex.

The complexity arises from synchronizing data (ie. tables and views), user profiles, cluster definitions, jobs definitions, source code, libraries, init scripts, and any other artifacts in between the primary and DR workspaces. Further complicating matters are any hard-coded connection strings, Personal Access Tokens (PAT tokens), and URIs for API calls in various scripts and code.

Determining the Correct DR Strategy

As analytical systems become more important, failures will cause a greater impact on businesses and become more costly, and potential failure points are constantly growing as environments become more complex and interrelated. As a result, performing impact analysis on use cases and/or workloads to determine if DR is necessary and ensuring teams are prepared for the implementation of such has become a critical activity.

There are tradeoffs between each listed strategy. Ultimately, RPO and RTO will inform which one should be selected for workloads and use cases. Then, the damages, financial and non-financial, of a disaster event should be weighed against the cost to implement and maintain any given DR strategy. Based on the estimated figures, a DR strategy should be selected and implemented to ensure the continuation of services in a timely manner following a disaster-caused disruption.

Get started

Determining if a DR solution is required for the applications using the Databricks Lakehouse Platform can prevent potential losses and maintain trust with consumers of the platform. The correct DR strategy and implementation ensure costs must not exceed potential losses (financial or non-financial) while providing services to resume critical applications and functions in the event of a disaster. The assessment below provides a starting point and guidance for performing an Impact and Preparedness Analysis to determine an appropriate DR strategy and implementation.

Download the DR Impact Assessment

Try Databricks for free. Get started today.

The post Disaster Recovery Overview, Strategies, and Assessment appeared first on Databricks.

↧

Cortex Labs is Joining Databricks to Accelerate Model Serving and MLOps

April 25, 2022, 10:54 am

≫ Next: Introducing Databricks SQL on Google Cloud – Now in Public Preview

≪ Previous: Disaster Recovery Overview, Strategies, and Assessment

As enterprises grow their investments in data platforms, they increasingly want to go beyond using data for internal analytics and start integrating predictions from machine learning (ML) models to create a competitive advantage for their products and services. For example, financial institutions deploy ML models to detect fraudulent transactions in real-time, and retailers use ML models to personalize product recommendations for each customer.

These mission-critical applications require an MLOps platform that can scale to process millions of predictions per second at low latency and with high availability while providing visibility into how models are performing in production. This becomes even more of a challenge with compute-intensive deep learning models that power natural language processing and computer vision applications.

To accelerate model serving and MLOps on Databricks, we are excited to announce that Cortex Labs, a Bay Area-based MLOps startup, has joined Databricks. Cortex Labs is the maker of Cortex, a popular open-source platform for deploying, managing, and scaling ML models in production. Cortex Labs was backed by leading infrastructure software investors Pitango Venture Capital, Engineering Capital, Uncorrelated Ventures, at.inc/, and Abstraction Capital, as well as angels Jeremey Schneider and Lior Gavish.

Cortex enables engineers and data scientists to deploy ML models in production without worrying about DevOps or cloud infrastructure. Companies from cybersecurity, biotechnology, retail, and other industries use Cortex to scale production ML workloads reliably, securely, and cost-effectively.

Databricks already provides advanced capabilities for developing models, and with the team behind Cortex, Databricks will augment its platform with capabilities to scale and monitor ML workloads in production. We’re thrilled to welcome Co-founders Omer Spillinger and David Eliahu to the Databricks team. Together, we’ll be working to realize our shared vision of an end-to-end, multi-cloud platform that empowers enterprises to deliver machine learning applications to their customers. Stay tuned for more updates!

Try Databricks for free. Get started today.

The post Cortex Labs is Joining Databricks to Accelerate Model Serving and MLOps appeared first on Databricks.

↧

Introducing Databricks SQL on Google Cloud – Now in Public Preview

April 25, 2022, 2:00 pm

≫ Next: Bring Your Own VPC to Databricks on Google Cloud

≪ Previous: Cortex Labs is Joining Databricks to Accelerate Model Serving and MLOps

Today we’re pleased to announce the availability of Databricks SQL in public preview on Google Cloud. With this announcement, customers can further adopt the lakehouse architecture by performing data warehousing and business intelligence workloads on Google Cloud by leveraging Databricks SQL’s world record-setting performance for data warehousing workloads using standard SQL.

The tight integration of Databricks on Google Cloud gives organizations the flexibility of running analytics and AI workloads on a simple, open lakehouse platform that combines the best of data warehouses and data lakes.

With Databricks SQL you have all capabilities needed to run data warehousing and analytics workloads on the Databricks Lakehouse Platform with Google Cloud:

Instant, elastic serverless compute for low-latency, high-concurrency queries that are typical in analytics workloads. Compute is separated from storage so you can scale with confidence.
Optimized and integrated connectors for your BI tools, so you can get value from your data without having to learn new solutions.
Simplified administration and data governance using standard SQL, so you can quickly and confidently enable self-serve analytics.
Simple user administration with native support for Google Workspace based SSO.
A first-class, built-in analytics experience with a SQL query editor, visualizations and interactive dashboards.

Getting started

Databricks SQL is in public preview for all customers on Google Cloud enabling customers to operate multi-cloud lakehouse architectures with performant query execution. Thus public preview of Databricks SQL on Google Cloud is a win-win for customers who can enable their organizations to work seamlessly across to support data and AI services on Google Cloud. Learn more about Databricks SQL on Google Cloud by joining us at Data Partner Spotlight and our next hands-on Quickstart Lab.

Try Databricks for free. Get started today.

The post Introducing Databricks SQL on Google Cloud – Now in Public Preview appeared first on Databricks.

↧

Bring Your Own VPC to Databricks on Google Cloud

April 27, 2022, 8:00 am

≫ Next: How Uplift built CDC and Multiplexing data pipelines with Databricks Delta Live Tables

≪ Previous: Introducing Databricks SQL on Google Cloud – Now in Public Preview

Today, we are excited to announce the public preview of customer-managed virtual private cloud for Databricks on Google Cloud. This new capability further enhances the Databricks Lakehouse Platform deep integration with Google Cloud’s data and security services. For example, Google Cloud Identity is natively integrated and compliance with security certifications such as ISO 27001 and SOC 2 enable customers to comply with GDPR and CCPA.
able.

Virtual Private Cloud

Enterprise customers should begin using customer-managed virtual private cloud (VPC) capabilities for their deployments on Google Cloud. Customer-managed VPCs enable you to comply with a number of internal and external security policies and frameworks, while providing a platform-as-a-service approach to data and AI to combine the ease of use of a managed platform with secure-by-default deployment. Below is a diagram to illustrate the difference between Databricks-managed and customer-managed VPCs:

Conceptual architectural comparison between a Databricks-managed VPC vs. customer-managed VPC. >

Conceptual architecture – Databricks-managed vs. customer-managed VPC

Bring your own VPC

To use your own managed VPC:

Create and set up your VPC network
Confirm or add roles on projects for your admin user account
Register your network with Databricks, which creates a network configuration object
Create a Databricks workspace that references your network configuration

Register your network with Databricks, which creates a network configuration object

Workspace referencing your network configuration

The feature is in public preview today with full production SLAs in Databricks supported Google Cloud regions. General availability is coming soon.

Next steps

To get started with Databricks, using your own VPC on Google Cloud, begin with these instructions. If you are new, start with a Databricks on Google Cloud trial, attend a Quickstart Lab, and take advantage of this 3-part training series. For any questions, please reach out to us using this contact form.

Try Databricks on GCP for Free!

Try Databricks for free. Get started today.

The post Bring Your Own VPC to Databricks on Google Cloud appeared first on Databricks.

↧

How Uplift built CDC and Multiplexing data pipelines with Databricks Delta Live Tables

April 27, 2022, 1:17 pm

≫ Next: How Wrong Is Your Model?

≪ Previous: Bring Your Own VPC to Databricks on Google Cloud

This blog has been co-developed and co-authored by Ruchira and Joydeep from Uplift, we’d like to thank them for their contributions and thought leadership on adopting the Databricks Lakehouse Platform.

Uplift is the leading Buy Now, Pay Later solution that empowers people to get more out of life, one thoughtful purchase at a time. Uplift’s flexible payment option gives shoppers a simple, surprise-free way to buy now, live now, and pay over time.

Uplift’s solution is integrated into the purchase flow of more than 200 merchant partners, with the highest levels of security, privacy and data management. This ensures that customers enjoy frictionless shopping across online, call center and in-person experiences. This massive partner ecosystem creates challenges for their engineering team in both data engineering and analytics. As the company scales exponentially with data being its primary value driver, Uplift requires an extremely scalable solution that minimizes the amount of infrastructure and “janitor code” that it needs to manage.

With hundreds of partners and data sources, Uplift leverages their core data pipeline from its integrations to drive insights and operations such as:

Funnel metrics – application rates, approval rates, take-up rates, conversion rates, transaction volume.
User metrics – repeat user rates, total active users, new users, churn rates, cross-channel shopping.
Partner reporting – funnel and revenue metrics at partner level.
Funding – eligibility criteria, metrics, and monitoring for financed assets.
Payments – authorization approval rates, retry success rates.
Lending – roll rates, delinquency monitoring, recoveries, credit/fraud approval funnels.
Customer support – call center statistics, queue monitoring, payment portal activity funnel.

To achieve this, Uplift leveraged the Databricks Lakehouse Platform to construct a robust data integration system that easily ingests and orchestrates hundreds of topics from Kafka and S3 object storage. While each data source is stored separately, new sources are discovered and ingested automatically from the application engineering teams (data producers), and data evolves independently for each data source to be made available to the downstream analytics team.

Prior to standardizing on the lakehouse platform, adding new data sources and communicating changes across teams was manual, error-prone, and time-consuming since each new source required a new data pipeline to be written. Using Delta Live Tables, their system has become scalable, automatically reactive to changes, and configurable, thus making time to insight much faster by reducing the number of notebooks (from 100+ to 2 pipelines) to develop, manage and orchestrate.

For this data integration pipeline, Uplift had the following requirements:

Provide the ability to scalably ingest 100+ topics from Kafka/S3 into the Lakehouse, with Delta Lake being the foundation, and can be utilized by analysts in its raw form in a table format.
Provide a flexible layer that dynamically creates a table for a new Kafka topic that could arrive at any point. This allows for easy new data discovery and exploration.
Automatically update schema changes for each topic as data changes from Kafka.
Provide a downstream layer configurable with explicit table rules such as schema enforcement, data quality expectations, data type mappings, default values, etc. to ensure productized tables are governed properly.
Ensure that the data pipeline can handle SCD Type 1 updates to all explicitly configured tables.
Allow for applications downstream to create aggregate summary statistics and trends.

These requirements serve as a fitting use case for a design pattern called “multiplexing”. Multiplexing is used when a set of independent streams all share the same source In this example, we have a Kafka message queue and a series of S3 buckets with 100s of change events with raw data being inserted into a single Delta table that we would like to ingest and parse in parallel.

Note, multiplexing is a complex streaming design pattern that has different trade offs from the typical pattern of creating one-to-one source to target streams. If multiplexing is something you are considering but have not yet implemented, it would be helpful to start here with this getting streaming in production video that covers many best practices around basic streaming, as well as the tradeoffs of implementing this design pattern.

Let’s review two general solutions for this use case that utilize the Medallion Architecture using Delta Lake. This is a foundational framework that underpins both solutions below.

Multiplexing Solutions:

Spark Structured Streaming on Databricks using one to many streaming using the foreachBatch method. This solution reads the bronze stage table and splits the single stream into multiple tables inside the micro-batch.
Databricks Delta Live Tables (DLT) is used to create and manage all streams in parallel. This process uses the single input table to dynamically identify all the unique topics in the bronze table and generate independent streams for each without needing to explicitly write code and manage checkpoints for each topic.

*The remainder of this article assumes you have exposure to Spark Structured Streaming and an introduction to Delta Live Tables

In our example here, Delta Live Tables provides a declarative pipeline that allows us to provide a configuration of all table definitions in a highly flexible architecture managed for us. With one data pipeline, DLT can define, stream, and manage 100s of tables in a configurable pipeline without losing table level flexibility. For example, some downstream tables may need to run once per day while others need to be real-time for analytics. All of this can now be managed in one data pipeline.

Before we dive into the Delta Live Tables (DLT) Solution, it is helpful to point out the existing solution design using Spark Structured Streaming on Databricks.

Solution 1: Multiplexing using Delta + Spark Structured Streaming in Databricks

The architecture for this structured streaming design pattern is shown below:

In a Structured Streaming task, a stream will read multiple topics from Kafka, and then parse out tables in one stream to multiple tables within a foreachBatch statement. The code block below serves as an example for writing to multiple tables in a single stream.


df_bronze_stage_1 = spark.readStream.format(“json”).load()

def writeMultipleTables(microBatchDf, BatchId):
  
  df_topic_1 = (microBatchDf
                 .filter(col("topic")== lit("topic_1"))
                  )
  
  df_topic_2 = (microBatchDf
                 .filter(col("topic")== lit("topic_2"))
                  )
  
  df_topic_3 = (microBatchDf
                 .filter(col("topic")== lit("topic_3"))
                  )
  
  df_topic_4 = (microBatchDf
                 .filter(col("topic")== lit("topic_4"))
                  )
  
  df_topic_5 = (microBatchDf
                 .filter(col("topic")== lit("topic_5"))
                  )
  
  ### Apply schemas
  
  ## Look up schema registry, check to see if the events in each event type are equal to the most recently registered schema, Register new schema
  
  ##### Write to sink location (in series within the microBatch)
  df_topic_1.write.format("delta").mode("overwrite").option("path","/data/dlt_blog/bronze_topic_1").saveAsTable("bronze_topic_1")
  df_topic_2.write.format("delta").option("mergeSchema", "true").option("path", "/data/dlt_blog/bronze_topic_2").mode("overwrite").saveAsTable("bronze_topic_2")
  df_topic_3.write.format("delta").mode("overwrite").option("path", "/data/dlt_blog/bronze_topic_3").saveAsTable("bronze_topic_3")
  df_topic_4.write.format("delta").mode("overwrite").option("path", "/data/dlt_blog/bronze_topic_4").saveAsTable("bronze_topic_4")
  df_topic_5.write.format("delta").mode("overwrite").option("path", "/data/dlt_blog/bronze_topic_5").saveAsTable("bronze_topic_5")
  
return

 ### Using For each batch - microBatchMode
 (df_bronze_stage_1 # This is a readStream data frame
   .writeStream
   .trigger(availableNow=True) # ProcessingTime='30 seconds'
   .option("checkpointLocation", checkpoint_location)
   .foreachBatch(writeMultipleTables)
   .start()
 )

There are a few key design consideration notes in the Spark Structured Streaming solution.

To stream one-to-many tables in structured streaming, we need to use a foreachBatch function, and provide the table writes inside that function for each microBatch (see example above). This is a very powerful design, but it has some limitations:

Scalability: Writing one-to-many tables is easy for a few tables, but not scalable for 100s of tables as this would mean all tables are written in series (since spark code runs in order, each write statement needs to complete before the next starts) by default as shown in the code example above. This will increase the overall job runtime significantly for each table added.
Complexity: The writes are hardcoded, meaning there is no simple way to automatically discover new topics and create tables moving forward of those new topics. Each time a new data source arrives, a code release is required. This is a significant time sink and makes the pipeline brittle. This is possible, but requires significant development effort.
Rigidity: Tables may need to be refreshed at different rates, have different quality expectations, and different pre-processing logic such as partitions or data layout needs. This requires the creation of totally separate jobs to refresh different groups of tables.
Efficiency: These tables can have wildly different data volumes, so if they all use the same streaming cluster, then there will be times where the cluster is not well utilized. Load balancing these streams requires development effort and more creative solutions.

Overall, this solution works well, however, the challenges can be addressed and further the solution further simplified with a single DLT pipeline.

Solution 2: Multiplexing + CDC using Databricks Delta Live Tables in Python

To easily satisfy the requirements above (automatically discovering new tables, parallel stream processing in one job, data quality enforcement, schema evolution by table, and perform CDC upserts at the final stage for all tables), we use the Delta Live Tables meta-programming model in Python to declare and build all tables in parallel for each stage.

The architecture for this solution in Delta Live Tables is as follows:

This is accomplished with 1 job made up of 2 tasks:

Task A: A readStream of raw data from all Kafka topics into Bronze Stage 1 into a single Delta Table. Task A then creates a view of the distinct topics that the stream has seen. (You can optionally use a schema registry to explicitly store and use the schemas of each topic payload to parse in the next task, this view could hold that schema registry or you could use any other schema management system). In this example, we simply dynamically infer all schemas from each JSON payload for each topic, and perform data type conversions downstream at the silver stage.
Task B: A single Delta Live Tables pipeline that streams from Bronze Stage 1, uses the view generated in the first tasks as a configuration, and then uses the meta programming model to create Bronze Stage 2 tables for every topic currently in the view each time it is triggered.
The same DLT pipeline then reads an explicit configuration (a JSON config in this case) to register “productized” tables with more stringent data quality expectations and data type enforcements. In this stage, the pipeline cleans all Bronze Stage 2 tables, and then implements the APPLY CHANGES INTO method for the productized tables to merge updates into the final Silver Stage.

Finally, Gold Stage aggregates are created from the Silver Stage representing analytics serving tables to be ingested by reports.

Implementation Steps for Multiplexing + CDC in Delta Live Tables

Below are the individual implementation steps for setting up a multiplexing pipeline + CDC in Delta Live Tables:

Raw to Bronze Stage 1 – Code example reading topics from Kafka and saving to a Bronze Stage 1 Delta Table.
Create View of Unique Topics/Events – Creation of the View from Bronze Stage 1.
Fan out Single Bronze Stage 1 to Individual Tables – Bronze Stage 2 code example (meta-programming) from the view.
Bring Bronze Stage 2 Tables to Silver Stage – Code example demonstrating metaprogramming model from the silver config layer along with silver table management configuration example.
Create Gold Aggregates – Code example in Delta Live Tables creating complete Gold Summary Tables.
DLT Pipeline DAG – Test and Run the DLT pipeline from Bronze Stage 1 to Gold.
DLT Pipeline Configuration – Configure the Delta Live Tables pipeline with any parameters, cluster customization, and other configuration changes needed for implementing in production.
Multi-task job Creation – Combined step 1 and step 2-7 (all one DLT pipeline) into a Single Databricks Job, where there are 2 tasks that run in series.

Step 1: Raw to Bronze Stage 1 – Code example reading topics from Kafka and saving to a Bronze Stage 1 Delta Table.


startingOffsets = "earliest"

kafka = (spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", kafka_bootstrap_servers_plaintext) 
  .option("subscribe", topic )
  .option("startingOffsets", startingOffsets)
  .load()
        )

read_stream = (kafka.select(col("key").cast("string").alias("topic"), col("value").alias("payload"))
              )

(read_stream
 .writeStream
 .format("delta")
 .mode("append")
 .option("checkpointLocation", checkpoint_location)
 .option("path", )
 saveAsTable("PreBronzeAllTypes")
)

Step 2: Create View of Unique Topics/Events



%sql
CREATE VIEW IF NOT EXISTS dlt_types_config AS
SELECT DISTINCT topic, sub_topic -- Other things such as schema from a registry, or other helpful metadata from Kafka
FROM PreBronzeAllTypes;

Step 3: Fan out Single Bronze Stage 1 to Individual Tables


%python
bronze_tables = spark.read.table("cody_uplift_dlt_blog.dlt_types_config")

## Distinct list is already managed for us via the view definition
topic_list = [[i[0],i[1]] for i in bronze_tables.select(col('topic'), col('sub_topic')).coalesce(1).collect()]

print(topic_list)


import re

def generate_bronze_tables(topic, sub_topic):
  topic_clean = re.sub("/", "_", re.sub("-", "_", topic))
  sub_topic_clean = re.sub("/", "_", re.sub("-", "_", sub_topic))
  
  @dlt.table(
    name=f"bronze_{topic_clean}_{sub_topic_clean}",
    comment=f"Bronze table for topic: {topic_clean}, sub_topic:{sub_topic_clean}"
  )
  
  def create_call_table():
    ## For now this is the beginning of the DAG in DLT
    df = spark.readStream.table('cody_uplift_dlt_blog.PreBronzeAllTypes').filter((col("topic") == topic) & (col("sub_topic") == sub_topic))
    
    ## Pass readStream into any preprocessing functions that return a streaming data frame
    df_flat = _flatten(df, topic, sub_topic)
    
    return df_flat


for topic, sub_topic in topic_list:
  #print(f”Build table for {topic} with event type {sub_topic}”)
  generate_bronze_tables(topic, sub_topic)

Step 4: Bring Bronze Stage 2 Tables to Silver Stage

Define DLT Function to Generate Bronze Stage 2 Transformations and Table Configuration


def generate_bronze_transformed_tables(source_table, trigger_interval, partition_cols, zorder_cols, column_rename_logic = '', drop_column_logic = ''):
  
  @dlt.table(
   name=f"bronze_transformed_{source_table}",
   table_properties={
    "quality": "bronze",
    "pipelines.autoOptimize.managed": "true",
    "pipelines.autoOptimize.zOrderCols": zorder_cols,
    "pipelines.trigger.interval": trigger_interval
  }
  )
  def transform_bronze_tables():
      source_delta = dlt.read_stream(source_table)
      transformed_delta = eval(f"source_delta{column_rename_logic}{drop_column_logic}")
      return transformed_delta

Define Function to Generate Silver Tables with CDC in Delta Live Tables

.

def generate_silver_tables(target_table, source_table, merge_keys, where_condition, trigger_interval, partition_cols, zorder_cols, expect_all_or_drop_dict, column_rename_logic = '', drop_column_logic = ''):


  #### Define DLT Table this way if we want to map columns
  @dlt.view(
  name=f"silver_source_{source_table}")
  @dlt.expect_all_or_drop(expect_all_or_drop_dict)
  def build_source_view():
    #
    source_delta = dlt.read_stream(source_table)
    transformed_delta = eval(f"source_delta{column_rename_logic}{column_rename_logic}")
    return transformed_delta
    #return dlt.read_stream(f"bronze_transformed_{source_table}")

  ### Create the target table definition
  dlt.create_target_table(name=target_table,
  comment= f"Clean, merged {target_table}",
  #partition_cols=["topic"],
  table_properties={
    "quality": "silver",
    "pipelines.autoOptimize.managed": "true",
    "pipelines.autoOptimize.zOrderCols": zorder_cols,
    "pipelines.trigger.interval": trigger_interval
  }
  )
  
  ## Do the merge
  dlt.apply_changes(
    target = target_table,
    source = f"silver_source_{source_table}",
    keys = merge_keys,
    #where = where_condition,#f"{source}.Column) <> col({target}.Column)"
    sequence_by = col("timestamp"),#primary key, auto-incrementing ID of any kind that can be used to identity order of events, or timestamp
    ignore_null_updates = False
  )
   return

Get Silver Table Config and Pass to Merge Function


for table, config in silver_tables_config.items():
  ##### Build Transformation Query Logic from a Config File #####
  
  #Desired format for renamed columns
  result_renamed_columns = []
  for renamed_column, coalesced_columns in config.get('renamed_columns')[0].items():
    renamed_col_result = []
    for i in range( 0 , len(coalesced_columns)):
      renamed_col_result.append(f"col('{coalesced_columns[i]}')")
    result_renamed_columns.append(f".withColumn('{renamed_column}', coalesce({','.join(renamed_col_result)}))")
    
  #Drop renamed columns
  result_drop_renamed_columns = []
  for renamed_column, dropped_column in config.get('renamed_columns')[0].items():
    for item in dropped_column:
      result_drop_renamed_columns.append(f".drop(col('{item}'))")
    
    
  #Desired format for pk NULL check
  where_conditions = []
  for item in config.get('upk'):
    where_conditions.append(f"{item} IS NOT NULL")
  
  source_table = config.get("source_table_name")
  upks = config.get("upk")

  ### Table Level Properties
  trigger_interval = config.get("trigger_interval")
  partition_cols = config.get("partition_columns")
  zorder_cols = config.get("zorder_columns")
  column_rename_logic = ''.join(result_renamed_columns)
  drop_column_logic = ''.join(result_drop_renamed_columns)
  expect_all_or_drop_dict = config.get("expect_all_or_drop")
  
  print(f"""Target Table: {table} \n 
  Source Table: {source_table} \n 
  ON: {upks} \n Renamed Columns: {result_renamed_columns} \n 
  Dropping Replaced Columns: {renamed_col_result} \n 
  With the following WHERE conditions: {where_conditions}.\n 
  Column Rename Logic: {column_rename_logic} \n 
  Drop Column Logic: {drop_column_logic}\n\n""")
    
  ### Do CDC Separate from Transformations
  generate_silver_tables(target_table=table, 
                         source_table=config.get("source_table_name"), 
                         trigger_interval = trigger_interval,
                         partition_cols = partition_cols,
                         zorder_cols = zorder_cols,
                         expect_all_or_drop_dict = expect_all_or_drop_dict,
                         merge_keys = upks,
                         where_condition = where_conditions,
                         column_rename_logic= column_rename_logic,
                         drop_column_logic= drop_column_logic
                         )

Step 5: Create Gold Aggregates

Create Gold Aggregation Tables


@dlt.table(
name='Funnel_Metrics_By_Day',
table_properties={'quality': 'gold'}
)
def getFunnelMetricsByDay():
  
  summary_df = (dlt.read("Silver_Finance_Update").groupBy(date_trunc('day', col("timestamp")).alias("Date")).agg(count(col("timestamp")).alias("DailyFunnelMetrics"))
        )
  
  return summary_df

Step 6: DLT Pipeline DAG – Putting it all together creates the following DLT Pipeline:

Step 7: DLT Pipeline Configuration

{
    "id": "c44f3244-b5b6-4308-baff-5c9c1fafd37a",
    "name": "UpliftDLTPipeline",
    "storage": "dbfs:/pipelines/c44f3244-b5b6-4308-baff-5c9c1fafd37a",
    "configuration": {
        "pipelines.applyChangesPreviewEnabled": "true"
    },
    "clusters": [
        {
            "label": "default",
            "autoscale": {
                "min_workers": 1,
                "max_workers": 5
            }
        }
    ],
    "libraries": [
        {
            "notebook": {
                "path": "/Streaming Demos/UpliftDLTWork/DLT - Bronze Layer"
            }
        },
        {
            "notebook": {
                "path": "/Users/DataEngineering/Streaming Demos/UpliftDLTWork/DLT - Silver Layer"
            }
        }
    ],
    "target": "uplift_dlt_blog",
    "continuous": false,
    "development": true
}

In this settings configuration, this is where you can set up pipeline level parameters, cloud configurations like IAM Instance profiles, cluster configurations, and much more. See the following documentation for the full list of DLT configurations available.

Step 8: Multi-task job Creation – Combine DLT Pipeline and Preprocessing Step to 1 Job

In Delta Live Tables, we can control all aspects of each table independently via the configurations of the tables without changing the pipeline code. This simplifies pipeline changes, vastly increases scalability with advanced auto-scaling, and improves efficiency due to the parallel generation of tables. Lastly, the entire 100+ table pipeline is all supported in one job that abstracts away all streaming infrastructure to a simple configuration, and manages data quality for all supported tables in the pipeline in a simple UI. Before Delta Live Tables, managing the data quality and lineage for a pipeline like this would be manual and extremely time consuming.

This is a great example of how Delta Live Tables simplifies the data engineering experience while allowing data engineers and analysts (You can also create DLT pipelines in all SQL) to build sophisticated pipelines that would have taken hundreds of hours to build and manage in-house.

Ultimately, Delta Live Tables enables Uplift to focus on providing smarter and more effective product offerings for their partners instead of wrangling each data source with thousands of lines of “janitor code”.

Try Databricks for free. Get started today.

The post How Uplift built CDC and Multiplexing data pipelines with Databricks Delta Live Tables appeared first on Databricks.

↧

How Wrong Is Your Model?

April 28, 2022, 8:00 am

≫ Next: Announcing General Availability of Databricks Feature Store

≪ Previous: How Uplift built CDC and Multiplexing data pipelines with Databricks Delta Live Tables

In this blog, we look at the topic of uncertainty quantification for machine learning and deep learning. By no means is this a new subject, but the introduction of tools such as Tensorflow Probability and Pyro have made it easy to perform probabilistic modeling to streamline uncertainty calculations. Consider the scenario in which we predict the value of an asset like a house, based on a number of features, to drive purchasing decisions. Wouldn’t it be beneficial to know how certain we are of these predicted prices? Tensorflow Probability allows you to use the familiar Tensorflow syntax and methodology but adds the ability to work with distributions. In this introductory post, we leave the priors and the Bayesian treatment behind and opt for a simpler probabilistic treatment to illustrate the basic principles. We use the likelihood principle to illustrate how an uncertainty measure can be obtained along with predicted values by applying them to a deep learning regression problem.

Uncertainty quantification

Uncertainty can be divided into two types:

Epistemic uncertainty
Aleatoric uncertainty

Epistemic uncertainty is a result of the model not having information, but this information can be obtained from providing new data to the model or increasing the representation capacity of the model by increasing its complexity. This type of uncertainty can potentially be addressed and reduced. Aleatoric uncertainty, on the other hand, stems from the inherent stochasticity in the data-generating process. In stochastic processes, there are a number of parameters and only a subset of these parameters are observable. So, theoretically, if there was a way to measure all these parameters, we would be able to reproduce an event exactly. However, in most real-life scenarios, this is not the case. In this work, we are trying to quantify the epistemic uncertainty, which stems from the lack of knowledge in our network or model parameters.

Problem definition

The goal here is to quantify the uncertainty of predictions. In other words, along with getting the predicted values, a measure of certainty or confidence would also be computed for each predicted value. We are going to illustrate this uncertainty analysis using a regression problem. Here we model the relationship between the independent variables and the dependent variable using a neural network. Instead of the neural network outputting a single predicted value y_pred, the network will now predict the parameters of a distribution. This probability distribution is chosen based on the type of the target or dependent variable. For classification, the MaxLike [https://www.nbi.dk/~petersen/Teaching/Stat2015/Week3/AS2015_1201_Likelihood.pdf] principle tells us that the network weights are updated to maximize the likelihood or probability of seeing the true data class given the model (network + weights). A Normal distribution is a baseline; however, it may not be appropriate for all scenarios. For example, if the target variable represents count data, we would choose a Poisson distribution. For a Normal distribution, the neural network would output two values, the parameters of the distribution y_mean and y_std, for every input data point. We are assuming a parametric distribution in the output or target variable, which may or may not be valid. For more complex modeling, you may want to consider a mixture of Gaussians or a Mixture Density network instead.

Normally, the error of the predicted values is computed using a number of loss functions such as the MSE, cross-entropy, etc. Since we have probabilistic outputs, MSE is not an appropriate way to measure the error. Instead, we choose the likelihood function, or rather the Negative Log-likelihood (NLL) as a baseline loss function. In fact, apart from the differences in interpretation of one being deterministic and the other being probabilistic in nature, it can be shown that cross-entropy and NLL are equivalent [REFERENCE]. To illustrate this, two Normal distributions are plotted in Fig. 1 below with the dotted lines indicating the likelihood as the probability density at two different data points. The narrower distribution is shown in red, while the wider distribution is plotted in blue. The likelihood of the data point given by x=68 is higher for the narrower distribution, while the likelihood for the point given by x=85 is higher for the wider distribution.

Sample uncertainty analysis where normal distributions with differing variances and probability densities correspondent to two points.

Fig. 1 Likelihood at two different points

Using the MaxLike principle [REFERENCE] and under assumptions of independence of data points, the objective here is to maximize the likelihood of each data point. As a result of the independence assumption, the total likelihood is therefore the product of the individual likelihoods. For numerical stability, we use the log-likelihood as opposed to the likelihood. The NLLs are summed up for each point to obtain the total loss for each iteration.

We want to capture the non-linear relationships that may exist between the independent and dependent variables, therefore we use multiple hidden layers with activation functions for both parameters y_mean and y_std. This allows non-monotonic variations for both parameters. One could simplify this in two ways:

Fixed variance: only a single parameter y_mean is estimated
Linear variance: y_std is also estimated but now this is a function of a single hidden layer and no activation function

The examples below show non-linear variance of standard deviation. The first example illustrates how to fit a linear model (linear variation for y_mean) and will be followed by non-linear variation of y_mean to capture more complex phenomena.

What is Tensorflow Probability (TFP)?

Tensorflow Probability is a framework that builds upon TensorFlow but can work with and perform operations on distributions of data. You can define distributions and sample from it, as shown in the following section.

What distributions are available?

Common distributions such as the Bernoulli, Binomial, Normal, Gamma, etc. are available. More information about these distributions can be found here [https://www.tensorflow.org/probability/api_docs/python/tfp/distributions]

Predictions with uncertainty using Tensorflow Probability on synthetic data

In order to illustrate how TFP can be used to quantify the uncertainty of a prediction, we start off with a synthetic one-dimensional dataset. Synthetic data allows us to perform a controlled experiment and the single dimension makes it easy to visualize the uncertainty associated with each data point and prediction.

Synthetic data generation

The goal here is to generate some synthetic data with non-constant variance. This property of the data is referred to as heteroscedasticity. This data is generated in segments and then concatenated together, as shown below.

Fit a linear model with non-constant standard deviation

Some noise is added to the above data, and we generate the target variable ‘y’ from the independent variable ‘x’ and the noise. The relationship between them is:

y=2.7*x+noise
This data is then split into a training set and a validation set to assess performance. The relationship between the dependent and independent variables can be visualized in Fig. 2 for both the training set and the validation set.

np.random.seed(4710)
noise=np.random.normal(0,x,len(x))
np.random.seed(99)
first_part=len(x1)
x11=np.random.uniform(-1,1,first_part)
np.random.seed(97)
x12=np.random.uniform(1,6,len(noise)-first_part)
x=np.concatenate([x11,x12])
x=np.sort(x)
y=2.7*x+noise

Sample visualization depicting the relationship between the dependent and independent variables.

Fig. 2 – Generated data

Define the model

The model that we build is a fairly simple one with three dense layers applied to the data and two outputs, corresponding to the mean y_mean and the standard deviation y_std. These parameters are concatenated and passed to the distribution function ‘my_dist’.

In the function ‘my_dist,’ the Normal distribution is parameterized by the mean and scale (standard deviation). The mean is the first index in the two-dimensional variable ‘params,’ and standard deviation is defined through a softplus operation because we are computing the log of the standard deviation or log y_std. This is because the standard deviation is always a positive value and the output of the neural network layer can be positive or negative. Therefore the transformation helps to constrain the output to just positive values.

The function ‘NLL’ computes the Negative Log-likelihood (NLL) of the input data given the network parameters, as the name indicates and returns them. This will be the loss function.

Three models are generated:

Model – outputs y_mean and y_std for the output distribution
Model_mean – outputs the mean of the distribution returned from ‘my_dist’
Model_std – outputs the standard deviation of the distribution returned from ‘my_dist’

def NLL(y, distr):
 return -distr.log_prob(y)
def my_dist(params):
 return tfd.Normal(loc=params[:,0:1], scale=1e-3 + tf.math.softplus(0.05 * params[:,1:2]))# both parameters are learnable
inputs = Input(shape=(1,))
hiddena = Dense(30)(inputs)
hidden1 = Dense(20,activation="relu")(hiddena)
hidden2 = Dense(20,activation="relu")(hidden1)
out1 = Dense(1)(hiddena) #A
out2 = Dense(1)(hidden2) #B
params = Concatenate()([out1,out2]) #C
dist = tfp.layers.DistributionLambda(my_dist)(params)
model_flex_sd = Model(inputs=inputs, outputs=dist)
model_flex_sd.compile(Adam(learning_rate=0.01), loss=NLL)

Evaluating the results

Once the model is trained and the convergence plot is inspected, we can also observe the sum of NLLs for the test data. We will look at this in more detail in the next section. This can be used to tune the model and evaluate the fit, but care should be taken to not perform comparisons across different datasets. The sum of NLLs can be computed as shown below.

model_flex_sd.evaluate(x_test,y_test, verbose=0)
4.0097329248257765

The model fitted on the training and validation data is shown below. A linear model was fit as a result of the training, and the black line obtained from y_mean captures this trend. The variance is indicated by the dotted red lines, which aligns with the variance that was incorporated into the generated data. Finally, this is evaluated on the test data set.

Fig. 3 Predicted mean and standard deviation on the test data

Fit a non-linear model with non-constant standard deviation

Here, the relationship between the dependent and independent variables vary in a non-linear manner due to the squared term and is shown below.

y = 2.7x + x^2 + noise

In order to obtain this nonlinear behavior, we add an activation function (non-linear) to the output of y_mean. Similar to what was done before, we fit the model and plot the predicted mean and standard deviation at each data point for the training, validation and test data points as shown below.

inputs = Input(shape=(1,))
hiddena = Dense(30, activation="relu")(inputs)
hidden1 = Dense(20,activation="relu")(hiddena)
hidden2 = Dense(20,activation="relu")(hidden1)
out1 = Dense(1)(hiddena) #A
out2 = Dense(1)(hidden2) #B
params = Concatenate()([out1,out2]) #C
dist = tfp.layers.DistributionLambda(my_dist)(params)

Sample visualization plotting the predicted mean and standard deviation at each data point for the training, validation and test data.You would notice that now the predicted mean is no longer linear and attempts to capture the non-linearity in the original data.

Fig. 4 Non-linear generated data – training and validation

Sample visualization predicting mean and standard deviation on the training and validation data.

Fig. 5 Predicted mean and standard deviation on the training and validation data

Sample visualization predicting mean and standard deviation on the test data.Quantify uncertainty on high-dimensional real-life data

Fig. 6 Predicted mean and standard deviation on the test data

Unlike the data previously generated, real-life tends to not have desirable properties such as unit standard deviation; therefore, preprocessing the data is often a good idea. This is particularly important for techniques where assumptions of Normality in the data distribution are made for the techniques to be valid. The dataset used here is the Diabetes dataset [REFERENCE]. This is a regression problem with numerical features and targets.

Data preprocessing

There are two transformations applied to the data here.

Standardization
Power transformation or Quantile transformation

The data is standardized and one of two transforms is applied to the data. The Power transformation can include either the Box-Cox transform [G.E.P. Box and D.R. Cox, “An Analysis of Transformations”, Journal of the Royal Statistical Society B, 26, 211-252 (1964).], which assumes that all values are positive, or the Yeo-Johnson transform [[I.K. Yeo and R.A. Johnson, “A new family of power transformations to improve normality or symmetry.” Biometrika, 87(4), pp.954-959, (2000).], which makes no such assumption about the nature of the data. In the Quantile transformer. Both of these transform the data to more Gaussian-like distribution. The quantile information of each feature is used to map it to the desired distribution, which here is the Normal distribution.

def preprocess_pipeline_power(df, target_column):
 scaler = StandardScaler()
 power_transform = PowerTransformer(method='yeo-johnson')
 pipeline_power = Pipeline([('s', scaler), ('p', power_transform)])
 res_power = pipeline_power.fit_transform(df)
 x_train, x_test, y_train, y_test = train_test_split(res_power[:,0:-1], res_power[:,-1], test_size = 0.2, random_state=123)
 return(x_train, x_test, y_train, y_test)

def preprocess_pipeline_quantile(df, target_column):
 scaler = StandardScaler()
 quantile_transform = QuantileTransformer(n_quantiles=100, output_distribution="normal")
 pipeline_quantile = Pipeline([('s', scaler), ('q', quantile_transform)])
 res_quantile = pipeline_quantile.fit_transform(df)
 x_train, x_test, y_train, y_test = train_test_split(res_quantile[:,0:-1], res_quantile[:,-1], test_size = 0.2, random_state=123)
 return(x_train, x_test, y_train, y_test)

Model fit and evaluate

def NLL(y, distr):
  	return -distr.log_prob(y)

def my_dist(params):
  	return tfd.Normal(loc=params[:,0:1], scale=1e-3 + tf.math.softplus(0.05 * params[:,1:2]))# both parameters are learnable

def get_model(X):
   if(isinstance(X, pd.DataFrame)):
     Xlen = len(X.columns)
   else:
     Xlen = np.shape(X)[1]
   input1 = Input(shape=(Xlen)) # 13 for boston housing and 8 for california housing data
   hidden1 = Dense(32, activation='relu', name='dense_1')(input1) # 32 or 8
   hidden2 = Dense(8, activation='relu', name='dense_2')(input1)
   out1 = Dense(1, activation='relu', name='out_1')(hidden2) # out1 is mean
   out2 = Dense(1, activation='relu', name='out_2')(hidden1) # out2 is std
   params = Concatenate()([out1,out2]) #C
   dist = tfp.layers.DistributionLambda(my_dist)(params)
   model = Model(inputs=input1, outputs=dist)
   model.compile(Adam(learning_rate=0.001), loss=NLL)
   model_mean = Model(inputs=input1, outputs=dist.mean())
   model_std = Model(inputs=input1, outputs=dist.stddev())
   model.summary()
   return(model, model_mean, model_std)

def fit_model(model, X_data_train, y_data_train, batch_size=128, epochs=1000, validation_split=0.1):
 	 history = model.fit(X_data_train, y_data_train, batch_size=batch_size, epochs=epochs, validation_split=validation_split)
    	 return(model)

def evaluate_model(model, model_mean, model_std, X_data_test, y_data_test):
   y_out_mean = model_mean.predict(X_data_test)
   y_out_std = model_std.predict(X_data_test)
   y_out_mean_vals = y_out_mean.squeeze(axis=1)
   if(isinstance(y_data_test, pd.DataFrame)):
   	  y_test_true_vals = y_data_test.values.squeeze(axis=1)
   else:
     	y_test_true_vals = y_data_test
   y_out_std_vals = y_out_std.squeeze(axis=1)
   neg_log_prob_array = []
   for elem in zip(y_out_mean_vals, y_test_true_vals, y_out_std_vals):
     predicted = elem[0]
     predicted_var = elem[2]
     true_val = elem[1]
     neg_log_prob = -1.0 * tfd.Normal(predicted, predicted_var).log_prob(true_val).numpy()
    neg_log_prob_array.append(neg_log_prob)
     return(neg_log_prob_array)

Evaluating the results

As mentioned before, apart from the convergence plots, you can evaluate model uncertainty based on the performance on the test set using the sum of NLL. This metric gives us a way to compare different models. We can also look at the distribution of the NLLs that are obtained on the test data set to understand how the model has generalized to new data points. Outliers could have contributed to a large NLL, which would be obvious from inspecting the distribution.

model, model_mean, model_std = get_model(X_trans)
model = fit_model(model, X_data_train, y_data_train, epochs=1000)
neg_log_array = evaluate_model(model, model_mean, model_std, X_data_test, y_data_test)

Here, the NLL for each point is accumulated in the array ‘neg_log_array’ and the histogram is plotted. We compare two scenarios: one where the quantile transformation is applied to the target and the power-transformed version is applied to the other. We want most of the density in the histogram to be close to 0, indicating that most of the data points had a low NLL, i.e. the model fit those points well. Fig. 7 illustrates this for the two transformation techniques, the quantile transformation seems to have marginally better performance if your goal is to minimize the outliers in the uncertainty of model predictions. This can also be used to perform hyperparameter tuning of the model and select an optimal model.

Fig. 7 Histogram of NLL, ideally we want more density towards 0

Conclusion

This post shows how uncertainty quantification can be beneficial in predictive modeling. Additionally, we walked through using Tensorflow Probability to quantify uncertainty from a probabilistic perspective on a deep learning problem. This approach avoids a full Bayesian treatment and tends to be more approachable introduction to uncertainty estimation.

Try out the examples shown here on Databricks on our ML runtimes!

Try Databricks for free. Get started today.

The post How Wrong Is Your Model? appeared first on Databricks.

↧