Quantcast
Channel: Databricks
Viewing all 1874 articles
Browse latest View live

Recap of Databricks Lakehouse Platform Announcements at Data and AI Summit 2022

$
0
0

Data teams have never been more important to the world. Over the past few years we’ve seen many of our customers building a new generation of data and AI applications that are reshaping and transforming every industry with the lakehouse.

The data lakehouse paradigm introduced by Databricks is the future for modern data teams seeking to build solutions that unify analytics, data engineering, machine learning, and streaming workloads across clouds on one simple, open data platform.

Many of our customers, from enterprises to startups across the globe, love and trust Databricks. In fact, half of the Fortune 500 are seeing the lakehouse drive impact. Organizations like John Deere, Amgen, AT&T, Northwestern Mutual and Walgreens, are making the move to the lakehouse because of its ability to deliver analytics and machine learning on both structured and unstructured data.

Last month we unveiled innovation across the Databricks Lakehouse Platform to a sold-out crowd at the annual Data + AI Summit. Throughout the conference, we announced several contributions to popular data and AI open source projects as well as new capabilities across workloads.

Open sourcing all of Delta Lake

Delta Lake is the fastest and most advanced multi-engine storage format. We’ve seen incredible success and adoption thanks to the reliability and fastest performance it provides. Today, Delta Lake is the most widely used storage layer in the world, with over 7 million monthly downloads; growing by 10x in monthly downloads in just one year.

We announced that Databricks will contribute all features and enhancements it has made to Delta Lake to the Linux Foundation and open source all Delta Lake APIs as part of the Delta Lake 2.0 release.

Delta Lake 2.0 will bring unmatched query performance to all Delta Lake users and enable everyone to build a highly performant data lakehouse on open standards. With this contribution, Databricks customers and the open source community will benefit from the full functionality and enhanced performance of Delta Lake 2.0. The Delta Lake 2.0 Release Candidate is now available and is expected to be fully released later this year. The breadth of the Delta Lake ecosystem makes it flexible and powerful in a wide range of use cases.

Spark from Any Device and Next Generation Streaming Engine

As the leading unified engine for large-scale data analytics, Spark scales seamlessly to handle data sets of all sizes. However, the lack of remote connectivity and the burden of applications developed and run on the driver node, hinder the requirements of modern data applications. To tackle this, Databricks introduced Spark Connect, a client and server interface for Apache Spark™ based on the DataFrame API that will decouple the client and server for better stability, and allow for built-in remote connectivity. With Spark Connect, users can access Spark from any device.

Data streaming on the lakehouse is one of the fastest-growing workloads within the Databricks Lakehouse Platform and is the future of all data processing. In collaboration with the Spark community, Databricks also announced Project Lightspeed, the next generation of Spark Structured Streaming engine for data streaming on the lakehouse.

Expanding Data Governance, Security, and Compliance Capabilities

For organizations, governance, security, and compliance are critical because they help guarantee that all data assets are maintained and managed securely across the enterprise and that the company is in compliance with regulatory frameworks. Databricks announced several new capabilities that further expand data governance, security, and compliance capabilities.

  • Unity Catalog will be generally available on AWS and Azure in the coming weeks, Unity Catalog offers a centralized governance solution for all data and AI assets, with built-in search and discovery, automated lineage for all workloads, with performance and scalability for a lakehouse on any cloud.
  • Databricks also introduced Data lineage, for Unity Catalog earlier last month, significantly expanding data governance capabilities on the lakehouse and giving data teams a complete view of the entire data lifecycle. With data lineage, customers gain visibility into where data in their lakehouse came from, who created it and when, how it has been modified over time, how it’s used across data warehousing and data science workloads, and much more.
  • Databricks extended capabilities for customers in highly regulated industries to help them maintain compliance with Payment Card Industry Data Security Standard (PCI-DSS) and Health Insurance Portability and Accountability Act (HIPAA). Databricks extended HIPAA and PCI-DSS compliance features on AWS for multi-tenant E2 architecture deployments, and now also provides HIPAA Compliance features on Google Cloud (both are in public preview).

Safe, open sharing allows data to achieve new value without vendor lock-in

Data sharing has become important in the digital economy as enterprises wish to easily and securely exchange data with their customers, partners, suppliers and internal line of business to better collaborate and unlock value from that data. To address the limitations of existing data sharing solutions, Databricks developed Delta Sharing, with various contributions from the OSS community, and donated it to the Linux Foundation. We announced Delta Sharing will be generally available in the coming weeks.

Databricks is helping customers share and collaborate with data across organizational boundaries and we also unveiled enhancements to data sharing enabled by Databricks Marketplace and Data Cleanrooms.

  • Databricks Marketplace: Available in the coming months, Databricks Marketplace provides an open marketplace to package and distribute data sets and a host of associated analytics assets like notebooks, sample code and dashboards without vendor lock-in.
  • Data Cleanrooms: Available in the coming months, Data Cleanrooms for the lakehouse will provide a way for companies to safely discover insights together by partnering in analysis without having to share their underlying data.

The Best Data Warehouse is a Lakehouse

Data warehousing is one of the most business-critical workloads for data teams. Databricks SQL (DBSQL) is a serverless data warehouse on the Databricks Lakehouse Platform that lets you run all your SQL and BI applications at scale with up to 12x better price/performance, a unified governance model, open formats and APIs, and your tools of choice – no lock-in. Databricks unveiled new data warehousing capabilities in its platform to enhance analytics workloads further:

  • Databricks SQL Serverless is now available in preview on AWS, providing instant, secure and fully managed elastic compute for improved performance at a lower cost.
  • Photon, the record-setting query engine for lakehouse systems, will be generally available on Databricks Workspaces in the coming weeks, further expanding Photon’s reach across the platform. In the two years since Photon was announced, it has processed exabytes of data, run billions of queries, delivered benchmark-setting price/performance at up to 12x better than traditional cloud data warehouses.
  • Open source connectors for Go, Node.js, and Python make it even simpler to access the lakehouse from operational applications, while the Databricks SQL CLI enables developers and analysts to run queries directly from their local computers.
  • Databricks SQL now provides query federation, offering the ability to query remote data sources including PostgreSQL, MySQL, AWS Redshift, and others without the need to first extract and load the data from the source systems.
  • Python UDFs deliver the power of Python right into Databricks SQL! Now analysts can tap into python functions – from complex transformation logic to machine learning models – that data scientists have already developed and seamlessly use them in their SQL statements.
  • Adding support for Materialized Views (MVs) to accelerate end-user queries and reduce infrastructure costs with efficient, incremental computation. Built on top of Delta Live Tables (DLT), MVs reduce query latency by pre-computing otherwise slow queries and frequently used computations.
  • Primary Key & Foreign Key Constraints provides analysts with a familiar toolkit for advanced data modeling on the lakehouse. DBSQL & BI tools can then leverage this metadata for improved query planning.

Reliable Data Engineering

Tens of millions of production workloads run daily on Databricks. With the Databricks Lakehouse Platform, data engineers have access to an end-to-end data engineering solution for ingesting and transforming batch and streaming data, orchestrating reliable production workflows at scale, and increasing the productivity of data teams with built-in data quality testing and support for software development best practices.

We recently announced general availability on all three clouds of Delta Live Tables (DLT), the first ETL framework to use a simple, declarative approach to building reliable data pipelines. Since its launch earlier this year, Databricks continues to expand DLT with new capabilities. We are excited to announce we are developing Enzyme, a performance optimization purpose-built for ETL workloads. Enzyme efficiently keeps up-to-date a materialization of the results of a given query stored in a Delta table. It uses a cost model to choose between various techniques, including techniques used in traditional materialized views, delta-to-delta streaming, and manual ETL patterns commonly used by our customers. Additionally, DLT offers new enhanced autoscaling, purpose-built to intelligently scale resources with the fluctuations of streaming workloads, and CDC Slowly Changing Dimensions—Type 2, easily tracks every change in source data for both compliance and machine learning experimentation purposes .When dealing with changing data (CDC), you often need to update records to keep track of the most recent data. SCD Type 2 is a way to apply updates to a target so that the original data is preserved.

We also recently announced general availability on all three clouds of Databricks Workflows, the fully managed lakehouse orchestration service for all your teams to build reliable data, analytics and AI workflows on any cloud. Since its launch earlier this year, Databricks continues to expand Databricks Workflows with new capabilities including Git support for Workflows now available in Public Preview, running dbt projects in production, new SQL task type in Jobs, new “Repair and Rerun” capability in Jobs, and context sharing between tasks.

Production Machine Learning at Scale

Databricks Machine Learning on the lakehouse provides end-to-end machine learning capabilities from data ingestion and training to deployment and monitoring, all in one unified experience, creating a consistent view across the ML lifecycle and enabling stronger team collaboration. We continue to innovation across the ML lifecycle to enable you to put models faster into production –

  • MLflow 2.0, As one of the most successful open source machine learning (ML) projects, MLflow has set the standard for ML platforms. The release of MLflow 2.0 introduces MLflow Pipelines to make MLOps simple and get more projects to production. It offers out of box templates and provides a structured framework that enables to teams to automate the handoff from experimentation to production. You can preview this functionality with the latest version of MLflow.
  • Serverless Model Endpoints, Deploy your models on Serverless Model Endpoints for real-time inference for your production application, without the need to maintain your own infrastructure. Users can customize autoscaling to handle their model’s throughput and for predictable traffic use cases, and teams can save costs by autoscaling all the way down to zero.
  • Model Monitoring, Track the performance of your production models with Model Monitoring. It auto-generates dashboards to help teams view and analyze data and model quality drift. Model Monitoring also provides the underlying analysis and drift tables as Delta tables so teams can join performance metrics with business value metrics to calculate business impact as well as create alerts when metrics have fallen below specified thresholds.

Learn more

Modern data teams need innovative data architectures to meet the requirements of the next generation of Data and AI applications. The lakehouse paradigm provides a simple, multicloud, and open platform and it remains our mission to continue supporting all our customers who want to be able to do business intelligence, AI, and machine learning in one platform. You can watch all our Data and AI Summit keynotes and breakout sessions on demand to learn more about these announcements. You can also download the Data Team’s Guide to the Databricks Lakehouse Platform for a deeper dive to the Databricks Lakehouse Platform.

--

Try Databricks for free. Get started today.

The post Recap of Databricks Lakehouse Platform Announcements at Data and AI Summit 2022 appeared first on Databricks.


Key Public Sector Take Aways from Data + AI Summit

$
0
0

This year’s Data + AI Summit was groundbreaking overall from the quality of keynote speakers to the game-changing product news. One of the most exciting additions were our new hybrid industry tracks with sessions and forums for attendees across six of the largest industries at Databricks, including Public Sector!

In case you missed the live event, I’m excited to share important product announcements and highlights of the industry program. Our sessions, which are now on-demand, feature Databricks employees, customers, and partners sharing their views of the Lakehouse for Public Sector and why it has been a key component for government agencies looking to modernize their data strategy to deliver more insights and support the mission of government.

Public Sector Forum

For our government attendees, the most exciting part of Data + AI Summit 2022 was the Public Sector Forum – a two-hour event that brought together leaders from across all segments of government to hear from peers about their data journey.

In his keynote, Databricks VP of Federal, Howard Levenson, shared an overview of the lakehouse and how it delivers on the promise of both the Federal Data Strategy and the DoD Data Decrees.

In a fireside chat with CDC Chief Data Officer, Alan Sim and CDC Chief Architect, Rishi Tarar, attendees learned about the agency’s COVID-19 vaccine rollout and the challenges they addressed by providing near real-time insight to the public, hospitals and state and local agencies. The CDC was also announced as the winner of the 2022 Data Democratization Award for the work they did to support the vaccine rollout, and their work with state and local agencies and medical partners to monitor the spread and treatment of COVID-19.

The forum included an executive panel featuring Fredy Diaz, Analytics Director at the USPS Office of the Inspector General, and Dr. John Scott, Acting Director of Data Management and Analytics at the Veterans Health Association, who discussed their agency adoption of the lakehouse and the impact it’s had on their mission.

Concluding the session, Cody Ferguson, Data Operations Director at DoD Advana and Brad Corwin, Chief Data Scientist at Booz Allen Hamilton, shared an in-depth overview of the DoD Advanced Analytics Platform, Advana,and the capabilities it has delivered to the Department of Defense.

Industry Sessions

All sessions are now available on our virtual platform. Here are few you don’t want to miss:

LA County, Department of Human Resources – How the Largest US County is Transforming Hiring with A Modern Data Lakehouse
US Air Force – Safeguarding Personnel Data at Enterprise Scale
Veterans Affairs – Cloud and Data Science Modernization with Azure Databricks
Deloitte – Implementing a Framework for Data Security at a Large Public Sector Agency
State of CA, CalHEERS – Data Lake for State Health Exchange Analytics Using Databricks

Databricks Announcements That Will Transform the Public Sector

While much has been written about the innovations shared by Databricks at this year’s Data + AI Summit, I thought I would provide a quick recap of the news that is particularly exciting for our government customers:

Data Management and Engineering

Delta Lake 2.0 – now fully open source.
This announcement is extremely relevant to our Public Sector customers. Both the DoD Data Decrees and the Federal Data Strategy stress the importance of choosing open source solutions for the Public Sector; by taking this step, Databricks further demonstrates its commitment to developing a lakehouse foundation that is secure, open, and interoperable. Government customers can be sure that:

  • Your data is in an open storage format in YOUR object store
  • Your code is managed via CI/CD and lives in YOUR GitHub repo
  • Your applications leverage open source APIs
  • There is no code or data lock-in. We lock you in with value:
    • The infrastructure savings of running your application faster and turning off your cloud compute sooner
    • The productivity gains of leveraging our platform to do your development and production work
    • The mission outcomes that you can unlock, with a very quick time to value

Delta Live Tables introduces enhanced Auto Scaling. This is going to be a game changer for our Public Sector customers, many of whom have asked for the ability to optimize their cluster utilization to reduce infrastructure costs in an automated way without requiring manual intervention. This combines the two major things that will improve the speed at which our public sector customers can build pipelines to ingest and curate their data, but do it in the most cost-effective way without manual tuning.

The information on Project Lightspeed shared at the conference is incredibly relevant to our public sector customers who have seen a significant increase in the need to gain insight into streaming data in real-time. With use cases spanning every segment of our government from visa processing and supply chain management to electronic health records and postal delivery, the combined power of Delta Live Tables (DLT) and Structured Streaming holds great potential for the public sector. In addition, the focus on leveraging streaming data insight at PB scale volumes enables government agencies to mitigate cyber threats and meet the requirements as laid out in OMB M 21-31. All in all, the ease of use and flexibility of this solution are unmatched and we’re excited to offer this to our Public Sector customers.

Governance and Data Sharing

Delta Sharing is now GA. Delta Sharing is a phenomenal technical solution to enable some amazing outcomes for the government. Intergovernmental data sharing has become more critical than ever, as highlighted by the COVID-19 pandemic most recently. In order to address complex challenges that require the collaboration of multiple Federal agencies, state and local governments, and commercial partners, it is critical that government agencies have a way to securely share data to achieve outcomes that will benefit all constituents.

The announcement of Cleanrooms provides an opportunity for the government as agencies begin to share data more openly. The win is the ability to share data across agencies without sacrificing data ownership and data governance, ultimately leading to better mission outcomes.

Also shared were updates around Unity Catalog, which address the number one goal of many Federal CDOs today – the need for a well-cataloged and governed data platform. In addition, many of our catalog partners will be able to take advantage of Unity’s existing API standards to leverage governance on top of the lakehouse. Because Public Sector customers care particularly about data lineage, they will celebrate having a greater understanding of the data sources that make up reports and tables.

Data Science and Machine Learning

Lastly, we announced MLflow 2.0, which includes MLFlow Pipelines,.a significant advantage for public sector data teams when they need to operationalize a model. MLflow Pipelines provides a structured framework that enables teams to automate the handoff from exploration to production so that ML engineers no longer have to juggle manual code rewrites and refactoring. MLflow Pipeline templates scaffold pre-defined graphs with user-customizable steps and natively integrate with the rest of MLflow’s model lifecycle management tools. Pipelines also provide helper functions, or “step cards”, to standardize model evaluation and data profiling across projects. The net of this is that a Public sector organization can put a model into production significantly faster.

Beyond these featured announcements, there was other exciting news about Databricks Marketplace and Serverless Model Endpoints. I encourage you to check out the Day 1 and Day 2 Keynotes to learn more about our product announcements!

--

Try Databricks for free. Get started today.

The post Key Public Sector Take Aways from Data + AI Summit appeared first on Databricks.

Hevo Data and Databricks Partner to Automate Data Integration for the Lakehouse

$
0
0

Businesses today have a wealth of information siloed across databases like MongoDB, PostgreSQL, and MySQL, and SaaS applications such as Salesforce, Zendesk, Intercom, and Google Analytics. Bringing this data into a centralized repository requires a lot of development and maintenance work. Building a custom connector for one such data source requires months of engineering bandwidth and constant maintenance work to avoid data loss or loading errors due to changes in source data or APIs.

Hevo Data, the end-to-end data pipeline platform, has partnered with Databricks to provide an easy and automated way for businesses to integrate their data from multiple SaaS sources and databases into Delta Lake. This partnership will enable Databricks users to break down data silos quickly, eliminate manual, error-prone data integration tasks, and get accurate and reliable data in the lakehouse to support their analytics, reporting, and AI/ML use cases. Hevo’s upcoming integration with Databricks Partner Connect will make it easier for Databricks customers to try Hevo and ingest data quickly. Users will be able to start a seamless Hevo trial experience right from the Databricks product, reducing friction for users to leverage Hevo to onboard data to the Lakehouse.

Hevo provides 150+ pre-built integrations with various data sources such as databases, SaaS applications, cloud storage systems, SDKs, streaming services, and more to simplify the integration, transformations, and processing of disparate data. The platform supports multiple use cases, including data replication, ETL, ELT, and Reverse ETL. With Hevo’s Databricks connector, Delta Lake users can achieve faster, more accurate, and more reliable data integration at scale, helping hydrate the lakehouse and solve more use cases with real-time, accurate, and unified data.

In addition to creating pipelines to load data from all the SaaS sources and databases to Delta Lake, Hevo provides numerous vital features. This includes pre-load and post-load transformation, append-only and de-duplication methods for loading data, near real-time replication, and auto schema mapping.

Databricks pipeline dashboard on Hevo

Databricks pipeline dashboard on Hevo

Databricks and Hevo share many commonalities. Here are the top three benefits that highlight our partnership:

  • Scalable – Both platforms are hosted on the cloud and built on a horizontally scalable architecture.
  • Secure – Both platforms are built with enterprise-grade security and are HIPAA, SOC2, and GDPR-ready to ensure that your data is completely protected.
  • Robust – Delta helps build robust pipelines at scale, and Hevo automatically detects any anomaly in the incoming data and notifies you instantly to reduce downtime.

If you are already a Databricks customer, look for the upcoming Hevo integration in Partner Connect. To learn more about Hevo’s Databricks connector, please review the detailed documentation or watch the demo video here. Hevo provides a 14-day free trial, so start building your data pipelines today!

--

Try Databricks for free. Get started today.

The post Hevo Data and Databricks Partner to Automate Data Integration for the Lakehouse appeared first on Databricks.

Security Best Practices for Delta Sharing

$
0
0

The data lakehouse has enabled us to consolidate our data management architectures, eliminating silos and leverage one common platform for all use cases. The unification of data warehousing and AI use cases on a single platform is a huge step forward for organizations, but once they’ve taken that step, the next question to consider is “how do we share that data simply and securely no matter which client, tool or platform the recipient is using to access it?” Luckily, the lakehouse has an answer to this question too: data sharing with Delta Sharing.

Delta Sharing

Delta Sharing is the world’s first open protocol for securely sharing data internally and across organizations in real-time, independent of the platform on which the data resides. It’s a key component of the openness of the lakehouse architecture, and a key enabler for organizing our data teams and access patterns in ways that haven’t been possible before, such as data mesh.

Delta Sharing is the world's first open protocol for securely sharing data both internally and across organizations in real-time.

Secure by Design

It’s important to note that Delta Sharing has been built from the ground up with security in mind, allowing you to leverage the following features out of the box whether you use the open source version or its managed equivalent:

  • End-to-end TLS encryption from client to server to storage account
  • Short lived credentials such as pre-signed URLs are used to access the data
  • Easily govern, track, and audit access to your shared data sets via Unity Catalog

The best practices that we’ll share as part of this blog are additive, allowing customers to align the appropriate security controls to their risk profile and the sensitivity of their data.

Security Best Practices

Our best practice recommendations for using Delta Sharing to share sensitive data are as follows:

  1. Assess the open source versus the managed version based on your requirements
  2. Set the appropriate recipient token lifetime for every metastore
  3. Establish a process for rotating credentials
  4. Consider the right level of granularity for Shares, Recipients & Partitions
  5. Configure IP Access Lists
  6. Configure Databricks Audit logging
  7. Configure network restrictions on the Storage Account(s)
  8. Configure logging on the Storage Account(s)

1. Assess the open source versus the managed version based on your requirements

As we have established above, Delta Sharing has been built from the ground up with security top of mind. However, there are advantages to using the managed version:

  • Delta Sharing on Databricks is provided by Unity Catalog, which allows you to provide fine-grained access to any data sets between different sets of users centrally from one place. With the open source version, you would need to separate data sets that have various data access rights amongst several sharing servers, and you would also need to impose access restrictions on those servers and the underlying storage accounts. For ease of deployment, a docker image is provided with the open source version, but it is important to note that scaling deployments across large enterprises will pose a non-trivial overhead on the teams responsible for managing them.
  • Just like the rest of the Databricks Lakehouse Platform, Unity Catalog is provided as a managed service. You don’t need to worry about things like the availability, uptime and maintenance of the service because we worry about that for you.
  • Unity Catalog allows you to configure comprehensive audit logging capabilities out of the box.
  • Data owners will be able to manage shares using SQL syntax. Additionally, REST APIs are available to manage shares. Using familiar SQL syntax simplifies the way we share data, reducing the administrative burden.
  • Using the open source version, you’re responsible for the configuration, infrastructure and management of data sharing but with the managed version all this functionality is available out of the box.

For these reasons, we recommend assessing both versions and making a decision based on your requirements. If ease of setup and use, out-of-the-box governance and auditing, and outsourced service management are important to you, the managed version will likely be the right choice.

2. Set the appropriate recipient token lifetime for every metastore

When you enable Delta Sharing, you configure the token lifetime for recipient credentials. If you set the token lifetime to 0, recipient tokens never expire.

Setting the appropriate token lifetime is critically important for regulatory, compliance and reputational standpoint. Having a token that never expires is a huge risk; therefore, it is recommended using short-lived tokens as best practice. It is far easier to grant a new token to a recipient whose token has expired than it is to investigate the use of a token whose lifetime has been improperly set.

See the documentation (AWS, Azure) for configuring tokens to expire after the appropriate number of seconds, minutes, hours, or days.

3. Establish a process for rotating credentials

There are a number of reasons that you might want to rotate credentials, from the expiry of an existing token, concerns that a credential may have been compromised, or even just that you have modified the token lifetime and want to issue new credentials that respect that expiration time.

To ensure that such requests are fulfilled in a predictable and timely manner, it’s important to establish a process, preferably with an established SLA. This could be integrated well into your IT service management process, with the appropriate action completed by the designated data owner, data steward or DBA for that metastore.

See the documentation (AWS, Azure) for how to rotate credentials. In particular:

  • If you need to rotate a credential immediately, set --existing-token-expire-in-seconds to 0, and the existing token will expire immediately.
  • Databricks recommends the following actions when there are concerns that credentials may have been compromised:
    1. Revoke the recipient’s access to the share.
    2. Rotate the recipient and set --existing-token-expire-in-seconds to 0 so that the existing token expires immediately.
    3. Share the new activation link with the intended recipient over a secure channel.
    4. After the activation URL has been accessed, grant the recipient access to the share again.

4. Consider the right level of granularity for Shares, Recipients & Partitions

In the managed version, each share can contain one or more tables and can be associated with one or more recipients, using fine-grained controls to manage who or how the multiple data sets are accessed.. This allows us to provide fine-grained access to multiple data sets in a way that would be much harder to achieve using open source alone. And we can even go one step further than this, adding only part of a table to share by providing a partition specification (see the documentation on AWS, Azure).

It’s worth taking advantage of these features by implementing your shares and recipients to follow the principle of least privilege, such that if a recipient credential is compromised, it is associated with the fewest number of data sets or the smallest subset of the data possible.

5. Configure IP Access Lists

By default, all that is required to access your shares is a valid Delta Sharing Credential File, therefore it’s critical to minimize the possibility that credentials may be compromised by implementing network-level limits on where they can be used from.

Configure Delta Sharing IP access lists (see the docs for AWS, Azure) to restrict recipient access to trusted IP addresses, for example, the public IP of your corporate VPN.

Combining the IP access lists with the access token considerably reduces the unauthorized access risks. For someone to access the data in an unauthorized manner, they need to both have acquired a copy of your token and to be on the same authorized network which is much harder than just acquiring the token itself.

6. Configure Databricks Audit Logging

Audit logs are your authoritative record of what’s happening on your Databricks Lakehouse Platform, including all of the activities related to Delta Sharing. As such, we highly recommend that you configure Databricks audit logs for each cloud (see the docs for AWS, Azure) and set up automated pipelines to process those logs and monitor/alert on important events.

Check out our companion blog, Monitoring Your Databricks Lakehouse Platform with Audit Logs for a deeper dive on this subject, including all the code you need to set up Delta Live Tables pipelines, configure Databricks SQL alerts and run SQL queries to answer important questions like:

  • Which of my Delta Shares are the most popular?
  • Which countries are my Delta Shares being accessed from?
  • Are Delta Sharing Recipients being created without IP access list restrictions being applied?
  • Are Delta Sharing Recipients being created with IP access list restrictions which are outside of my trusted IP address range?
  • Are attempts to access my Delta Shares failing IP access list restrictions?
  • Are attempts to access my Delta Shares repeatedly failing authentication?

7. Configure network restrictions on the storage account(s)

Once a delta sharing request has been successfully authenticated by the sharing server, an array of short-lived credentials are generated and returned to the client. The client then uses these URLs to request the relevant files directly from the cloud provider. This design means that the transfer can happen in parallel at massive bandwidth, without streaming the results through the server. It also means that from a security perspective, you’re likely to want to implement similar network restrictions on the storage account to the delta sharing recipient itself – there’s no point in protecting the share at the recipient level, if the data itself is hosted in a storage account that can be accessed by anyone and from anywhere.

Azure

On Azure, Databricks recommends using Managed Identities (currently in Public Preview) to access the underlying Storage Account on behalf of Unity Catalog. Customers can then configure Storage firewalls to restrict all other access to the trusted private endpoints, virtual networks or public IP ranges that delta sharing clients may use to access the data. Please reach out to your Databricks representative for more information.

Important Note: Again, it’s important to consider all of the potential use cases when determining what network level restrictions to apply. For example, as well as accessing data via delta sharing, it’s likely that one or more Databricks workspaces will also require access to the data, and therefore you should allow access from the relevant trusted private endpoints, virtual networks or public IP ranges used by those workspaces.

AWS

On AWS, Databricks recommends using S3 bucket policies to restrict access to your S3 buckets. For example, the following Deny statement could be used to restrict access to trusted IP addresses and VPCs.

Important Note: It’s important to consider all of the potential use cases when determining what network level restrictions to apply. For example:

  • When using the managed version, the pre-signed URLs are generated by Unity Catalog, and therefore you will need to allow access from the Databricks Control Plane NAT IP for your region.
  • It’s likely that one or more Databricks workspaces will also require access to the data, and therefore you should allow access from the relevant VPC IDs if the underlying S3 bucket is in the same region and you’re using VPC Endpoints to connect to S3 or the public IP address that the data plane traffic resolves to (for example via a NAT Gateway).
  • To avoid losing connectivity from within your corporate network, Databricks recommends always allowing access from at least one known and trusted IP address, such as the public IP of your corporate VPN. This is because Deny conditions apply even within the AWS console.
{
    "Version": "2012-10-17",
    "Statement": [
      {
            "Sid": "DenyAccessFromUntrustedNetworks",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::",
                "arn:aws:s3:::/*"
            ],
            "Condition": {
                "NotIpAddressIfExists": {
                    "aws:SourceIp": ["", "", ""]
                },
                "StringNotEqualsIfExists": {
                    "aws:SourceVpc": ["", ""]
                }
            }
        }
   ]
}

In addition to network level restrictions, it is also recommended that you restrict access to the underlying S3 buckets to the IAM role used by Unity Catalog. The reason being is that as we have seen, Unity Catalog provides fine grained access to your data in a way that is not possible with the coarse grained permissions provided by AWS IAM/S3. Therefore, if someone were able to access the S3 bucket directly they might be able to bypass those fine grained permissions and access more of the data than you had intended.

Important Note: As above, Deny conditions apply even within the AWS console, so it is recommended that you also allow access to an administrator role that a small number of privileged users can use to access the AWS UI/APIs.

{
     "Sid": "DenyActionsFromUntrustedPrincipals",
     "Effect": "Deny",
     "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::",
                "arn:aws:s3:::/*"
            ],
            "Condition": {
                "StringNotEqualsIfExists": {
                    "aws:PrincipalArn": [
                        "",
                        ""
            ]
       }
    }
 }

8. Configure logging on the storage account(s)

In addition to enforcing network-level restrictions on the underlying storage account(s), you’re likely going to want to monitor whether anyone is trying to bypass them. As such, Databricks recommends:

Conclusion

The lakehouse has solved most of the data management issues that led to us having fragmented data architectures and access patterns, and severely throttled the time to value an organization could expect to see from its data. Now that data teams have been freed from these problems, open but secure data sharing has become the next frontier.

Delta Sharing is the world’s first open protocol for securely sharing data internally and across organizations in real-time, independent of the platform on which the data resides. And by using Delta Sharing in combination with the best practices outlined above, organizations can easily but safely exchange data with their users, partners and customers at enterprise scale.

Existing data marketplaces have failed to maximize business value for data providers and data consumers, but with Databricks Marketplace you can leverage the Databricks Lakehouse Platform to reach more customers, reduce costs and deliver more value across all of your data products.

If you’re interested in becoming a Data Provider Partner, we’d love to hear from you!

--

Try Databricks for free. Get started today.

The post Security Best Practices for Delta Sharing appeared first on Databricks.

Automating ML, scoring, and alerting for Detecting Criminals and Nation States through DNS Analytics

$
0
0

This blog is part two of our DNS Analytics blog, where you learned how to detect a remote access trojan using passive DNS (pDNS) and threat intel. Along the way, you’ll learn how to store, and analyze DNS data using Delta, Spark and MLFlow. In this blog post, we will show you how easy it is to train your model using Databricks AutoML, use Delta Live Tables to score your DNS logs, and how to generate Databricks SQL alerts on malicious domains scored by the model right into your inbox.

The Databricks Lakehouse Platform has come a long way since we last blogged about Detecting Criminals and Nation States through DNS Analytics back in June 2020. We’ve set world records, acquired companies, and launched new products that bring the benefits of the lakehouse architecture to whole new audiences like data analysts and citizen data scientists. The world has changed significantly too. Many of us have been working remotely for the majority of that time, and remote work has put increased dependency on the internet infrastructure. One thing has not changed – our reliance on the DNS protocol for naming and routing on the internet. This has led to Advanced Persistent Threat (APT) groups and cyber criminals leveraging DNS for command and control or beaconing or resolution of attacker domains. This is why academic researchers, industry groups and the federal government advise security teams to collect and analyze DNS events to hunt, detect, investigate and respond to new emerging threats and uncover malicious domains used by attackers to infiltrate networks. But you know, it’s not as easy as it sounds.


Figure 1.The Complexity, cost, and limitations of legacy technology make detecting DNS security threats challenging for most enterprise organizations.

Detecting malicious domains with Databricks

Using the notebooks below, you will be able to detect the Agent Tesla RAT. You will be training a machine learning model for detecting domain generation algorithms (DGA), typosquatting and performing threat intel enrichment using URLhaus. Along the way you will learn the Databricks concepts of:

  • Data ingestion, enrichment, and ad hoc analytics with ETL
  • Model building using AutoML
  • Live scoring domains using Delta Live Tables
  • Producing Alerts with Databricks SQL alerts

Why use Databricks for this? Because the hardest thing about security analytics isn’t the analytics. You already know that analyzing large scale DNS traffic logs is complicated. Colleagues in the security industry tell us that the challenges fall into three categories:

  • Deployment complexity: DNS server data is everywhere. Cloud, hybrid, and multi-cloud deployments make it challenging to collect the data, have a single data store and run analytics consistently across the entire deployment.
  • Tech limitations: Legacy SIEM and log aggregation solutions can’t scale to cloud data volumes for storage, analytics or ML/AI workloads. Especially, when it comes to joining data like threat intel enrichments.
  • Cost: SIEMs or log aggregation systems charge by volume of data ingested. With so much data SIEM/log licensing and hardware requirements make DNS analytics cost prohibitive. And moving data from one cloud service provider to another is also costly and time consuming. The hardware pre-commit in the cloud or the capex of physical hardware on-prem are all deterrents for security teams.

In order to address these issues, security teams need a real-time data analytics platform that can handle cloud-scale, analyze data wherever it is, natively support streaming and batch analytics and, have collaborative content development capabilities. And… if someone could make this entire system elastic to prevent hardware commits… Now wouldn’t that be cool! We will show how Databricks Lakehouse platform addresses all these challenges in this blog.

Let us start with the high level steps of the detection process to execute the analytics. You can use the notebooks in your own Databricks deployment. Here is the high level flow in these notebooks:

  • Read passive DNS data from AWS S3 bucket
  • Specify the schema for DNS and load the data into Delta
  • Enrich and prep the DNS data with a DGA detection model and GeoIP Enrichments
  • Build the DGA detection model using Auto ML.
  • Automate the DNS log scoring with DLT
  • Produce Databricks SQL alerts


Figure 2. High level process showing how Databricks DNS analytics help detect criminal threats using pDNS, URLHaus, dnstwist, and Apache Spark

ETL & ML Prep

In our previous blog post we extensively covered ETL and ML prep for DNS analytics. Each section of the notebook has comments. At the end of running this notebook, you will have a clean silver.dns_training_dataset table that is ready for ML training.


Figure 3. Clean DNS training data set with features and label

Automated ML Training with Databricks AutoML

Machine Learning (ML) is at the heart of innovation across industries, creating new opportunities to add value and reduce cost, security analytics is no different. At the same time, ML is hard to do and it takes an enormous amount of skill and time to build and deploy reliable ML models. In the previous blog, we showed how to train and create the ML model for one type of ML model – the random forest classifier. Imagine, if we have to repeat that process for , say ten, different types of ML models so that we can find the best model (both type and parameters) – the Databricks AutoML lets us automate that process! Databricks AutoML — now generally available (GA) with Databricks Runtime ML 10.4 – automatically trains models on a data set and generates customizable source code, significantly reducing the time-to value of ML projects. This glass-box approach to automated ML provides a realistic path to production with low to no code, while also giving ML experts a jumpstart by creating baseline models that they can reproduce, tune, and improve. No matter your background in data science, AutoML can help you get to production machine learning quickly. All you need is a training dataset and AutoML does the rest. Let us use silver.dns_training_dataset that we produced in the previous step to automatically apply machine learning using the auto ml classification notebook.

AutoML automatically distributes hyperparameter tuning trials across the worker nodes of a cluster.

Each model is constructed from open source components and can easily be edited and integrated into your machine learning pipelines. You can use Databricks AutoML for regression, classification, and forecasting problems. It evaluates models based on algorithms from the scikit-learn, xgboost, and LightGBM packages.

We will use the dataset that was prepped using pDNS, URLHaus, DGA, dnstwist, alexa 10K, and dictionary in the ETL & ML Prep step. Each row in the table represents a DNS domain features and a class label as IoC or legit. The goal is to determine if a domain is IoC or not based on its domain name, domain length, domain entropy, alexa_grams, and word_grams.

input_df = 
spark.read.format("delta").load("dbfs:/FileStore/tables/tables/silver/dns_training_dataset")
Figure 4. CMD3 of notebook 2_dns_analytics_automl_classification showing loading the training data set into a spark dataframe

The following command splits the dataset into training, validation and test sets. Use the randomSplit method with the specified weights and seed to create dataframes storing each of these datasets

train_df, test_df = input_df.randomSplit([0.99, 0.01], seed=42)
Figure 5. CMD4 of notebook 2_dns_analytics_automl_classification showing splits the dataset into training, validation and test sets

The following command starts an AutoML run. You must provide the column that the model should predict in the target_col argument.
When the run completes, you can follow the link to the best trial notebook to examine the training code. This notebook also includes a feature importance plot.

from databricks import automl
summary = automl.classify(train_df, target_col="class", timeout_minutes=30)
Figure 6. CMD4 of notebook 2_dns_analytics_automl_classificationshowing starting and AutoML run

AutoML prepares the data for training, runs data exploration, trials multiple model candidates, and generates a Python notebook with the source code tailored to the provided dataset for each trial run. It also automatically distributes hyperparameter tuning and records all experiment artifacts and results in MLflow. It is ridiculously easy to get started with AutoML, and hundreds of customers are using this tool today to solve a variety of problems.

At the end of running this notebook you will have the “best model” that you can use for inference. It’s that easy to build a model with Databricks AutoML.

Easy & reliable DNS log processing with Delta Live Tables

We’ve learned from our customers that loading, cleaning and scoring DNS logs and turning into production ML pipelines typically involves a lot of tedious, complicated operational work. Even at a small scale, the majority of a data engineer’s time is spent on tooling and managing infrastructure rather than transformation. We also learned from our customers that observability and governance were extremely difficult to implement and, as a result, often left out of the solution entirely. This led to spending lots of time on undifferentiated tasks and led to data that was untrustworthy, not reliable, and costly.

In our previous blog, we showed how to perform the loading and transformation logic in vanilla notebooks – imagine if we can simplify that and have a declarative deployment approach with it. Delta Live Tables (DLT) is the first framework that uses a simple declarative approach to building reliable data pipelines and automatically managing your infrastructure at scale so data analysts and engineers can spend less time on tooling and focus on getting value from data. With DLT, engineers are able to treat their data as code and apply modern software engineering best practices like testing, error handling, monitoring and documentation to deploy reliable pipelines at scale. DLT was built from the ground up to automatically manage your infrastructure and to automate complex and time-consuming activities. DLT automatically scales compute infrastructure by allowing the user to set the minimum and maximum number of instances and let DLT size up the cluster according to cluster utilization. In addition, tasks like orchestration, error handling and recovery are all done automatically — as is performance optimization. With DLT, you can focus on data transformation instead of operations.

And because the ETL pipelines that process security logging will benefit greatly from the reliability, scalability and built-in data quality controls that DLT provides, we’ve taken the ETL pipeline shared as part of our previous blog and converted it to DLT.

This DLT pipeline reads your DNS event logs from cloud object storage into your lakehouse and scores those logs using the model that was trained in the previous section.

@dlt.table(
  name="dns_log_analytics",
  table_properties={
    "quality": "bronze", 
    "pipelines.autoOptimize.managed": "true",
    "delta.autoOptimize.optimizeWrite": "true",
    "delta.autoOptimize.autoCompact": "true"
  }
)
def dns_logs_scoring():
    df = spark.read.csv(dnslogs)
    df.createOrReplaceTempView("dnslogs")
    df = spark.sql("SELECT _c0 as domain, domain_extract(_c0) as domain_tldextract, scoreDNS(domain_extract(_c0)) as class, current_timestamp() as timestamp  FROM dnslogs")                                
    return df
Figure 7. CMD3 of notebook 3_dns_analytics_logs_scoring_pipeline showing reading DNS logs events from cloud storage and scoring with ML model

To get the new DLT pipeline running on your environment, please use the following steps:

  1. Create a new DLT pipeline, linking to the shared_include and 3_dns_analytics_logs_scoring_pipeline notebook (see the docs for AWS, Azure, GCP). You’ll need to enter the following configuration options:
      a. dns.dns_logs: The cloud storage path that you’ve configured for DNS logs that need to be scored. This will usually be a protected storage account which isn’t exposed to your Databricks users.
      b. dns.model_uri: The best model path that was created as part of the ML Training step. This is readily available to copy paste from Cmd 19 out of the notebook 2_dns_analytics_automl_classification.py.
      Your DLT configuration should look something like this:


Figure 8. DLT pipeline configuration example with notebooks and parameters.

  1. Now you should be ready to configure your pipeline to run based on the appropriate schedule and trigger. Once it has run successfully, you should see something like this:


Figure 8. DLT pipeline execution example

At the end of running this DLT pipeline, you will have a dns_logs.dns_log_analytics table with a row for each dns log and a column as class indicating if the domain is scored as IoC or not.

Easy dashboarding and alerting with Databricks SQL

Now that you’ve ingested, transformed and perform ML-based detections on your DNS logs in the Lakehouse, what can you do with the results next? Databricks SQL is a net-new capability since our previous blog that lets you write and perform queries, create dashboards, and setup notifications easily with awesome price-performance. If you navigate to the Data Explorer (see the docs for AWS, Azure) you’ll find the dns_log_analytics table in the target database you specified within the DLT configuration above.

Potential use cases here might be anything from ad-hoc investigations into potential IoCs, to finding out who’s accessing malicious domains from your network infrastructure. You can easily configure Databricks SQL alerts to notify you when a scheduled SQL query returns a hit on one of these events.

  • We will make the queries time bound (I.e. by adding a timestamp >= current_date() - 1) to alert on the current date.
  • We will use the query to return a count of IoCs (I.e. by adding a COUNT(*) and an appropriate WHERE clause)
  • Now we can configure an alert to run every day and trigger if the count of IoCs is > 0
  • For more complicated alerting based on conditional logic, consider the use of CASE statements (see the docs for AWS, Azure)

For example, the following SQL queries could be used to alert on IoCs :

  
select
  count(*) as ioc_count
from
  dns_logs.dns_log_analytics
where
  class = 'ioc'
  AND timestamp >= current_date() - 1

Figure 9. A simple SQL query to find all the IoC domains found on a given day.


Sample dashboard enabling security analysts to search for a specific domain in a pile of potential IoCs, get a count of potential IoCs seen on a given day and also get a full list of potential IoC domains seen on a given day.

These could be coupled with a custom alert template like the following to give platform administrators enough information to investigate whether the acceptable use policy has been violated:

Hello,
Alert "{{ALERT_NAME}}" changed status to {{ALERT_STATUS}}.
There have been the following unexpected events on the last day:
{{QUERY_RESULT_ROWS}}

Check out our documentation for instructions on how to configure alerts (AWS, Azure), as well as for adding additional alert destinations like Slack or PagerDuty (AWS, Azure).

Conclusion

In this blog post you learned how easy it is to ingest, ETL, prep for ML, train models, and live score DNS logs in your Databricks Lakehouse. You also have an example of detection to hunt for signs of compromise within the DNS events and setup alerts to get notifications.

What’s more, you can even query the Lakehouse via your SIEM tool.

We invite you to log in to your own Databricks account and run these notebooks. Please refer to the docs for detailed instructions on importing the notebook to run.

We look forward to your questions and suggestions. You can reach us at: cybersecurity@databricks.com. Also if you are curious about how Databricks approaches security, please review our Security & Trust Center.

--

Try Databricks for free. Get started today.

The post Automating ML, scoring, and alerting for Detecting Criminals and Nation States through DNS Analytics appeared first on Databricks.

Sharing Context Between Tasks in Databricks Workflows

$
0
0

Databricks Workflows is a fully-managed service on Databricks that makes it easy to build and manage complex data and ML pipelines in your lakehouse without the need to operate complex infrastructure.

Sometimes, a task in an ETL or ML pipeline depends on the output of an upstream task. An example would be to evaluate the performance of a machine learning model and then have a task determine whether to retrain the model based on model metrics. Since these are two separate steps, it would be best to have separate tasks perform the work. Previously, accessing information from a previous task required storing this information outside of the job’s context, such as in a Delta table.

Databricks Workflows is introducing a new feature called “Task Values”, a simple API for setting and retrieving small values from tasks. Tasks can now output values that can be referenced in subsequent tasks, making it easier to create more expressive workflows. Looking at the history of a job run also provides more context, by showcasing the values passed by tasks at the DAG and task levels. Task values can be set and retrieved through the Databricks Utilities API.

The history of the run shows that the “evaluate_model” task has emitted a value

When clicking on the task, you can see the values emitted by the task

Task values are now generally available. We would love for you to try out this new functionality and tell us how we can improve orchestration even further!

--

Try Databricks for free. Get started today.

The post Sharing Context Between Tasks in Databricks Workflows appeared first on Databricks.

How Tata Steel is Shifting Global Manufacturing and Production Toward Sustainability

$
0
0

Struggling to overcome the management and cost demands of a legacy data system

Tata Steel products are in almost everything, from household appliances and automobiles, to consumer packaging and industrial equipment. As a fully integrated steel operation, Tata mines, manufactures, and markets the finished products, leveraging data insights from our seven worldwide production sites and across business functions for transparency, coordination, and stability. We have a variety of ways we are leveraging our data and analytics in both the manufacturing and commercial areas of our business with the goals of enhancing sustainability across our operations, reducing overall costs, and streamlining demand planning.

A couple of years ago, Tata started our journey to become a data-driven company. Most use cases were ad hoc analysis focused rather than end-to-end solutions, and we were in the beginning stages of digitalization. We had plans to leverage our data across different facets of the business including:

  • Streamlining supply chain management and logistics;
  • Enabling capacity and demand forecasting;
  • Initiating payload optimization;
  • Supporting, measuring, and guiding Tata toward accomplishing our environmental initiatives.

However, achieving transformation across multiple operations was challenging to get off the ground. Initially, we were using a legacy data product that mimicked a mini data lake, without all the benefits of democratization and scale that we needed. We had limited internal knowledge of how to maximize the infrastructure and considered most of the tools to be user-unfriendly given the makeup of our data team. Issues around cluster availability and access across multiple users caused frequent outages, which frustrated users and led to costly downtime. Additionally, we had to provide our own UI on top of the infrastructure and analytics engine, causing us to incur significant overhead for productionization of use cases as well as integration with our Azure cloud. Too many infrastructure management demands were adding to overhead, and forcing IT talent to inordinate amounts of time patching, updating, and maintaining tools — wasting time on infrastructure issues rather than solving business issues. This inability to take action on our data inhibited our ability to move our target use cases into production.

We needed a fully managed, user-friendly data platform that could not only unify all our data in one place but also enable and empower teams to take advantage of our data once the barriers of access are removed. Once learning about the Lakehouse architecture, we realized the value it offered beyond standard cloud data warehouses which tended to lack the openness, flexibility and machine learning support of data lakes. Most importantly, taking a unified approach delivered by the Lakehouse would unlock the promise of our data across more teams within the organization, fueling innovation and better decision-making.

Data enablement in the lakehouse allows for smarter business decisions

About a year and a half ago, Tata deployed the Databricks Lakehouse Platform on Azure. The migration was smooth because we had internal Azure knowledge and the software instantly solved a lot of issues, specifically from an infrastructure and data management perspective. All of a sudden, data accessibility was enhanced for both administrators and users. Our cluster issues were eliminated, and we finally had ease of scalability — allowing us to explore our data in ways not possible before. Without the low-level infrastructure maintenance tasks, we became more cost-efficient and lowered overhead. At the same time, we were finally able to leverage machine learning (ML) to better meet business goals through innovation. In total, everything in Databricks was UI-based, enabling us to shift from being IT-driven to business-value driven due to the simplicity and ease-of-use of the Databricks Lakehouse Platform.

Now equipped with a centralized and unified lakehouse platform, Tata has about 20 to 30 different use cases in production. Using Databricks Lakehouse components, MLflow, and a combination of Delta Tables with MLflow, Tata teams across the board are utilizing ML without the problems we struggled with in the past. Demand forecasting and supply chain management, which were previously estimated with rough data, are now streamlined based on customer need, existing and future supply, mode of transportation, workflow capacity, and inventory management. These insights have allowed Tata to better meet customer expectations and improve overall satisfaction by better understanding their requirements and empowering us to produce products when needed or utilize existing inventory to avoid waste and expensive rush transport for last-minute deliveries.

On the production side, uses cases include predicting finish dates of our orders, dynamic recipe control to ensure steel criteria meet customer standards prior to production completion, and predictive motor maintenance and repair planning to avoid downtime and interruptions at plants. In addition to these use cases focused on meeting customer expectations, we are also heavily invested in environmental sustainability. For example, through payload optimization during freight transportation, we are able to improve our carbon-neutral production and reduce CO2 emissions. With Databricks, Tata is able to move closer to those goals by reducing the dust and odor emissions that occur during production, and increasing freight payload so that products are only transported in fully-utilized trucks.

The value Tata is experiencing goes beyond machine learning. We are also able to serve data-driven insights to different teams and stakeholders across the organization. With our data centralized in the Lakehouse, we are able to easily feed data to dashboards used to make better decisions around supply chain workflows, demand forecasting, capacity planning, and more. On the scale that Tata operates, these small changes contribute significantly to reducing our footprint and striving toward sustainability.

Structured data sharing ensures a better tomorrow, every day

Now that structured data sharing is enabled through Databricks and the various Databricks components being used by Tata, we’re seeing hard results we can trust. Today, our demand forecasting model is performing 30% more accurately than before Databricks. Our payload optimization use case is delivering 4-8% cost savings through better transportation planning and allocation of transportation space.

From a user adoption standpoint, we have created forums to share knowledge and help teams in functional departments with their Databricks journey. In turn, this decreased project turn around and helped teams to gain deeper data insight to make smarter decisions throughout Tata. With everything originating from Databricks, we know that teams are using accurate numbers, collaborating across teams, and participating in knowledge sharing which makes master data management easier, more trustworthy, and more logical. Today, we have over 50 machine learning use cases in production across our commercial business and manufacturing including logistics planning, payload optimization, production quality management, predictive maintenance, and more.

Additionally, Databricks components like MLflow solve the traditional issues that often prevent non-IT users from successfully implementing data-based use cases. Now, less experienced users can kick off their own projects and easily get benchmarks with AutoML, monitor for data quality across various sources with MLflow, and use Delta Tables with MLflow for traceability between versions.

Overall Databricks helps us plan for the future because it allows us to focus on what really matters. We can see big picture sustainability progress and small picture use case applications without getting lost in the minutia of technical management. Instead, we can scale data ingestion, use cases, and user adoption for more impact throughout the organization. With our partnership with Databricks, we are quickly moving towards sustainable manufacturing with award-winning use cases such as Zero-carbon logistics and are well on our way to becoming the leading data-driven steel company of the future.

--

Try Databricks for free. Get started today.

The post How Tata Steel is Shifting Global Manufacturing and Production Toward Sustainability appeared first on Databricks.

Data + AI Summit Recap for Media & Entertainment Teams

$
0
0

Now that Data + AI Summit is officially wrapped, we wanted to spend a minute recapping some of the top news, content, and updates – and what those mean for data teams in Media & Entertainment.

Here’s what we announced:

Security, Governance & Sharing

Introducing Data Cleanrooms for the Lakehouse 

We are excited to announce data cleanrooms for the Lakehouse, allowing businesses to easily collaborate with their customers and partners on any cloud in a privacy-safe way. Participants in the data cleanrooms can share and join their existing data and run complex workloads in any language – Python, R, SQL, Java, and Scala – on that data while maintaining data privacy. Data cleanrooms open a broad array of use cases across industries. In the media industry, advertisers and marketers can deliver more targeted ads, with broader reach, better segmentation, and greater ad effectiveness transparency while safeguarding data privacy.

Introducing Databricks Marketplace

Databricks Marketplace is an open marketplace for exchanging data products such as data sets, notebooks, dashboards, and machine learning models. To accelerate insights, data consumers can discover, evaluate, and access more data products from third-party vendors than ever before.

What’s new with Databricks Unity Catalog

With the general availability of Unity Catalog, everything customers love about Unity Catalog – fine-grained access controls, lineage, integrated governance, auditing, ease of confidently sharing data across business units – is now available to every customer on the platform. Easily and confidently share data across business units.

Platform Updates

Delta Lake is going fully open source

Media teams have been asking for more open sourcing of Delta Lake for a long time, which is why we’re so excited to share that we’re open sourcing ALL of Delta with the upcoming Delta Lake 2.0 release, starting with the most requested features from the community. Delta Lake is the fastest, most popular, and most advanced open format table storage format. The remaining features will be gradually open sourced over time in the coming months. This means that features that were available in the past to Databricks customers only will be available to all of the Delta Lake community.

In addition, this change will allow for better collaboration across the industry, increased performance, and access to previously proprietary features like Change Data Feed and Z-Ordering, which help lower costs and drive faster insights. You can read more about optimizing performance with file management here.

Delta Live Tables Announces New Capabilities and Performance Optimizations

Delta Live Tables (DLT) has grown to power production ETL use cases at over 1,000 leading companies – from startups to enterprises – all over the world since its inception. Project Enzyme is a new optimization layer for Delta Live Tables that speeds up ETL processing and enables enterprise capabilities and UX improvements.

Enhanced Autoscaling optimizes cluster utilization by automatically allocating cluster resources based on workload volume, with minimal impact on the data processing latency of your pipelines, reducing usage and cost for customers.

Project Lightspeed: Faster and Simpler Stream Processing With Apache Spark

As media companies shift to direct-to-consumer models and the advertising ecosystem demand real-time insights, streaming data is core to many of the use cases for Media & Entertainment. Project Lightspeed makes streaming data a first-class citizen on the Databricks Lakehouse Platform, helping continue to make Databricks an industry leader in performance and price for streaming data use cases.

This announcement was the first major streaming announcement we’ve made – although streaming has ALWAYS been a large and successful part of our business for improving performance to achieve higher throughput, lower latency, and lower cost. The announcement includes improving ecosystem support for connectors, enhancing functionality for processing data with new operators and APIs, and simplifying deployment, operations, monitoring, and troubleshooting.

Data Science & Machine Learning

Introducing MLflow Pipelines with MLflow 2.0

MLflow Pipelines enables data scientists to create production-grade ML pipelines that combine modular ML code with software engineering best practices to make model development and deployment fast and scalable. In practice, this means that code for a recommendation engine, or an anomaly detection algorithm, can be swiftly moved from exploration to production without costly rewrites or refactoring.

Serverless Model Endpoints

Serverless Model Endpoints improve upon existing Databricks-hosted model serving by offering horizontal scaling to thousands of QPS, potential cost savings through auto-scaling, and operational metrics for monitoring runtime performance. Ultimately, this means Databricks-hosted models are suitable for production use at scale. With this addition, your data science teams can now spend more time on business use cases and less time on building and managing Kubernetes infrastructure to serve ML models.

Media & Entertainment Industry sessions

And in case you missed it, there were some incredible Media & Entertainment sessions in which teams discussed the business benefits, cost, productivity savings, and advanced analytics they’re now able to realize with Databricks. Here are a few to highlight:

  • LaLiga: How technical and tactical football analysis is improved through data [DETAILS | WATCH NOW]
  • HuuugeGames: Real-time cost reduction monitoring and alerting [DETAILS]

--

Try Databricks for free. Get started today.

The post Data + AI Summit Recap for Media & Entertainment Teams appeared first on Databricks.


Key Takeaways for Financial Services at Data + AI Summit 2022

$
0
0

Data + AI Summit, the largest gathering of the open source data and AI community, returned as a hybrid event at the Moscone Center from June 27-30. With incredible breakout sessions by global financial services institutions (FSIs) like Northwestern Mutual, Nasdaq, and HSBC to mainstage keynotes from Intuit and Coinbase and live interactive demos from Databricks solution architects and partners like Avanade, Deloitte, and Confluent, we heard about innovation with data and AI in ways previously unimagined in financial services.

Financial Services Forum

For our financial services attendees, the most exciting part of Data + AI Summit was the Financial Services Forum – a two-hour event that brought together leaders from across global brands in banking, insurance, capital markets and fintech to share innovative data and AI use cases from their data journeys. With the theme of “The Future of Financial Services Is Open With Data and AI at Its Core,” Junta Nakai, Global Head of Financial Services and Sustainability Leader at Databricks, spoke of the need for FSIs to have a short and clear path to “AI in action” to preserve margins, find new revenue streams, and shorten development timelines, given today’s prolonged inflationary environment. He also gave an overview of the Databricks Lakehouse for Financial Services, a single platform that brings together all data and analytics workloads to power transformative innovations in modern financial services institutions.

In their opening keynote, TD Bank’s Paul Wellman (Executive Product Owner, Data as a Service) and Upal Hossain (AVP, Data as a Service), shared their data transformation journey and accelerated transition to the cloud, migrating over 100 million files and ~8,000 ETL jobs at petabyte scale with Delta Lake and the Azure cloud.

Later, Geping Chen, Head of Data Engineering from GEICO, talked about personalization and the use of Telematics IoT in auto insurance as the biggest upcoming trend in the financial services industry – among other exciting industry topics.

Attendees also learned best practices for achieving business outcomes with data + AI regarding people, process, and technology from Jack Berkowitz, Chief Data Officer, ADP; Jeff Parkinson, VP, Core Data Engineering, Northwestern Mutual; Ken Priyadarshi, AI Leader, EY Global; Gary Jones, Chief Data Engineer and Mona Soni, Chief Technology Officer, S&P Global Sustainable1; Arup Nanda, Managing Director, CTO Enterprise Cloud Data Ecosystem, JP Morgan; Christopher Darringer, Lead Engineer and Shraddha Shah, Data Engineer at Point 72 Asset Management.

Financial Services Breakout Sessions and Demos

Check out these financial services breakout sessions to hear from our customers about the business benefits, cost and productivity savings, and advanced analytics they’re now able to realize with Databricks:

5 Key Announcements That Will Transform the Financial Services Industry

1. Integrated data governance with Databricks Unity Catalog (GA expected in coming weeks after DAIS).

We announced the upcoming GA of Unity Catalog (UC), which allows customers to enable fine-grained access controls on data and meet their privacy requirements. Unity Catalog is the catalog and governance layer for Databricks Lakehouse and offers a range of capabilities, including:

  • The best way to secure access to data in Databricks across all compute/language types
  • The best way to allow for secure data sharing, powered by Delta Sharing on Databricks
  • A single source of truth for data and access control across Databricks workspaces
  • Easy to control with SQL based grants to give people/groups/principals access to data
  • API-driven to help with easy automation and workflow processes

For financial institutions, UC provides the ability to centralize catalog management at the account level (i.e., across multiple clouds). Unity Catalog also makes it easy to automate discovery and lineage – automated lineage is something even the biggest players in the cataloging space still struggle with.

2. MLfLow 2.0 – Personalization gets a boost from model serving and MLflow Pipelines, making model development and deployment fast and scalable.

MLflow Pipelines enable Data Scientists to create production-grade ML pipelines that combine modular ML code with software engineering best practices to make model development and deployment fast and scalable. These new features around model monitoring will be impactful for FSIs. It’s common for many financial institutions to have a significant number of models, especially considering the global scale for retail and institutional businesses. It becomes impossible to do proper model drift monitoring without an automated framework.

MLflow Pipelines will help improve model governance frameworks because FSIs can now apply CI/CD practices around constructing and managing ML model infrastructure setup.

3. Delta Lake will be fully open source.

Delta Lake 2.0 lowers the barrier to entry for adopting a Lakehouse architecture. As organizations think about on-prem or Hadoop migrations to the Databricks Lakehouse, they can use a consistent foundation to make Lakehouse a simpler transition. Moving to the Lakehouse has never been easier and for workloads that are not yet on Databricks, organizations can show the total cost of ownership (TCO) savings. Other benefits include:

  • With Delta Lake 2.0, users can now reap the benefits of better performance (3.5x better overall performance compared to other solutions) to save on computation costs
  • Delta Lake 2.0 offers an unrivaled level of maturity and proven real-world performance (663% increase in contributor strength over the past 3 years)
  • Delta Lake 2.0 will open source all APIs, including OPTIMIZE and ZORDER – FSIs are no longer forced to sacrifice performance and choose alternatives like Apache Iceberg and due to limited functionality
  • Anyone can now achieve a simplified architecture with Delta’s all-encompassing framework – no need to leverage third-party services for features like data sharing

4. Next generation streaming–Project Lightspeed–is a game-changer for Financial Services, leveraging fresh data for insight generation.

This announcement was one of the major streaming announcements Databricks has made – although streaming has always been a large and successful part of our business for improving performance to achieve higher throughput, lower latency, and lower cost. The announcement includes improving ecosystem support for connectors, enhancing functionality for processing data with new operators and APIs, simplifying deployment, operations, monitoring and troubleshooting.

For example, two major themes in Financial Services that are driving business today are personalization and regulatory reporting. The first category includes use cases from personalized insurance pricing to next-best-action. Regulatory reporting may include trade reporting or clearing and settlement. The key to unlocking the use cases above is the ability to stream data sources and process them in near real-time. Any FSI looking to make advances on these use cases will require these cutting-edge, native streaming capabilities.

5. Databricks Marketplace helps data consumers turn data into insights quicker and supports data providers’ growth as they distribute and monetize data assets

Databricks Marketplace is an open marketplace for exchanging data assets such as datasets, notebooks, dashboards, and machine learning models. To accelerate insights, data consumers can discover, evaluate, and access more data products from third-party vendors than ever before.

Financial Services institutions can accelerate projects to monetize data and build alternative streams of revenue (e.g. monetizing unique datasets and models) more seamlessly to a broad audience. The Databricks Marketplace will set the stage for FSIs to accelerate having data as an asset on the balance sheet.

Beyond these featured announcements, there were other exciting announcements like Databricks Serverless Model Endpoints, Project Enzyme, Delta Sharing and Data Cleanrooms. We encourage you to check out the Day 1 and Day 2 keynotes to learn more about our product announcements.

--

Try Databricks for free. Get started today.

The post Key Takeaways for Financial Services at Data + AI Summit 2022 appeared first on Databricks.

Treating Data and AI as a Product Delivers Accelerated Return on Capital

$
0
0

The outsized benefits of data and AI to the Manufacturing sector have been thoroughly documented. As a recent McKinsey study reported, the Manufacturing segment is projected to deliver $700B-$1,200b value through data and AI in cost savings, productivity gains, and new revenue sources. As an example, data-led manufacturing use cases, powered by data and AI, reduce stock replenishment forecasting error by 20-50%, increasing total factory productivity by 50% or lowering scrap rates by 30%.

It shouldn’t be a surprise that the largest customers using the Databricks Manufacturing Lakehouse outperformed the overall market by over 200% over the last two years. What drove this success? These digitally-mature Lakehouse practitioners had:

  • more agile supply chains and profitable operations enabled by prescriptive and advanced analytical solutions that foresaw operational issues caused by COVID-19 disrupted supply chains.
  • advanced prescriptive analytics that promote uptime with prescriptive maintenance and supply chain integration.
  • new sources of revenue in this uncertain time.

Data + AI Summit 2022 featured several of these industry winners at the Manufacturing Industry Forum. These experts shared their experiences of how data and AI are transforming their businesses and delivering a stronger return on invested capital (ROIC). We’d like to highlight some of their insights shared during the event.

Manufacturing Industry Forum Keynote

Muthu Sabarethinam, Vice President, Enterprise Analytics & IT at Honeywell, kicked off the session with his keynote: The Future of Digital Transformation in Manufacturing. Part of his talk focused on how to approach a digital transformation project; in his own words: “start first with data contextualization in the digital transformation process,” meaning start by leveraging IT and OT data convergence to bring all relevant data in context to the users.

Citing that only 30% of projects are productionalized and escape POC Purgatory, he explored the use of AI to create data of value and provided insight on the concept that AI has the potential to streamline data cleaning, mapping, and deduping. In his own words: “Use AI to create data, not data to create AI.”

He further explored this point by providing an example of how contextual information was leveraged to “fill in the gaps” in master data during Honeywell’s consolidation of fifty SAP systems to ten, which involved using AI to map, cleanse and dedupe data and led to significant reductions in effort. Using these techniques, Honeywell boosted its digital implementation success ratio to nearly 80%.

Key insights delivered to accelerating AI adoption and monetization:

  • Build your AI engine first, then feed other use cases.
  • Deliver persona-led data to your users.
  • Productize the offering, allowing products to change behavior through application-based services that overcome adoption challenges of immature offerings.

In summary, a key insight was, “don’t wait for the data to be there, use AI to create it”.

Muthu Sabarethinam (Honeywell), Aimee DeGrauwe (John Deere), Peter Conrardy (Collins Aerospace), Shiv Trisal (Databricks)

Manufacturing Industry Panel Discussion

Muthu Sabarethinam, Aimee DeGrauwe, Digital Product Manager of John Deere and Peter Conrardy, Executive Director, Data and Digital Systems of Collins Aerospace formed a panel discussion hosted by Shiv Trisal (a Brickters of only three weeks) that discussed three major topics timely topics in data and AI:

Data & AI investment in a challenging economic backdrop
The panel discussed how businesses are accelerating their use of data and AI  amongst all the supply chain and economic uncertainty. Mr. Conrarday’s perspective: even in uncertain times, access to data is a constant, leading to initiatives that help gain more value from data. Ms. DeGrauwe echoed Peter’s perspective with: “we are seeking now to drive more AI into their connected products and double down on investment in infrastructure and workforce.” Shiv Trisal summarized the conversation with, “speed, move faster, commit to the vision and don’t wait, we have to do this”.

Data & AI driving sustainability outcomes
The panel members all agreed that sustainability is not a fad in manufacturing, but basic principles of operational excellence and energy conservation are just good business tactics. Ms. DeGrauwe commented, “our customers are intrinsically linked to the land” and “the [customer] desire to be environmentally sound has driven technologies like Deere’s See and Spray product, using machine vision as a foundational technology, to selectively identify and apply herbicide to weeds reducing herbicide use by 75%”. “Deere is supporting sustainability by no longer managing operations at the farm level or field level but by moving down to the granular plant level, to do what plants need and no more”.

Mr. Sabarethinam looked at sustainability through a slightly different lens, providing insights into Honeywell’s organization, explaining that “it gives a sense of purpose” to the organization’s employees and that Honeywell’s products enable connected households and businesses, energy reduction, and fugitive emission capture – all of which are core tenets of sustainability.

Mr. Trisal summed the conversion up with his insight that we could miss a larger opportunity if we only thought about sustainability in the context of point solutions and should also consider the effect on the organization and how sustainability percolates value from direct customers to their customers.

Measuring success of data & AI strategies

This topic explored a number of areas, and Mr. Sabarethinam shared that a successful organization elevates the conversation to the senior levels, driving and managing the conversation through measured financial data and analytics-driven measurements on hard document savings. Mr. Conrarday shared that data and analytics projects need to be treated like a product, where the customer and financial outcomes are deeply embedded in the project planning and execution. He pointed out that successful projects typically are funded by a department or business segment, as other business segments do not have “any skin in the game” to ensure success; a successful project is not done for free and has established metrics that are confirmed to ultimately deliver hard financial results to the business. Ms. DeGrauwe got an unexpected laugh when speaking about one of the challenges the John Deere team has when teaching the organization what machine learning is and how it will benefit the business. Ms. DeGrauwe commented that a colleague said, “we’ll know success when they stop saying, “just put it in the ML”, as if ML was a special department, product or mystical black box.

The Future

The panel finished the discussion by filling in this blank, “I could achieve 10x more value if I could solve for ______”. Mr.Conrarday suggested that solving for Edge in an aviation segment would be the place he would concentrate, and humorously suggested to sensor the entire aircraft fleet at zero cost in zero time. Ms. DeGrauwe suggested that it all comes back to the data and the AI it produces. Accessing good clean data at reasonable cost in a repeatable fashion across a variety of legacy disparate systems will drive advanced use cases driving upsized value. Mr. Sabarethinam reinforced his earlier comments about the contextualization of data and its delivery to the right persona at the right time delivers outsized benefits.

Clearly, Ms. DeGrauwe, Mr. Mr.Conrarday and Mr. Sabarethinam have deep industry experience and see a bright future for Manufacturing by leveraging data and AI. Their collective insights should help both those digitally mature and those just starting out in their digital transformation journeys achieve a measurable accelerated return on capital and improve their success ratio of digital projects by preventing them from falling into POC Purgatory. Each company is currently leveraging the Databricks Lakehouse Platform to run business-critical use cases from predictive maintenance embedded in John Deere’s Expert Alerts to seamless passenger journeys to connected operating systems for buildings, plants and energy management.

For more information on Databricks and these exciting product announcements, click here. Below are several manufacturing-centric Breakout Sessions from the Data + AI Summit:

Breakout Sessions
Why a Data Lakehouse is Critical During the Manufacturing Apocalypse – Corning
Predicting and Preventing Machine Downtime with AI and Expert Alerts – John Deere
How to Implement a Semantic Layer for Your Lakehouse – AtScale
Applied Predictive Maintenance in Aviation: Without Sensor Data – FedEx Express
Smart Manufacturing: Real-time Process Optimization with Databricks – Tredence

The Manufacturing Industry Forum

--

Try Databricks for free. Get started today.

The post Treating Data and AI as a Product Delivers Accelerated Return on Capital appeared first on Databricks.

How to Build a Marketing Analytics Solution Using Fivetran and dbt on the Databricks Lakehouse

$
0
0

Marketing teams use many different platforms to drive marketing and sales campaigns, which can generate a significant volume of valuable but disconnected data. Bringing all of this data together can help to drive a large return on investment, as shown by Publicis Groupe who were able to increase campaign revenue by as much as 50%.

The Databricks Lakehouse, which unifies data warehousing and AI use cases on a single platform, is the ideal place to build a marketing analytics solution: we maintain a single source of truth, and unlock AI/ML use cases. We also leverage two Databricks partner solutions, Fivetran and dbt, to unlock a wide range of marketing analytics use cases including churn and lifetime value analysis, customer segmentation, and ad effectiveness.

Fivetran and dbt can read and write to Delta Lake using a Databricks cluster or Databricks SQL warehouse

Fivetran allows you to easily ingest data from 50+ marketing platforms into Delta Lake without the need for building and maintaining complex pipelines. If any of the marketing platforms’ APIs change or break, Fivetran will take care of updating and fixing the integrations so your marketing data keeps flowing in.

dbt is a popular open source framework that lets lakehouse users build data pipelines using simple SQL. Everything is organized within directories, as plain text, making version control, deployment, and testability simple. Once the data is ingested into Delta Lake, we use dbt to transform, test and document the data. The transformed marketing analytics data mart built on top of the ingested data is then ready to be used to help drive new marketing campaigns and initiatives.

Both Fivetran and dbt are a part of Databricks Partner Connect, a one-stop portal to discover and securely connect data, analytics and AI tools directly within the Databricks platform. In just a few clicks you can configure and connect these tools (and many more) directly from within your Databricks workspace.

How to build a marketing analytics solution

In this hands-on demo, we will show how to ingest Marketo and Salesforce data into Databricks using Fivetran and then use dbt to transform, test, and document your marketing analytics data model.

All the code for the demo is available on Github in the workflows-examples repository.

dbt lineage graph showing data sources and models

dbt lineage graph showing data sources and models

The final dbt model lineage graph will look like this. The Fivetran source tables are in green on the left and the final marketing analytics models are on the right. By selecting a model, you can see the corresponding dependencies with the different models highlighted in purple.

Data ingestion using Fivetran

Fivetran has many marketing analytics data source connectors

Fivetran has many marketing analytics data source connectors

Create new Salesforce and Marketo connections in Fivetran to start ingesting the marketing data into Delta Lake. When creating the connections Fivetran will also automatically create and manage a schema for each data source in Delta Lake. We will later use dbt to transform, clean and aggregate this data.

Define a destination schema in Delta Lake for the Salesforce data source

Define a destination schema in Delta Lake for the Salesforce data source

For the demo name the schemas that will be created in Delta Lake marketing_salesforce and marketing_marketo. If the schemas do not exist Fivetran will create them as part of the initial ingestion load.

Select which data source objects to synchronize as Delta Lake tables

Select which data source objects to synchronize as Delta Lake tables

You can then choose which objects to sync to Delta Lake, where each object will be saved as individual tables. Fivetran also makes it simple to manage and view what columns are being synchronized for each table:

Fivetran monitoring dashboard to monitor monthly active rows synchronized

Fivetran monitoring dashboard to monitor monthly active rows synchronized

Additionally, Fivetran provides a monitoring dashboard to analyze how many monthly active rows of data are synchronized daily and monthly for each table, among other useful statistics and logs.

Data modeling using dbt

Now that all the marketing data is in Delta Lake, you can use dbt to create your data model by following these steps

Setup dbt project locally and connect to Databricks SQL

Set up your local dbt development environment in your chosen IDE by following the set-up instructions for dbt Core and dbt-databricks.

Scaffold a new dbt project and connect to a Databricks SQL Warehouse using dbt init, which will ask for following information.

$ dbt init
Enter a name for your project (letters, digits, underscore): 
Which database would you like to use?
[1] databricks
[2] spark

Enter a number: 1
host (yourorg.databricks.com): 
http_path (HTTP Path): 
token (dapiXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX): 
schema (default schema that dbt will build objects in): 
threads (1 or more) [1]: 

Once you have configured the profile you can test the connection using:

$ dbt debug

Install Fivetran dbt model packages for staging

The first step in using the Marketo and Salesforce data is to create the tables as sources for our model. Luckily, Fivetran has made this easy to get up and running with their pre-built Fivetran dbt model packages. For this demo, let’s make use of the marketo_source and salesforce_source packages.

To install the packages just add a packages.yml file to the root of your dbt project and add the marketo-source, salesforce-source and the fivetran-utils packages:

packages:
  - package: dbt-labs/spark_utils
    version: 0.3.0
  - package: fivetran/marketo_source
    version: [">=0.7.0", "<0.8.0"]
  - package: fivetran/salesforce_source
    version: [">=0.4.0", "<0.5.0"]

To download and use the packages run

 $ dbt deps

You should now see the Fivetran packages installed in the packages folder.

Update dbt_project.yml for Fivetran dbt models

There are a few configs in the dbt_project.yml file that you need to modify to make sure the Fivetran packages work correctly for Databricks.

The dbt_project.yml file can be found in the root folder of your dbt project.

spark_utils overriding dbt_utils macros

The Fivetran dbt models make use of macros in the dbt_utils package but some of these macros need to be modified to work with Databricks which is easily done using the spark_utils package.

It works by providing shims for certain dbt_utils macros which you can set using the dispatch config in the dbt_project.yml file and with this dbt will first search for macros in the spark_utils package when resolving macros from the dbt_utils namespace.

dispatch:
 - macro_namespace: dbt_utils
   search_order: ['spark_utils', 'dbt_utils']

Variables for the marketo_source and salesforce_source schemas

The Fivetran packages require you to define the catalog (referred to as database in dbt) and schema of where the data lands when being ingested by Fivetran.

Add these variables to the dbt_project.yml file with the correct catalog and schema names. The default catalog is hive_metastore which will be used if _database is left blank. The schema names will be what you defined when creating the connections in Fivetran.

vars:
 marketo_source:
   marketo_database: # leave blank to use the default hive_metastore catalog
   marketo_schema: marketing_marketo
 salesforce_source:
   salesforce_database: # leave blank to use the default hive_metastore catalog
   salesforce_schema: marketing_salesforce

Target schema for Fivetran staging models

To avoid all the staging tables that are created by the Fivetran source models being created in the default target schema it can be useful to define a separate staging schema.

In the dbt_project.yml file add the staging schema name and this will then be suffixed to the default schema name.

models:
 marketo_source:
   +schema: your_staging_name # leave blank to use the default target_schema
 salesforce_source:
   +schema: your_staging_name # leave blank to use the default target_schema

Based on the above, if your target schema defined in profiles.yml is mkt_analytics, the schema used for marketo_source and salesforce_source tables will be mkt_analytics_your_staging_name.

Disable missing tables

At this stage you can run the Fivetran model packages to test that they work correctly.

dbt run –select marketo_source
dbt run –select salesforce_source

If any of the models fail due to missing tables, because you chose not to sync those tables in Fivetran, then in your source schema you can disable those models by updating the dbt_project.yml file.

For example if the email bounced and email template tables are missing from the Marketo source schema you can disable the models for those tables by adding the following under the models config:

models:
 marketo_source:
   +schema: your_staging_name 
   tmp:
     stg_marketo__activity_email_bounced_tmp:
       +enabled: false
     stg_marketo__email_template_history_tmp:
       +enabled: false
   stg_marketo__activity_email_bounced:
     +enabled: false
   stg_marketo__email_template_history:
     +enabled: false

Developing the marketing analytics models

dbt lineage graph showing the star schema and aggregate tables data model

dbt lineage graph showing the star schema and aggregate tables data model

Now that the Fivetran packages have taken care of creating and testing the staging models you can begin to develop the data models for your marketing analytics use cases which will be a star schema data model along with materialized aggregate tables.

For example, for the first marketing analytics dashboard, you may want to see how engaged certain companies and sales regions are by the number of email campaigns they have opened and clicked.

To do so, you can join Salesforce and Marketo tables using the Salesforce user email, Salesforce account_id and Marketo lead_id.

The models will be structured under the mart folder in the following way.

marketing_analytics_demo
|-- dbt_project.yml
|-- packages.yml
|-- models
      |-- mart
             |-- core
             |-- intermediate
             |-- marketing_analytics

You can view the code for all the models on Github in the /models/mart directory and below describes what is in each folder along with an example.

Core models

The core models are the facts and dimensions tables that will be used by all downstream models to build upon.

The dbt SQL code for the dim_user model

with salesforce_users as (
   select
       account_id,
       email
   from {{ ref('stg_salesforce__user') }}
   where email is not null and account_id is not null
),
marketo_users as (
   select
       lead_id,
       email
   from {{ ref('stg_marketo__lead') }}
),
joined as (
   select
     lead_id,
     account_id
   from salesforce_users
     left join marketo_users
     on salesforce_users.email = marketo_users.email
)

select * from joined

You can also add documentation and tests for the models using a yaml file in the folder.

There are 2 simple tests in the core.yml file that have been added

version: 2

models:
 - name: dim_account
   description: "The Account Dimension Table"
   columns:
     - name: account_id
       description: "Primary key"
       tests:
         - not_null
 - name: dim_user
   description: "The User Dimension Table"
   columns:
     - name: lead_id
       description: "Primary key"
       tests:
         - not_null

Intermediate models

Some of the final downstream models may rely on the same calculated metrics and so to avoid repeating SQL you can create intermediate models that can be reused.

The dbt SQL code for int_email_open_clicks_joined model:

with opens as (
	select * 
	from {{ ref('fct_email_opens') }} 
), clicks as (
	select * 
	from {{ ref('fct_email_clicks') }} 
), opens_clicks_joined as (

    select 
      o.lead_id as lead_id,
      o.campaign_id as campaign_id,
      o.email_send_id as email_send_id,
      o.activity_timestamp as open_ts,
      c.activity_timestamp as click_ts
    from opens as o 
      left join clicks as c 
      on o.email_send_id = c.email_send_id
      and o.lead_id = c.lead_id

)

select * from opens_clicks_joined

Marketing Analytics models

These are the final marketing analytics models that will be used to power the dashboards and reports used by marketing and sales teams.

The dbt SQL code for country_email_engagement model:

with accounts as (
	select 
        account_id,
        billing_country
	from {{ ref('dim_account') }}
), users as (
	select 
        lead_id,
        account_id
	from {{ ref('dim_user') }} 
), opens_clicks_joined as (

    select * from {{ ref('int_email_open_clicks_joined') }} 

), joined as (

	select * 
	from users as u
	left join accounts as a
	on u.account_id = a.account_id
	left join opens_clicks_joined as oc
	on u.lead_id = oc.lead_id

)

select 
	billing_country as country,
	count(open_ts) as opens,
	count(click_ts) as clicks,
	count(click_ts) / count(open_ts) as click_ratio
from joined
group by country

Run and test dbt models

Now that your model is ready you can run all the models using

$ dbt run

And then run the tests using

$ dbt test

View the dbt docs and lineage graph

dbt lineage graph for the marketing analytics model

dbt lineage graph for the marketing analytics model

Once your models have run successfully you can generate the docs and lineage graph using

$ dbt docs generate

To then view them locally run

$ dbt docs serve

Deploying dbt models to production

Once you have developed and tested your dbt model locally you have multiple options for deploying into production one of which is the new dbt task type in Databricks Workflows (private preview).

Your dbt project should be managed and version controlled in a Git repository. You can create a dbt task in your Databricks Workflows job pointing to the Git repository.

Using a dbt task type in Databricks Workflows to orchestrate dbt

Using a dbt task type in Databricks Workflows to orchestrate dbt

As you are using packages in your dbt project the first command should be dbt deps followed by dbt run for the first task and then dbt test for the next task.

Databricks Workflows job with two dependant dbt tasks

Databricks Workflows job with two dependant dbt tasks

You can then run the workflow immediately using run now and also set up a schedule for the dbt project to run on a specified schedule.

Viewing the dbt logs for each dbt run

Viewing the dbt logs for each dbt run

For each run you can see the logs for each dbt command helping you debug and fix any issues.

Powering your marketing analytics with Fivetran and dbt

As shown here using Fivetran and dbt alongside the Databricks Lakehouse Platform allows you to easily build a powerful marketing analytics solution that is easy to set-up, manage and flexible enough to suit any of your data modeling requirements.

To get started with building your own solution visit the documentation for integrating Fivetran and dbt with Databricks and re-use the marketing_analytics_demo project example to quickly get started.

The dbt task type in Databricks Workflows is in private preview. To try the dbt task type, please reach out to your Databricks account executive.

--

Try Databricks for free. Get started today.

The post How to Build a Marketing Analytics Solution Using Fivetran and dbt on the Databricks Lakehouse appeared first on Databricks.

Announcing the Preview of Serverless Compute for Databricks SQL on Azure Databricks

$
0
0

We are excited to announce the preview of Serverless compute for Databricks SQL (DBSQL) on Azure Databricks. DBSQL Serverless makes it easy to get started with data warehousing on the lakehouse. Serverless compute for DBSQL frees up time, lowers costs, and enables you to focus on delivering the most value to your business rather than managing infrastructure. In this blog post, we will go over the benefits of DBSQL Serverless and show how you can integrate with popular business intelligence tools such as Microsoft Power BI to get powerful analytics and insights from your data.

Serverless compute for DBSQL helps address challenges customers face with cluster startup time, capacity management, and infrastructure costs:

  • Instant and elastic: Serverless compute brings a truly elastic environment that’s instantly available and scales with your needs. You’ll benefit from simple usage-based pricing, without worrying about idle time charges. Imagine no longer needing to wait for clusters to become available to run queries or overprovisioning resources to handle spikes in usage. Databricks SQL Serverless dynamically grows and shrinks resources to handle whatever workload you throw at it.
  • Eliminate management overhead: Serverless transforms DBSQL into a fully managed service, eliminating the burden of capacity management, patching, upgrading, and performance optimization of the cluster. You only need to focus on your data and the insights it holds. Additionally, the simplified pricing model means there’s only one bill to track and only one place to check costs.
  • Lower infrastructure costs: Under the covers, the serverless compute platform uses machine learning algorithms to provision and scale compute resources right when you need them. This enables substantial cost savings without the need to manually shut down clusters.

Using Databricks SQL Serverless with Power BI

Once your administrator enables Serverless for your Azure Databricks workspace, you will see the Serverless option when creating a SQL warehouse.

This short video shows how you can create a Serverless SQL warehouse and connect it to Power BI. The seamless integration enables you to use Databricks SQL and Power BI to analyze, visualize and derive insights from your data instantly without worrying about managing your infrastructure.

Get Started

Serverless compute for Databricks SQL will be rolling out over the next few days to Azure East US 2, Azure East US, Azure West Europe. Please submit a request to start using DBSQL Serverless on Azure. For instructions on connecting Power BI with Databricks SQL warehouse, visit the Power BI documentation page.

--

Try Databricks for free. Get started today.

The post Announcing the Preview of Serverless Compute for Databricks SQL on Azure Databricks appeared first on Databricks.

Announcing Photon Engine General Availability on the Databricks Lakehouse Platform

$
0
0

We are pleased to announce that Photon, the record-setting next-generation query engine for lakehouse systems, is now generally available on Databricks across all major cloud platforms. Photon, built from the ground up by the original creators of Apache Spark™ and fully compatible with modern Spark workloads, delivers fast performance with lower TCO on cloud hardware for all data use cases.

Since its launch two years ago, Photon has processed exabytes of data, ran billions of queries, delivered benchmark-setting price/performance at up to 12x better than traditional cloud data warehouses, and received a prestigious award.

While the initial focus of Photon was on SQL to enable data warehousing workloads on your existing data lakes, we have expanded the coverage of languages (e.g. Python, Scala, Java, and R) and workloads (e.g. data engineering, analytics, and data science) to reflect modern DataFrame and SparkSQL workloads.

As a result, customers like AT&T have seen dramatic infrastructure cost savings and speed-ups on Photon not only via Databricks SQL Warehouse – but also for data ingestion, ETL, streaming, and interactive queries on the traditional Databricks Workspaces:

  • Up to 80% TCO cost savings (30% on average) with Photon over traditional Databricks Runtime (Apache SparkTM), and up to 85% reduction in VM compute hours (50% on average)
  • Up to 5x lower latency for ⅕ of the compute using Delta Live Tables with Photon
  • 3-8x faster queries on interactive SQL workloads

Furthermore, in a recent survey of 400 preview customers, 90% reported faster query execution in the workspace and 87% said they can get more work done due to faster increase in performance, so they can iterate and develop business value faster.

Radical speed on the Databricks Lakehouse Platform with Photon

What’s new in Photon with GA?

While Photon GA has many amazing features, we’d like to emphasize the following:

  • Fast and Robust Sort: Using vectorized sort in Photon, customers have seen 3-20x performance gain during preview, which is significantly faster than in Apache Spark™.
  • Accelerated Window Functions: Functions that perform calculations across a set of table rows for use cases such as aggregations, moving average, or data duplications have been reported to speed up 2-3 times during preview.
  • Accelerated Structured Streaming: Photon now supports stateless Structured Streaming workloads. During the preview, customers who’ve had streaming jobs reported a 5x decrease in cost.

Getting Started

Follow our docs to get started with Photon, and watch our Data + AI Summit talk to dive in!

--

Try Databricks for free. Get started today.

The post Announcing Photon Engine General Availability on the Databricks Lakehouse Platform appeared first on Databricks.

Monte Carlo and Databricks Partner to Help Companies Build More Reliable Data Lakehouses

$
0
0

This is a collaborative post between Monte Carlo and Databricks. We thank Matt Sulkis, Head of Partnerships, Monte Carlo, for his contributions.

 
As companies increasingly leverage data-driven insights to innovate and maintain their competitive edge, it’s essential that this data is accurate and reliable. With Monte Carlo and Databricks’ partnership, teams can trust their data through end-to-end data observability across their lakehouse environments.

Has your CTO ever told you that the numbers in a report you showed her looked way off?

Has a data scientist ever pinged you when a critical Spark job failed to run?

What about a rise in a field’s null rate that went unnoticed for days or weeks until it caused a significant error in an ML model downstream?

You’re not alone if you answered yes to any of these questions. Data downtime, in other words, periods of time when data is missing, inaccurate, or otherwise erroneous, is an all-too-familiar reality for even the best data teams. It costs millions of dollars in wasted revenue and up to 50 percent of a data engineering team’s time that could be spent building data products and ML models that move the needle for the business.

To help companies accelerate the adoption of more reliable data products, Monte Carlo and Databricks are excited to announce our partnership, bringing end-to-end data observability and data quality automation tools to the data lakehouse. Data engineering and analytics teams that depend on Databricks to derive critical insights about their business and build ML models that can now leverage the power of automated data observability and monitoring to prevent bad data from affecting downstream consumers.

Achieving reliable Databricks pipelines with data observability

With our new partnership and updated integration, Monte Carlo provides full, end-to-end coverage across data lake and lakehouse environments powered by Databricks.

With our new partnership and updated integration, Monte Carlo provides full, end-to-end coverage across data lake and lakehouse environments powered by Databricks.

Over the past few years Databricks has established the lakehouse category, revolutionizing how organizations store and process their data at an unprecedented scale across nearly infinite use cases. Cloud data lakes like Delta Lake have gotten so powerful (and popular) that according to Mordor Intelligence, the data lake market is expected to grow from $3.74 billion in 2020 to $17.60 billion by 2026, a compound annual growth rate of nearly 30%.

Monte Carlo itself is built on the Databricks Lakehouse Platform, enabling our data and engineering teams to build and train our anomaly detection models at unprecedented speed and scale. Building on top of Databricks has allowed us to focus on our core value of improving observability and quality of data for our customers while leveraging the automation, infrastructure management, and analytics scaling tools of the lakehouse. This makes our resources more efficient and better able to serve our customers’ data quality needs. As our business grows, we are confident it will scale with Databricks and enhance the value of our core offering.

Now, with Monte Carlo and Databricks’ partnership, data teams can ensure that these investments are leveraging reliable, accurate data at each stage of the pipeline.

“As data pipelines become increasingly complex and companies ingest more and more data, often from third-party sources, it’s paramount that this data is reliable,” said Barr Moses, co-founder and CEO of Monte Carlo. “Monte Carlo is excited to partner with Databricks to help companies trust their data through end-to-end data observability across their lakehouse.”

With Monte Carlo, data teams get complete Databricks Lakehouse Platform coverage no matter the metastore.

With Monte Carlo, data teams get complete Databricks Lakehouse Platform coverage no matter the metastore.

Coupled with our new Databricks Unity Catalog and Delta Lake integrations, this partnership will make it easier for organizations to take full advantage of Monte Carlo’s data quality monitoring, alerting, and root cause analysis functionalities. Simultaneously, Monte Carlo customers will benefit from Databricks’ speed, scale, and flexibility. With Databricks, analytics or machine learning tasks that previously took hours or even days to complete can now be delivered in minutes, making it faster and more scalable to build impactful data products for the business.

Here’s how teams on Databricks and Monte Carlo can benefit from our strategic partnership:

  • Achieve end-to-end data observability across your Databricks Lakehouse Platform without writing code. Get full, automated coverage across your data pipelines with a low-code implementation process. Access out-of-the-box visibility into data Freshness, Volume, Distribution, Schema, and Lineage by plugging Monte Carlo into your lakehouse.
  • Know when data breaks, as soon as it happens. Monte Carlo continuously monitors your Databricks assets and proactively alerts stakeholders to data issues. Monte Carlo’s machine learning-first approach gives data teams complete coverage for freshness, volume and schema changes, and opt-in distribution monitors and business-context-specific checks layered on top ensure you’re covered at each stage of your data pipeline.
  • Find the root cause of data quality issues fast. Pre-built machine learning-based monitoring and anomaly detection save time and resources, giving teams a single pane of glass to investigate and resolve data issues. By bringing all information and context for your pipelines into one place, teams spend less time firefighting data issues and more time innovating for the business.
  • Immediately understand the business impact of bad data. End-to-end Spark lineage on top of Unity Catalog for your pipelines from the point it enters Databricks (or further upstream!) down to the business intelligence layer, data teams can triage and assess the business impact of their data issues, reducing risk and improving productivity throughout the organization.
  • Prevent data downtime. Give your teams complete visibility into your Databricks pipelines and how they impact downstream reports and dashboards to make more informed development decisions. With Monte Carlo, teams can better manage breaking changes to ELTs, Spark models, and BI assets by knowing what is impacted and who to notify.

In addition to supporting existing mutual customers, Monte Carlo provides end-to-end, automated coverage for teams migrating from their legacy stacks to Databricks Lakehouse Platform. Moreover, Monte Carlo’s security-first approach to data observability ensures that data never leaves your Databricks Lakehouse Platform.

Monte Carlo can automatically monitor and alert for data schema, volume, freshness, and distribution anomalies within the Databricks Lakehouse Platform.

Monte Carlo can automatically monitor and alert for data schema, volume, freshness, and distribution anomalies within the Databricks Lakehouse Platform.

What our mutual customers have to say

Monte Carlo and Databricks customers like ThredUp, a leading online consignment marketplace, and Ibotta, a global cashback and rewards app, are excited to leverage the new Delta Lake and Unity Catalog integrations to improve data reliability at scale across their lakehouse environments.

ThredUp’s data engineering teams leverage Monte Carlo’s capabilities to know where and how their data breaks in real-time. The solution has enabled ThredUp to immediately identify bad data before it affects the business, saving them time and resources on manually firefighting data downtime.

“With Monte Carlo, my team is better positioned to understand the impact of a detected data issue and decide on the next steps like stakeholder communication and resource prioritization. Monte Carlo’s end-to-end lineage helps the team draw these connections between critical data tables and the Looker reports, dashboards, and KPIs the company relies on to make business decisions,” said Satish Rane, Head of Data Engineering, ThredUp. “I’m excited to leverage Monte Carlo’s data observability for our Databricks environment.”

At Ibotta, Head of Data, Jeff Hepburn and his team rely on Monte Carlo to deliver end-to-end visibility into the health of their data pipelines, starting with ingestion in Databricks down to the business intelligence layer.

“Data-driven decision making is a huge priority for Ibotta, but our analytics are only as reliable as the data that informs them. With Monte Carlo, my team has the tools to detect and resolve data incidents before they affect downstream stakeholders, and their end-to-end lineage helps us understand the inner workings of our data ecosystem so that if issues arise, we know where and how to fix them,” said Jeff Hepburn, Head of Data, Ibotta. “I’m excited to leverage Monte Carlo’s data observability with Databricks.”

Pioneering the future of data observability for data lakes

These updates enable teams to leverage Databricks for data engineering, data science, and machine learning use cases to prevent data downtime at scale.

When it comes to ensuring data reliability on the lakehouse, Monte Carlo and Databricks are better together. For more details on how to execute these integrations, see our documentation.

--

Try Databricks for free. Get started today.

The post Monte Carlo and Databricks Partner to Help Companies Build More Reliable Data Lakehouses appeared first on Databricks.

How to Build a Marketing Analytics Solution Using Fivetran and dbt on the Databricks Lakehouse

$
0
0

Marketing teams use many different platforms to drive marketing and sales campaigns, which can generate a significant volume of valuable but disconnected data. Bringing all of this data together can help to drive a large return on investment, as shown by Publicis Groupe who were able to increase campaign revenue by as much as 50%.

The Databricks Lakehouse, which unifies data warehousing and AI use cases on a single platform, is the ideal place to build a marketing analytics solution: we maintain a single source of truth, and unlock AI/ML use cases. We also leverage two Databricks partner solutions, Fivetran and dbt, to unlock a wide range of marketing analytics use cases including churn and lifetime value analysis, customer segmentation, and ad effectiveness.

Fivetran and dbt can read and write to Delta Lake using a Databricks cluster or Databricks SQL warehouse

Fivetran allows you to easily ingest data from 50+ marketing platforms into Delta Lake without the need for building and maintaining complex pipelines. If any of the marketing platforms’ APIs change or break, Fivetran will take care of updating and fixing the integrations so your marketing data keeps flowing in.

dbt is a popular open source framework that lets lakehouse users build data pipelines using simple SQL. Everything is organized within directories, as plain text, making version control, deployment, and testability simple. Once the data is ingested into Delta Lake, we use dbt to transform, test and document the data. The transformed marketing analytics data mart built on top of the ingested data is then ready to be used to help drive new marketing campaigns and initiatives.

Both Fivetran and dbt are a part of Databricks Partner Connect, a one-stop portal to discover and securely connect data, analytics and AI tools directly within the Databricks platform. In just a few clicks you can configure and connect these tools (and many more) directly from within your Databricks workspace.

How to build a marketing analytics solution

In this hands-on demo, we will show how to ingest Marketo and Salesforce data into Databricks using Fivetran and then use dbt to transform, test, and document your marketing analytics data model.

All the code for the demo is available on Github in the workflows-examples repository.

dbt lineage graph showing data sources and models

dbt lineage graph showing data sources and models

The final dbt model lineage graph will look like this. The Fivetran source tables are in green on the left and the final marketing analytics models are on the right. By selecting a model, you can see the corresponding dependencies with the different models highlighted in purple.

Data ingestion using Fivetran

Fivetran has many marketing analytics data source connectors

Fivetran has many marketing analytics data source connectors

Create new Salesforce and Marketo connections in Fivetran to start ingesting the marketing data into Delta Lake. When creating the connections Fivetran will also automatically create and manage a schema for each data source in Delta Lake. We will later use dbt to transform, clean and aggregate this data.

Define a destination schema in Delta Lake for the Salesforce data source

Define a destination schema in Delta Lake for the Salesforce data source

For the demo name the schemas that will be created in Delta Lake marketing_salesforce and marketing_marketo. If the schemas do not exist Fivetran will create them as part of the initial ingestion load.

Select which data source objects to synchronize as Delta Lake tables

Select which data source objects to synchronize as Delta Lake tables

You can then choose which objects to sync to Delta Lake, where each object will be saved as individual tables. Fivetran also makes it simple to manage and view what columns are being synchronized for each table:

Fivetran monitoring dashboard to monitor monthly active rows synchronized

Fivetran monitoring dashboard to monitor monthly active rows synchronized

Additionally, Fivetran provides a monitoring dashboard to analyze how many monthly active rows of data are synchronized daily and monthly for each table, among other useful statistics and logs.

Data modeling using dbt

Now that all the marketing data is in Delta Lake, you can use dbt to create your data model by following these steps

Setup dbt project locally and connect to Databricks SQL

Set up your local dbt development environment in your chosen IDE by following the set-up instructions for dbt Core and dbt-databricks.

Scaffold a new dbt project and connect to a Databricks SQL Warehouse using dbt init, which will ask for following information.

$ dbt init
Enter a name for your project (letters, digits, underscore): 
Which database would you like to use?
[1] databricks
[2] spark

Enter a number: 1
host (yourorg.databricks.com): 
http_path (HTTP Path): 
token (dapiXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX): 
schema (default schema that dbt will build objects in): 
threads (1 or more) [1]: 

Once you have configured the profile you can test the connection using:

$ dbt debug

Install Fivetran dbt model packages for staging

The first step in using the Marketo and Salesforce data is to create the tables as sources for our model. Luckily, Fivetran has made this easy to get up and running with their pre-built Fivetran dbt model packages. For this demo, let’s make use of the marketo_source and salesforce_source packages.

To install the packages just add a packages.yml file to the root of your dbt project and add the marketo-source, salesforce-source and the fivetran-utils packages:

packages:
  - package: dbt-labs/spark_utils
    version: 0.3.0
  - package: fivetran/marketo_source
    version: [">=0.7.0", "<0.8.0"]
  - package: fivetran/salesforce_source
    version: [">=0.4.0", "<0.5.0"]

To download and use the packages run

 $ dbt deps

You should now see the Fivetran packages installed in the packages folder.

Update dbt_project.yml for Fivetran dbt models

There are a few configs in the dbt_project.yml file that you need to modify to make sure the Fivetran packages work correctly for Databricks.

The dbt_project.yml file can be found in the root folder of your dbt project.

spark_utils overriding dbt_utils macros

The Fivetran dbt models make use of macros in the dbt_utils package but some of these macros need to be modified to work with Databricks which is easily done using the spark_utils package.

It works by providing shims for certain dbt_utils macros which you can set using the dispatch config in the dbt_project.yml file and with this dbt will first search for macros in the spark_utils package when resolving macros from the dbt_utils namespace.

dispatch:
 - macro_namespace: dbt_utils
   search_order: ['spark_utils', 'dbt_utils']

Variables for the marketo_source and salesforce_source schemas

The Fivetran packages require you to define the catalog (referred to as database in dbt) and schema of where the data lands when being ingested by Fivetran.

Add these variables to the dbt_project.yml file with the correct catalog and schema names. The default catalog is hive_metastore which will be used if _database is left blank. The schema names will be what you defined when creating the connections in Fivetran.

vars:
 marketo_source:
   marketo_database: # leave blank to use the default hive_metastore catalog
   marketo_schema: marketing_marketo
 salesforce_source:
   salesforce_database: # leave blank to use the default hive_metastore catalog
   salesforce_schema: marketing_salesforce

Target schema for Fivetran staging models

To avoid all the staging tables that are created by the Fivetran source models being created in the default target schema it can be useful to define a separate staging schema.

In the dbt_project.yml file add the staging schema name and this will then be suffixed to the default schema name.

models:
 marketo_source:
   +schema: your_staging_name # leave blank to use the default target_schema
 salesforce_source:
   +schema: your_staging_name # leave blank to use the default target_schema

Based on the above, if your target schema defined in profiles.yml is mkt_analytics, the schema used for marketo_source and salesforce_source tables will be mkt_analytics_your_staging_name.

Disable missing tables

At this stage you can run the Fivetran model packages to test that they work correctly.

dbt run –select marketo_source
dbt run –select salesforce_source

If any of the models fail due to missing tables, because you chose not to sync those tables in Fivetran, then in your source schema you can disable those models by updating the dbt_project.yml file.

For example if the email bounced and email template tables are missing from the Marketo source schema you can disable the models for those tables by adding the following under the models config:

models:
 marketo_source:
   +schema: your_staging_name 
   tmp:
     stg_marketo__activity_email_bounced_tmp:
       +enabled: false
     stg_marketo__email_template_history_tmp:
       +enabled: false
   stg_marketo__activity_email_bounced:
     +enabled: false
   stg_marketo__email_template_history:
     +enabled: false

Developing the marketing analytics models

dbt lineage graph showing the star schema and aggregate tables data model

dbt lineage graph showing the star schema and aggregate tables data model

Now that the Fivetran packages have taken care of creating and testing the staging models you can begin to develop the data models for your marketing analytics use cases which will be a star schema data model along with materialized aggregate tables.

For example, for the first marketing analytics dashboard, you may want to see how engaged certain companies and sales regions are by the number of email campaigns they have opened and clicked.

To do so, you can join Salesforce and Marketo tables using the Salesforce user email, Salesforce account_id and Marketo lead_id.

The models will be structured under the mart folder in the following way.

marketing_analytics_demo
|-- dbt_project.yml
|-- packages.yml
|-- models
      |-- mart
             |-- core
             |-- intermediate
             |-- marketing_analytics

You can view the code for all the models on Github in the /models/mart directory and below describes what is in each folder along with an example.

Core models

The core models are the facts and dimensions tables that will be used by all downstream models to build upon.

The dbt SQL code for the dim_user model

with salesforce_users as (
   select
       account_id,
       email
   from {{ ref('stg_salesforce__user') }}
   where email is not null and account_id is not null
),
marketo_users as (
   select
       lead_id,
       email
   from {{ ref('stg_marketo__lead') }}
),
joined as (
   select
     lead_id,
     account_id
   from salesforce_users
     left join marketo_users
     on salesforce_users.email = marketo_users.email
)

select * from joined

You can also add documentation and tests for the models using a yaml file in the folder.

There are 2 simple tests in the core.yml file that have been added

version: 2

models:
 - name: dim_account
   description: "The Account Dimension Table"
   columns:
     - name: account_id
       description: "Primary key"
       tests:
         - not_null
 - name: dim_user
   description: "The User Dimension Table"
   columns:
     - name: lead_id
       description: "Primary key"
       tests:
         - not_null

Intermediate models

Some of the final downstream models may rely on the same calculated metrics and so to avoid repeating SQL you can create intermediate models that can be reused.

The dbt SQL code for int_email_open_clicks_joined model:

with opens as (
	select * 
	from {{ ref('fct_email_opens') }} 
), clicks as (
	select * 
	from {{ ref('fct_email_clicks') }} 
), opens_clicks_joined as (

    select 
      o.lead_id as lead_id,
      o.campaign_id as campaign_id,
      o.email_send_id as email_send_id,
      o.activity_timestamp as open_ts,
      c.activity_timestamp as click_ts
    from opens as o 
      left join clicks as c 
      on o.email_send_id = c.email_send_id
      and o.lead_id = c.lead_id

)

select * from opens_clicks_joined

Marketing Analytics models

These are the final marketing analytics models that will be used to power the dashboards and reports used by marketing and sales teams.

The dbt SQL code for country_email_engagement model:

with accounts as (
	select 
        account_id,
        billing_country
	from {{ ref('dim_account') }}
), users as (
	select 
        lead_id,
        account_id
	from {{ ref('dim_user') }} 
), opens_clicks_joined as (

    select * from {{ ref('int_email_open_clicks_joined') }} 

), joined as (

	select * 
	from users as u
	left join accounts as a
	on u.account_id = a.account_id
	left join opens_clicks_joined as oc
	on u.lead_id = oc.lead_id

)

select 
	billing_country as country,
	count(open_ts) as opens,
	count(click_ts) as clicks,
	count(click_ts) / count(open_ts) as click_ratio
from joined
group by country

Run and test dbt models

Now that your model is ready you can run all the models using

$ dbt run

And then run the tests using

$ dbt test

View the dbt docs and lineage graph

dbt lineage graph for the marketing analytics model

dbt lineage graph for the marketing analytics model

Once your models have run successfully you can generate the docs and lineage graph using

$ dbt docs generate

To then view them locally run

$ dbt docs serve

Deploying dbt models to production

Once you have developed and tested your dbt model locally you have multiple options for deploying into production one of which is the new dbt task type in Databricks Workflows (private preview).

Your dbt project should be managed and version controlled in a Git repository. You can create a dbt task in your Databricks Workflows job pointing to the Git repository.

Using a dbt task type in Databricks Workflows to orchestrate dbt

Using a dbt task type in Databricks Workflows to orchestrate dbt

As you are using packages in your dbt project the first command should be dbt deps followed by dbt run for the first task and then dbt test for the next task.

Databricks Workflows job with two dependant dbt tasks

Databricks Workflows job with two dependant dbt tasks

You can then run the workflow immediately using run now and also set up a schedule for the dbt project to run on a specified schedule.

Viewing the dbt logs for each dbt run

Viewing the dbt logs for each dbt run

For each run you can see the logs for each dbt command helping you debug and fix any issues.

Powering your marketing analytics with Fivetran and dbt

As shown here using Fivetran and dbt alongside the Databricks Lakehouse Platform allows you to easily build a powerful marketing analytics solution that is easy to set-up, manage and flexible enough to suit any of your data modeling requirements.

To get started with building your own solution visit the documentation for integrating Fivetran and dbt with Databricks and re-use the marketing_analytics_demo project example to quickly get started.

The dbt task type in Databricks Workflows is in private preview. To try the dbt task type, please reach out to your Databricks account executive.

--

Try Databricks for free. Get started today.

The post How to Build a Marketing Analytics Solution Using Fivetran and dbt on the Databricks Lakehouse appeared first on Databricks.


Announcing the Preview of Serverless Compute for Databricks SQL on Azure Databricks

$
0
0

We are excited to announce the preview of Serverless compute for Databricks SQL (DBSQL) on Azure Databricks. DBSQL Serverless makes it easy to get started with data warehousing on the lakehouse. Serverless compute for DBSQL frees up time, lowers costs, and enables you to focus on delivering the most value to your business rather than managing infrastructure. In this blog post, we will go over the benefits of DBSQL Serverless and show how you can integrate with popular business intelligence tools such as Microsoft Power BI to get powerful analytics and insights from your data.

Serverless compute for DBSQL helps address challenges customers face with cluster startup time, capacity management, and infrastructure costs:

  • Instant and elastic: Serverless compute brings a truly elastic environment that’s instantly available and scales with your needs. You’ll benefit from simple usage-based pricing, without worrying about idle time charges. Imagine no longer needing to wait for clusters to become available to run queries or overprovisioning resources to handle spikes in usage. Databricks SQL Serverless dynamically grows and shrinks resources to handle whatever workload you throw at it.
  • Eliminate management overhead: Serverless transforms DBSQL into a fully managed service, eliminating the burden of capacity management, patching, upgrading, and performance optimization of the cluster. You only need to focus on your data and the insights it holds. Additionally, the simplified pricing model means there’s only one bill to track and only one place to check costs.
  • Lower infrastructure costs: Under the covers, the serverless compute platform uses machine learning algorithms to provision and scale compute resources right when you need them. This enables substantial cost savings without the need to manually shut down clusters.

Using Databricks SQL Serverless with Power BI

Once your administrator enables Serverless for your Azure Databricks workspace, you will see the Serverless option when creating a SQL warehouse.

This short video shows how you can create a Serverless SQL warehouse and connect it to Power BI. The seamless integration enables you to use Databricks SQL and Power BI to analyze, visualize and derive insights from your data instantly without worrying about managing your infrastructure.

Get Started

Serverless compute for Databricks SQL will be rolling out over the next few days to Azure East US 2, Azure East US, Azure West Europe. Please submit a request to start using DBSQL Serverless on Azure. For instructions on connecting Power BI with Databricks SQL warehouse, visit the Power BI documentation page.

--

Try Databricks for free. Get started today.

The post Announcing the Preview of Serverless Compute for Databricks SQL on Azure Databricks appeared first on Databricks.

Announcing Photon Engine General Availability on the Databricks Lakehouse Platform

$
0
0

We are pleased to announce that Photon, the record-setting next-generation query engine for lakehouse systems, is now generally available on Databricks across all major cloud platforms. Photon, built from the ground up by the original creators of Apache Spark™ and fully compatible with modern Spark workloads, delivers fast performance with lower TCO on cloud hardware for all data use cases.

Since its launch two years ago, Photon has processed exabytes of data, ran billions of queries, delivered benchmark-setting price/performance at up to 12x better than traditional cloud data warehouses, and received a prestigious award.

While the initial focus of Photon was on SQL to enable data warehousing workloads on your existing data lakes, we have expanded the coverage of languages (e.g. Python, Scala, Java, and R) and workloads (e.g. data engineering, analytics, and data science) to reflect modern DataFrame and SparkSQL workloads.

As a result, customers like AT&T have seen dramatic infrastructure cost savings and speed-ups on Photon not only via Databricks SQL Warehouse – but also for data ingestion, ETL, streaming, and interactive queries on the traditional Databricks Workspaces:

  • Up to 80% TCO cost savings (30% on average) with Photon over traditional Databricks Runtime (Apache Spark™), and up to 85% reduction in VM compute hours (50% on average)
  • Up to 5x lower latency for ⅕ of the compute using Delta Live Tables with Photon
  • 3-8x faster queries on interactive SQL workloads

Furthermore, in a recent survey of 400 preview customers, 90% reported faster query execution in the workspace and 87% said they can get more work done due to faster increase in performance, so they can iterate and develop business value faster.

Radical speed on the Databricks Lakehouse Platform with Photon

What’s new in Photon with GA?

While Photon GA has many amazing features, we’d like to emphasize the following:

  • Fast and Robust Sort: Using vectorized sort in Photon, customers have seen 3-20x performance gain during preview, which is significantly faster than in Apache Spark™.
  • Accelerated Window Functions: Functions that perform calculations across a set of table rows for use cases such as aggregations, moving average, or data duplications have been reported to speed up 2-3 times during preview.
  • Accelerated Structured Streaming: Photon now supports stateless Structured Streaming workloads. During the preview, customers who’ve had streaming jobs reported a 5x decrease in cost.

Getting Started

Follow our docs to get started with Photon, and watch our Data + AI Summit talk to dive in!

--

Try Databricks for free. Get started today.

The post Announcing Photon Engine General Availability on the Databricks Lakehouse Platform appeared first on Databricks.

Monte Carlo and Databricks Partner to Help Companies Build More Reliable Data Lakehouses

$
0
0

This is a collaborative post between Monte Carlo and Databricks. We thank Matt Sulkis, Head of Partnerships, Monte Carlo, for his contributions.

 
As companies increasingly leverage data-driven insights to innovate and maintain their competitive edge, it’s essential that this data is accurate and reliable. With Monte Carlo and Databricks’ partnership, teams can trust their data through end-to-end data observability across their lakehouse environments.

Has your CTO ever told you that the numbers in a report you showed her looked way off?

Has a data scientist ever pinged you when a critical Spark job failed to run?

What about a rise in a field’s null rate that went unnoticed for days or weeks until it caused a significant error in an ML model downstream?

You’re not alone if you answered yes to any of these questions. Data downtime, in other words, periods of time when data is missing, inaccurate, or otherwise erroneous, is an all-too-familiar reality for even the best data teams. It costs millions of dollars in wasted revenue and up to 50 percent of a data engineering team’s time that could be spent building data products and ML models that move the needle for the business.

To help companies accelerate the adoption of more reliable data products, Monte Carlo and Databricks are excited to announce our partnership, bringing end-to-end data observability and data quality automation tools to the data lakehouse. Data engineering and analytics teams that depend on Databricks to derive critical insights about their business and build ML models that can now leverage the power of automated data observability and monitoring to prevent bad data from affecting downstream consumers.

Achieving reliable Databricks pipelines with data observability

With our new partnership and updated integration, Monte Carlo provides full, end-to-end coverage across data lake and lakehouse environments powered by Databricks.

With our new partnership and updated integration, Monte Carlo provides full, end-to-end coverage across data lake and lakehouse environments powered by Databricks.

Over the past few years Databricks has established the lakehouse category, revolutionizing how organizations store and process their data at an unprecedented scale across nearly infinite use cases. Cloud data lakes like Delta Lake have gotten so powerful (and popular) that according to Mordor Intelligence, the data lake market is expected to grow from $3.74 billion in 2020 to $17.60 billion by 2026, a compound annual growth rate of nearly 30%.

Monte Carlo itself is built on the Databricks Lakehouse Platform, enabling our data and engineering teams to build and train our anomaly detection models at unprecedented speed and scale. Building on top of Databricks has allowed us to focus on our core value of improving observability and quality of data for our customers while leveraging the automation, infrastructure management, and analytics scaling tools of the lakehouse. This makes our resources more efficient and better able to serve our customers’ data quality needs. As our business grows, we are confident it will scale with Databricks and enhance the value of our core offering.

Now, with Monte Carlo and Databricks’ partnership, data teams can ensure that these investments are leveraging reliable, accurate data at each stage of the pipeline.

“As data pipelines become increasingly complex and companies ingest more and more data, often from third-party sources, it’s paramount that this data is reliable,” said Barr Moses, co-founder and CEO of Monte Carlo. “Monte Carlo is excited to partner with Databricks to help companies trust their data through end-to-end data observability across their lakehouse.”

With Monte Carlo, data teams get complete Databricks Lakehouse Platform coverage no matter the metastore.

With Monte Carlo, data teams get complete Databricks Lakehouse Platform coverage no matter the metastore.

Coupled with our new Databricks Unity Catalog and Delta Lake integrations, this partnership will make it easier for organizations to take full advantage of Monte Carlo’s data quality monitoring, alerting, and root cause analysis functionalities. Simultaneously, Monte Carlo customers will benefit from Databricks’ speed, scale, and flexibility. With Databricks, analytics or machine learning tasks that previously took hours or even days to complete can now be delivered in minutes, making it faster and more scalable to build impactful data products for the business.

Here’s how teams on Databricks and Monte Carlo can benefit from our strategic partnership:

  • Achieve end-to-end data observability across your Databricks Lakehouse Platform without writing code. Get full, automated coverage across your data pipelines with a low-code implementation process. Access out-of-the-box visibility into data Freshness, Volume, Distribution, Schema, and Lineage by plugging Monte Carlo into your lakehouse.
  • Know when data breaks, as soon as it happens. Monte Carlo continuously monitors your Databricks assets and proactively alerts stakeholders to data issues. Monte Carlo’s machine learning-first approach gives data teams complete coverage for freshness, volume and schema changes, and opt-in distribution monitors and business-context-specific checks layered on top ensure you’re covered at each stage of your data pipeline.
  • Find the root cause of data quality issues fast. Pre-built machine learning-based monitoring and anomaly detection save time and resources, giving teams a single pane of glass to investigate and resolve data issues. By bringing all information and context for your pipelines into one place, teams spend less time firefighting data issues and more time innovating for the business.
  • Immediately understand the business impact of bad data. End-to-end Spark lineage on top of Unity Catalog for your pipelines from the point it enters Databricks (or further upstream!) down to the business intelligence layer, data teams can triage and assess the business impact of their data issues, reducing risk and improving productivity throughout the organization.
  • Prevent data downtime. Give your teams complete visibility into your Databricks pipelines and how they impact downstream reports and dashboards to make more informed development decisions. With Monte Carlo, teams can better manage breaking changes to ELTs, Spark models, and BI assets by knowing what is impacted and who to notify.

In addition to supporting existing mutual customers, Monte Carlo provides end-to-end, automated coverage for teams migrating from their legacy stacks to Databricks Lakehouse Platform. Moreover, Monte Carlo’s security-first approach to data observability ensures that data never leaves your Databricks Lakehouse Platform.

Monte Carlo can automatically monitor and alert for data schema, volume, freshness, and distribution anomalies within the Databricks Lakehouse Platform.

Monte Carlo can automatically monitor and alert for data schema, volume, freshness, and distribution anomalies within the Databricks Lakehouse Platform.

What our mutual customers have to say

Monte Carlo and Databricks customers like ThredUp, a leading online consignment marketplace, and Ibotta, a global cashback and rewards app, are excited to leverage the new Delta Lake and Unity Catalog integrations to improve data reliability at scale across their lakehouse environments.

ThredUp’s data engineering teams leverage Monte Carlo’s capabilities to know where and how their data breaks in real-time. The solution has enabled ThredUp to immediately identify bad data before it affects the business, saving them time and resources on manually firefighting data downtime.

“With Monte Carlo, my team is better positioned to understand the impact of a detected data issue and decide on the next steps like stakeholder communication and resource prioritization. Monte Carlo’s end-to-end lineage helps the team draw these connections between critical data tables and the Looker reports, dashboards, and KPIs the company relies on to make business decisions,” said Satish Rane, Head of Data Engineering, ThredUp. “I’m excited to leverage Monte Carlo’s data observability for our Databricks environment.”

At Ibotta, Head of Data, Jeff Hepburn and his team rely on Monte Carlo to deliver end-to-end visibility into the health of their data pipelines, starting with ingestion in Databricks down to the business intelligence layer.

“Data-driven decision making is a huge priority for Ibotta, but our analytics are only as reliable as the data that informs them. With Monte Carlo, my team has the tools to detect and resolve data incidents before they affect downstream stakeholders, and their end-to-end lineage helps us understand the inner workings of our data ecosystem so that if issues arise, we know where and how to fix them,” said Jeff Hepburn, Head of Data, Ibotta. “I’m excited to leverage Monte Carlo’s data observability with Databricks.”

Pioneering the future of data observability for data lakes

These updates enable teams to leverage Databricks for data engineering, data science, and machine learning use cases to prevent data downtime at scale.

When it comes to ensuring data reliability on the lakehouse, Monte Carlo and Databricks are better together. For more details on how to execute these integrations, see our documentation.

--

Try Databricks for free. Get started today.

The post Monte Carlo and Databricks Partner to Help Companies Build More Reliable Data Lakehouses appeared first on Databricks.

New Solution Accelerator: Customer Entity Resolution

$
0
0

Check our new Customer Entity Resolution Solution Accelerator for more details and to download the notebooks.

A growing number of customers now expect personalized interactions as part of their shopping experience. Whether browsing in-app, receiving offers via electronic mail or being pursued by online advertisements, more and more people expect the brands with which they interact to recognize their individual needs and preferences and to tailor the engagement accordingly. In fact, 76% of consumers are more likely to consider buying from a brand that personalizes. And as organizations pursue omnichannel excellence, these same high expectations are extending into the in-store experience through digitally-assisted employee interactions, offers of specialized in-person services and more. In an age of shopper choice, more and more, retailers are getting the message that personalized engagement is becoming fundamental to attracting and retaining customer spend.

The key to getting personalized interactions right is deriving actionable insights from every bit of information that can be gathered about a customer. First-party data generated through sales transactions, website browsing, product ratings and surveys, customer surveys and support center calls, third-party data purchased from data aggregators and online trackers, and even zero-party data provided by customers themselves come together to form a 360-degree view of the customer. While conversations about Customer-360 platforms tend to focus on the volume and variety of data with which the organization must work and the range of data science use cases often applied to them, the reality is a Customer-360 view cannot be achieved without establishing a common customer identity, linking together customer records across the disparate datasets.

Matching Customer Records Is Challenging

On the surface, the idea of determining a common customer identity across systems seems pretty straightforward. But between different data sources with different data types, it is rare that a unique identifier is available to support record linking. Instead, most data sources have their own identifiers which are translated into basic name and address information to support cross-dataset record matching. Putting aside the challenge that customer attributes, and therefore data, may change over time, automated matching on names and addresses can be incredibly challenging due to non-standard formats and common data interpretation and entry errors.

Take for instance the name of one of our authors: Bryan. This name has been recorded in various systems as Bryan, Brian, Ryan, Byron and even Brain. If Bryan lives at 123 Main Street, he might find this address entered as 123 Main Street, 123 Main St or 123 Main across various systems, all of which are perfectly valid even if inconsistent.

To a human interpreter, records with common variations of a customer’s name and generally accepted variations of an address are pretty easy to match. But to match the millions of customer identities most retail organizations are confronted with, we need to lean on software to automate the process. Most first attempts tend to capture human knowledge of known variations in rules and patterns to match those records, but this often leads to an unmanageable and sometimes unpredictable web of software logic. To avoid this, more and more organizations facing the challenge of matching customers based on variable attributes find themselves turning to machine learning.

Machine Learning Provides a Scalable Approach

In a machine learning (ML) approach to entity resolution, text attributes like name, address, phone number, etc. are translated into numerical representations that can be used to quantify the degree of similarity between any two attribute values. Models are then trained to weigh the relative importance of each of these scores in determining if a pair of records is a match.

For example, slight differences between the spelling of a first name may be given less importance if a perfect match between something like a phone number is found. In some ways, this approach mirrors the natural tendencies humans use when examining records, while being far more scalable and consistent when applied across a large dataset.

That said, our ability to train such a model depends on our access to accurately labeled training data, i.e. pairs of records reviewed by experts and labeled as either a match or not a match. Ultimately, data we know is correct that our model can learn from In the early phase of most ML-based approaches to entity resolution, a relatively small subset of pairs likely to be a match for each other are assembled, annotated and fed to the model algorithm. It’s a time-consuming exercise, but if done right, the model learns to reflect the judgements of the human reviewers.

With a trained model in-hand, our next challenge is to efficiently locate the record pairs worth comparing. A simplistic approach to record comparison would be to compare each record to every other one in the dataset. While straightforward, this brute-force approach results in an explosion of comparisons that computationally gets quickly out of hand.

A more intelligent approach is to recognize that similar records will have similar numerical scores assigned to their attributes. By limiting comparisons to just those records within a given distance (based on differences in these scores) from one another, we can rapidly locate just the worthwhile comparisons, i.e. candidate pairs. Again, this closely mirrors human intuition as we’d quickly eliminate two records from a detailed comparison if these records had first names of Thomas and William or addresses in completely different states or provinces.

Bringing these two elements of our approach together, we now have a means to quickly identify record pairs worth comparing and a means to score each pair for the likelihood of a match. These scores are presented as probabilities between 0.0 and 1.0 which capture the model’s confidence that two records represent the same individual. On the extreme ends of the probability ranges, we can often define thresholds above or below which we simply accept the model’s judgment and move on. But in the middle, we are left with a (hopefully small) set of pairs for which human expertise is once again needed to make a final judgment call.

Zingg Simplifies ML-Based Entity Resolution

The field of entity resolution is full of techniques, variations on these techniques and evolving best practices which researchers have found work well to identify quality matches on different datasets. Instead of maintaining the expertise required to apply the latest academic knowledge to challenges such as customer identity resolution, many organizations rely on libraries encapsulating this knowledge to build their applications and workflows.

One such library is Zingg, an open source library bringing together the latest ML-based approaches to intelligent candidate pair generation and pair-scoring. Oriented towards the construction of custom workflows, Zingg presents these capabilities within the context of commonly employed steps such as training data label assignment, model training, dataset deduplication and (cross-dataset) record matching.

Built as a native Apache Spark application, Zingg scales well to apply these techniques to enterprise-sized datasets. Organizations can then use Zingg in combination with platforms such as Databricks to provide the backend to human-in-the-middle workflow applications that automate the bulk of the entity resolution work and present data experts with a more manageable set of edge case pairs to interpret. As an active-learning solution, models can be retrained to take advantage of this additional human input to improve future predictions and further reduce the number of cases requiring expert review.

Interested in seeing how this works? Then, please be sure to check out the Databricks customer entity resolution solution accelerator. In this accelerator, we show how customer entity resolution best practices can be applied leveraging Zingg and Databricks to deduplicate records representing 5-million individuals. By following the step-by-step instructions provided, users can learn how the building blocks provided by these technologies can be assembled to enable their own enterprise-scaled customer entity resolution workflow applications.

--

Try Databricks for free. Get started today.

The post New Solution Accelerator: Customer Entity Resolution appeared first on Databricks.

Optimizing Order Picking to Increase Omnichannel Profitability with Databricks

$
0
0

Check our new Order Picking Optimization Solution Accelerator for more details and to download the notebooks.

Demand for buy-online pickup in-store (BOPIS), curbside and same-day home delivery is forcing retailers to use local stores as rapid fulfillment centers. Caught off-guard in the early days of the pandemic, many retailers scrambled to introduce and expand the availability of these services, using existing store inventories and infrastructure to deliver goods in a timely manner. As shoppers return to stores, requests for these services are unabated, and recent surveys show expectations for still more and faster options will only increase in the years to come. This is leaving retailers asking how best to deliver these capabilities in the longer term.

The core challenge most retailers are facing today is not how to deliver goods to customers in a timely manner, but how to do so while retaining profitability. It is estimated that margins are reduced 3 to 8 percentage-points on each order placed online for rapid fulfillment. The cost of sending a worker to store shelves to pick the items for each order is the primary culprit, and with the cost of labor only rising (and customers expressing little interest in paying a premium for what are increasingly seen as baseline services), retailers are feeling squeezed.

Concepts such as automated warehouses and dark stores optimized for picking efficiency have been proposed as solutions. However, the upfront capital investment required along with questions about the viability of such models in all but the largest markets have caused many to focus their attention on continued use of existing store footprints. In fact, Walmart, the world’s largest retailer, recently announced its commitment to this direction though with some in-store changes intended to improve the efficiency of their efforts.

The Store Layout Is Purposefully Inefficient

In the fulfillment models proposed by Walmart and many others, the existing store footprint is a core component of a rapid fulfillment strategy. In the most simplistic of these models, workers traverse the store layout, picking items for online orders which are then packaged and shipped from the counter or a backroom. In more sophisticated models, high demand items are organized in a backroom fulfillment area, limiting the need to send workers on to the store floor where picking productivity drops.

The decline in picking productivity on the store floor is by design. In a traditional retail scenario, the retailer exploits the free labor provided by the customer to increase time in-store. By sending the customer from one end of the store to the other in order to pick the items frequently needed during a visit, the retailer increases the shopper’s exposure to the goods and services available. In doing so, the retailer increases the probability that an additional purchase will be made.

For workers tasked with picking orders on behalf of customers, impulse purchases are simply not an option, and long traversal times only add to the cost of fulfillment. As one analyst notes, “the killer of productivity in a store environment is travel distance.” The store design decisions that maximize the potential of the in-person shopper are at odds with those responsible for omnichannel fulfillment.

Shoppers Know This, But Pickers May Not

Most shoppers recognize the inefficiency inherent in most store layouts. Frugal shoppers will typically carry a list of items to purchase and often optimize the sorting of items on the list to minimize back and forth between departments and aisles. Knowledge of product placement as well as the special handling needs of certain items ensure a more efficient passage through the store and minimize the potential for repeat journeys to replace items damaged in transit.

But this knowledge, built through years of experience and familiarity with the items being purchased, may not be available to a picker who is often a gig worker picking orders for others as part of an occasional side-hustle. For these workers, the list of items to pick may offer no clues as to optimal sequencing, leaving the worker to traverse the store picking the items in the order presented.

Optimizing Picking Sequences Can Help

In a recent paper titled The Buy-Online-Pick-Up-in-Store Retailing Model: Optimization Strategies for In-Store Picking and Packing, Pietri et al. examined the efficiency of several picking sequence optimizations for a real grocery store with a layout as shown in Figure 1.

Figure 1. The layout of a store, divided into fifteen distinct zones, from which orders will be picked.

Figure 1. The layout of a store, divided into fifteen distinct zones, from which orders will be picked.

Using historical orders, the authors altered the picking sequence of items with various goals in mind such as minimizing total traversal time and minimizing product damage. They compared these to the default sort order provided to pickers which was based on the order in which items were originally added to the online cart. Their goal was not to identify one best approach for all retail scenarios but instead to provide a framework for the evaluation of different approaches that others could emulate as they seek ways to improve picking efficiency.

With this goal in mind, we’ve recreated portions of their work using the 3.3-million orders in the Instacart dataset mapped to the provided store layout as the proprietary order history used by the paper’s authors is unavailable to us. While the historical datasets differ, we found the relative impact of different sequencing approaches on picking times to closely mirror the authors’ findings (Figure 2).

Figure 2. The average picking time (seconds) associated with orders leveraging various optimization strategies.

Figure 2. The average picking time (seconds) associated with orders leveraging various optimization strategies.

Databricks Can Make Optimization More Efficient

In the evaluation of optimization strategies, it is a common practice to apply various algorithms to a historical dataset. Using prior configurations and scenarios, the effects of optimization strategies can be assessed before being applied in the real-world. Such evaluations can help organizations avoid unexpected outcomes and assess the impact of small variations in approaches but can be quite time consuming to perform.

But by parallelizing the work, the days or even weeks often spent evaluating an approach can be reduced to hours or even minutes. The key is to identify discrete, independent units of work within the larger evaluation set and then to leverage technology to distribute these across a large, computational infrastructure.

In the picking optimization explored above, each order represents such a unit of work as the sequencing of the items in one order has no impact on the sequencing of any others. At the extreme end of things, we might execute optimizations on all 3.3-millions simultaneously to perform our work incredibly quickly. More typically, we might provision a smaller number of resources and distribute subsets of the larger set to each computational node, allowing us to balance the cost of provisioning infrastructure with the time for performing our analysis.

The power of Databricks in this scenario is that it makes the provisioning of resources in the cloud very simple. By loading our historical orders to a Spark dataframe, they are instantly distributed across the provisioned resources. If we provision more or fewer resources, the dataframe rebalances itself with no additional effort on our part.

The trick is then applying the optimization logic to each order. Using a pandas user-defined function (UDF), we are able to apply open source libraries and custom logic to each order in an efficient manner. Results are returned to the dataframe and can then be persisted and analyzed further. To see how this was done in the analysis referenced above or implement at your organziation, check our our solution accelerator for Optimized Order Picking.

--

Try Databricks for free. Get started today.

The post Optimizing Order Picking to Increase Omnichannel Profitability with Databricks appeared first on Databricks.

Viewing all 1874 articles
Browse latest View live