Quantcast
Channel: Databricks
Viewing all 1874 articles
Browse latest View live

Defining the Future of Data & AI: Announcing the Finalists for the 2022 Databricks Data Team Disruptor Award

$
0
0

The annual Databricks Data Team Awards recognize data teams who are harnessing the power of data and AI to deliver solutions for some of the world’s toughest problems.

Nearly 250 teams were nominated across six categories from all industries, regions, and companies – all with impressive stories about the work they are doing with data and AI. As we lead up to Data and AI Summit, we will be showcasing the finalists in each of the categories over the coming days.

The Data Team Disruptor Award celebrates the data teams who are using data and AI to disrupt an industry and challenge the status quo, deploying cutting-edge use cases that others will soon adopt.

Meet the five finalists for the Data Team Disruptor Award category:

Grammarly
Grammarly has changed the world of digital writing, helping 30 million people write more clearly and effectively every day. The Data Platform team at Grammarly made it their mission to create a data ecosystem that would strengthen the engineering and analytical excellence embodied by the product. No compromises. With a small but mighty team, they were able to successfully migrate to the Databricks Lakehouse architecture. Grammarly now makes 5 billion daily events available for analytics in under 15 minutes. Engineering teams have a tailored and centralized platform to ensure product and feature releases bring joy and value to users. Efficiencies are realized for both engineering and analytics without the risk of compromising the high data security and compliance standards at Grammarly. With the Databricks Lakehouse in place, Grammarly has been able to rapidly establish a genuinely data-driven culture, empowering all teams to make more intelligent decisions for the business autonomously.

Ophelos
Ophelos is using Databricks Lakehouse Platform to power its AI and machine learning efforts to disrupt the traditionally antiquated and hostile debt collection industry and turn it into one that’s compassionate, flexible, automated and preventative, via the Ophelos Debt Resolution Platform. The company created OLIVE (Ophelos Linguistic Identification of Vulnerability), a cutting-edge natural language processing (NLP) model that predicts the likelihood that a customer is vulnerable and identifies the possible causes. Ophelos is also addressing customer service efficiency and customer experience through the Ophelos Decision Engine, an ML-powered solution that automatically calculates the long-term effects of each action, and then creates bespoke communication strategies for each individual customer. All of this data is collected anonymously in a real-time analytics dashboard to ensure businesses truly understand their customers and how they can help.

PicPay
PicPay is a Brazilian technology company that facilitates the payments of more than 30 million active users, who transacted more than 91 billion reais in 2021. But the company doesn’t want to stop there, aiming to resolve the entire financial life of its clients in one app. The resources cover a digital wallet, P2P payments, financial marketplace, electronic commerce, social features and much more. To manage the complexity and scale of data quantity and processing in real time, PicPay’s data team uses the Databricks Lakehouse Platform to process and unify large volumes of data, including transaction success rates, transaction types, fraudulent activities, and much more. It makes it easier for teams across organizations to use ML to improve customer engagement and margins, analyze and automate rebate incentives, segment customers based on usage patterns, and predict how customers will use rewards by enabling more targeted programs. Now, the team can expand its capabilities in areas such as transportation and games and provide thousands of Brazilians with a single platform for all their needs.

Pumpjack Dataworks
As the Sports & Entertainment businesses are rapidly transforming into Direct-to Consumer markets, a new world of opportunity is opening up for rights holders to redefine the commercial value of their business by gaining a better understanding of their fanbase. But it’s not always easy for organizations to optimize their myriad of fan touch points and engagement into a monetizable asset. Pumpjack Dataworks is changing the game by building an analytics platform on top of Databricks Lakehouse, leveraging key features such as Delta Sharing and Unity Catalog, to provide clients with a scalable, unified view of all their data to help create a better fan experience, drive new sponsorship and OTT opportunities, and securely exchange data between business partners to realize new revenue streams. Currently, the company is providing data solutions for Major League Rugby, Real Madrid, Inter Miami CF, Dallas Mavericks, and others throughout the world. Fan data is undervalued, and the product solutions powered through Pumpjack and Databricks empower clients to seize control of their data and grow its asset value.

Rivian
While electric cars in the consumer market have been gaining in popularity in recent years, they still make up a small percentage of vehicles on the road. Rivian Automotive sees them as key to a more sustainable future. The company is redefining the driving experience and optimizing vehicle performance for a safer and greener world. With the Databricks Lakehouse Platform, Rivian’s data team is building a next-generation platform for software-defined vehicles that uses advanced analytics, BI dashboards, and ML to gain a deep understanding of vehicle performance in the real world. For instance, they can now access vehicle data on charging efficiency, vehicle dynamics and airbag activity to help provide predictive diagnostics and inform the development of software updates. With these broad insights, they can identify and solve potential issues related to reliability, automated driving functionality, and battery management. As a result, the company can innovate faster, reduce costs, and ultimately, deliver a better and more sustainable driving experience to customers.

Check out the award finalists in the other five categories and come raise a glass and celebrate these amazing data teams during an award ceremony at the Data and AI Summit on June 29.

--

Try Databricks for free. Get started today.

The post Defining the Future of Data & AI: Announcing the Finalists for the 2022 Databricks Data Team Disruptor Award appeared first on Databricks.


What’s New With Databricks Notebooks at the Data & AI Summit

$
0
0

At Databricks, we are continually evolving the development experience to accelerate the path from data to insights. Today we’re excited to announce further improvements to the Databricks Notebook ahead of the Data + AI Summit happening June 27-30 (register here!). Join us to see how we’re making the development experience even better with simplified compute, deeper connections between SQL and Python, embracing the Jupyter ecosystem, new low-code exploratory data analysis tools, and better access tracking with audit logs.

Integrated Compute Management in the Notebook
We’ve rebuilt the Notebook’s UI for choosing compute resources from the ground up, designing it so you spend less time finding the right resource and more time on your work. Your most recently used compute resources are now at your fingertips, and when you need a new resource, you can create it with only a few clicks without ever having to leave the Notebook.

Grab and go compute makes it easier to focus on the task at hand
Grab and go compute makes it easier to focus on the task at hand

Easily take your exploration from SQL to Python
Python and SQL are the two most popular languages in Databricks notebooks, and users frequently load data with SQL before diving deeper with Python. You can now use Python to explore SQL cell results in Python notebooks. You’ll be able to retrieve the results of SQL cells as Python dataframes without the need to manually convert between the two languages.

 Turn your SQL results into Python dataframes automatically and save the hassle of writing extra code
Turn your SQL results into Python dataframes automatically and save the hassle of writing extra code

Introducing Ipywidgets Support to Python Notebooks
With ipywidgets support (currently in public preview) on top of the IPython kernel, you can make your notebooks interactive with rich controls and over 30 distinct UI elements. This brings the power of the Jupyter ecosystem to Databricks notebooks.


Add great visuals and interactivity to your notebook with over 30 flexible widgets

UI-Based Data Exploration with Bamboolib
Prepare, transform, visualize, and explore your data using a simple, user-friendly interface! With Bamboolib, an extendable GUI integrated into your notebook, you save time with no-code data exploration but have access to the generated code needed to replicate and customize the results. This makes it easier for citizen data scientists and domain experts to manipulate their data with Python.


Enjoy the ease of low-code data exploration through a user friendly yet powerful UI

Improved Compliance with Notebook Cell Execution Audit Logs
All notebook access and user revisions are associated with user identities, allowing security administrators to easily audit actions performed on your workspace. We’ve taken transparency a step further by logging execution of individual cells within Databricks notebooks. This improves access controls and identity management by allowing administrators to always know who did what, when, and where.

Improve your audit capabilities by tracking down to the cell execution level within the Databricks notebook
Improve your audit capabilities by tracking down to the cell execution level within the Databricks notebook

Learn more:

--

Try Databricks for free. Get started today.

The post What’s New With Databricks Notebooks at the Data & AI Summit appeared first on Databricks.

Prescriptive Guidance for Implementing a Data Vault Model on the Databricks Lakehouse Platform

$
0
0

There are many different data models that you can use when designing an analytical system, such as industry-specific domain models, Kimball, Inmon, and Data Vault methodologies. Depending on your unique requirements, you can use these different modeling techniques when designing a lakehouse. They all have their strengths, and each can be a good fit in different use cases.

Ultimately, a data model is nothing more than a construct defining different tables with one-to-one, one-to-many, and many-to-many relationships defined. Data platforms must provide best practices for physicalizing the data model, to help with easier information retrieval and better performance.

In a previous article, we covered Five Simple Steps for Implementing a Star Schema in Databricks With Delta Lake. In this article, we aim to explain what a Data Vault is, how to implement it within the Bronze/Silver/Gold layer and how to get the best performance of Data Vault with Databricks Lakehouse Platform.

Data Vault modeling, defined

The goal of Data Vault modeling is to adapt to fast-paced changing business requirements and support faster and agile development of data warehouses by design. A Data Vault is well suited to the lakehouse methodology since the data model is easily extensible and granular with its hub, link and satellite design so design and ETL changes are easily implemented.

Let’s understand a few building blocks for a Data Vault. In general, a Data Vault model has three types of entities:

  • Hubs — A Hub represents a core business entity, like customers, products, orders, etc. Analysts will use the natural/business keys to get information about a Hub. The primary key of Hub tables is usually derived by a combination of business concept ID, load date, and other metadata information.
  • Links — Links represent the relationship between Hub entities. It has only the join keys. It is like a Factless Fact table in the dimensional model. No attributes – just join keys.
  • Satellites — Satellite tables have the attributes of the entities in the Hub or Links. They have descriptive information on core business entities. They are similar to a normalized version of a Dimension table. For example, a customer hub can have many satellite tables such as customer geographical attributes, , customer credit score, customer loyalty tiers, etc.

One of the major advantages of using Data Vault methodology is that existing ETL jobs need significantly less refactoring when the data model changes. Data Vault is a “write-optimized” modeling style and supports agile development approaches and is a great fit for data lakes and lakehouse approach.

A diagram shows how data vault modeling works, with hubs, links, and satellites connected to one another

A diagram shows how data vault modeling works, with hubs, links, and satellites connected to one another.

How Data Vault fits in a Lakehouse

Let’s see how some of our customers are using Data Vault Modeling in a Databricks Lakehouse architecture:

Data Vault Architecture on the Lakehouse

Data Vault Architecture on the Lakehouse

Considerations for implementing a Data Vault Model in Databricks Lakehouse

  • Data Vault modeling recommends using a hash of business keys as the primary keys. Databricks supports hash, md5, and SHA functions out of the box to support business keys.
  • Data Vault layers have the concept of a landing zone (and sometimes a staging zone). Both these physical layers naturally fit the Bronze layer of the data lakehouse. If the landing zone data arrives such as Avro, CSV, parquet, XML, JSON formats, it is converted to Delta-formatted tables in the staging zone, so that the subsequent ETL can be highly performant.
  • Raw Vault is created from the landing or staging zone. Data is modeled as Hubs, Links and Satellite tables in the Raw Data Vault. Additional “business” ETL rules are not typically applied while loading the Raw Data Vault.
  • All the ETL business rules, data quality rules, cleansing and conforming rules are applied between Raw and Business Vault. Business Vault tables can be organized by data domains – which serve as an enterprise “central repository” of standardized cleansed data. Data stewards and SMEs own the governance, data quality and business rules around their areas of the Business Vault.
  • Query-helper tables such as Point-in-Time (PIT) and Bridge tables are created for the presentation layer on top of the business vault. The PIT tables will bolster query performance as some satellites and hubs are pre-joined and provide some WHERE conditions with “point in time” filtering. Bridge tables pre-joins hubs or entities to provide a flattened “dimensional table” like views for Entities. Delta Live Tables are exactly like Materialized Views and can be used to create Point-in-Time tables as well as Bridge tables in the Gold/Presentation layer on top of the Business Data Vault.
  • As business processes change and adapt, the Data Vault model can be easily extended without massive refactoring like the dimensional models. Additional hubs (subject areas) can be easily added to links (pure join tables) and additional satellites (e.g. customer segmentations) can be added to a Hub (customer) with minimal changes.
  • Also loading a dimensional model Data Warehouse in Gold layer becomes easier for the following reasons:
    • Hubs make key management easier (natural keys from hubs can be converted to surrogate keys via Identity columns).
    • Satellites make loading dimensions easier because they contain all the attributes.
    • Links make loading fact tables quite straightforward because they contain all the relationships.

Tips to get best performance out of a Data Vault Model in Databricks Lakehouse

  • Use Delta Formatted tables for Raw Vault, Business Vault and Gold layer tables.
  • Make sure to use OPTIMIZE and Z-order indexes on all join keys of Hubs, Links and Satellites.
  • Do not over partition the tables -especially the smaller satellites tables. Use Bloom filter indexing on Date columns, current flag columns and predicate columns that are typically filtered on to ensure best performance – especially if you need to create additional indices apart from Z-order.
  • Delta Live Tables (Materialized Views) makes creating and managing PIT tables very easy.
  • Reduce the optimize.maxFileSize to a lower number, such as 32-64MB vs. the default of 1 GB. By creating smaller files, you can benefit from file pruning and minimize the I/O retrieving the data you need to join.
  • Data Vault model has comparatively more joins, so use the latest version of DBR which ensures that the Adaptive Query Execution is ON by default so that the best Join strategy is automatically used. Use Join hints only if necessary. ( for advanced performance tuning).

Learn more about Data Vault modeling at Data Vault Alliance.

Get started on building your Data Vault in the Lakehouse

Try Databricks free for 14 days.

--

Try Databricks for free. Get started today.

The post Prescriptive Guidance for Implementing a Data Vault Model on the Databricks Lakehouse Platform appeared first on Databricks.

Data Warehousing Modeling Techniques and Their Implementation on the Databricks Lakehouse Platform

$
0
0

The lakehouse is a new data platform paradigm that combines the best features of data lakes and data warehouses. It is designed as a large-scale enterprise-level data platform that can house many use cases and data products. It can serve as a single unified enterprise data repository for all of your:

  • data domains,
  • real-time streaming use cases,
  • data marts,
  • disparate data warehouses,
  • data science feature stores and data science sandboxes, and
  • departmental self-service analytics sandboxes.

Given the variety of the use cases — different data organizing principles and modeling techniques may apply to different projects on a lakehouse. Technically, the Databricks Lakehouse Platform can support many different data modeling styles. In this article, we aim to explain the implementation of the Bronze/Silver/Gold data organizing principles of the lakehouse and how different data modeling techniques fit in each layer.

What is a Data Vault?

A Data Vault is a more recent data modeling design pattern used to build data warehouses for enterprise-scale analytics compared to Kimball and Inmon methods.

Data Vaults organize data into three different types: hubs, links, and satellites. Hubs represent core business entities, links represent relationships between hubs, and satellites store attributes about hubs or links.

Data Vault focuses on agile data warehouse development where scalability, data integration/ETL and development speed are important. Most customers have a landing zone, Vault zone and a data mart zone which correspond to the Databricks organizational paradigms of Bronze, Silver and Gold layers. The Data Vault modeling style of hub, link and satellite tables typically fits well in the Silver layer of the Databricks Lakehouse.

Learn more about Data Vault modeling at Data Vault Alliance.

A diagram showing how Data Vault modeling works, with hubs, links, and satellites connecting to one another.

A diagram showing how Data Vault modeling works, with hubs, links, and satellites connecting to one another.

What is Dimensional Modeling?

Dimensional modeling is a bottom-up approach to designing data warehouses in order to optimize them for analytics. Dimensional models are used to denormalize business data into dimensions (like time and product) and facts (like transactions in amounts and quantities), and different subject areas are connected via conformed dimensions to navigate to different fact tables.

The most common form of dimensional modeling is the star schema. A star schema is a multi-dimensional data model used to organize data so that it is easy to understand and analyze, and very easy and intuitive to run reports on. Kimball-style star schemas or dimensional models are pretty much the gold standard for the presentation layer in data warehouses and data marts, and even semantic and reporting layers. The star schema design is optimized for querying large data sets.

A star schema example

A star schema example

Both normalized Data Vault (write-optimized) and denormalized dimensional models (read-optimized) data modeling styles have a place in the Databricks Lakehouse. The Data Vault’s hubs and satellites in the Silver layer are used to load the dimensions in the star schema, and the Data Vault’s link tables become the key driving tables to load the fact tables in the dimension model. Learn more about dimensional modeling from the Kimball Group.

Data organization principles in each layer of the Lakehouse

A modern lakehouse is an all-encompassing enterprise-level data platform. It is highly scalable and performant for all kinds of different use cases such as ETL, BI, data science and streaming that may require different data modeling approaches. Let’s see how a typical lakehouse is organized:

 A diagram showing characteristics of the Bronze, Silver, and Gold layers of the Data Lakehouse Architecture.

A diagram showing characteristics of the Bronze, Silver, and Gold layers of the Data Lakehouse Architecture.

Bronze layer — the Landing Zone

The Bronze layer is where we land all the data from source systems. The table structures in this layer correspond to the source system table structures “as-is,” aside from optional metadata columns that can be added to capture the load date/time, process ID, etc. The focus in this layer is on change data capture (CDC), and the ability to provide an historical archive of source data (cold storage), data lineage, auditability, and reprocessing if needed — without rereading the data from the source system.

In most cases, it’s a good idea to keep the data in the Bronze layer in Delta format, so that subsequent reads from the Bronze layer for ETL are performant — and so that you can do updates in Bronze to write CDC changes. Sometimes, when data arrives in JSON or XML formats, we do see customers landing it in the original source data format and then stage it by changing it to Delta format. So sometimes, we see customers manifest the logical Bronze layer into a physical landing and staging zone.

Storing raw data in the original source data format in a landing zone also helps with consistency wherein you ingest data via ingestion tools that don’t support Delta as a native sink or where source systems dump data onto object stores directly. This pattern also aligns well with the autoloader ingestion framework wherein sources land the data in landing zone for raw files and then Databricks AutoLoader converts the data to Staging layer in Delta format.

Silver layer — the Enterprise Central Repository

In the Silver layer of the Lakehouse, the data from the Bronze layer is matched, merged, conformed and cleaned (“just-enough”) so that the Silver layer can provide an “enterprise view” of all its key business entities, concepts and transactions. This is akin to an Enterprise Operational Data Store (ODS) or a Central Repository or Data domains of a Data Mesh (e.g. master customers, products, non-duplicated transactions and cross-reference tables). This enterprise view brings the data from different sources together, and enables self-service analytics for ad-hoc reporting, advanced analytics and ML. It also serves as a source for departmental analysts, data engineers and data scientists to further create data projects and analysis to answer business problems via enterprise and departmental data projects in the Gold layer.

In the Lakehouse Data Engineering paradigm, typically the (Extract-Load-Transform) ELT methodology is followed vs. traditional Extract-Transform-Load(ETL). ELT approach means only minimal or “just-enough” transformations and data cleansing rules are applied while loading the Silver layer. All the “enterprise level” rules are applied in the Silver layer vs. project-specific transformational rules, which are applied in the Gold layer. Speed and agility to ingest and deliver the data in Lakehouse is prioritized here.

From a data modeling perspective, the Silver Layer has more 3rd-Normal Form like data models. Data Vault-like write-performant data architectures and data models can be used in this layer. If using a Data Vault methodology, both the raw Data Vault and Business Vault will fit in the logical Silver layer of the lake — and the Point-In-Time (PIT) presentation views or materialized views will be presented in the Gold Layer.

Gold layer — the Presentation Layer

In the Gold layer, multiple data marts or warehouses can be built as per dimensional modeling/Kimball methodology. As discussed earlier, the Gold layer is for reporting and uses more denormalized and read-optimized data models with fewer joins compared to the Silver layer. Sometimes tables in the Gold Layer can be completely denormalized, typically if the data scientists want it that way to feed their algorithms for feature engineering.

ETL and data quality rules that are “project-specific” are applied when transforming data from the Silver layer to Gold layer. Final presentation layers such as data warehouses, data marts or data products like customer analytics, product/quality analytics, inventory analytics, customer segmentation, product recommendations, marketing/sales analytics etc. are delivered in this layer. Kimball style star-schema based data models or Inmon style Data marts fit in this Gold Layer of the Lakehouse. Data Science Laboratories and Departmental Sandboxes for self-service analytics also belong in the Gold Layer.

The Lakehouse Data Organization Paradigm

The Lakehouse Data Organization Paradigm

To summarize, data is curated as it moves through the different layers of a Lakehouse.

  • The Bronze layer uses the data models of source systems. If data is landed in raw formats, it is converted to DeltaLake format within this layer.
  • The Silver layer for the first time brings the data from different sources together and conforms it to create an Enterprise view of the data — typically using a more normalized, write-optimized data models that are typically 3rd-Normal Form-like or Data Vault-like.
  • The Gold layer is the presentation layer with more denormalized or flattened data models than the Silver layer, typically using Kimball-style dimensional models or star schemas. The Gold layer also houses departmental and data science sandboxes to enable self-service analytics and data science across the enterprise. Providing these sandboxes and their own separate compute clusters prevents the Business teams from creating their own copies of data outside of the Lakehouse.

This Lakehouse data organization approach is meant to break data silos, bring teams together, and empower them to do ETL, streaming, and BI and AI on one platform with proper governance. Central data teams should be the enablers of innovation in the organization, speeding up the onboarding of new self-service users, as well as the development of many data projects in parallel — rather than the data modeling process becoming the bottleneck. The Databricks Unity Catalog provides search & discovery, governance and lineage on the Lakehouse to ensure good data governance cadence.

Build your Data Vaults and star schema data warehouses with Databricks SQL today.

Data is curated as it moves through the different layers of a Lakehouse.

How data is curated as it moves through the various Lakehouse layers.

Further reading:

--

Try Databricks for free. Get started today.

The post Data Warehousing Modeling Techniques and Their Implementation on the Databricks Lakehouse Platform appeared first on Databricks.

Azure Databricks Guide to Data + AI Summit 2022 Featuring Akamai and AT&T

$
0
0

This is a collaborative post from Databricks and Microsoft Azure. We thank Rajeev Jain, Senior Product Marketing Manager at Microsoft, for his contributions.

 

Data + AI Summit 2022: Register now to join this in-person and virtual event June 27-30 and learn from the global data community.

Microsoft is a Platinum Sponsor of Data + AI Summit 2022, the world’s largest gathering of the data and analytics community. Join us for breakout sessions, customer keynotes, in-person networking, and more!

At Data + AI Summit, Databricks and Microsoft customers will take the stage across several sessions to share how they achieved business results using the Azure Databricks Lakehouse. Attendees will have the opportunity to hear from data leaders from Akamai and engineering and sales leaders from Microsoft.

The sessions below are a guide for everyone interested in Azure Databricks and they span a range of topics — from scaling business operations for enterprise-wide analytics to building a complete analytics and AI solution built on the lakehouse architecture. If you have questions about Azure Databricks or service integrations, connect with Azure Databricks Solutions Architects at Data + AI Summit at the Microsoft Azure booth on the Expo floor.

Azure Databricks Customer Breakout Sessions

Pushing the limits of scale and performance for enterprise-wide analytics: A fire-side chat with Akamai | Hagai Attias, Senior Software Architect (Akamai) and Arindam Chatterjee, Principal General Manager (Microsoft) | 6/28 @ 4:00 PM PST

With the world’s most distributed compute platform — from cloud to edge — Akamai makes it easy for businesses to develop and run applications, while keeping experiences closer to users and threats farther away.

So when it was time to scale its legacy Hadoop-like infrastructure reaching its capacity limits, while keeping their global operations running uninterrupted, Akamai partnered with Microsoft and Databricks to migrate to Azure Databricks.

Learn more


How AT&T Data Science Team Solved an Insurmountable Big Data Challenge on Databricks with Two Different Approaches using Photon and RAPIDS Accelerator for Apache Spark | Hao Zhu, Senior Manager (NVIDIA) and Chris Vo, Principal Member of Tech Staff (AT&T) | 6/28 @ 4:45 PM PST

Data driven personalization is an insurmountable challenge for AT&T’s data science team because of the size of datasets and complexity of data engineering. More often these data preparation tasks not only take several hours or days to complete but some of these tasks fail to complete affecting productivity. In this session, the AT&T Data Science team will talk about how RAPIDS Accelerator for Apache Spark and Photon runtime on Databricks can be leveraged to process these extremely large datasets resulting in improved content recommendation, classification, etc while reducing infrastructure costs. The team will discuss the design of experiments on different Azure Databricks runtimes with NVIDIA T4 GPU instances and then by Databricks’ Photon runtime. The team will compare speedups and costs to the regular Databricks runtime Apache Spark environment.

Learn more


Improving Apache Spark Structured Streaming Application Processing Time by Configurations, Code Optimizations, and Custom Data Source | Nir Dror, Principle Performance Engineer (Akamai) and Kineret Raviv, Principal software developer (Akamai) | 6/28 @ 5:30 PM PST

In this session, we’ll go over several use-cases and describe the process of improving our spark structured streaming application micro-batch time from ~55 to ~30 seconds in several steps.
Our app is processing ~ 700 MB/s of compressed data, it has very strict KPIs, and it is using several technologies and frameworks such as: Spark 3.1, Kafka, Azure Blob Storage, AKS and Java 11.

We’ll share our work and experience in those fields, and go over a few tips to create better Spark structured streaming application

Learn more

Azure Databricks Breakout Sessions


Your fastest path to Lakehouse and beyond | Nate Shea-han, Director Specialist, Global Black Belt Team (Microsoft) | 6/29 @ 11:30 AM PST

Azure Databricks is an easy, open, and collaborative service for data, analytics & AI use cases, enabled by Lakehouse architecture. Join this session to discover how you can get the most out of your Azure investments by combining the best of Azure Synapse Analytics, Azure Databricks and Power BI for building a complete analytics & AI solution based on lakehouse architecture.

Learn more


Your AI strategy is only as robust as your data estate | A fire-side chat with Accenture, Avanade, and Microsoft | 6/28 @ 11:00 AM PST

Participants: Paul Barrett, CTO – MD (Accenture) | Tripti Sethi, North America Data and AI Lead (Avanade) | Lindsey Allen, General Manager – Azure Databricks and Applied AI (Microsoft)

Learn more


We also invite you to visit the Microsoft booth on the Expo floor, where you’ll get to talk 1:1 with Azure data engineering on how to address your toughest analytics challenges with Azure.

Register now to join this free virtual event and join the data and AI community. Learn how companies are successfully building their Lakehouse architecture with Azure Databricks to create a simple, open and collaborative data platform. Get started using Databricks with $200 in Azure credits and a free trial.

--

Try Databricks for free. Get started today.

The post Azure Databricks Guide to Data + AI Summit 2022 Featuring Akamai and AT&T appeared first on Databricks.

Everything You Need to Know About Data + AI Summit 2022

$
0
0

Data + AI Summit 2022, the global event for the data community, takes place in San Francisco and virtually in just a few days, June 27-30! This year’s Summit is truly a “can’t miss” event – with 240+ technical sessions, keynotes, Meetups and more, whether you attend in person at Moscone South, San Francisco, or join us virtually, for free.

(Psst: It’s not too late to register! Get all the details here).

Data + AI Summit, formerly known as Spark + AI Summit, will bring together tens of thousands of data practitioners, leaders and visionaries from more than 160 countries to explore the latest in Lakehouse, open source, AI/ML and more. Attendees can also participate in conference training and brand new certification courses.

With Summit only a few days away, here’s a rundown of what you need to know:

Our theme: Destination Lakehouse

Today’s technology decisions are at the intersection of data and AI. There is a growing demand for new approaches to analytics in the cloud that embrace open, simple methods and architectures. The rise of the data lakehouse paradigm means every organization now has a new destination for data. This year’s Destination Lakehouse theme highlights the building blocks of the modern data stack and focuses on product innovations that enable data stewards to mobilize data assets quickly for better, more informed decisions.

Choose your experience

Choose how you want to explore the Lakehouse. Data + AI Summit returns to San Francisco for the first time in three years. Data enthusiasts can still tune in virtually by registering for the free, immersive online experience. No matter how you join, get ready for 4 days of technical deep dives, keynotes from data leaders and visionaries, socializing with peers, and more.

We have an incredible lineup of keynotes from industry thought leaders, such as Databricks Co-founders Ali Ghodsi, Matei Zaharia and Reynold Xin, as well as keynotes from visionaries like Tarika Barrett, CEO of Girls Who Code, Peter Norvig, a pioneer in AI and best-selling textbook author,  Zhamak Dehghani, the creator of Data Mesh, Tristan Handy, CEO and Co-founder of dbt Labs, and many others.

Here’s a rundown of what to expect:

Get the most from your on-site experience

The in-person Summit experience takes place in San Francisco! The best way to navigate the event is to download the official Data + AI Summit mobile app for iOS and Android — just visit the Apple App Store or Google Play Store, search “Data + AI Summit,” and install it on your device. Use your registration credentials to sign in.

Designed to help you get the best from the event, the app lets you explore the full lineup of keynotes, technical sessions, training and networking opportunities. Be sure to switch on real-time notifications so you don’t miss a session or event.

Build your agenda
Easily build your agenda via our conference app for the in-person event. Our agenda is jam-packed with rich technical content, product deep dives, luminary keynotes and more. Simply go to Full Agenda under the Agenda tab, and click ❤️ next to the session you want to add. You can also add each session to your personal calendar. Make sure to take the time to play with the agenda filters by track and topic, and explore the speakers’ pages to build your viewing experience straight from home.

Sample personalized agenda available to attendees of Data + AI Summit

Get the most from your virtual experience

Our virtual event platform launches Friday, June 24 for all virtual attendees; the platform will also be available for in-person attendees after the event. After July 1, you will be able to access all breakouts and keynotes on the platform for two weeks until July 15. From there, you can still access content on-demand through our Summit website.

Your one-stop-shop dashboard
As you enter the virtual conference, you will be welcomed by your personal dashboard — a home for everything you need to know about Summit. The dashboard presents the most useful links to navigate the program agenda, featured sessions, and interactive attendee experience. Make sure to keep an eye on your inbox for notifications, so you don’t miss any program updates.

But it’s not just to keep you organized! This platform will also feature a variety of fun activities, such as a photo booth and “Summit Quest” game to earn prizes, as well as networking opportunities with other attendees, job boards and more.

The virtual event offers both live stream keynotes, training, and breakout sessions, as well as additional content from the in-person event that will be made available on-demand within 24 hours after the session time. Live-streamed sessions will appear in the “Full Agenda” and you can access all other sessions from the On Demand library.

Sample personalized dashboard available to attendees of Data + AI Summit

Build your agenda

Easily build and update your daily agenda via the dashboard. To add sessions to your personalized agenda, go to the agenda tab and select Full Agenda, and then click on the ❤️ next to the session you want to add. You can also add each session to your personal calendar. All live-streamed sessions from San Francisco can be found on the Full Agenda page of the platform. Don’t forget to also check out the On-Demand Library for hundreds more sessions that will be made available throughout the week! We encourage all attendees to take advantage of the filters to find the content most relevant to them.

Networking & community opportunities for in-person attendees

Data + AI Summit is just as much about the community as it is the technology. That’s why we have a variety of Meetups, networking opportunities and other community-focused events planned for both on-site and virtual attendees.

From after-hour parties to open source-focused Meetups, we’ve got a lot planned for folks in San Francisco.

Dev Hub + Expo

We welcome you to our Dev Hub + Expo, where you can interact one-on-one with data professionals and explore the latest technologies. Visit the Databricks Booth to dive into open source technologies such as  Delta Lake, Apache Spark™, PyTorch and MLflow (and of course learn all about Lakehouse). We’re also excited to host a variety of sponsors, who are tech innovators in the Data + AI Community. Visit their sponsor booths in person or check out their virtual booths online.

At the Dev Hub + Expo, you will be able to connect with 80+ tech innovators, participate in a variety of Summit-related games, listen to our Databrew podcast, and network with your peers. Our in-person event will feature some extra perks for attendees, such as the developer lounge, industry lounges, Community Cove and Welcome Expo Party.

Sample Dev Hub and Expo available to attendees of Data + AI Summit

 Sample personalized Summit Quest leaderboard available to attendees of Data + AI Summit

Community Meetups & Events 

In-person attendees will have access to our Meetups at Moscone Center, San Francisco. These special events are free to attend on a first-come, first-served basis. Here’s a glimpse at the schedule:

June 27:  Meetup: The War in Ukraine: Challenges in Documenting War Crimes and Russian False Flags

June 29: Delta Lake birthday party – meet and greet with contributors and committers

We’re also excited to host 6 Ask Me Anything (AMAs across a variety of topics. Leaders in the space (contributors, data engineers, developers and more) will share everything you need to know in these rapid-fire Q&As with some of the biggest names in the industry, including Matei Zaharia (Co-founder & Chief Technologist at Databricks; Original creator of Apache Spark and MLflow) and Reynold Xin (Co-founder & Chief Architect at Databricks; top contributor to Apache Spark). AMA topics include Apache Spark, Delta Lake, MLOps, Pandas API Streaming, Lakehouse best practices…and more.

Bring on the fun!

It has been a while since we have all come together in person – and we are making the most of it! Our attendees in San Francisco will have many opportunities to connect, and it all starts off with “flaring” your badge to connect with like-minded people.

Throughout the conference, we will also have a Summit Quest (a gamified way to participate in Summit and win prizes), specialized networking events, two fun-filled evening parties, including our opening night reception at the Dev Hub + Expo and our Datatorium Summit party at the Exploratorium: enjoy Food, drinks, DJ, interactive exhibits and The Spazmatics

Women in Data + AI 

Please join us for a panel discussion to hear from female leaders joining us from ThoughtSpot, Google and VMware. Following the discussion, there will be an interactive Q&A session with the speakers. These leaders in Data and AI will also provide actionable items for how to get involved with organizations supporting the education of women in tech and how to support and mentor in your local community.

Networking & community opportunities for virtual attendees

Our goal is to make Data + AI Summit 2022 our best conference for all attendees, which is why we have a lot planned for attendees joining us remotely, too! Here’s how you can engage with peers and have fun (without even having to change out of your pajamas).

Job Board

Browse the latest job opportunities in the world of data and AI. Check out the job board at the Dev Hub + Expo to learn more.

Win prizes with Summit Quest 

Collect points throughout the conference by checking into sessions, visiting sponsors and completing surveys. The top 50 spots on the leaderboard will win Summit merchandise and other cool gifts!

Databricks Experience (DBX)

DBX is a curated set of content and sessions, where you will learn more about the amazing innovation happening at Databricks. Designed to help you make the most of your time at Summit, DBX offers quick access to the subjects most relevant to your role. Some key features of DBX include:

  • Product deep dives covering the Databricks Lakehouse Platform with sessions on Delta Lake, Unity Catalog, Databricks SQL, Delta Live Tables, MLflow and more.
  • Training and certification on everything Databricks, from Apache Spark programming to building ETL pipelines to managing machine learning models.
  • News on upcoming Databricks products, features and innovations.

Whether you join us from San Francisco or from your couch, our goal is to bring people together to learn and connect in a virtual environment. There is so much more that we can share, but now it’s your turn to discover what Data + AI Summit has to offer. If you haven’t yet registered, it’s not too late. As always, the virtual pass is completely free. Join us for all the action at Data + AI Summit — we look forward to seeing you there!

To get even more details on what to expect, view our Know Before You Go deck.

--

Try Databricks for free. Get started today.

The post Everything You Need to Know About Data + AI Summit 2022 appeared first on Databricks.

Databricks on Google Cloud Guide to Data + AI Summit 2022

$
0
0

This is a collaborative post from Databricks and Google Cloud. We thank Eddie White, Data and Analytics, Google Cloud for his contributions.

 

Data + AI Summit 2022: Register now to join this in-person and virtual event June 27-30 and learn from the global data community.

Google Cloud is a Platinum Sponsor of Data + AI Summit 2022, the world’s largest gathering of the data and analytics community. Join this event and learn from Google Cloud technology leaders about how customers are successfully leveraging the Databricks Lakehouse Platform for their businesses, bringing together data, AI and analytics on one common platform.

At Data + AI Summit, Databricks and Google Cloud customers and leaders will present sessions to help you see how they achieved business results using the Databricks on Google Cloud Lakehouse. Attendees will have the opportunity to hear from Bruno Aziza, Head of Data & Analytics at Google Cloud on Tuesday, June 28 and Anagha Khanolkar, a Google Cloud customer engineer on Wednesday, June 29. Also, be sure to catch the virtual session with Ivan Nardini and Deb Lee, Customer Engineers, Smart Analytics at Google Cloud.

The sessions below are a guide for everyone interested in Databricks on Google Cloud and span a range of topics — from innovative ways customers are applying data today to a look at the future of data and emerging trends. If you have questions about Databricks on Google Cloud or service integrations, connect with Databricks on Google Cloud Solutions Architects at Data + AI Summit at the Google Cloud booth on the Expo floor.


Databricks on Google Cloud Breakout Sessions

The Future of Data – What’s Next with Google Cloud | Bruno Aziza, Head of Data & Analytics, (Google Cloud) | 6/28 @ 10:45 AM PST

Join Bruno Aziza, Head of Data & Analytics at Google Cloud for an in-depth look at what he is seeing in the future of data and emerging trends. He will also cover Google Cloud’s data analytics practice, including insights into the Data Cloud Alliance, Big Lake, and our strategic partnership with Databricks.

Learn more


The Future is Open – a Look at Google Cloud’s Open Data Ecosystem | Anagha Khanolkar, Customer Engineer, Analytics (Google Cloud) | 6/29 @ 2:05 PM PST

Join Siddhartha Agarwal, Senior Director, SaaS Partnerships & Co-Innovation at Google Cloud and Anagha Khanolkar, Cloud Customer Engineer, Advanced Analytics at Google Cloud for a deep dive into some of the innovative ways we’re seeing data applied with our customers today. This session will cover some exciting use cases and a deep dive into the Databricks and Google Cloud technology stack, deployments, and results companies are seeing in the market.

Learn more


Databricks on Google Cloud Virtual Sessions

Accelerating MLOps Using Databricks and Vertex AI on GCP | Ivan Nardini, Customer Engineer, Smart Analytics (Google Cloud) and Deb Lee, Customer Engineer, Smart Analytics (Google Cloud) | Virtual

In this session, attendees will learn how to serve real-time prediction models built and trained on Databricks using Vertex AI. We will highlight the business benefits realized by implementing MLOps on an open source platform: accelerating the model development and deployment lifecycle while decreasing the time to make data-driven business decisions.

Learn more


Register now to join this free virtual event and join the data and AI community. Learn how companies are successfully building their Lakehouse architecture with Databricks on Google Cloud to create a simple, open and collaborative data platform. Get started using Databricks on Google Cloud with a 14-day free trial.

--

Try Databricks for free. Get started today.

The post Databricks on Google Cloud Guide to Data + AI Summit 2022 appeared first on Databricks.

Software Engineering Best Practices With Databricks Notebooks

$
0
0

Notebooks are a popular way to start working with data quickly without configuring a complicated environment. Notebook authors can quickly go from interactive analysis to sharing a collaborative workflow, mixing explanatory text with code. Often, notebooks that begin as exploration evolve into production artifacts. For example,

  1. A report that runs regularly based on newer data and evolving business logic.
  2. An ETL pipeline that needs to run on a regular schedule, or continuously.
  3. A machine learning model that must be re-trained when new data arrives.

Perhaps surprisingly, many Databricks customers find that with small adjustments, notebooks can be packaged into production assets, and integrated with best practices such as code review, testing, modularity, continuous integration, and versioned deployment.

To Re-Write, or Productionize?

After completing exploratory analysis, conventional wisdom is to re-write notebook code in a separate, structured codebase, using a traditional IDE. After all, a production codebase can be integrated with CI systems, build tools, and unit testing infrastructure. This approach works best when data is mostly static and you do not expect major changes over time. However, the more common case is that your production asset needs to be modified, debugged, or extended frequently in response to changing data. This often entails exploration back in a notebook. Better still would be to skip the back-and-forth.

Directly productionizing a notebook has several advantages compared with re-writing. Specifically:

  1. Test your data and your code together. Unit testing verifies business logic, but what about errors in data? Testing directly in notebooks simplifies checking business logic alongside data representative of production, including runtime checks related to data format and distributions.
  2. A much tighter debugging loop when things go wrong. Did your ETL job fail last night? A typical cause is unexpected input data, such as corrupt records, unexpected data skew, or missing data. Debugging a production job often requires debugging production data. If that production job is a notebook, it’s easy to re-run some or all of your ETL job, while being able to drop into interactive analysis directly over the production data causing problems.
  3. Faster evolution of your business logic. Want to try a new algorithm or statistical approach to an ML problem? If exploration and deployment are split between separate codebases, any small changes require prototyping in one and productionizing in another, with care taken to ensure logic is replicated properly. If your ML job is a notebook, you can simply tweak the algorithm, run a parallel copy of your training job, and move to production with the same notebook.

“But notebooks aren’t well suited to testing, modularity, and CI!” – you might say. Not so fast! In this article, we outline how to incorporate such software engineering best practices with Databricks Notebooks. We’ll show you how to work with version control, modularize code, apply unit and integration tests, and implement continuous integration / continuous delivery (CI/CD). We’ll also provide a demonstration through an example repo and walkthrough. With modest effort, exploratory notebooks can be adjusted into production artifacts without rewrites, accelerating debugging and deployment of data-driven software.

Version Control and Collaboration

A cornerstone of production engineering is to have a robust version control and code review process. In order to manage the process of updating, releasing, or rolling back changes to code over time, Databricks Repos makes integrating with many of the most popular Git providers simple. It also provides a clean UI to perform typical Git operations like commit, pull, and merge. An existing notebook, along with any accessory code (like python utilities), can easily be added to a Databricks repo for source control integration.

managing version control in Databricks Repos
Managing version control in Databricks Repos

Having integrated version control means you can collaborate with other developers through Git, all within the Databricks workspace. For programmatic access, the Databricks Repos API allows you to integrate Repos into your automated pipelines, so you’re never locked into only using a UI.

Modularity

When a project moves past its early prototype stage, it is time to refactor the code into modules that are easier to share, test, and maintain. With support for arbitrary files and a new File Editor, Databricks Repos enable the development of modular, testable code alongside notebooks. In Python projects, modules defined in .py files can be directly imported into the Databricks Notebook:

importing custom Python modules in Databricks Notebooks
Importing custom Python modules in Databricks Notebooks

Developers can also use the %autoreload magic command to ensure that any updates to modules in .py files are immediately available in Databricks Notebooks, creating a tighter development loop on Databricks. For R scripts in Databricks Repos, the latest changes can be loaded into a notebook using the source() function.

Code that is factored into separate Python or R modules can also be edited offline in your favorite IDE. This is particularly useful when cosebases become larger.

Databricks Repos encourages collaboration through the development of shared modules and libraries instead of a brittle process involving copying code between notebooks.

Unit and Integration Testing

When collaborating with other developers, how do you ensure that changes to code work as expected? This is achieved through testing each independent unit of logic in your code (unit tests), as well as the entire workflow with its chain of dependencies (integration tests). Failures of these types of test suites can be used to catch problems in the code before they affect other developers or jobs running in production.

To unit test notebooks using Databricks, we can leverage typical Python testing frameworks like pytest to write tests in a Python file. Here is a simple example of unit tests with mock datasets for a basic ETL workflow:

Python file with pytest fixtures and assertions
Python file with pytest fixtures and assertions

We can invoke these tests interactively from a Databricks Notebook (or the Databricks web terminal) and check for any failures:

invoking pytest in Databricks Notebooks
Invoking pytest in Databricks Notebooks

When testing our entire notebook, we want to execute without affecting production data or other assets – in other words, a dry run. One simple way to control this behavior is to structure the notebook to only run as production when specific parameters are passed to it. On Databricks, we can parameterize notebooks with Databricks widgets:

# get parameter
is_prod = dbutils.widgets.get("is_prod")

# only write table in production mode
if is_prod == "true":
    df.write.mode("overwrite").saveAsTable("production_table")

The same results can be achieved by running integration tests in workspaces that don’t have access to production assets. Either way, Databricks supports both unit and integration tests, setting your project up for success as your notebooks evolve and the effects of changes become cumbersome to check by hand.

Continuous Integration / Continuous Deployment

To catch errors early and often, a best practice is for developers to frequently commit code back to the main branch of their repository. There, popular CI/CD platforms like GitHub Actions and Azure DevOps Pipelines make it easy to run tests against these changes before a pull request is merged. To better support this standard practice, Databricks has released two new GitHub Actions: run-notebook to trigger the run of a Databricks Notebook, and upload-dbfs-temp to move build artifacts like Python .whl files to DBFS where they can be installed on clusters. These actions can be combined into flexible multi-step processes to accommodate the CI/CD strategy of your organization.

In addition, Databricks Workflows are now capable of referencing Git branches, tags, or commits:

Job configured to run against main branch
Job configured to run against main branch

This simplifies continuous integration by allowing tests to run against the latest pull request. It also simplifies continuous deployment: instead of taking an additional step to push the latest code changes to Databricks, jobs can be configured to pull the latest release from version control.

Conclusion

In this post we have introduced concepts that can elevate your use of the Databricks Notebook by applying software engineering best practices. We covered version control, modularizing code, testing, and CI/CD on the Databricks Lakehouse platform. To learn more about these topics, be sure to check out the example repo and accompanying walkthrough.

Learn more

Share Feedback

--

Try Databricks for free. Get started today.

The post Software Engineering Best Practices With Databricks Notebooks appeared first on Databricks.


The Emergence of the Composable Customer Data Platform

$
0
0

This is a collaborative post between Databricks, Hightouch, and Snowplow. We thank Martin Lepka (Head of Industry Solutions at Snowplow) and Alec Haase (Product Evangelist at Hightouch) for their contributions.

 

There is no denying that one of the greatest assets to the modern digital organization is first-party customer data. The rapid rise of the privacy-centric consumer has led to a monumental shift away from third-party tracking methods. Organizations are now scrambling to implement a data infrastructure that, leveraging first-party data, can enable the personalized experiences that customers expect with every interaction.

Companies that want to engage customers effectively must build actionable intelligence on top of their first-party customer data. Actionable intelligence means improving their customer relationships, building consumer trust to capture rich first-party data, and utilizing that data to build intelligence that can be activated to continuously optimize the customer experience.

Historically, building a rich, behavioral data set of customer interest and intent was difficult. Using that data to make meaningful inferences and predictions about individual customers at scale, was even more challenging. And activating that intelligence, across a myriad of customer touchpoints and marketing channels, seemed almost unachievable.

Many companies turn to customer data platforms (CDPs) to help overcome these significant challenges. While cloud CDPs offer an off-the-shelf solution to collecting, cleaning, and activating customer data, adopting organizations have long since struggled with their rigid data models, long onboarding times and data redundancy across analytics and marketing tools. The CDP Institute’s latest survey found that only “58% of companies with a deployed CDP say it delivers significant value” – leaving much to be desired from these packaged CDP solutions.

Thanks to the rise of the modern data stack and emergence of products like Snowplow, Databricks and Hightouch that offer best-in-breed alternatives to each component of off-the-shelf CDPs, the Composable CDP has emerged as a category-leading solution for modern organizations to manage their first-party data strategies.

In this post, we will discuss off-the-shelf CDPs, the rise of the Composable CDP, and how the latter can add transformational value to any modern organization.

What is a Customer Data Platform?

The definition of Customer Data Platforms has evolved numerous times since their inception in 2013. Gartner currently defines CDPs as “software application that supports marketing and customer experience use cases by unifying a company’s customer data from marketing and other channels. CDPs optimize the timing and targeting of messages, offers and customer engagement activities, and enable the analysis of individual-level customer behavior over time.”

The components of standard CDP offerings can be classified into the following categories:

  • Data Collection: CDPs are designed to collect customer events from a number of different sources (onsite, mobile applications and server-side) and append these activities to the customer profile. These events typically contain metadata to provide detailed context about the customer’s specific digital interactions. Event collection is typically designed to support marketing use cases such as marketing automation.
  • Data Storage and Modeling: CDPs provide a proprietary repository of data that aggregates and manages different sources of customer data collected from most of the business’s SaaS and internal applications. The unified database is a 360 degree view about each customer and a central source of truth for the business. Most CDPs have out-of-the-box identity stitching functionality and tools to create custom traits on user profiles.
  • Data Activation: CDPs offer the ability to build audience segments leveraging the data available in the platform. Thanks to a wide-array of pre-built integrations, these audiences and other customer data points are then able to be pushed both to and from various marketing channels.

Evolutions of CDPs

The term CDP was first introduced by David Rabb, a marketing technology consultant and industry analyst in April 2013. The piece titled “I’ve Discovered a New Class of System: the Customer Data Platform. Causata Is An Example,” introduced the term CDP to the market for the first time.

Tag management platforms were amongst the first to adopt this early definition of CDPs. In 2012, Google launched Google Tag Manager, a free tool allowing brands to manage their client-side web tracking. With a behemoth like Google releasing a free tool, household names in the tag management space no longer held the same value. They had to pivot their technology, focusing on data collection, a consolidated customer profile and activation through their integrations.

CDPs were at the time seen as a solution to help brands build towards a single view of their customer and activate the data either through integrations with other tools or natively within their own tool’s platform.

So why aren’t off-the-shelf CDPs the solution for every business? A root cause for many of the challenges that organizations face with their CDP implementations is that off-the-shelf solutions often promote the idea of sending data directly to the CDP. As a result, data engineers become frustrated with having to use data engineering tools that are native to the CDP, analysts and domain experts become concerned about having to manage audience segments in multiple places, and data scientists question how they will be able to use the value derived from the CDP for adjacent use cases, such as content optimization.

Compounding these issues, the CDP becomes a siloed copy of an organization’s critical asset – customer data. The solution to these challenges is to think of the CDP as an extension of your broader data management strategy. In other words, extend the capabilities of your existing lakehouse to support additional use cases, instead of sending copies of your data to multiple places.

Introducing the Composable CDP

A Composable CDP consists of the same components as their off-the-shelf counterparts; Data Collection, Data Storage and Modeling, and Data Activation. By implementing a best-in-class product at each layer of the Composable CDP, organizations can achieve a far more extensible CDP solution that can solve problems well beyond the common use cases of off-the-shelf CDPs. Understanding each of these components allows teams to make the most informed architecture decisions when implementing their own Composable CDP.

Composable CDP with Snowplow, Databricks, and Hightouch

Composable CDP with Snowplow, Databricks, and Hightouch

Behavioral Data Creation (Snowplow)

Behavioral data creation is the underlying foundation of the Composable CDP, providing a platform to power your personalized digital experience for your customer. Snowplow’s Behavior Data Platform empowers data teams to manage end-to-end behavioral data creation with the delivery of BI and AI-ready data that is well-structured, reliable, consistent, accurate, explainable, and compliant, directly to your lakehouse.

To build a true single customer view you need to create data from across your digital platforms.
With Snowplow, data teams can define their own version-controlled custom events and entity schema, part of our Universal Data Language. Each event can also be enriched with an unlimited number of entities and properties creating data bespoke to your business and providing unlimited opportunities for activation.

Add an unlimited number of entities and properties to enrich your event data

Add an unlimited number of entities and properties to enrich your event data

Accurate and compliant identification of users is the cornerstone of any CDP. With first-party user identifiers included with each event and in-stream privacy tooling, including PII pseudonymization, you have a complete, and compliant view of every customer interaction.

With data generated and enriched, behavioral data must be modeled with activation in mind. With Snowplow’s private deployment model and native connector to Databricks, your unified event stream lands in real-time in the Delta Lake and is modeled at an interaction, session, and user level, creating your single view of your customer behavior without data wrangling – ready for identity stitching with other data sources.

Storage and Modeling (Databricks)

Creating and maintaining a single view of the customer delivers a tremendous amount of value for organizations big and small. Whether it’s used by a marketing team to facilitate cross-sell/upsell opportunities or a product team to personalize the user experience, the value of this asset can and should be realized across all organizational boundaries. This can be achieved by using the Databricks Lakehouse Platform as the storage and modeling layer of your Composable CDP. The benefit of this approach is three-fold.

First, the Databricks Lakehouse Platform natively supports any type of data, whether it’s batch or streaming, structured or unstructured. This means having a common way to work with all of your data, whether it’s clickstream data streaming in from Snowplow, marketing data that is updated in batches from Fivetran, unstructured text data from a customer service tool, such as Zendesk, or otherwise.

Second, all of this data that you’re managing in your lakehouse can be used directly for BI and ML/AI. For example, using Databricks SQL, return on ad spend for marketing campaigns can be easily analyzed using your BI tool of choice. Likewise, with MLflow and AutoML natively integrated into Databricks, data scientists can easily train and productionize models, such as propensity to churn, and then use the output of those models for activation via Hightouch.

Lastly, because the Databricks Lakehouse Platform is simple, open, and collaborative, the single view of the customer can be managed and governed in a consistent, and scalable way, for present and future use cases alike.

Data Activation (Hightouch)

Data Activation is the final piece of the Composable CDP. All business teams, from sales and marketing, to support and customer success, need relevant, accurate, and near real-time customer data to add critical context inside the software they already use.

Leveraging a technology coined “Reverse ETL,” Data Activation platforms stream data out from the data lakehouse to any business application. Whether you’re enhancing communications with customers via CRM, optimizing ad spend with audience targeting, or personalizing email/sms campaigns, Hightouch makes your data actionable – no scripts or APIs required.

Define the data you want to sync with SQL or a no-code audience builder

Define the data you want to sync with SQL or a no-code audience builder

Thanks to Hightouch’s easy to use audience and trait builders, marketing teams are able to centrally manage all of their cross-channel personalization and targeting efforts within the platform. Teams can connect Databricks as a Hightouch source, and in minutes, take action on their lakehouse data in any of their various downstream marketing tools.

“Data Activation is the method of unlocking the knowledge stored within your lakehouse, and making it actionable by your business users in the end tools that they use every day. In doing so, Data Activation helps bring data people toward the center of the business, directly tying their work to business outcomes.”
– Tejas Manohar, Co-founder of Hightouch

Benefits of a Composable CDP

By harnessing the power of best-in-class tooling to create a Composable CDP, there are four key benefits over an off-the-shelf CDP;

Better Data Governance

In today’s privacy-conscious world and with ever-evolving data legislation, taking ownership and having full control of your customer data is paramount. Rather than an off-the-shelf CDP managing all of your customer data, a Composable CDP provides you with full transparency, assurance, and auditability at each step of your customer’s data architecture.

Controlling what personally identifiable information is collected, how data is stored and modeled, and what data is shared with your marketing partners, ensures that you can comply with GDPR and CCPA and future legislation.

Better Results With Better Data Quality

Advanced personalization and segmentation of your campaigns rely on a consistent source of well-structured, reliable, accurate, explainable, and compliant behavioral data describing what customers are doing minute-by-minute. With a Composable CDP, you can determine the events and entities that match your business and decide how your data is modeled for activation.

Although behavioral data can be exported from an off-the-shelf CDP, in reality their data models were never intended to be used outside of their platform. CDP data exports from irregular table structures requiring complex joins and transformation before data can be activated.

With a Composable CDP, data science teams can directly leverage the behavioral data in your lakehouse, along with Databricks’ enormous data processing capability, to build AI models specific to your data, product, or business goal instead of relying on the black box models offered by off-the-shelf CDPs. With greater model accuracy, you can create additional opportunities and revenue from your campaigns.

Future Proof and Modular by Design

Composable CDPs are future-proof by design, allowing you to avoid the vendor lock-in and one size fits all approach associated with off-the-shelf CDPs. With every element in a Composable CDP modular, you can choose the best-in-class collection, storage, modeling, and activation tools that fit the requirements of each of your teams. As the requirements of the business evolve you can continue to invest on top of your Composable CDP opposed to implementing a new stack from scratch which has high risk and cost to the business.

With a modular design, you also have the flexibility to determine your approach to identity resolution to ensure your team is able to deliver accurate and compliant marketing campaigns. Your business has complete control over how and when to stitch together user identities – leveraging every customer data point available.

Single Source of Truth Across Marketing and Other Teams

Instead of adding another data silo to the tech stack, teams can do more with the single source of truth they already have, their lakehouse. With the lakehouse as the single source of truth for the Composable CDP, all teams have access to the most comprehensive customer profiles and insights from across the business and can activate it through Hightouch with an easy-to-use UI and workflow.

The single source of truth also has applications outside just marketing use cases, it can also power other use-cases ranging from internal reporting to product analytics.

Getting Started

As we’ve seen, implementing an off-the-shelf CDP can be challenging, especially at an enterprise level. Because CDPs revolve around their customer database, product engineering teams must implement data collection by tracking user traits and events across various websites, backend services, and apps via CDP APIs and SDKs. Implementation can often take 3-6 months before marketing efforts can even begin.

The Composable CDP allows you to solve the most important problem in front of you incrementally-enabling you to choose the best solution and components for your business. You can educate yourself throughout the process and future-proof and swap out specific components down the line when your needs change or when a particular tool isn’t “cutting it.”

Thanks to the partnerships of Snowplow, Databricks and Hightouch, getting started with a composable CDP has never been easier. Snowplow’s new Databricks Loader allows you to load directly into both Delta Lake and Databricks. To get started with Snowplow and Databricks contact us or deploy Snowplow Open Source. And with the launch of Hightouch on Databricks Partner Connect, Databricks customers can establish a secure Hightouch integration in just a few clicks.

Further instructions on how to get started with each of the market-leading Composable CDP tools can be found below. Additionally, you can attend Data + AI Summit 2022 in-person in San Francisco, June 27-30, and talk to experts at booth 323 (Hightouch) or 832 (Snowplow).

--

Try Databricks for free. Get started today.

The post The Emergence of the Composable Customer Data Platform appeared first on Databricks.

Software Engineering Best Practices With Databricks Notebooks

$
0
0

Notebooks are a popular way to start working with data quickly without configuring a complicated environment. Notebook authors can quickly go from interactive analysis to sharing a collaborative workflow, mixing explanatory text with code. Often, notebooks that begin as exploration evolve into production artifacts. For example,

  1. A report that runs regularly based on newer data and evolving business logic.
  2. An ETL pipeline that needs to run on a regular schedule, or continuously.
  3. A machine learning model that must be re-trained when new data arrives.

Perhaps surprisingly, many Databricks customers find that with small adjustments, notebooks can be packaged into production assets, and integrated with best practices such as code review, testing, modularity, continuous integration, and versioned deployment.

To Re-Write, or Productionize?

After completing exploratory analysis, conventional wisdom is to re-write notebook code in a separate, structured codebase, using a traditional IDE. After all, a production codebase can be integrated with CI systems, build tools, and unit testing infrastructure. This approach works best when data is mostly static and you do not expect major changes over time. However, the more common case is that your production asset needs to be modified, debugged, or extended frequently in response to changing data. This often entails exploration back in a notebook. Better still would be to skip the back-and-forth.

Directly productionizing a notebook has several advantages compared with re-writing. Specifically:

  1. Test your data and your code together. Unit testing verifies business logic, but what about errors in data? Testing directly in notebooks simplifies checking business logic alongside data representative of production, including runtime checks related to data format and distributions.
  2. A much tighter debugging loop when things go wrong. Did your ETL job fail last night? A typical cause is unexpected input data, such as corrupt records, unexpected data skew, or missing data. Debugging a production job often requires debugging production data. If that production job is a notebook, it’s easy to re-run some or all of your ETL job, while being able to drop into interactive analysis directly over the production data causing problems.
  3. Faster evolution of your business logic. Want to try a new algorithm or statistical approach to an ML problem? If exploration and deployment are split between separate codebases, any small changes require prototyping in one and productionizing in another, with care taken to ensure logic is replicated properly. If your ML job is a notebook, you can simply tweak the algorithm, run a parallel copy of your training job, and move to production with the same notebook.

“But notebooks aren’t well suited to testing, modularity, and CI!” – you might say. Not so fast! In this article, we outline how to incorporate such software engineering best practices with Databricks Notebooks. We’ll show you how to work with version control, modularize code, apply unit and integration tests, and implement continuous integration / continuous delivery (CI/CD). We’ll also provide a demonstration through an example repo and walkthrough. With modest effort, exploratory notebooks can be adjusted into production artifacts without rewrites, accelerating debugging and deployment of data-driven software.

Version Control and Collaboration

A cornerstone of production engineering is to have a robust version control and code review process. In order to manage the process of updating, releasing, or rolling back changes to code over time, Databricks Repos makes integrating with many of the most popular Git providers simple. It also provides a clean UI to perform typical Git operations like commit, pull, and merge. An existing notebook, along with any accessory code (like python utilities), can easily be added to a Databricks repo for source control integration.

managing version control in Databricks Repos
Managing version control in Databricks Repos

Having integrated version control means you can collaborate with other developers through Git, all within the Databricks workspace. For programmatic access, the Databricks Repos API allows you to integrate Repos into your automated pipelines, so you’re never locked into only using a UI.

Modularity

When a project moves past its early prototype stage, it is time to refactor the code into modules that are easier to share, test, and maintain. With support for arbitrary files and a new File Editor, Databricks Repos enable the development of modular, testable code alongside notebooks. In Python projects, modules defined in .py files can be directly imported into the Databricks Notebook:

importing custom Python modules in Databricks Notebooks
Importing custom Python modules in Databricks Notebooks

Developers can also use the %autoreload magic command to ensure that any updates to modules in .py files are immediately available in Databricks Notebooks, creating a tighter development loop on Databricks. For R scripts in Databricks Repos, the latest changes can be loaded into a notebook using the source() function.

Code that is factored into separate Python or R modules can also be edited offline in your favorite IDE. This is particularly useful when cosebases become larger.

Databricks Repos encourages collaboration through the development of shared modules and libraries instead of a brittle process involving copying code between notebooks.

Unit and Integration Testing

When collaborating with other developers, how do you ensure that changes to code work as expected? This is achieved through testing each independent unit of logic in your code (unit tests), as well as the entire workflow with its chain of dependencies (integration tests). Failures of these types of test suites can be used to catch problems in the code before they affect other developers or jobs running in production.

To unit test notebooks using Databricks, we can leverage typical Python testing frameworks like pytest to write tests in a Python file. Here is a simple example of unit tests with mock datasets for a basic ETL workflow:

Python file with pytest fixtures and assertions
Python file with pytest fixtures and assertions

We can invoke these tests interactively from a Databricks Notebook (or the Databricks web terminal) and check for any failures:

invoking pytest in Databricks Notebooks
Invoking pytest in Databricks Notebooks

When testing our entire notebook, we want to execute without affecting production data or other assets – in other words, a dry run. One simple way to control this behavior is to structure the notebook to only run as production when specific parameters are passed to it. On Databricks, we can parameterize notebooks with Databricks widgets:

# get parameter
is_prod = dbutils.widgets.get("is_prod")

# only write table in production mode
if is_prod == "true":
    df.write.mode("overwrite").saveAsTable("production_table")

The same results can be achieved by running integration tests in workspaces that don’t have access to production assets. Either way, Databricks supports both unit and integration tests, setting your project up for success as your notebooks evolve and the effects of changes become cumbersome to check by hand.

Continuous Integration / Continuous Deployment

To catch errors early and often, a best practice is for developers to frequently commit code back to the main branch of their repository. There, popular CI/CD platforms like GitHub Actions and Azure DevOps Pipelines make it easy to run tests against these changes before a pull request is merged. To better support this standard practice, Databricks has released two new GitHub Actions: run-notebook to trigger the run of a Databricks Notebook, and upload-dbfs-temp to move build artifacts like Python .whl files to DBFS where they can be installed on clusters. These actions can be combined into flexible multi-step processes to accommodate the CI/CD strategy of your organization.

In addition, Databricks Workflows are now capable of referencing Git branches, tags, or commits:

Job configured to run against main branch
Job configured to run against main branch

This simplifies continuous integration by allowing tests to run against the latest pull request. It also simplifies continuous deployment: instead of taking an additional step to push the latest code changes to Databricks, jobs can be configured to pull the latest release from version control.

Conclusion

In this post we have introduced concepts that can elevate your use of the Databricks Notebook by applying software engineering best practices. We covered version control, modularizing code, testing, and CI/CD on the Databricks Lakehouse platform. To learn more about these topics, be sure to check out the example repo and accompanying walkthrough.

Learn more

Share Feedback

--

Try Databricks for free. Get started today.

The post Software Engineering Best Practices With Databricks Notebooks appeared first on Databricks.

Inspiring Innovation With Data & AI: Announcing the Finalists for the 2022 Databricks Data Team Visionary Award

$
0
0

The annual Databricks Data Team Awards recognize data teams who are harnessing the power of data and AI to deliver solutions for some of the world’s toughest problems.

Nearly 250 teams were nominated across six categories from all industries, regions, and companies – all with impressive stories about the work they are doing with data and AI. As we lead up to Data and AI Summit, we have been showcasing the finalists in each category.

The Data Team Visionary Award recognizes the leaders who embody innovation and impact in data and AI — producing amazing results within their organization and inspiring the global data community.

Meet the four finalists for the Data Team Visionary Award category:

American Airlines – Poonam Mohan, Vice President – Information Technology
Poonam Mohan, Vice President – Information Technology at American Airlines, has set a charter for her team to always strive to deliver the right data at the right time to the business and customer enabling faster decision-making that helps to benefit millions of travelers. To achieve this goal, she has enabled the data team to migrate the on-prem data warehouse and BigData platform to the cloud, scale up the data science community at American, and adopt cloud technologies like the Databricks Lakehouse Platform. By unifying diverse data (flight search, booking, ticketing and check-in data), the data team has enabled ML use cases that improve daily operations. With the help of real-time booking and check-in data, team members were able to solve various business problems and also identify business opportunities. Under her leadership, the data team at American has achieved many milestones including assembling the DataOps team, leveraging the Lakehouse for data ingestion and ML, and establishing the DATA Academy for self-pace learning for their developer and user community.

Shell – Dan Jeavons, VP Computational Science & Digital Innovation & IT CTO
Dan Jeavons, VP Computational Science and Digital Innovation at Shell, is driving the firm’s digital innovation agenda. Dan has taken on Shell’s challenge to achieve net-zero by 2050 and a 50% reduction in scope 1&2 emissions by 2030 and is using digital technologies like Databricks Lakehouse to help make it happen. The Lakehouse is powering a number of use cases– from renewable asset management to predictive maintenance to providing retail customers with rewards and incentives to reduce their carbon footprint– giving Shell some early wins, with data insights derived from CO2 monitoring systems reducing emissions across a variety of use cases. The Lakehouse architecture has now brought together data from across Shell’s businesses – covering everything from LNG trains to emerging energy & chemicals parks to wind farms and solar parks. This consistent data layer now has in excess of 2.9 trillion rows of curated data. This data layer now supports the development of a variety of solutions from self-service dashboards, to self-service digital tools to advanced AI solutions. To accelerate the development of digital solutions, Dan has also brought his leadership to the industry at large, collaborating with other visionaries to create an open eco-system for AI-based decarbonization solutions that all energy providers can collectively benefit from.

WarnerBros Discovery – Duan Peng, SVP of Global Data & AI
Duan Peng, the SVP of Global Data & AI at WarnerBros. Discovery, is leading the charge in leveraging data and AI to create next-level viewing experiences for HBOMax and its global audience. Duan and her team built their direct-to-consumer streaming service from the ground up with the goal of delivering the best possible customer experience, through personalized content and recommendations. With her efforts, the use of data to continually improve the HBOMax service has seen an incredible amount of growth. Now, with Databricks Lakehouse, Duan is working on upgrading HBOMax’s data and machine learning infrastructure in order to effectively scale the platform’s capabilities from the US market to a global one — her team recently launched HBOMax in 39 countries across Latin America, and is working on rolling it out to Europe and Asia Pacific next. This is just the beginning for Duan’s vision for what her team can accomplish, as she works towards bringing HBOMax’s high-quality streaming entertainment to the rest of the world.

Walgreens Boots Alliance – Luigi Guadagno, GVP, Pharmacy Healthcare Technology Platform
Luigi Guadagno, GVP, Pharmacy Healthcare Technology Platform, at Walgreens Boots Alliance (WBA) is the platform leader who has championed the company’s efforts to personalize pharmacy care using data and AI. Luigi, and the Walgreens data team, have guided the company through its cloud-first, data architecture modernization, so that they can now deliver real-time healthcare insights to pharmacists on the front lines, and ensure the right medications are always on shelves when patients need them across 9,000 locations. Core to WBA and the Pharmacy Healthcare Technology Platform team’s vision is the Databricks Lakehouse Platform, which ingests around 200,000 transactions per second while enabling all its users (e.g., engineers, data scientists, power users, and more) to understand, analyze and create insights at an unparalleled scale and speed. WBA can now activate its data to better understand how best to engage and communicate with patients, from sharing vital information about medications to providing guidance on their journey towards a healthier life. Under Luigi’s leadership, the WBA data team has saved millions of dollars from more efficient infrastructure, higher productivity and value from use cases including inventory optimization and Covid vaccination reporting.

Check out the award finalists in the other five categories and come raise a glass and celebrate these amazing data teams during an award ceremony at the Data and AI Summit on June 29.

--

Try Databricks for free. Get started today.

The post Inspiring Innovation With Data & AI: Announcing the Finalists for the 2022 Databricks Data Team Visionary Award appeared first on Databricks.

Announcing the Winners for the 2022 Databricks Data Team Awards

$
0
0

The annual Databricks Data Team Awards recognize data teams who are harnessing the power of data and AI to deliver solutions for some of the world’s toughest problems.

Nearly 250 teams were nominated across six categories from all industries, regions, and companies – all with impressive stories about the work they are doing with data and AI. We are proud to recognize and celebrate 29 finalists who are driving innovation, transforming the way organizations operate, driving CoEs to deliver data across the business, contributing to open source projects, visionaries who are leading the way, and those team that are using data to make a positive impact on the world. It was hard to choose just 6 winners!

This year’s Data Team Award winners include, Centers for Disease Control and Prevention, Karius, Ophelos, T-Mobile, Toyota, and Walgreens. Hear how these organizations are using data in very different and unique ways to do incredible work.

Congratulations to the 2022 Data Team Award winners:

Data Team Transformation Award: Toyota

Toyota’s mission is to “continuously strive to transform the very nature of movement”. For the company to deliver on that promise, they are moving forward with an aggressive commitment to sustainability–not only by shifting their focus to electrified vehicles, but also changing how those vehicles are manufactured to achieve carbon neutrality by 2035. The data team at Toyota is using Databricks Lakehouse to help power this change, moving away from their legacy on premise data warehouse, to a cloud-based unified platform for data and AI. With the Lakehouse ingesting and standardizing petabytes of data, we are able to use machine learning and advanced analytics to analyze trillions of batch and real-time records to optimize manufacturing processes, predict energy demand, improve the utilization of renewable energy sources, and identify opportunities to further reduce the our carbon footprint. With the Lakehouse, Toyota is tackling the challenge of decarbonization head-on, while also solving a range of other problems– from supply chain and revenue forecasting, to quality assurance– to make sure that the company never stops moving while surpassing customers’ expectations.

Finalists: Compass, H&R Block, Providence, Samsung Electronics

Data Team Democratization Award: Centers for Disease Control and Prevention

The Centers for Disease Control (CDC) has been on the frontlines guiding communities, governments and healthcare workers in response to the COVID-19 pandemic. Throughout this time, data and AI has played a critical role in helping to deliver fast insights across the U.S., helping to save lives. The Databricks Lakehouse has empowered the CDC’s to democratize data at massive scale — ingesting high volumes of all kinds of data on CDC’s Enterprise Analytics and Visualization (EDAV) platform. The lakehouse paradigm was implemented at CDC for COVID-19 vaccines data coming in from states and federal agencies (at a pace of 5+ million new records per day) and sharing vaccination and mortality rate metrics with cities, states, the White House and the general public so that they can make more informed decisions at local, regional and national level. These decisions included when to reopen businesses, enforce mask mandates, school closures, and more. Through the democratization of data and unification with analytics, they’ve been able to deliver on many more use cases to inform the people within the US of current health situations and provide the government and general public with actionable insights needed to ensure the highest levels of health within the US.

Finalists: Conde Nast, Corning, Sam’s Club, The Gap

Data Team for Good Award: Karius

Karius has developed a liquid biopsy test for infectious diseases, using innovations across chemistry, data, and AI, to non-invasively detect over 1,000 pathogens from a single blood sample. The Karius Test, offered to hundreds of hospitals across the country, can help decrease the time and effort it takes clinicians to accurately diagnose an infection, without the need for an invasive diagnostic procedure or the application of slower, less-effective methods like a blood culture. To go beyond the diagnosis of an infection in a single patient, Karius is leveraging Databricks Lakehouse to unlock the promise of a new data type — microbial cell-free DNA — with AI to“see” patterns across infections, expanding from a few pathogens to the wider microbial landscape. The new capability allows Karius to identify novel biomarkers connecting microbes to opportunities across human health and disease. Furthermore, the organization has super-charged its biomarker discovery platform by developing a de-identified clinicogenomics database, which connects Karius molecular data to clinical data, empowering scientists, and physicians, to better interpret the patterns. Karius is now looking to apply its new data and AI capabilities beyond infectious disease, including opportunities across oncology, autoimmune disease, and response to therapy.

Finalists: Cognoa, National Heavy Vehicle Regulator, Regeneron Genetics Center, US DoD Chief Data and Artificial Intelligence Office, ADVANA Program

Data Team Disruptor Award: Ophelos

Ophelos is using Databricks Lakehouse Platform to power its AI and machine learning efforts to disrupt the traditionally antiquated and hostile debt collection industry and turn it into one that’s compassionate, flexible, automated and preventative, via the Ophelos Debt Resolution Platform. The company created OLIVE (Ophelos Linguistic Identification of Vulnerability), a cutting-edge natural language processing (NLP) model that predicts the likelihood that a customer is vulnerable and identifies the possible causes. Ophelos is also addressing customer service efficiency and customer experience through the Ophelos Decision Engine, an ML-powered solution that automatically calculates the long-term effects of each action, and then creates bespoke communication strategies for each individual customer. All of this data is collected anonymously in a real-time analytics dashboard to ensure businesses truly understand their customers and how they can help.

Finalists: Grammarly, PicPay, Pumpjack Dataworks, Rivian Automotive

Data Team OSS Award: T-Mobile

T-Mobile’s mission is to build the nation’s best 5G network while reducing customer pain points every day. To meet the Un-carrier’s aggressive build plans and customer-focused goals, they embarked on a digital transformation — relying on their data to optimize back-office business processes, streamline network builds, mitigate fraud, and improve the overall experience for the enterprise’s business teams. At the heart of their data strategy is the lakehouse architecture and Delta Lake — democratizing access to data for BI and ML workloads at the speed of business. As valuable members of the Delta Lake community, they have been pushing the boundaries of Delta Lake to solve their toughest data problems by optimizing their procurement and supply chain process, ensuring billions of dollars of cell-site equipment is at the right place at the right time, to streamlining internal initiatives that better engage customers, save money and drive revenue.

Finalists: Apple, Back Market, Samba TV, Scribd

Data Team Visionary Award: Walgreens Boots Alliance – Luigi Guadagno

Luigi Guadagno, GVP, Pharmacy Healthcare Technology Platform, at Walgreens Boots Alliance (WBA) is the platform leader who has championed the company’s efforts to personalize pharmacy care using data and AI. Luigi, and the Walgreens data team, have guided the company through its cloud-first, data architecture modernization, so that they can now deliver real-time healthcare insights to pharmacists on the front lines, and ensure the right medications are always on shelves when patients need them across 9,000 locations. Core to WBA and the Pharmacy Healthcare Technology Platform team’s vision is the Databricks Lakehouse Platform, which ingests around 200,000 transactions per second while enabling all its users (e.g., engineers, data scientists, power users, and more) to understand, analyze and create insights at an unparalleled scale and speed. WBA can now activate its data to better understand how best to engage and communicate with patients, from sharing vital information about medications to providing guidance on their journey towards a healthier life. Under Luigi’s leadership, the WBA data team has saved millions of dollars from more efficient infrastructure, higher productivity and value from use cases including inventory optimization and Covid vaccination reporting.

Finalists: American Airlines – Poonam Mohan, Shell – Dan Jeavons, WarnerBros Discovery – Duan Peng

Let’s raise a glass

Check out the award finalists in the other five categories and come raise a glass and celebrate these amazing data teams during an award ceremony at the Data and AI Summit on June 29 at 5:30 p.m. at the Expo Stage.

--

Try Databricks for free. Get started today.

The post Announcing the Winners for the 2022 Databricks Data Team Awards appeared first on Databricks.

Databricks’ 2022 Global Partner Awards

$
0
0

Databricks has a partner ecosystem with over 600 partners globally that are critical to building and delivering the best data and AI solutions in the world for our joint customers. We are proud of this collaboration and know it’s the result of the mutual commitment and investment that spans ongoing training, solution development, field programs, and workshops for customers.

The 2022 Global Partner Awards recognize Databricks’ partners for their exceptional accomplishments and joint collaboration with Databricks, as they brought deep industry expertise, technology skills and impactful solutions to customers all over the world. In addition to awarding partners for the overall impact they have on the ecosystem, we asked them to share their proudest customer success moments with us across innovation, transformation, and industry engagement. We received an overwhelming response from our partners that shows their unwavering commitment to our customers’ success.

Presented at this year’s Partner Summit on June 27th, the winners were recognized across 19 categories. These partners demonstrated the importance of being multi-cloud, creating repeatable industry solutions, and leveraging the functionality of the Databricks platform to power the future of data and AI.

Consulting and System Integrator Partner Winners

Global Partner of the Year: Accenture and Avanade
AMER Partner of the Year: Wipro
APJ Partner of the Year: Celebal
EMEA Partner of the Year: Avanade
LATAM Partner of the Year: BlueShift
Transformation Partner of the Year: Deloitte
Innovation Partner of the Year: Lovelytics
Financial Services Partner of the Year: Accenture
Communications, Media and Entertainment Partner of the Year: Slalom
Manufacturing Partner of the Year: Capgemini
Public Sector Partner of the Year: Deloitte
Retail CPG Partner of the Year: Tredence
Health and Life Sciences Partner of the Year: Accenture

Technology Partner Winners

ISV: BI Partner of the Year: Tableau
ISV: Data Ingestion Partner of the Year: Fivetran
ISV: Data Transformation Partner of the Year: dbt Labs
ISV: Data Governance Partner of the Year: Collibra
ISV: ML/AI Partner of the Year: Tecton
Data Sharing Partner of the Year: Safegraph

Partner Champions

Partner Champions are the top technical evangelists in our partner community, which has grown leaps and bounds this year. This year, three Partner Champions were recognized for their excellent evangelism of the Databricks Lakehouse Platform, through both customer implementations and community support. They are driving Databricks adoption for our largest customers, standardizing on lakehouse across their practices’ architectures, and developing strong Centers of Excellence.

  • Paul Huynh, Solutions Architect Data and Artificial Intelligence, Avanade
  • Panagiotis Gouskos, Senior Manager, Capgemini
  • Prakash Trivedi, Senior Manager, Accenture

Thank you to the entire community of our 600+ partners! We look forward to even more engagement, collaboration and growth together in the new year. To learn more about the Databricks partner ecosystem, click here.

--

Try Databricks for free. Get started today.

The post Databricks’ 2022 Global Partner Awards appeared first on Databricks.

Databricks SQL Serverless Now Available on AWS

$
0
0

We are excited to announce the availability of serverless compute for Databricks SQL (DBSQL) in Public Preview on AWS today at the Data + AI Summit! DB SQL Serverless makes it easy to get started with data warehousing on the lakehouse. Serverless frees up time, lowers costs, and allows you to focus on delivering the most value to your business rather than managing infrastructure.

Databricks SQL Serverless for improved performance at lower cost

Databricks SQL Serverless helps address challenges customers face with compute, management, and infrastructure costs:

  • Instant and elastic: Serverless compute brings a truly elastic, always-on environment that’s instantly available and scales with your needs. You’ll benefit from simple usage based pricing, without worrying about idle time charges. Imagine no longer needing to wait for clusters to become available to run queries or overprovisioning resources to handle spikes in usage. Databricks SQL Serverless dynamically grows and shrinks resources to handle whatever workload you throw at it.
  • Eliminate management overheads: Serverless transforms DBSQL into a fully managed service, eliminating the burden of capacity management, patching, upgrading and performance optimization of the cluster. You only need to focus on your data and the insights it holds. Additionally, the simplified pricing model means there’s only one bill to track and only one place to check attribute costs.
  • Lower infrastructure cost: Under the covers, the serverless compute platform uses machine learning algorithms to provision and scale compute resources right when you need them. This enables substantial cost savings without the need to manually shut down clusters. Customers such as Scribd have found that adopting serverless allowed them to increase utilization of their SQL warehouses and significantly reduce infrastructure cost.

“We rely on Databricks SQL to power the business intelligence tools used by our analysts. Databricks SQL Serverless allows us to use the power of Databricks SQL while being much more efficient with our infrastructure. Checking the serverless box resulted in a 3x reduction in our infrastructure costs, which is why we’re integrating Databricks SQL Serverless into more and more data pipelines.”
– R Tyler Croy, Director of Platform Engineering, Scribd

In fact, in our internal tests we found Databricks SQL Serverless to have the best price-performance compared to traditional data warehouses.

Source – 2022 Cloud Data Warehouse Benchmark Report; Databricks research

With the serverless platform, we support enterprise grade security features such as private network connectivity for blob storage and customer managed keys for encrypting data at rest which will allow you to bring your sensitive, production workloads while maintaining your organization’s governance controls.

Getting Started

If you are on AWS, ask your admin to enable Serverless from the account console and create serverless SQL warehouses (formerly known as endpoints). You can also convert an existing SQL warehouse to leverage serverless compute by simply toggling the serverless option in the warehouse settings page. To learn more visit the Serverless compute documentation page.

If you are an Azure customer, please submit your request and we will onboard you as soon as Databricks SQL Serverless for Azure Databricks becomes available.


Enabling serverless compute at account console


Creating and managing serverless warehouses on Databricks SQL

--

Try Databricks for free. Get started today.

The post Databricks SQL Serverless Now Available on AWS appeared first on Databricks.

Project Lightspeed: Faster and Simpler Stream Processing With Apache Spark

$
0
0

Streaming data is a critical area of computing today. It is the basis for making quick decisions on the enormous amounts of incoming data that systems generate, whether web postings, sales feeds, or sensor data, etc. Processing streaming data is also technically challenging, and it has needs far different from and more complicated to meet than those of event-driven applications and batch processing.

To meet the stream processing needs, Structured Streaming was introduced in Apache Spark™ 2.0. Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. The user can express the logic using SQL or Dataset/DataFrame API. The engine will take care of running the pipeline incrementally and continuously and update the final result as streaming data continues to arrive. Structured Streaming has been the mainstay for several years and is widely adopted across 1000s of organizations, processing more than 1 PB of data (compressed) per day on the Databricks platform alone.

As the adoption accelerated and the diversity of applications moving into streaming increased, new requirements emerged. We are starting a new initiative codenamed Project Lightspeed to meet these requirements, which will take Spark Structured Streaming to the next generation. The requirements addressed by Lightspeed are bucketed into four distinct categories:

  • Improving the latency and ensuring it is predictable
  • Enhancing functionality for processing data with new operators and APIs
  • Improving ecosystem support for connectors
  • Simplifying deployment, operations, monitoring and troubleshooting

In this blog, we will discuss the growth of Spark Structured Streaming and its key benefits. Then we will outline an overview of the proposed new features and functionality in Project Lightspeed.

Growth of Spark Structured Streaming

Spark Structured Streaming has been widely adopted since the early days of streaming because of its ease of use, performance, large ecosystem, and developer communities. The majority of streaming workloads we saw were customers migrating their batch workloads to take advantage of the lower latency, fault tolerance, and support for incremental processing that streaming offers. We have seen tremendous adoption from streaming customers for both open source Spark and Databricks. The graph below shows the weekly number of streaming jobs on Databricks over the past three years, which has grown from thousands to 4+ millions and is still accelerating.

Growth of Spark Structured Streaming

Advantages of Spark Structured Streaming

Several properties of Structured Streaming have made it popular for thousands of streaming applications today.

  • Unification – The foremost advantage of Structured Streaming is that it uses the same API as batch processing in Spark DataFrames, making the transition to real-time processing from batch much simpler. Users can simply write a DataFrame computation using Python, SQL, or Spark’s other supported languages and ask the engine to run it as an incremental streaming application. The computation will then run incrementally as new data arrives, and recover automatically from failures with exactly-once semantics, while running through the same engine implementation as a batch computation and thus giving consistent results. Such sharing reduces complexity, eliminates the possibility of divergence between batch and streaming workloads, and lowers the cost of operations (consolidation of infrastructure is a key benefit of Lakehouse). Additionally, many of Spark’s other built-in libraries can be called in a streaming context, including ML libraries.
  • Fault Tolerance & Recovery – Structured Streaming checkpoints state automatically during processing. When a failure occurs, it automatically recovers from the previous state. The failure recovery is very fast since it is restricted to failed tasks as opposed to restarting the entire streaming pipeline in other systems. Furthermore, fault tolerance using replayable sources and idempotent sinks enables end-to-end exactly-once semantics.
  • Performance – Structured Streaming provides very high throughput with seconds of latency at a lower cost, taking full advantage of the performance optimizations in the Spark SQL engine. The system can also adjust itself based on the resources provided thereby trading off cost, throughput and latency and supporting dynamic scaling of a running cluster. This is in contrast to systems that require upfront commitment of resources.
  • Flexible Operations – The ability to apply arbitrary logic and operations on the output of a streaming query using foreachBatch, enabling the ability to perform operations like upserts, writes to multiple sinks, and interact with external data sources. Over 40% of our users on Databricks take advantage of this feature.
  • Stateful Processing – Support for stateful aggregations and joins along with watermarks for bounded state and late order processing. In addition, arbitrary stateful operations with [flat]mapGroupsWithState backed by a RocksDB state store are provided for efficient and fault-tolerant state management (as of Spark 3.2).

Project Lightspeed

With the significant growing interest in streaming in enterprises and making Spark Structured Streaming the de facto standard across a wide variety of applications, Project Lightspeed will be heavily investing in improving the following areas:

Predictable Low Latency

Apache Spark Structured Streaming provides a balanced performance across multiple dimensions – throughput, latency and cost. As Structured Streaming grew and is used in new applications, we are profiling our customer workloads to guide improvements in tail latency by up to 2x. Towards meeting this goal, some of the initiatives we will be undertaking are as follows:

  • Offset Management – Our customer workload profiling and performance experiments indicate that offset management operations consume upto 30-50% of the time for pipelines. This can be improved by making these operations asynchronous and configurable cadence, thereby reducing the latency.
  • Asynchronous Checkpointing – Current checkpointing mechanism synchronously writes into object storage after processing a group of records. This contributes substantially to latency. This could be improved by as much as 25% by overlapping the execution of the next group of records with writing of the checkpointing for the previous group of records.
  • State Checkpointing Frequency – Spark Structured Streaming checkpoints the state after a group of records have been processed that adds to end-to-end latency. Instead, if we make it tunable to checkpoint every Nth group, the latency can be further reduced depending on the choice for N.

Enhanced Functionality for Processing Data / Events

Spark Structured Streaming already has rich functionality for expressing predominant sets of use cases. As enterprises extend streaming into new use cases, additional functionality is needed to express them concisely. Project Lightspeed is advancing the functionality in the following areas:

  • Multiple Stateful Operators – Currently, Structured Streaming supports only one stateful operator per streaming job. However, some use cases require multiple state operators in a job such as:
    • Chained time window aggregation (e.g. 5 mins tumble window aggregation followed by 1 hour tumble window aggregation)
    • Chained stream-stream outer equality join (e.g. A left outer join B left outer join C)
    • Stream-stream time interval join followed by time window aggregation
    • Project Lightspeed will add support for this capability with consistent semantics.
  • Advanced Windowing – Spark Structured Streaming provides basic windowing that addresses most use cases. Advanced windowing will augment this functionality with simple, easy to use, and intuitive API to support arbitrary groups of window elements, define generic processing logic over the window, ability to describe when to trigger the processing logic and the option to evict window elements before or after the processing logic is applied.
  • State Management – Stateful support is provided through predefined aggregators and joins. In addition, specialized APIs are provided for direct access to state and manipulating it. New functionality, in Lightspeed, will incorporate the evolution of the state schema as the processing logic changes and the ability to query the state externally.
  • Asynchronous I/O – Often, in ETL, there is a need to join a stream with external databases and microservices. Project Lightspeed will introduce a new API that manages connections to external systems, batch requests for efficiency and handles them asynchronously.
  • Python API Parity – While Python API is popular, it still lacks the primitives for stateful processing. Lightspeed will add a powerful yet simple API for storing and manipulating state. Furthermore, Lightspeed will provide tighter integrations with popular Python data processing packages like Pandas – to make it easy for the developers.

Connectors and Ecosystem

Connectors make it easier to use the Spark Structured Streaming engine to process data from and write processed data into various messaging buses like Apache Kafka and storage systems like Delta lake. As part of Project Lightspeed, we will work on the following:

  • New Connectors – We will add new connectors working with partners (for example, Google Pub/Sub, Amazon DynamoDB) to enable developers to easily use the Spark Structured Streaming engine with additional messaging buses and storage systems they prefer.
  • Connector Enhancement – We will enable new functionalities and improve performance on existing connectors. Some examples include AWS IAM auth support in the Apache Kafka connector and enhanced fan-out support in the Amazon Kinesis connector.

Operations and Troubleshooting

Structured Streaming jobs are continuously running until explicitly terminated. Because of the always-on nature, it is necessary to have the appropriate tools and metrics to monitor, debug and alert when certain thresholds are exceeded. Towards satisfying these goals, Project Lightspeed will improve the following:

  • Observability – Currently, the metrics generated from structured streaming pipelines for monitoring require coding to collect and visualize. We will unify the metric collection mechanism and provide capabilities to export to different systems and formats. Furthermore, based on customer input, we will add additional metrics for troubleshooting.
  • Debuggability – We will provide capabilities to visualize pipelines and how its operators are grouped and mapped into tasks and the executors the tasks are running. Furthermore, we will implement the ability to drill down to specific executors, browse their logs and various metrics.

What’s Next

In this blog, we discussed the advantages of Spark Structured Streaming and how it contributed to its widespread growth and adoption. We introduced Project Lightspeed which advances Spark Structured Streaming into the real-time era as more and more new use cases and workloads migrate into streaming.

In subsequent blogs, we will expand on the individual categories of improving Spark Structured Streaming performance across multiple dimensions, enhanced functionality, operations and ecosystem support.

Project Lightspeed will roll out incrementally by collaborating and closely working with community. We are expecting most of the features to be delivered by early next year.

--

Try Databricks for free. Get started today.

The post Project Lightspeed: Faster and Simpler Stream Processing With Apache Spark appeared first on Databricks.


Introducing Data Cleanrooms for the Lakehouse

$
0
0

We are excited to announce data cleanrooms for the Lakehouse, allowing businesses to easily collaborate with their customers and partners on any cloud in a privacy-safe way. Participants in the data cleanrooms can share and join their existing data, and run complex workloads in any language – Python, R, SQL, Java, and Scala – on the data while maintaining data privacy.

With the demand for external data greater than ever, organizations are looking for ways to securely exchange their data and consume external data to foster data-driven innovations. Historically, organizations have leveraged data sharing solutions to share data with their partners and relied on mutual trust to preserve data privacy. But the organizations relinquish control over the data once it is shared and have little to no visibility into how data is consumed by their partners across various platforms. This exposes potential data misuse and data privacy breaches. With stringent data privacy regulations, it is imperative for organizations to have control and visibility into how their sensitive data is consumed. As a result, organizations need a secure, controlled and private way to collaborate on data, and this is where data cleanrooms come into the picture.

This blog will discuss data cleanrooms, the demand for data cleanrooms, and our vision for a scalable data cleanroom on Databricks Lakehouse Platform.

What is a Data Cleanroom and why does it matter for your business?

A data cleanroom provides a secure, governed and privacy-safe environment, in which multiple participants can join their first-party data and perform analysis on the data, without the risk of exposing their data to other participants. Participants have full control of their data and can decide which participants can perform what analysis on their data without exposing any sensitive data such as Personally identifiable information (PII).

Data cleanrooms open a broad array of use cases across industries. For example, consumer packaged goods (CPG) companies can see sales uplift by joining their first-party advertisement data with point of sale (POS) transactional data of their retail partners. In the media industry, advertisers and marketers can deliver more targeted ads, with broader reach, better segmentation, and greater ad effectiveness transparency while safeguarding data privacy. Financial services companies can collaborate across the value chain to establish proactive fraud detection or anti-money laundering strategies. In fact IDC predicts that by 2024, 65% of G2000 Enterprises will form data-sharing partnerships with external stakeholders via data cleanrooms to increase interdependence while safeguarding data privacy.

Privacy-safe data cleanroom

Privacy-safe data cleanroom

Let’s look at some of the compelling reasons driving the demand for cleanrooms:
Rapidly changing security, compliance, and privacy landscape: Stringent data privacy regulations such as GDPR and CCPA, along with sweeping changes in third-party measurement, have transformed how organizations collect, use and share data, particularly for advertising and marketing use cases. For example, Apple’s App Tracking Transparency Framework (ATT) provides users of Apple devices the freedom and flexibility to easily opt out of app tracking. Google also plans to phase out support for third-party cookies in Chrome by late 2023. As these privacy laws and practices evolve, the demand for data cleanrooms is likely to rise as the industry moves to new identifiers that are PII based, such as UID 2.0. Organizations will try to find new solutions to join data with their partners in a privacy-centric way to achieve their business objectives in the cookie-less reality.
Collaboration in a fragmented data ecosystem: Today, consumers have more options than ever before when it comes to where, when and how they engage with content. As a result, the digital footprint of consumers is fragmented across different platforms, necessitating that companies collaborate with their partners to create a unified view of their customers’ needs and requirements. To facilitate collaboration across organizations, cleanrooms provide a secure and private way to combine their data with other data to unlock new insights or capabilities.
New ways to monetize data: Most organizations either already have or are looking to develop monetization strategies for their existing data or IP. With today’s privacy laws, companies will try to find any possible advantages to monetize their data without the risk of breaking privacy rules. This creates an opportunity for data vendors or publishers to join data for big data analytics without having direct access to the data.

Existing data cleanroom solutions come with big drawbacks

As organizations explore various cleanrooms solutions, there are some glaring shortcomings in the existing solutions, which don’t realize the full potential of the “cleanrooms” and meet business requirements of organizations.

Data movement and replication : The existing data cleanroom vendors require participants to move their data into the vendor platforms, which results in platform lock-in and added data storage cost to the participants. Additionally, it is time consuming for participants to prepare the data in a standardized format before performing any analysis on the aggregated data. Furthermore, participants have to replicate the data across different clouds and regions to facilitate collaborations with participants on different clouds and regions, resulting in operational and cost overhead.

Restricted to SQL: Existing cleanroom solutions don’t provide much flexibility to run arbitrary workloads and analyses and are often restricted to simple SQL statements. While SQL is powerful, and absolutely needed for cleanrooms, there are times when you require complex computations such as machine learning, integration with APIs, or other analysis workloads where SQL just won’t cut it.

Hard to scale: Most of the existing cleanroom solutions are tied to a single vendor and are not scalable to expand collaboration beyond two participants at a time. For example, an advertiser might want to get a detailed view of their ad performance across different platforms, which requires analysis on the aggregated data from multiple data publishers. With collaboration limited to just two participants, organizations get partial insights on one cleanroom platform and end up moving their data to another cleanroom vendor, incurring operational overhead of manually collating partial insights.

Deploy a scalable and flexible Data cleanroom solution with the Databricks lakehouse platform

Databricks Lakehouse Platform provides a comprehensive set of tools to build, serve and deploy a scalable and flexible data cleanroom based on your data privacy and governance requirements.
Secure data sharing with no replication: With Delta Sharing, cleanroom participants can securely share data from their data lakes with other participants without any data replication across clouds or regions. Your data stays with you and it is not locked into any platform. Additionally, cleanroom participants can centrally audit and monitor the usage of their data.
Full support to run arbitrary workloads and languages: Databricks lakehouse platform provides the cleanroom participants the flexibility to run any complex computations such as machine learning or data workloads in any language — SQL, R, Scala, Java, Python — on the data..
Easily scalable with guided on-boarding experience: Cleanrooms on the Databricks Lakehouse Platform are easily scalable to multiple participants on any cloud or region. It is easy to get started and guide participants through common use cases using predefined templates (e.g., jobs, workflows, dashboards), reducing time to insights.
Privacy-safe with fine-grained access controls: With Unity Catalog, you can enable fine-grained access controls on the data and meet your privacy requirements. Integrated governance allows participants to have full control over queries or jobs that can be executed on their data. All the queries or jobs on the data are executed on Databricks-hosted trusted compute. Participants never get access to the raw data of other participants, ensuring data privacy. Participants can also leverage open source or third-party differential privacy frameworks, making your cleanroom future-proof.

To learn more about data cleanrooms on Databricks Lakehouse, please reach out to your Databricks account representatives.

--

Try Databricks for free. Get started today.

The post Introducing Data Cleanrooms for the Lakehouse appeared first on Databricks.

Introducing Databricks Marketplace

$
0
0

We’re pleased to announce Databricks Marketplace, an open marketplace for exchanging data products such as datasets, notebooks, dashboards, and machine learning models. To accelerate insights, data consumers can discover, evaluate, and access more data products from third-party vendors than ever before. Providers can now commercialize new offerings and shorten sales cycles by providing value-added services on top of their data. Databricks Marketplace is powered by Delta Sharing allowing consumers to access data products without having to be on the Databricks platform. This open approach allows data providers to broaden their addressable market without forcing consumers into vendor lock-in.

Databricks Marketplace

Databricks Marketplace

This blog will discuss the key limitations of the existing data marketplaces and our vision for an open marketplace on the Databricks Lakehouse platform

Existing data marketplaces fail to maximize business value for data providers and data consumers

The demand for 3rd party data to make data-driven innovations is greater than ever and data marketplaces act as a bridge between data providers and data consumers to help facilitate the discovery and delivery of datasets. However, as organizations continue leveraging more third party data, the value these platforms provide has not kept up with the needs of both providers and consumers.

Challenges for data consumers

Data consumers value ease of data discovery and frictionless data evaluation from a data marketplace.

However, existing data marketplaces that provide only datasets miss out on one of the key considerations for data consumers which is the context around the data. In most of the current data marketplaces, consumers receive a brief overview of the datasets, and maybe a few sample queries. This often leads to frustration as consumers have to spend time understanding the data model and going back and forth with the data provider’s support teams before they are able to determine if it is the right fit for their analytic needs.

Additionally, most current marketplaces work in walled garden environments. Data exchange can only be done on their closed platforms and sometimes only within their proprietary data formats. There are limited options to access the data from 3rd party tools or platforms seamlessly and the data consumers are forced to be on the platform which creates lock-in,

Challenges for data providers

From data providers’ perspective, two important measures of success are an increase in sales and lowering of operational cost. However, most data marketplaces fall short on both of these measures.

With existing data marketplaces, data providers can only package and distribute datasets. And most marketplaces limit providers to only offer a brief write-up or out-of-context query examples to augment their dataset product profiles. Data consumers end up incurring significant effort and downstream cost to evaluate these datasets. This results in cumbersome onboarding, unnecessary long sales cycles, and eventually lost revenue opportunities.

Additionally, many data marketplaces require data providers to load data into their proprietary format, leverage their compute, and replicate data into different clouds and regions in which their customers operate. This quickly increases compute costs and operational burden as more and more moving parts are added to the system to maintain parity across cloud providers/regions. As the number of datasets and their volume grows, data providers must consider these costs and trade-off decisions. Some data providersmay be left with the decision to deprioritize potentially valuable datasets as the cost to commercialize them grows.

Unlock business value with Databricks Marketplace

The vision behind Databricks Marketplace is to address these problems and help. consumers and providers achieve their business objectives.

Benefits for Data Consumers

Faster time to insights

With Databricks marketplace, Data consumers can get access not only to just datasets but other data assets including dashboards, notebooks, and ML models. This provides data consumers an easy way to evaluate data and accelerate time to insights. For example, data consumers can leverage a starter notebook to do exploratory data analysis or a machine learning model that helps predict future rankings of the dataset. Before requesting access to the data, Databricks hosted dashboards enable customers to explore the data live without any additional cost. All of this helps speed up evaluation, acquisition, and analysis cycle and get more value from the data.

An open marketplace

Powered by Delta Sharing, Databricks Marketplace allows data consumers to seamlessly access the data products without the need to be on the Databricks platform. There is no lock-in, and it provides consumers options to maximize the data value from the tools of their choice.

Benefits for Data Providers

Distribute and monetize a wide array of data products

With the Databricks Marketplace, providers can market and distribute not only just datasets, but also their other data products such as notebooks, dashboards, and models that are essential to help consumers realize the full value of a dataset.

Lets say a provider is selling Environmental Social and Governance (ESG) data. The provider can package a notebook along with the data to show how the data can be utilized for NLP analysis, a dashboard that provides a visualization of the worst polluting companies, and a model that will show how the shared ESG data can provide recommendations on when a company’s ESG ranking will change. With the existing data marketplaces, there is no easy way for providers to share all these highly valuable assets.

Broaden the reach of the data products

With Databricks Marketplace,data providers can expand their addressable market beyond the consumers who are on the Databricks Platform. This helps data providers increase the revenue potential of their data products.

No replication of data products

Databricks Marketplace allows data providers to share their data products without having to move or replicate the data products from their cloud storage. This allows providers to deliver data products to other clouds, tools, and platforms from a single source. Providers may choose to replicate data products as desired, but they have the option to choose versus being forced to do so and incurring additional costs.

What Databricks Partners are saying:

“Databricks Marketplace is a compelling platform for us. We like the fact that it is open and provides us a way to reach existing and new types of personas for our data offerings. We see the platform as a key enabler to accelerate value with our data offerings to our customers”
– Chris Anderson, CTO Intellectual Property Solutions, LexisNexis

“Customers need solutions, not only raw data. Being able to package raw data along with the code and analytics on top of it is how we see customers consuming raw data in the future”
– Ross Epstein, VP New Projects, Safegraph

“Facteus is extremely excited to be part of the inception of the Databricks Marketplace. A marketplace built on their Delta Share protocol is a huge step forward in democratizing and simplifying data access.”
– Jonathan Chin, Co-Founder Head of Data and Growth, Facteus

“With more than 1.2B non-identified patient records, IQVIA has unparalleled healthcare data and is focused on advancing innovation for a healthier world. We are looking forward to the upcoming launch of Databricks’ Delta Sharing Marketplace to enable seamless data sharing with our customers, which will accelerate time to insights and value across the ecosystem.”
– Avinob Roy, VP & GM Product Management, IQVIA

SIGN UP TO BE A DATA PROVIDER

--

Try Databricks for free. Get started today.

The post Introducing Databricks Marketplace appeared first on Databricks.

What’s New With Databricks Unity Catalog at the Data & AI Summit 2022

$
0
0

Today we are excited to announce that Unity Catalog, a unified governance solution for all data assets on the Lakehouse, will be generally available on AWS and Azure in the upcoming weeks. Currently, you can apply for a public preview or reach out to a member of your Databricks account team.

Sign-up for Public Preview

In a previous blog, we set out our vision for a governed lakehouse and how Unity Catalog can help customers simplify governance at scale. This blog will explore the most recent updates to Unity Catalog and our growing partner ecosystem.

What’s new with Unity Catalog for Data and AI Summit?

Automated Data Lineage for all workloads

Unity Catalog now automatically tracks data lineage across queries executed in any language. Data lineage is captured down to the table and column level, while key assets such as notebooks, dashboards and jobs are tracked. Lineage opens up several use cases – including assessing the impact changes to tables will have on your data consumers, and auto-generating documentation that consumers can use to understand data in the lakehouse. For more information, see our recent blog post.

Built-in Data Search and Discovery

Unity Catalog now includes a built-in search capability. Once data is registered in Unity Catalog, end users can easily search across metadata fields including table names, column names, and comments to find the data they need for their analysis. This search capability automatically leverages the governance model put in place by Unity Catalog. Users will only see search results for data they have access to, which serves as a productivity boost for the user, and a critical control for data administrators who want to ensure that sensitive data is protected.


Search and Discovery in Unity Catalog

Simplified access controls with privilege inheritance

Unity Catalog offers a simple model to control access to data via a UI or SQL. We have now extended this model to allow data admins to set up access to 1000s of tables via a single click or SQL statement. This is achieved through a privilege inheritance model which allows admins to set access policies on whole catalogs or schemas of objects. For example, executing the following SQL statement will give the ml_team read access to all current tables and views in the main catalog, and any that are created in the future.

GRANT SELECT ON CATALOG main TO ml_team

This also serves as a way to set safe access defaults on catalogs and schemas. A common pattern may be to give a team a schema to store their data. Now an admin can set a policy on that schema so that by default all team members can read objects created by others.

Information Schema

Information Schemas have been a fundamental asset within database systems for decades. They offer a pre-defined set of views that describe the objects within the database – for example what tables have been created, when, by who, and what access levels have been granted on each, amongst other things. This metadata is often leveraged by users to understand what data is available in the system, but also to automate report generation on topics such as access levels per table. Unity Catalog brings the concept of the Information Schema to the lakehouse. Each catalog you create in Unity arrives with a pre-defined schema called information_schema which defines a set of views which describes the catalog. This can be queried from DBSQL or the notebook environment.
Information Schema in Unity Catalog

Information Schema in Unity Catalog

Azure Managed Identities in Unity Catalog

We are excited that Unity Catalog now supports using a Azure Managed Identity for accessing both managed storage and external storage in a Unity Catalog metastore. Managed Identities are a Microsoft Azure construct that provide an identity for applications to use when connecting to resources that support Azure Active Directory (AAD). Up to this point, Unity Catalog relied on Service Principals as an identity to gain access to data in Azure Data Lake Storage (ADLS). Managed Identities have two major benefits over Service Principals for this use case. Firstly, Managed Identities do not require maintaining credentials or rotating secrets. Secondly they offer a way to connect to ADLS that is protected via a storage firewall.

Upgrade your Hive Metastore to Unity Catalog

Unity Catalog now offers a seamless upgrade experience from your existing Hive Metastore to take advantage of all the new features described above! Users can select 1000s of tables to upgrade at once within our purpose built user interface. The upgrade tool works by copying metadata for tables from existing Hive Metastores to a Unity Catalog metastore. This will also automatically resolve DBFS mount points that have been used in the definition of the tables, so that data can be securely accessed across your entire Databricks account. For those who prefer code over UIs, we also make the SQL syntax (‘CREATE TABLE LIKE…’) available for running against a Databricks cluster or SQL Warehouse.


Upgrade Hive Metastore

Better together with our governance and catalog partners

In addition to all the features and capabilities you’ve read about, we also have a healthy and vibrant ecosystem of partners who are joining us in supporting Unity Catalog with their products. The ecosystem is growing every day.

Privacera
“Privacera integrates with Unity Catalog by leveraging the new APIs built by the Databricks team and through a policy translation layer built by Privacera. The integration is transparent to data consumers and IT administrators and supports the same fine-grained access control functionality that is supported in Privacera integration with legacy Databricks High Concurrency clusters.” –Don Bosco Durai

Don Bosco Durai
Don Bosco Durai is the co-founder and CTO of Privacera. Bosco is also the creator of the ASF Open Source project Apache Ranger and a thought leader in the security and governance space.

Immuta
With Unity Catalog, physical data policy enforcement is native to Databricks, less invasive to data consumers, and no longer tied to plugins specifically built for different Spark runtimes – enforcement done correctly. Meanwhile, Immuta continues to solve management challenges by providing active data monitoring, metadata discovery/centralization, scalable policy orchestration (table-, row-, column-, and cell-level controls) to include leveraging Unity Catalog’s lineage features to simplify policy enforcement, and compliance reporting/alerting. – Steve Touw

Steve Touw
Steve Touw Steve Touw is the co-founder and CTO of Immuta. He has a long history of designing large-scale geo-temporal analytics across the U.S. intelligence community – including some of the very first Hadoop analytics and frameworks to manage complex multi-tenant data policy controls.

Alation
Alation and Databricks help organizations to gain data intelligence, eliminate silos, and promote governance capabilities to drive digital transformation projects. Alation enables organizations to nurture data as an asset – helping to enhance data discovery, aid understanding, promote trust and ensure compliance with relevant policies. Leveraging the data captured by the Unity metastore, Alation will enhance our existing integration with Databricks by easily including metadata from multiple workspaces. Together Databricks and Alation will ultimately provide catalog, lineage and policy management and enforcement for the Lakehouse. Alation is thrilled to partner with Databricks and looking forward to working jointly to enable data scientists, engineers, and analysts to quickly turn data into business insights. – Ibrahim “Ibby” Rahmani

Ibrahim Rahmani
Ibrahim “Ibby” Rahmani is Director of Product Marketing at Alation

Collibra
Many of Collibra’s most strategic customers have found great value from the power of Databricks. This has been the focus of our technical integration with Unity Catalog. Collibra’s enterprise catalog brings value to business and governance personas and, thus, we think that Unity Catalog’s tactical platform focus is a perfect pairing. There are also benefits at the metadata ingestion level because there is no longer a need to have a Databricks cluster running to pull metadata. We feel that lineage, direct from a platform API like Unity Catalog, is better quality and easier to update over time as processing changes. –Vaughn Micciche

Vaughn Micciche
Vaughn Micciche is the Technical Partnership Director at Collibra

Atlan
Atlan connects to Databricks Unity Catalog’s API to extract all relevant metadata powering discovery, governance, and insights inside Atlan. This integration allows Atlan to generate lineage for tables, views, and columns for all the jobs and languages that run on Databricks. By pairing this with metadata extracted from other tools in the data stack (e.g. BI, transformation, ELT), Atlan can create true end-to-end lineage. Thanks to Unity Catalog’s simplified delivery system, which sends complete lineage through its API, this entire experience is near instantaneous with drastically reduced compute and cost for customers. This allows Databricks customers to holistically understand the flow of their data, gain deeper insight into the data populating their models, run RCA exercises, and even power programmatic governance at scale with Atlan’s metadata activation engine. –Amit Prabhu

Amit Prabhu
Amit Prabhu is a Software Architect at Atlan leading the Orchestration team

Getting Started with Unity Catalog on AWS and Azure

Sign-up for Public Preview

Visit the Unity Catalog documentation [AWS, Azure] to learn more.

--

Try Databricks for free. Get started today.

The post What’s New With Databricks Unity Catalog at the Data & AI Summit 2022 appeared first on Databricks.

Connect From Anywhere to Databricks SQL

$
0
0

Today we are thrilled to announce a full lineup of open source connectors for Go, Node.js, Python, as well as a new CLI that makes it simple for developers to connect to Databricks SQL from any application of their choice. Along the same theme of empowering developers, we have also published the official Databricks JDBC driver on the Maven central repository, making it possible to use it in your build system and confidently package it with your applications.

Databricks SQL connectors: connect from anywhere and  build data apps powered by your lakehouse

Databricks SQL connectors: connect from anywhere and
build data apps powered by your lakehouse

Since its GA earlier this year, the Databricks SQL Connector for Python has seen tremendous adoption from our developer community, averaging over 1 million downloads a month. We are excited to announce that the connector is now completely open source.

We would like to thank the contributors to the open source projects that provided the basis for our new Databricks SQL connectors. We invite the community to join us on GitHub and collaborate on the future of data connectivity.

Databricks SQL Go Driver

Go is a popular open source language commonly used for building reliable cloud and network services and web applications. Our open source driver implements the idiomatic database/sql standard for database access.

Here’s a quick example of how to submit SQL queries to Databricks from Go:

    package main
    
    import (
        "database/sql"
        "log"
           "fmt"
    
        _ "github.com/databricks/databricks-sql-go"
    )
    
    
    // replace these values
    const (
        token = "dapi***********"
        hostname = "********.databricks.com"
        path = "/sql/1.0/endpoints/*******"
    )
    
    func main() {
        dsn := fmt.Sprintf("databricks://:%s@%s%s", token, hostname, path)
        db, err := sql.Open("databricks", dsn)
        if err != nil {
            log.Fatalf("Could not connect to %s: %s", dsn, err)
        }
        defer db.Close()
    
        db.Query("CREATE TABLE example (id INT, text VARCHAR(20))")
        db.Query("INSERT INTO example VALUES (1, \"Hello\"), (2, \"World\")")
    
        rows, err := db.Query("SELECT * FROM example")
        if err != nil {
            log.Fatal(err)
        }
        for rows.Next() {
            var text string
            var id int
            if err := rows.Scan(&id, &text); err != nil {
                log.Fatal(err)
            }
    
          fmt.Printf("%d %s\n", id, text)
      }
    
    }

Output:

    1 Hello
    2 World

You can find additional examples in the examples folder of the repo. We are looking forward to the community’s contributions and feedback on GitHub.

Databricks SQL Node.js Driver

Node.js is very popular for building services in JavaScript and TypeScript. The native Node.js driver, written entirely in TypeScript with minimum external dependencies, supports the async/await pattern for idiomatic, non-blocking operations. It can be installed using NPM (Node.js 14+):

$ npm i @databricks/sql

Here is a quick example to create a table, insert data, and query data:

    const { DBSQLClient } = require('@databricks/sql');
    
    
    // replace these values
    const host = '********.databricks.com';
    const path = '/sql/1.0/endpoints/*******';
    const token = 'dapi***********';
    
    async function execute(session, statement) {
      const utils = DBSQLClient.utils;
      const operation = await session.executeStatement(statement, { runAsync: true });
      await utils.waitUntilReady(operation);
      await utils.fetchAll(operation);
      await operation.close();
      return utils.getResult(operation).getValue();
    }
    
    const client = new DBSQLClient();
    
    client.connect({ host, path, token }).then(async client => {
      const session = await client.openSession();
    
      await execute(session, 'CREATE TABLE example (id INT, text VARCHAR(20))');
    
      await execute(session, 'INSERT INTO example VALUES (1, "Hello"), (2, "World")');
    
      const result = await execute(session, 'SELECT * FROM example');
      console.table(result);
    
      await session.close();
      client.close();
    }).catch(error => {
      console.log(error);
    });

Output:

    ┌────┬─────────┐
    │ id │  text    │
    ├────┼─────────┤
    │ 1  │ 'Hello'  │
    │ 2  │ 'World'  │
    └────┴─────────┘

The driver also provides direct APIs to get table metadata such as getColumns. You can find more samples in the repo. We are looking forward to the Node.js community’s feedback.

Databricks SQL CLI

Databricks SQL CLI is a new command line interface (CLI) for issuing SQL queries and performing all SQL operations.As it is built on the popular open source DBCLI package, it supports auto-completion and syntax highlighting. The CLI supports both interactive querying as well as the ability to run SQL files.You can install it using pip (Python 3.7+).

python3 -m pip install databricks-sql-cli

To connect, you can provide the hostname, HTTP path, and PAT as command line arguments like below, by setting environment variables, or by writing them into the [credentials] section of the config file.

$ dbsqlcli --hostname '********.databricks.com' --http-path '/sql/1.0/endpoints/*******' --access-token 'dapi***********'

You can now run dbsqlcli from your terminal, with a query string or .sql file.

$ dbsqlcli -e 'SELECT * FROM samples.nyctaxi.trips LIMIT 10'
$ dbsqlcli -e query.sql
$ dbsqlcli -e query.sql > output.csv

Use –help or check the repo for more documentation and examples.

Databricks JDBC Driver on Maven

Java and JVM developers use JDBC as a standard API for accessing databases. Databricks JDBC Driver is now available on the Maven Central repository, letting you use this driver in your build system and CI/CD runs. To include it in your Java project, add the following entry to your application’s pom.xml:

 
    
      com.databricks
      databricks-jdbc
      2.6.25-1
    

Here is some sample code to query data using JDBC driver:

    import java.sql.*;
    
    public static void main(String[] args) throws Exception {
        // Open a connection
    
        // replace the values below
        String token = "dapi*****";
        String url = "jdbc:databricks://********.cloud.databricks.com:443/default;" +            "transportMode=http;ssl=1;AuthMech=3;httpPath=sql/protocolv1/o/*****;" +
                "UID=token;" +
                "PWD=" + token;
    
        try (Connection conn = DriverManager.getConnection(url);
             Statement stmt = conn.createStatement();
             ResultSet rs = stmt.executeQuery("SELECT * FROM samples.nyctaxi.trips");) {
            // Extract data from result set
            while (rs.next()) {
                // Retrieve by column name
                System.out.print("ID: " + rs.getString("col_name"));
            }
        }
    }

Connect to the Lakehouse from Anywhere

With these additions, Databricks SQL now has native connectivity to Python, Go, Node.js, the CLI, ODBC/JDBC, as well as a new SQL Execution REST API that is in Private Preview. We have exciting upcoming features on the roadmap including: additional authentication schemes, support for Unity Catalog, support for SQLAlchemy, and performance improvements. We can’t wait to see all the great data applications that our partner and developer communities will build with Databricks SQL.

The best data warehouse is a Lakehouse. We are excited to enable everybody to connect to the lakehouse from anywhere! Please try out the connectors, and we would love to hear your feedback and suggestions on what’s next to build! (Contact us on GitHub and the Databricks Community)

Join the conversation in the Databricks Community where data-obsessed peers are chatting about Data + AI Summit 2022 announcements and updates. Learn. Network. Celebrate.

--

Try Databricks for free. Get started today.

The post Connect From Anywhere to Databricks SQL appeared first on Databricks.

Top 5 Workflows Announcements at Data + AI Summit

$
0
0

The Data and AI Summit was chock-full of announcements for the Databricks Lakehouse platform. Among these announcements were several exciting enhancements to Databricks Workflows, the fully-managed orchestration service that is deeply integrated with the Databricks Lakehouse Platform. With these new capabilities, Workflows enables data engineers, data scientists and analysts to build reliable data, analytics, and ML workflows on any cloud without needing to manage complex infrastructure.

Build Reliable Production Data and ML Pipelines With Git Support

We use Git to version control all of our code, so why not version control data and ML pipelines? With Git support in Databricks Workflows, you can use a remote Git reference as the source for tasks that make up a Databricks Workflow. This eliminates the risk of accidental edits to production code, removes the overhead of maintaining a production copy of the code in Databricks and keeping it updated, and improves reproducibility as each job run is linked to a commit hash. Git support for Workflows is available in Public Preview and works with a wide range of Databricks supported Git providers including GitHub, Gitlab, Bitbucket, Azure DevOps and AWS CodeCommit.

Git Support for Workflows

Git Support for Workflows

Please check out this blog post to learn more.

Run dbt projects in production

dbt is a popular open source tool that lets you build data pipelines using simple SQL. Everything is organized within directories as plain text, making version control, deployment, and testability simple. We announced the new dbt-databricks adapter last year, which brings simple setup and Photon-accelerated SQL to Databricks users. dbt users can now run their projects in production on Databricks using the new dbt task type in Jobs, benefiting from a highly-reliable orchestrator that offers an excellent API and semantics for production workloads.

dbt task type for Jobs

dbt task type for Jobs

Please contact your Databricks representative to enroll in the private preview.

Orchestrate even more of the lakehouse with SQL tasks

Real-world data and ML pipelines consist of many different types of tasks working together. With the addition of SQL task type in Jobs, you can now orchestrate even more of the lakehouse. For example, you can trigger a notebook to ingest data, run a Delta Live Table Pipeline to transform the data, and then use the SQL task type to schedule a query and refresh a dashboard.

SQL task type for Jobs

SQL task type for Jobs

Please contact your Databricks representative to enroll in the private preview.

Save Time and Money on Data and ML Workflows With “Repair and Rerun”

To support real-world data and machine learning use cases, organizations create sophisticated workflows with numerous tasks and dependencies, ranging from data ingestion and ETL to ML model training and serving. Each of these tasks must be completed in the correct order. However, when an important task in a workflow fails, it affects all downstream tasks. The new “Repair and Rerun” capability in Jobs addresses this issue by allowing you to run only failed tasks, saving you time and money.

Repair and Rerun for Jobs

Repair and Rerun for Jobs

“Repair and Rerun” is Generally Available and you can learn about it in this blog post.

Easily share context between tasks

A task may sometimes be dependent on the results of a task upstream. For instance, if a model statistic (such as the F1 score) falls below a predetermined threshold, you would want to retrain the model. Previously, in order to access data from an upstream task, it was necessary to store it somewhere other than the context of the job, like a Delta table.

The Task Values API now allows tasks to set values that can be retrieved by subsequent tasks. To facilitate debugging, the Jobs UI displays values specified by tasks.

Task Values

Task Values

Learn more:

Workflows Demo

Try Databricks Workflows today

Join the conversation in the Try Databricks Community where data-obsessed peers are chatting about Data + AI Summit 2022 announcements and updates. Learn. Network. Celebrate.

--

Try Databricks for free. Get started today.

The post Top 5 Workflows Announcements at Data + AI Summit appeared first on Databricks.

Viewing all 1874 articles
Browse latest View live