Databricks

Today, at the Data + AI Summit Europe 2020, we shared some exciting updates on the next generation Data Science Workspace – a collaborative environment for modern data teams – originally unveiled at Spark + AI Summit 2020.

The power of data and artificial intelligence is already disrupting many industries, yet we’ve only scratched the surface of its potential, and data teams still face many challenges on a daily basis. This is partially due to the lack of smooth hands-off among a variety of roles involved at different stages of the lifecycle. In addition, the rapid innovation in that field means that most open-source software and tools provided are not production ready yet. And finally, traditional software engineering and DevOps tools built for production were designed when software wasn’t as closely tied to data as it is today, and they haven’t caught up with emerging practices in data science and machine learning.

The next-generation Data Science Workspace navigates these challenges to provide an open and unified collaborative experience for modern data teams by bringing together notebook environments with best-of-breed developer workflows for Git-based collaboration and reproducibility. This in turns gets fully automated with low-friction CI/CD pipelines from experimentation to production deployments.

In this blog, we’ll cover the new features added to the Next Generation Data Science Workspace to continue to increase data science productivity and simplify hand-offs for data teams, specifically Table of Contents, TensorBoard, and Dark Mode as well as Staging of Changes, Diff View, and Enterprise Security features for Git-based Projects.

Introducing Embedded TensorBoard, Table of Contents, and Dark Mode for Databricks Notebooks

Data scientists and engineers alike already love Databricks notebooks for their collaborative and productivity features including support for multiple programming languages (Python, Scala, SQL, or R), commenting, co-authoring, co-presence, built-in visualizations, and the ability to track experiments along with the corresponding parameters, metrics, and code from a specific notebook version.

For example, Experiment Tracking on Databricks allows you to quickly see all of the runs that were logged using MLflow from within your notebook, with the click of a button. One common use case is to identify the best machine learning model by sorting by a metric. Now you can easily find the best run with the lowest loss. Because of the tight integration with the notebook, you can also go back to the exact version of the code that created this run, allowing you to reproduce it in the future as shown below.

New TensorBoard support

TensorFlow and Keras are widely popular deep learning frameworks, and chances are that you’d like to use the TensorFlow backend for Keras. Now you can embed TensorBoard directly into your notebook, so you can monitor your training progress right in context.

New Table-of-Contents

Because Notebooks can become quite long, the new Table-of-Contents feature uses Markdown headings to allow you to easily navigate through a notebook. The notebook example below trains a Keras model, and you can see how you can now quickly navigate to that section.

Introducing Dark Mode

Finally, we’re very excited to introduce Dark Mode on Databricks. We know a lot of our users are as excited about this feature as we are. Simply go to User Settings > Notebook Settings > and Turn on Dark Mode for notebook editor to flip the switch.

Many people think that such a collaborative platform can only be useful for exploration and experimentation. It has proven difficult to combine the ease-of-use and collaborative features of notebooks with the rigor of large-scale production deployments. For managing code versioning, CI/CD, and production-grade deployments, the industry is already leveraging best practices for robust code management in complex settings, and they are Git based.

Therefore Git-based Databricks Projects fully integrate with the Git ecosystem to bring those best practices to data engineering and data science, where reproducibility is becoming more and more important.

Introducing nbformat support, diff view, and enterprise security features for Git-based Databricks Projects

Databricks Projects allow data scientists to use Git to carry out their work on Databricks, where they can access all of their data sets and use best of breed open-source tools in a secure and scalable environment. Today, we discussed new features coming soon to help data teams rapidly and confidently move experiments to production including native support for an open and standard notebook format (nbformat) to facilitate collaboration and interoperability, new diff view to easily compare code when merging branches, and enterprise security features to safeguard your intellectual property.

Native nbformat support for Databricks notebooks

Databricks Notebooks can already be exported into the ipynb file format, but one aspect of this is to support a rich data format that allows us to retain more metadata than just exporting source files.

Therefore, we have extended the nbformat that underlies ipynb, to retain some of the metadata from Databricks notebooks, so all your work gets saved and checked. So now, for example, Notebook Dashboards can be stored with the ipynb file.

By using ipynb, we wanted to ensure that you can benefit from the ecosystem around this open format. For example, most git providers can render ipynb files and give you a great code review experience.

Staging and visual diff view

Once you’re done updating your code you can stage your files in the git dialog where we also added a visual diffing feature that allows you to preview pending changes.

This makes it much easier to decide which changes to check in, which to revert, or which still need some more work.

New Enterprise Security Features

To securely manage Git providers and what gets checked-in or checked-out, we’ve included new security features to help you protect your intellectual property.

Now admins can configure allow lists that specify which git providers and repositories users have access to. The most common use case for this is to avoid code making its way to public repositories.

Another common pitfall are unprotected secrets checked into your code. Databricks Projects will detect those before you actually commit your code, to avoid exposing tokens, keys, etc… and helping safeguard your credentials.

Next Steps

To learn more, you can watch today’s keynote: Introducing the Next-Generation Data Science Workspace.

We have worked hard with many customers to design these new experiences, and are very excited to bring all these innovations to public preview in the next couple of months. Sign-up here for future updates and be notified of public preview!

Try Databricks for free. Get started today.

The post New Features to Accelerate the Path to Production With the Next Generation Data Science Workspace appeared first on Databricks.

MLflow helps organizations manage the ML lifecycle through the ability to track experiment metrics, parameters, and artifacts, as well as deploy models to batch or real-time serving systems. The MLflow Model Registry provides a central repository to manage the model deployment lifecycle, acting as the hub between experimentation and deployment.

A critical part of MLOps, or ML lifecycle management, is continuous integration and deployment (CI/CD). In this post, we introduce new features in the Model Registry on Databricks [AWS] [Azure] to facilitate the CI/CD process, including tags and comments which are now enabled for all customers, and the upcoming webhooks feature currently in private preview.

Today at the Data + AI Summit, we announced the general availability of Managed MLflow Model Registry on Databricks, and showcased the new features in this post. You can read more about the enterprise features of the managed solution in our previous post on MLflow Model Registry on Databricks.

Annotating Models and Model Versions with Tags

Registered models and model versions support key-value pair tags, which can encode a wide variety of information. For example, a user may mark a model with the deployment mode (e.g., batch or real-time), and a deployment pipeline could add tags indicating in which regions a model is deployed. And with the newly added ability to search and query by tags, it’s now easy to filter by these attributes so you can identify the models that are important to your task.

Tags can be added, edited, and removed from the model and model version pages, as well as through the MLflow API.

Adding Comments to Model Versions

With the latest release of the Model Registry, your teams now have the ability to write free-form comments about model versions. Deployment processes often trigger in-depth discussions among ML engineers: whether to productionize a model, examine any cause of failures, ascertain model accuracies, reevaluate metrics, parameters, schemas, etc. Through comments, you can capture these discussions during a model’s deployment process, in a central location.

Moreover, as organizations look to automate their deployment processes, information about a deployed model can be spread out across various platforms. With comments, external CI/CD pipelines can post information like test results, error messages, and other notifications directly back into the model registry. Also, in conjunction with webhooks, you can set up your CI/CD pipelines to be triggered by specific comments.

Comments can be created and modified from the UI or from a REST API interface, which will be published shortly.

Notifications via Webhooks

Webhooks are a common mechanism to invoke an action via a HTTP request upon an occurrence of an event. Model registry wehbooks facilitate the CI/CD process by providing a push mechanism to run a test or deployment pipeline and send notifications through the platform of your choice. Model registry webhooks can be triggered upon events such as creation of new model versions, addition of new comments, and transition of model version stages.

For example, organizations can use webhooks to automatically run tests when a new model version is created and report back results. When a user creates a transition request to move the model to production, a webhook tied to a messaging service like Slack could automatically notify members of the MLOps team. After the transition is approved, another webhook could automatically trigger deployment pipelines.

The feature is currently in private preview. Look for an in-depth guide to using webhooks as a central piece to CI/CD integration coming soon.

Monitoring Events via Audit Logs

An important part of MLOps is the ability to monitor and audit issues in production. Audit logs (or diagnostic logs) on Databricks [AWS] [Azure] provide administrators a centralized way to understand and govern activities on the platform. If your workspace has audit logging enabled, model registry events, including those around comments and webhooks [AWS] [Azure], will be logged automatically.

Get Started with the Model Registry

To see the features in action, you can watch today’s keynote: Taking Machine Learning to Production with New Features in MLflow.

You can read more about MLflow Model Registry and how to use it on AWS or Azure. Or you can try an example notebook [AWS] [Azure].

If you are new to MLflow, read the open source MLflow quickstart. For production use cases, read about Managed MLflow on Databricks and get started on using the MLflow Model Registry.

Try Databricks for free. Get started today.

The post MLflow Model Registry on Databricks Simplifies MLOps With CI/CD Features appeared first on Databricks.

Scribd uses Delta Lake to enable the world’s largest digital library. Watch this discussion with QP Hou, Senior Engineer at Scribd and an Airflow committer, and R Tyler Croy, Director of Platform Engineering at Scribd to learn how they transitioned from legacy on-premises infrastructure to AWS and how they utilized, implemented, and optimized Delta tables and the Delta transaction log. Please note, this session ran live in October and below are the questions and answers that were raised at the end of the meetup.

Watch the discussion

Q&A

Questions and answers below have been slightly modified due to brevity; you can listen to the entire conversation in the video above.

How do you optimize then manage the file sizes in your Cloud? For example, when you have a lot of files going into your S3 buckets, right? That potentially can increase the costs, right? So how do you optimize all this? How do you improve the performance?

So one of the big reasons that we chose Delta Lake was we want to use it for streaming tables to work with our streaming workloads. So as you can imagine, when you are writing from a streaming application, you are basically creating a lot of small files. All of these small files will cause a big performance issue for you. Luckily, Delta Lake comes with the OPTIMIZE command that you can use to automatically optimize those small files and compact them into larger ones. From a user point of view, it’s transparently speeding up query retrieval. You just have to run the OPTIMIZE command to optimize the data and then everything will be taken care of for your Delta Lake.

From the writer’s point of view they don’t really care about optimization. The client(s) just write whatever data they want to the table, and you can do concurrent rights as well. While the readers do have to care about the small file problem, the writers do not. But running OPTIMIZE is safe to do because Delta Lake itself has MVCC. So it’s safe to optimize and concurrently write into the same table at the same time.

How has streaming unlocked value for your data workloads and have you had your users responded to this type of architecture?

When I’m doing streaming at Scribd, real-time data processing was that pie in the sky sort of moonshot initiative in comparison to the way that most of our data customers traditionally have consumed data.

They were used to nightly runs such that if anything went wrong with it, they might get their data two days from now. But how about if they wanted to look at AB test results for deployment that went out at 9:00 AM today? Using the traditional batch flow, they would be waiting until tomorrow morning or worst case until Saturday morning. But with streaming, the goal is that we want to analyze it as soon as data is created, we want to be giving that to the people that want that data to use it.

And there’s a couple of really interesting use cases that started to come out of the woodwork once we started incorporating streaming more into the platform – with one big one that was totally unexpected around our ad hoc queries. For starters, we enabled all of these people to use Databricks notebooks to run these queries. Because we’re streaming data to a Delta Table, from the user’s perspective it just looks like any other table. If you wanna pull streaming data into your ad hoc workloads and you don’t have Delta you might be teaching users how to connect to Kafka topics or pulling it into some other intermediate store that they’re going to query. But for our users, it’s simply a Delta Table that is populated by stream versus a Delta Table that’s updated via the nightly batch. It’s fundamentally the same interface except one is obviously refreshed a lot more frequently. And so users, in a lot of cases without even realizing it, started to get faster results because their tables were actually being streamed into as opposed to written from a nightly batch.

This was when some people started to recognize that they got that super power once and they were over the moon excited. I think the fastest time from data generated to something available in the platform that I’ve seen is about nine seconds. And that’s like nine seconds from the event being created from a production web application to it being available in a Databricks notebook. When you show somebody who’s used to having the worst case scenario of 48 hours for their data down to nine seconds – it’s like if you showed a spaceship to someone from the 1700s. It’s like they almost can’t even comprehend the tremendous amount of change that they just encountered and get the benefit from.

How has Databricks helped your engineering team deliver?

The biggest benefit we get is productivity boost; nowadays I think everyone agrees that engineering time is way more expensive than whatever other resources that you will be buying. So being able to save developer time, that’s the biggest win for us.

The other thing is being able to leverage the latest technologies that’s standard in the industry. Being able to use the latest version of Apache Spark™ and I have to say that Databricks has done a really good job at optimizing Spark. While not all the optimizations are available in open-source so when we’re using the Databricks platform we get all of the optimizations that we need to get the job done a lot faster.

Back in the old days, engineers have to compete for development machines. This is no longer the case as we can now collaborate on notebooks – this is a huge win! By being able to run your development workflows in the cloud, you can actually scale to any kind of machines you want to get your work done. If you need this work to be completed faster, you just add more machines and they would get them faster! I have to reiterate that all of the engineers really love the notebook interface that Databricks provide. I think that was also one of the main reasons that we chose Databricks from the beginning – we really loved the collaborative experience.

Can you tell us a little about what you are working on to allow Scribd make it easier for readers to consume the written word?

New recommendations are probably one of the most important parts of our future; what originally really attracted me to Scribd as a company is that the business relies on the data platform. The future success of Scribd is really, really intertwined with how well we can build out and scale and mature our recommendations engines, our search models, our ability to process content and get that back to users that they are going to find compelling and interesting. Because data is core with our content (the audio books, books, documents, etc.), it is core to what makes Scribd valuable and what makes Scribd successful. There’s this very short line between if we make a better data platform, if I can enable that recommendations engineer to do a better whatever they do, that’s immediately more success for the company. And so for us, recommendations and search are so crucial to the business and that our work on the data platform directly impacts that very key functionality is really, really exciting but it also means that we’ve got to do things right!

Just to cycle back a little bit to the technical side of things, I want to mention how Delta Lake enabled us to build better recommendation systems. As Tyler mentioned earlier we have this daily batch pipelines that run every day. And as you can imagine, if a user clicks on something or expresses an intent that they liked this type of content, what if they only get no new recommendations after that? That’s not good from a user experience.

With Delta Lake, we actually now stream that user intent into our data system and into our machine learning pipelines. And now that we can react to user requests in real time or near real time thus providing much better and fresher recommendation to our users. I think that this is proof that having the right technology to unlock all these possibilities for engineering teams prompt teams to build products that were not even possible before. So I think that was a big thing we got from using Delta Lake as well.

Watch the discussion here: https://youtu.be/QF180xOo0Gc

Learn more about how Scribd switched to Databricks on AWS and Delta Lake: https://databricks.com/customers/scribd

Try Databricks for free. Get started today.

The post How Scribd Uses Delta Lake to Enable the World’s Largest Digital Library appeared first on Databricks.

“Everything should be as simple as it can be, but not simpler”
– Albert Einstein

Generally, a simple data architecture is preferable to a complex one. Code complexity increases points of failure, requires more compute to run jobs, adds latency, and increases the need for support. As a result, data pipeline performance degrades over time, increasing costs while decreasing productivity as your data engineers spend more time troubleshooting and downstream users wait longer for data refreshes.

Complexity was perceived as a necessary evil for the automated data pipelines feeding business reporting, SQL analytics, and data science because the traditional approach for bringing together batch and streaming data required a lambda architecture. While a lambda architecture can handle large volumes of batch and streaming data, it increases complexity by requiring different code bases for batch and streaming, along with its tendency to cause data loss and corruption. In response to these data reliability issues, the traditional data pipeline architecture adds even more complexity by adding steps like validation, reprocessing for job failures, and manual update & merge.

While you can fine-tune the cost or performance of individual services, you cannot make significant (orders of magnitude) improvements in cost or performance for the total job in this architecture.

Typical Lambda data pipeline architecture requiring additional functions like validation, reprocessing, and updating & merging, adding latency, cost, and points of failure.

Typical data pipeline architecture requiring additional functions like validation, reprocessing, and updating & merging, adding latency, cost, and points of failure.

However, the Delta Architecture on Databricks is a completely different approach to ingesting, processing, storing, and managing data focused on simplicity. All the processing and enrichment of data from Bronze (raw data) to Silver (filtered) to Gold (fully ready to be used by analytics, reporting, and data science) happens within Delta Lake, requiring less data hops.

Lambda is complicated, requiring more to set up and maintain, whereas batch + streaming just work on Delta tables right out of the box. Once you’ve built a Bronze table for your raw data and converted existing tables to Delta Lake format, you’ve already solved the data engineer’s first dilemma: combining batch and streaming data. From there, data flows into Silver tables, where it is cleaned and filtered (e.g., via schema enforcement). By the time it reaches our Gold tables it receives final purification and stringent testing to make it ready for consumption for creating reports, business analytics, or ML algorithms. You can learn more about simplifying lambda architectures in our virtual session, Beyond Lambda: Introducing Delta Architecture.

The simplicity of the Delta Architecture on Databricks from ingest to downstream use. This simplicity is what lowers cost while increasing the reliability of automated data pipelines.

These are the advantages that the simplified Delta Architecture brings for these automated data pipelines:

Lower costs to run your jobs reliably: By reducing 1) the number of data hops, 2) the amount of time to complete a job, 3) the number of job fails, and 4) the cluster spin-up time, the simplicity of the Delta architecture cuts the total cost of ETL data pipelines. While we certainly run our own benchmarks, the best benchmark is your data running your queries. To understand how to evaluate benchmark tests for automated data pipelines, read our case study for how Germany’s #1 weather portal, wetter.com, evaluated different pipeline architectures, or reach out to sales@databricks.com to get your own custom analysis.
Single source of truth for all downstream users: In order to have data in a useful state for reporting and analytics, enterprises will often take the raw data from their data lake and then copy and process a small subset into a data warehouse for downstream consumption. These multiple copies of the data create versioning and consistency issues that can make it difficult to trust the correctness and freshness of your data. Databricks however serves as a unified data service, providing a single source of consumption feeding downstream users directly or through your preferred data warehousing service. As new use cases are tested and rolled out for the data, instead of having to build new, specialized ETL pipelines, you can simply query from the same Silver or Gold tables.
Less code to maintain: In order to ensure all data has been ingested and processed correctly, the traditional data pipeline architecture approach needs additional data validation and reprocessing functions. Lambda architecture isn’t transactional, so if your data pipeline write job fails halfway through, now you have to manually figure out what happened / fix it / deal with partial write or corrupted data. With Delta on Databricks however you ensure data reliability with ACID transactions and data quality guarantees. As a result, you end up with a more stable architecture, making troubleshooting much easier and more automated. When Guosto rebuilt their ETL pipelines on Databricks, they noted that, “a good side effect was the reduction in our codebase complexity. We went from 565 to 317 lines of Python code. From 252 lines of YML configuration to only 23 lines. We also don’t have a dependency on Airflow anymore to create clusters or submit jobs, making it easier to manage.”
Merge new data sources with ease: While we have seen an increase in alternative data sources (e.g., IoT or geospatial), the traditional way of building pipelines makes them highly rigid. Layering in new data sources to, for example, better understand how new digital media ads impact foot traffic to brick & mortar locations, typically means several weeks or months of re-engineering. Delta Lake’s schema evolution makes merging new data sources (or handling changes in formats of existing data sources) simple.

In the end, what the simplicity of Delta Architecture means for developers is less time spent stitching technology together and more time actually using it.

To see how Delta can help simplify your data engineering, drop us a line at sales@databricks.com.

Try Databricks for free. Get started today.

The post Delta vs. Lambda: Why Simplicity Trumps Complexity for Data Pipelines appeared first on Databricks.

This is a guest post by Keyuri Shah, lead software engineer, and Fred Kimball, software engineer, Northwestern Mutual.

Protecting PII (personally identifiable information) is very important as the number of data breaches and records with sensitive information exposed every day are trending upwards. To avoid becoming the next victim and protect users from identity theft and fraud, we need to incorporate multiple layers of data and information security.

As we use the Databricks platform, we need to make sure we are only allowing the right people access to sensitive information. Using a combination of Fernet encryption libraries, user-defined functions (UDFs), and Databricks secrets, Northwestern Mutual has developed a process to encrypt PII information and allow only those with a business need to decrypt it, with no additional steps needed by the data reader.

The need for protecting PII

Managing any amount of customer data these days almost certainly requires protecting PII. This is a large risk for organizations of all sizes as cases such as the Capital One data breach resulted in millions of sensitive customer records being stolen due to a simple configuration mistake. While encryption of the storage device and column-masking at the table level are effective security measures, unauthorized internal access to this sensitive data still poses a major threat. Therefore, we need a solution that restricts a normal user with file or table access from retrieving sensitive information within Databricks.

However, we also need those with a business need to read sensitive information to be able to do so. We don’t want there to be a difference in how each type of user reads the table. Both normal and decrypted reads should happen on the same Delta Lake object to simplify query construction for data analysis and report construction.

Building the process to enforce Column-level Encryption

Given these security requirements, we sought to create a process that would be secure, unobtrusive, and easy to manage. The below diagram provides a high-level overview of the components required for this process

Writing protected PII with Fernet

The first step in this process is to protect the data by encrypting it. One possible solution is the Fernet Python library. Fernet uses symmetric encryption, which is built with several standard cryptographic primitives. This library is used within an encryption UDF that will enable us to encrypt any given column in a dataframe. To store the encryption key, we use Databricks Secrets with access controls in place to only allow our data ingestion process to access it. Once the data is written to our Delta Lake tables, PII columns holding values such as social security number, phone number, credit card number, and other identifiers will be impossible for an unauthorized user to read.

Reading the protected data from a view with custom UDF

Once we have the sensitive data written and protected, we need a way for privileged users to read the sensitive data. The first thing that needs to be done is to create a permanent UDF to add to the Hive instance running on Databricks. In order for a UDF to be permanent, it must be written in Scala. Fortunately, Fernet also has a Scala implementation that we can leverage for our decrypted reads. This UDF also accesses the same secret we used in the encrypted write to perform the decryption, and, in this case, it is added to the Spark configuration of the cluster. This requires us to add cluster access controls for privileged and non-privileged users to control their access to the key. Once the UDF is created, we can use it within our view definitions for privileged users to see the decrypted data.

Currently, we have two view objects for a single dataset, one each for privileged and non-privileged users. The view for non-privileged users does not have the UDF, so they will see PII values as encrypted values. The other view for privileged users does have the UDF, so they can see the decrypted values in plain text for their business needs. Access to these views is also controlled by the table access controls provided by Databricks.

In the near future, we want to leverage a new Databricks feature called dynamic view functions. These dynamic view functions will allow us to use only one view and easily return either the encrypted or decrypted values based on the Databricks group they are a member of. This will reduce the amount of objects we are creating in our Delta Lake and simplify our table access control rules.

Either implementation allows the users to do their development or analysis without worrying about whether or not they need to decrypt values read from the view and only allows access to those with a business need.

Advantages of this method of column-level encryption

In summary, the advantages of using this process are:

Encryption can be performed using existing Python or Scala libraries
Sensitive PII data has an additional layer of security when stored in Delta Lake
The same Delta Lake object is used by users with all levels of access to said object
Analysts are unobstructed whether or not they are authorized to read PII

For an example of what this may look like, the following notebook may provide some guidance:

Notebook Download

Additional resources:

Fernet Libraries

Python
Scala

Create Permanent UDF

Create functions

Dynamic View Functions

View functions

Try Databricks for free. Get started today.

The post Enforcing Column-level Encryption and Avoiding Data Duplication With PII appeared first on Databricks.

As part of our Data + AI Online Meetup, we’ve explored topics ranging from genomics (with guests from Regeneron) to machine learning pipelines and GPU-accelerated ML to Tableau performance optimization. One key topic area has been an exploration of the Lakehouse.

The rise of the Lakehouse architectural pattern is built upon tech innovations enabling the data lake to support ACID transactions and other features of traditional data warehouse workloads.

The Getting Started with Delta Lake tech talk series takes you through the technology foundation of Delta Lake (Apache Spark™), building highly scalable data pipelines, tackling merged streaming + batch workloads, powering data science with Delta Lake and MLflow, and even goes behind the scenes with Delta Lake engineers to understand the origins.

Making Apache Spark Better with Delta Lake

Apache Spark is the dominant processing framework for big data. Delta Lake adds reliability to Spark so your analytics and machine learning initiatives have ready access to quality, reliable data stored in low-cost cloud object stores such as AWS S3, Azure Storage, and Google Cloud Storage. In this session, you’ll learn about using Delta Lake to enhance data reliability for your data lakes.

Simplify and Scale Data Engineering Pipelines

A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (“Bronze” tables), transformation/feature engineering (“Silver” tables), and aggregate tables/machine learning training or prediction (“Gold” tables). Combined, we refer to these tables as a “multi-hop” architecture. It allows data engineers to build a pipeline that begins with raw data as a “single source of truth” from which everything flows. In this session, you’ll learn about the data engineering pipeline architecture, data engineering pipeline scenarios and best practices, how Delta Lake enhances data engineering pipelines, and how easy adopting Delta Lake is for building your data engineering pipelines.

Beyond Lambda: Introducing Delta Architecture

Lambda architecture is a popular technique where records are processed by a batch system and streaming system in parallel. The results are then combined during query time to provide a complete answer. With the advent of Delta Lake, we are seeing a lot of our customers adopting a simple continuous data flow model to process data as it arrives. We call this architecture the “Delta Architecture.” In this session, we cover the major bottlenecks for adopting a continuous data flow model and how the Delta Architecture solves those problems.

Getting Data Ready for Data Science with Delta Lake and MLflow

When it comes to planning for data science initiatives, one must take a holistic view of the entire data analytics realm. Data engineering is a key enabler of data science that helps furnish reliable, quality data in a timely fashion. In this session, you will learn about the data science lifecycle, key tenets of modern data engineering, how Delta Lake can help make reliable data ready for analytics, how easy it is to adopt Delta Lake to power your data lake, and how to incorporate Delta Lake within your data infrastructure to enable Data Science.

Behind the Scenes: Genesis of Delta Lake

Developer Advocate Denny Lee interviews Burak Yavuz, Software Engineer at Databricks, to learn about the Delta Lake team’s decision making process and why they designed, architected, and implemented the architecture that it is today. In this session, you’ll learn about technical challenges that the team faced, how those challenges were solved, and what their plans are for the future.

Get Started

Get Started filling your Delta Lake today by watching this complete series.

What’s Next?

If you want to expand your knowledge on Delta Lake, watch our Diving into Delta Lake tech talk series. Guided by the Delta Lake engineering team, including Burak Yavuz, Andrea Neumann, Tathagata “TD” Das, and Developer Advocate, Denny Lee, you will learn about the internal implementation of Delta Lake.

If you want to hear about future online meetups, join our Data + AI Online Meetup on meetup.com

Diving into Delta Lake
Immerse yourself in the internals of Delta Lake, a popular open source technology for more reliable data lakes.

Watch Now

The post ACID Transactions on Data Lakes appeared first on Databricks.

We are excited to announce that Azure Databricks is now Federal Risk and Authorization Management Program (FedRAMP) authorized at the High Impact level, enabling new data and AI use cases across public sector on the dedicated Microsoft Azure Government (MAG) cloud.

Azure Databricks is trusted by federal, state and local government agencies, such as the U.S. Department of Veterans Affairs (VA), Centers for Medicare and Medicaid Services (CMS), Department of Transportation (DOT), and DC Water, for their critical data and AI needs. Databricks maintains the highest level of data security by incorporating industry-leading best practices into our security program. The FedRAMP High authorization provides customers the assurance that Azure Databricks meets U.S. Government security and compliance requirements to support their sensitive analytics and data science use cases.

FedRAMP has seen rapid adoption since it was introduced in 2011 by the Office of Management and Budget (OMB) to help accelerate adoption of secure cloud computing services. FedRAMP defines three primary classifications of data handled by local, state and federal agencies – Low, Moderate, and High Impact levels. Azure Databricks meets the FedRAMP requirements for the highest authorization level. FedRAMP High is a gold standard among public sector, enterprise and industry vertical organizations who are modernizing their approach to information security and privacy. FedRAMP High authorization validates Azure Databricks security controls and monitoring for NIST 800-53 at the high impact level.

“Azure Databricks helps customers address security and compliance requirements for regulated public sector use cases, such as immunization, chronic disease prevention, transportation, weather, and financial and economic risk analytics,” said David Cook, Chief Information Security Officer at Databricks. “The FedRAMP High authorization validates Azure Databricks security controls and monitoring for NIST 800-53 at the high impact level. We are pleased to demonstrate our commitment to security and compliance with the FedRAMP High authorization on Microsoft Azure Government.”

FedRAMP High authorization enables government agencies to analyze sensitive data such as insurance statements, financial records and healthcare claims to improve processing times, lower operating costs, and reduce claims fraud. For example, government agencies and their vendors can analyze large geospatial datasets from GPS satellites, cell towers, ships and autonomous platforms for marine mammal and fish population assessments, highway construction, disaster relief, and population health.

US Government Certification	Azure Databricks on Azure Government (MAG)
CJIS	X
CNSSI 1253	X
DFARS	X
DoD DISA SRG Level 2	X
DoE 10 CFR Part 810	X
EAR	X
FedRAMP High	X
IRS 1075	X
ITAR	X
MARS-E (US)	X
NERC	X
NIST Cybersecurity Framework	X
NIST SP 800-171	X

View the Azure Databricks FedRAMP High authorization assessment and other security compliance documentation

You can view and download the Azure Databricks FedRAMP High authorization and related authorizations by visiting the Microsoft Trust Center. You can view and download details on all Microsoft Azure services, including Azure Databricks at the Microsoft Azure compliance offerings documentation and view the list of Azure services by FedRAMP and DoD CC SRG audit scope and directly on the FedRAMP Marketplace. Learn more about FedRAMP by viewing the Microsoft FedRAMP documentation.

As always, we welcome your feedback and questions and commit to helping customers achieve and maintain the highest standard of security and compliance. Please feel free to reach out to the team through Microsoft Azure Support.

Learn more about this announcement by attending the Azure Databricks Government Forum and other Azure Databricks events and follow us on Twitter, LinkedIn, and Facebook for more Azure Databricks security and compliance news, customer highlights, and new feature announcements.

Try Databricks for free. Get started today.

The post Azure Databricks Achieves FedRAMP High Authorization on Microsoft Azure Government (MAG) appeared first on Databricks.

Women in Sales (WIS) is a global employee networking group (ERG) at Databricks dedicated to helping women accelerate their careers in sales. On October 13th, 2020, WIS hosted Heather Akuiyibo, VP of Commercial and Mid Market Sales for North America, Shelby Ferson, Sr. Commercial Sales Manager for Australia & New Zealand, and Jerry Weitzman, SVP of Enterprise Sales for the Americas, for a fireside chat. During the chat, the leaders shared how they leveraged the concepts of ownership, sponsorship and action to accelerate their careers and provided guidance on how everyone can apply the same approach to their own professional pursuits.

See below for a synopsis of the topics discussed by the panel and watch the video recording that follows to experience the entire program.

Ownership: Your destiny is in your own hands, and you must take it upon yourself to seek out the guidance necessary to make the career decisions that work best for YOU. If there is an area that you are interested in, or a mentor/sponsor has identified as an opportunity to develop, take ownership and put yourself in a position to grow your skillset with respect to it. Use your organizational chart as a resource, and reach out to a potential sponsor or mentor through an informal Slack message or even a more formal email. Be creative and use social settings, such as a virtual coffee or after work “happy hour” to build these connections. As companies develop more of a remote presence, you will need to find new ways to recreate the traditional “in the hallway” chats.
Sponsorship: Having one or more people who are invested in your professional growth can be the key to developing and unlocking new career opportunities. Building these sponsor relationships, both within and outside your department, across your organization is vital to your growth. Developing strong relationships with your sponsors, through conversations and other activities, will give you the champions and knowledge needed to gain access to future opportunities. As Shelby suggested in her comments, which you can listen to in full below, you need to identify and cultivate your “Go Team”–which ideally would consist of one person, who you know is invested in your success and that you can count on, from every team with which you interact.
Action: Take accountability and execute your next move. Breaking down goals into small actionable items can help you maintain focus and prevent you from burning out while moving your career forward. Start by mastering your current role and become an expert at everything it involves. Once you’ve mastered your current role, you can then build upon that success with complete confidence in your abilities and knowledge as you progress in your field or even cross-functionally. To master your craft, learn your strengths and weaknesses. Seek out constructive feedback from everyone you work with, to learn how and where you can improve the most. The sooner you get this honest appraisal, the faster you will be able to work towards filling the gaps and reaching the mastery needed to advance to the next level.

Leveraging the three-prong approach of ownership, sponsorship and action can help everyone looking to accelerate and advance their career in the right direction. The WIS mission is to help women working in sales find and accelerate their career paths at Databricks and within the industry through open dialogue, access to internal and external resources and networking events. We understand that building a supportive team around women in the workforce moves the individual as well as the collective forward and drives better results for companies, as reinforced by McKinsey’s study

Watch the recording

To learn more about opportunities at Databricks check out our careers page.

Try Databricks for free. Get started today.

The post Establishing Your Career Path: Lessons Brought to You by Databricks’ Women in Sales appeared first on Databricks.

This post is a collaboration between Databricks and Amazon Web Services (AWS), with contributions by Naseer Ahmed, senior partner architect, Databricks, and guest author Igor Alekseev, partner solutions architect, AWS. For more information on Databricks integrations with AWS services, visit https://databricks.com/aws/

Amazon Redshift recently announced support for Delta Lake tables. In this blog post, we’ll explore the options to access Delta Lake tables from Spectrum, implementation details, pros and cons of each of these options, along with the preferred recommendation.

A popular data ingestion/publishing architecture includes landing data in an S3 bucket, performing ETL in Apache Spark, and publishing the “gold” dataset to another S3 bucket for further consumption (this could be frequently or infrequently accessed data sets). In this architecture, Redshift is a popular way for customers to consume data. Often, users have to create a copy of the Delta Lake table to make it consumable from Amazon Redshift. This approach doesn’t scale and unnecessarily increases costs. This blog’s primary motivation is to explain how to reduce these frictions when publishing data by leveraging the newly announced Amazon Redshift Spectrum support for Delta Lake tables.

Amazon Redshift Spectrum integration with Delta

Amazon Redshift Spectrum relies on Delta Lake manifests to read data from Delta Lake tables. A manifest file contains a list of all files comprising data in your table. In the case of a partitioned table, there’s a manifest per partition. The manifest file(s) need to be generated before executing a query in Amazon Redshift Spectrum. Creating external tables for data managed in Delta Lake documentation explains how the manifest is used by Amazon Redshift Spectrum. Note, this is similar to how Delta Lake tables can be read with AWS Athena and Presto.

Here’s an example of a manifest file content:

s3://bucketname/stock_quotes_partitioned/core2/Symbol=AACG/part-00000-XXXXX-044a-44bc-9d78-48a28f2f6cfe.c000.snappy.parquet
s3://bucketname/stock_quotes_partitioned/core2/Symbol=AAKG/part-00000-XXXXX-044a-44bc-9d78-48a28f2f6cfe.c001.snappy.parquet

Steps to Access Delta on Amazon Redshift Spectrum

Next we will describe the steps to access Delta Lake tables from Amazon Redshift Spectrum. This will include options for adding partitions, making changes to your Delta Lake tables and seamlessly accessing them via Amazon Redshift Spectrum.

Step 1: Create an AWS Glue DB and connect Amazon Redshift external schema to it

Enable the following settings on the cluster to make the AWS Glue Catalog as the default metastore.

Create glue database :

%sql
CREATE DATABASE IF NOT EXISTS clicks_west_ext;
USE clicks_west_ext;

This will set up a schema for external tables in Amazon Redshift Spectrum.

%sql
CREATE EXTERNAL SCHEMA IF NOT EXISTS clicks_pq_west_ext
FROM DATA CATALOG
DATABASE 'clicks_west_ext'
IAM_ROLE 'arn:aws:iam::xxxxxxx:role/xxxx-redshift-s3'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Step 2: Generate Manifest

You can add the statement below to your data pipeline pointing to a Delta Lake table location.

%python
GENERATE symlink_format_manifest FOR TABLE delta.`<path-to-delta-table>`

Note, the generated manifest file(s) represent a snapshot of the data in the table at a point in time. The manifest files need to be kept up-to-date. There are two approaches here. One run the statement above, whenever your pipeline runs. This will update the manifest, thus keeping the table up-to-date. The main disadvantage of this approach is that the data can become stale when the table gets updated outside of the data pipeline.

The preferred approach is to turn on delta.compatibility.symlinkFormatManifest.enabled setting for your Delta Lake table. This will enable the automatic mode, i.e. any updates to the Delta Lake table will result in updates to the manifest files. Use this command to turn on the setting.

%sql
ALTER TABLE delta.`` SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)

This will keep your manifest file(s) up-to-date ensuring data consistency.

Step 3: Create an external table directly from Databricks Notebook using the Manifest

When creating your external table make sure your data contains data types compatible with Amazon Redshift. Note, we didn’t need to use the keyword external when creating the table in the code example below. It’ll be visible to Amazon Redshift via AWS Glue Catalog.

%sql 
CREATE TABLE if not exists gluedbname.redshiftdeltatable (SpotDate string, Exchange string, Currency string, OpenPrice double, HighPrice double, LowPrice double, LastPrice double, Volume int, SplitRatio string, CashDividend string, DividendCurrency string)
PARTITIONED BY( Symbol string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucketname/stock_quotes_partitioned/core42/_symlink_format_manifest'

Step 4: Options to Add/Delete partitions

If you have an unpartitioned table, skip this step. Otherwise, let’s discuss how to handle a partitioned table, especially what happens when a new partition is created.

Delta Engine will automatically create new partition(s) in Delta Lake tables when data for that partition arrives. Before the data can be queried in Amazon Redshift Spectrum, the new partition(s) will need to be added to the AWS Glue Catalog pointing to the manifest files for the newly created partitions.

There are three options to achieve this:

Add partition(s) using Databricks AWS Glue Data Catalog Client (Hive-Delta API),
Add partition(s) via Amazon Redshift Data APIs using boto3/CLI,
MSCK repair.

Below, we are going to discuss each option in more detail.

Option 1: Using the Hive-Delta API command’s (preferred way)

Using this option in our notebook we will execute a SQL ALTER TABLE command to add a partition.

%sql
ALTER TABLE gluedbname.redshiftdeltatable ADD IF NOT EXISTS PARTITION (Symbol='AATG') LOCATION 's3://bucketname/stock_quotes_partitioned/core7/_symlink_format_manifest/Symbol=AATG';

Note: here we added the partition manually, but it can be done programmatically. The code sample below contains the function for that. Also, see the full notebook at the end of the post.

%python
def add_partitions(partitions, tablename):
  for row in partitions.rdd.collect():
      sql =f"ALTER TABLE {tablename} ADD IF NOT EXISTS PARTITION (Symbol=\'{row['Symbol']}\') LOCATION \'{core_location}/_symlink_format_manifest/Symbol={row['Symbol']}\'"
      print(sql)
      spark.sql(sql)
  
df1 = ingestDF.select('Symbol').distinct()
df2 = targetDF.select('Symbol').distinct()
newPartitions = df1.subtract(df2)
      
add_partitions(newPartitions, table)

Option 2: Using Amazon Redshift Data API

Amazon Redshift recently announced availability of Data APIs. These APIs can be used for executing queries. Note that these APIs are asynchronous. If your data pipeline needs to block until the partition is created you will need to code a loop periodically checking the status of the SQL DDL statement.

Option 2.1 CLI

We can use the Redshift Data API right within the Databricks notebook. As a prerequisite we will need to add awscli from PyPI. Then we can use execute-statement to create a partition. Once executed, we can use the describe-statement command to verify DDLs success. Note get-statement-result command will return no results since we are executing a DDL statement here.

Option 2.2 Redshift Data API (boto3 interface)

Amazon Redshift also offers boto3 interface. Similarly, in order to add/delete partitions you will be using an asynchronous API to add partitions and need to code loop/wait/check if you need to block until the partitions are added.

Option 3. Using MSCK Repair

An alternative approach to add partitions is using Databricks Spark SQL

%sql
MSCK REPAIR TABLE ""

It’s a single command to execute, and you don’t need to explicitly specify the partitions. There will be a data scan of the entire file system. This might be a problem for tables with large numbers of partitions or files. However, it will work for small tables and can still be a viable solution.

Step 5: Querying the data

Conclusion

In this blog we have shown how easy it is to access Delta Lake tables from Amazon Redshift Spectrum using the recently announced Amazon Redshift support for Delta Lake. By making simple changes to your pipeline you can now seamlessly publish Delta Lake tables to Amazon Redshift Spectrum. You can also programmatically discover partitions and add them to the AWS Glue catalog right within the Databricks notebook.

Try this notebook with a sample data pipeline, ingesting data, merging it and then query the Delta Lake table directly from Amazon Redshift Spectrum.

Try Databricks for free. Get started today.

The post Simplify Access to Delta Lake Tables on Databricks From Serverless Amazon Redshift Spectrum appeared first on Databricks.

We are excited to announce that Azure Databricks is now generally available (GA) in Microsoft’s Azure Government (MAG) region, enabling new data and AI use cases for federal agencies, state and local governments, public universities, and government contractors to enable faster decisions, more accurate predictions, and unified and collaborative data analytics. More than a dozen federal agencies are building cloud data lakes and are looking to use Delta Lake for reliability. Azure Databricks is FedRAMP High authorized in the Microsoft Azure GovCloud regions. Additionally, DoD Impact Level 5 authorization is in progress and expected in the coming months.

Proven analytics and AI at scale

Azure Databricks is trusted by organizations such as the U.S. Department of Veterans Affairs (VA), Centers for Medicare and Medicaid Services (CMS), Department of Transportation (DOT), DC Water, Unilever, Daimler, Credit Suisse, Starbucks, AstraZeneca, McKesson, ExxonMobil, and H&R Block for mission-critical data and AI use cases. Databricks maintains the highest level of data security by incorporating industry leading best practices into our cloud computing security program. Azure Government general availability provides customers the assurance that Azure Databricks is designed to meet United States Government security and compliance requirements to support sensitive analytics and data science use cases. Azure Government is a gold standard among public sector organizations and their partners who are modernizing their approach to information security and privacy.

Enabling government agencies and partners to accelerate innovation on Azure Government

“At Veterans Affairs, we included Azure Databricks as part of our Microsoft Azure Authority to Operate (ATO),” said Joseph Fourcade, Lead Cyber Security Analyst, U.S. Department of Veterans Affairs Enterprise Cloud Solutions Office (ECSO). “When Databricks received FedRAMP High approval, we were able to move quickly to inherit that same Azure ATO and approve Azure Databricks for production workloads. Timing couldn’t have been better, as we have been working with a number of VA customers implementing Databricks for critical programs.”

“Numerous federal agencies are looking to build cloud data lakes and leverage Delta Lake for a complete and consistent view of all their data,” said Kevin Davis, VP, Public Sector at Databricks. “The power of data and AI are being used to dramatically enhance public services, lower costs and improve quality of life for citizens. Using Azure Databricks, government agencies have aggregated hundreds of data sources to improve citizen outreach, automated processing of hourly utility infrastructure IoT data for enabled predictive maintenance, deployed machine learning models to predict patient needs, and built dashboards to predict transportation needs and optimize logistics. FedRAMP High authorization for Azure Databricks further enables federal agencies to analyze all of their data for improved decision making and more accurate predictions.”

High Impact data are frequently stored and processed in emergency services systems, financial systems, department of defense and healthcare systems. Azure Databricks enables government agencies and their contractors to analyze public records data such as tax history, financial records, welfare and healthcare claims to improve processing times, reduce operating costs, and reduce claims fraud. In addition, government agencies and contractors are taking a data-driven approach to environmental, social and governance (ESG) performance and are analyzing large geospatial datasets from GPS satellites, cell towers, ships and autonomous platforms for marine mammal and fish population assessments, highway construction, disaster relief, and population health. State and local governments who utilize federal data also depend on Azure Government to ensure they meet these high standards of security and compliance.

Learn more about Azure Government and Azure Databricks

You can learn more about Azure Databricks and Azure Government by visiting the Azure Government website, see the full list of Azure services available in Azure Government, compare Azure Government and global Azure and read Microsoft’s Azure Government documentation here.

Get started with Azure Databricks by joining the Azure Databricks Government Forum and future Azure Databricks events. Continue your learning with this free, 3-part training series. Learn more about Azure Databricks security best practices by reading this blog post.

Follow us on Twitter, LinkedIn, and Facebook for more Azure Databricks security and compliance news, customer highlights, and new feature announcements.

Try Databricks for free. Get started today.

The post Azure Databricks Now Generally Available in Azure Government appeared first on Databricks.

Let’s face it, the landscape of different analytics services and products is complicated and constantly evolving. The Databricks and Microsoft partnership that created Azure Databricks began 4 years ago, and in that time Azure Databricks has evolved along with other Azure services like Azure Synapse. What remains constant is a great story from Databricks and Microsoft working together to enable joint customers like Unilever, Daimler and GSK to build their analytics on Azure with the best of both. It all starts with a common vision for an analytics platform.

Get your data in one place

There is a universal goal within analytics teams to establish a common data source that serves every type of analytics from one place. This eliminates the primary source of frustration and complexity for analytics, namely the separated silos of data. To build that common data source, look to cloud storage for unmatched performance, scale and value as the most compelling option. If you take away nothing else from this post, remember that getting all your data into a data lake built on cloud storage like Azure Data Lake Storage (ADLS) is the best first step in your analytics journey. And there are plenty of great options, for example, Azure Data Factory, to sync or move all your data directly into ADLS.

The next important thing to remember is data lakes built on cloud storage do not natively provide all the database-like features that are commonly needed for analytics. Historically this caused a lot of pain for teams implementing a data lake using data formats like Parquet, but in the last several years we saw innovations with transaction logs and related features (e.g. indexing) for data lakes. Delta Lake is the best example, originally created by Databricks and now an open-source project managed by the Linux Foundation. To ensure data is ready for analytics, Delta Lake provides transaction support and data quality capabilities to curate data, enforce schema and ensure reliable data. The majority of data processed with Azure Databricks is already in Delta Lake, customers like Starbucks, Grab, Mars Petcare and Cerner are more examples of companies using Delta Lake to create a foundation for their data platform.

Use Azure Databricks, Azure Synapse and Power BI together

The combination of ADLS with Delta Lake is at the heart of Databricks and Microsoft’s shared vision for analytics on Azure. Key analytics services like Databricks, Synapse and Power BI are primed and ready to tap into this data in one place, making it easy to address the spectrum of analytics scenarios across BI, data science and data engineering. Azure Databricks provides the best environment for empowering data engineers and data scientists with a productive, collaborative platform and code-first data pipelines. Azure Synapse provides high performance data warehousing for low-latency, high-concurrency BI, integrated with no-code / low-code development. Both have services for analysts to perform analytics using the most common syntax for data – SQL – directly on the lake, giving users on Azure a lot to cheer about.

These services on Azure also integrate with each other to form a mesh of interconnected analytics. Azure Databricks has a built-in and highly optimized connector to Synapse that today is the most popular service connector across all of Databricks. This is no surprise as many customers like Marks & Spencer and Rockwell Automation have used Azure Databricks and Synapse together to modernize their analytics platform into the cloud for high-performance and scalability. Power BI is already part of Synapse Studio, and the new Power BI connector to Azure Databricks makes it easier and more performant to deliver great BI visualizations and reports through the same Power BI service. The combination of these services operating together on the same underlying data lake make Azure a great place for analytics.

What makes Azure Databricks special

Delivering a cloud analytics platform is hard. The historical complexities of developing analytics software already existed, and now that is married with the subtleties and differences of architecting for a cloud-scale solution. To peek under the hood on what it takes, see what Databricks co-founder and chief technologist Matei Zaharia presented on developing large-scale cloud software and the lessons learned.

What quickly becomes apparent is how much depends on great engineering collaboration with the underlying cloud infrastructure and services. This is amplified for Azure Databricks that operates at cloud scale, spinning up millions of VM hours every day and processing Exabytes of data each month. That amount of processing driven by Azure Databricks leverages the underlying Azure services for compute, storage and networking, and it would be impossible to achieve great performance without serious joint engineering work that gets into details like compute resource request protocols and network throttling.

This is a big part of what makes Azure Databricks special. As a first-party service from Microsoft, the Databricks and Azure engineering teams work together all the time, constantly enhancing the performance and scalability across dozens of dimensions, and monitoring the fleet of environments while providing mission critical support for any issues. We jointly plan new features and releases on Azure, for example we recently hosted an exclusive public preview of the new Photon engine first on Azure. This collaboration has been underway for 4 years now, with hundreds of thousands of hours put into making Databricks run really well specifically on Azure!

The big picture

Beyond the specifics for any one service or technology, there are a few tenets that stand out. First, put data into one place with data lake services on cloud storage as the best foundation. Second, make that data open and accessible to the analytics services in the ecosystem to address any use case. When new features or services become available, as always happens, this architecture is flexible and future-ready to feed data wherever it needs to go. Databricks and Microsoft have worked together for years to make analytics on Azure a compelling platform for any organization by following these tenets and constantly innovating to provide simple, effective analytics services for Azure customers!

Learn More About Azure Databricks!

Try Databricks for free. Get started today.

The post The Analytics Evolution With Azure Databricks, Azure Synapse and Power BI appeared first on Databricks.

This is a guest community post authored by Brad Ito, CTO Retina.ai, with contributions by Databricks Customer Success Engineer Vini Jaiswal

Retina is the customer intelligence partner that empowers businesses to maximize customer-level profitability. We help our clients boost revenue with the most accurate lifetime value metrics. Our forward-looking, proprietary models predict customer lifetime value at or before the first transaction.

To build and deliver these models, Retina data scientists and engineers use Databricks. We recently started to use Databricks Container Services to reduce costs and increase efficiency. Docker containers give us 3x faster cluster spin-ups and unify our dependency management. In this post, we’ll share our process and some public docker images we’ve made so that you can do this too.

Context: Docker

We use Docker to manage our local data science environments and control dependencies at the binary level, for truly reproducible data science. At Retina, we use both R and Python, along with an ever-evolving mix of public and proprietary packages. While we use several tools to lock down our dependencies, including MRAN snapshots for R, conda for Python packages with clean dependencies, and pip-tools for Python packages with less-clean dependencies, we find that everything works most reliably when packaged in Docker containers.

Speaking of Docker containers, most tutorials lead users down the wrong path. Those tutorials usually start from a public official image, then add in dependencies and install the application to be executed all in a single image. Instead, if you are trying to develop and deploy code (for interpreted languages), like Retina models, you need at least 3 images:

Base image: We use a base image to lock down system and package dependencies and leverage whatever tools are available to try to make that reproducible.
Development image: We use a development image that adds development tools and use that when writing new code by running that image linked to local source code.
Optimized deployment image: And finally, when we want to run our code in other environments, we create an optimized deployment image that combines our code with the base image to create a fully executable container.

For Retina’s specific use case, our base image starts with Ubuntu and adds in R and Python dependencies, including some of our own packages. We then have two development images. The first leverages code from the rocker project to install RStudio for doing data science in R. The second leverages code from Jupyter Docker Stacks to install Jupyter Lab for Python. When we run the development images, we use a bash script that injects any needed environment variables and mounts the local source code in a way that is accessible from the install IDE.

Context: Databricks

Retina has been using Databricks for several years to manage the client data for Machine Learning models. It allows R and Python to get the best of both worlds, while leveraging Spark for its big data capabilities. However, Retina requires several custom packages that require time-consuming compilation upon installation, meaning new clusters become slow to spin up.

This slow cluster spin-up has had a cascading effect on costs. For interactive clusters, Retina had to over-provision instances to avoid long delays while the cluster auto-scales. For non-interactive jobs, we’ve had to custom-tune dependencies and pay for the EC2 instances to repeatedly compile the same C code whenever we run jobs frequently.

While we were still able to run the various workloads we needed to run, we knew that we were being inefficient in compute costs and maintenance hours.

Context: Databricks Container Services

Databricks allows every Data persona to be on one Unified platform and build their applications on it. With this mission in mind, data engineers and developers can build complex applications using their own custom golden images. Databricks Container Services allows users to specify bespoke libraries or custom images that can be used in conjunction with Databricks Runtime to leverage the distributed processing of Apache Spark while leveraging the optimizations of Databricks Runtime. With this solution, data engineers and developers can build highly customized execution environments, tightly secure applications, and prepackage an environment that is repeatable and immediately available when a cluster is created.

The golden image addresses these use cases:

Organizations can install required software agents on the VMs that are used by their spark applications.
A deterministic environment in a container that a cluster is built from for production workloads.
Data Scientists can use custom Machine Learning libraries provisioned by DCS enabled clusters for their model experiments or exploratory analysis.

Figure. How does Docker Container Services work with Databricks

Solution to address Retina’s pain points

The Databricks Container Services feature lets you build custom Docker containers to create new clusters. Retina built a hierarchy of custom containers in-house to address many of the pain points above.

We do pre-compilation of packages in the container. Instead of recompiling the same code over and over, we just load a container and have our packages already pre-installed. The result is a 3x speedup in the startup times for new clusters.

We also install most of the packages we need in “standard” docker containers that run both R and Python and have most of the dependencies we use frequently. These are the same packages installed in our local development containers, so we get a common reproducible environment for running code. For special cases, we have the ability to create new containers that are optimized for those additional containers.

The above diagram shows our container hierarchy.

We made our own minimal image retina/databricks-minimal that builds in the base requirements to run the Databricks Runtime 6.x, along with Scala, Python and R support in notebooks. It also incorporates some Docker optimizations to reduce the number of layers and the overall image size.

From there, we make our own “standard” image which adds in various packages, both public and private, which we use in both our automated databricks jobs and interactive clusters. To show how some of this works, we made a public retina/databricks-standard image which has the same package dependencies as the 6.x Databricks runtime.

You can see our source code for this in the repo docker-retina-databricks. We welcome collaborators on this to help to improve and optimize containers on databricks.

Wrapping Up

By leveraging containers, and the Databricks integration, Retina achieved significant savings in both cost and time. For Retina, this means faster spin-up times for more efficient usage of cloud resources, and better leveraging of the auto-scaling. It also means a seamless experience for our data scientists as they can focus on building models and not on troubleshooting dependencies.

You can build your Docker base from scratch. Your Docker image must meet these requirements:

JDK 8u191 as Java on the system PATH
bash
iproute2 (ubuntu iproute)
coreutils (ubuntu coreutils, alpine coreutils)
procps (ubuntu procps, alpine procps)
sudo (ubuntu sudo, alpine sudo)
Ubuntu or Alpine Linux

Or, you can use the minimal image built by Databricks at https://github.com/databricks/containers

A step by step guide is available for you to get started with Databricks container services.

Try Databricks for free. Get started today.

The post How Retina Uses Databricks Container Services to Improve Efficiency and Reduce Costs appeared first on Databricks.

Databricks, founded by the original creators of Apache Spark™ and Delta Lake, is thrilled to be a Platinum sponsor at AWS re:Invent 2020, where you can see how we simplify data engineering, analytics and ML with a unified platform. This year, we are bringing the magic and intrigue to your living room. We’ve got a surprise in store – a private online show with David Blaine the master illusionist. He makes the extraordinary look simple — just like Databricks. Sign up now to secure your spot! The show will take place on Monday, December 14 at 5:00 PM PST.

It’s all part of the magic you’ll find at this year’s Databricks re:Invent experience. Join us so you can:

Reserve your spot for the live performance with David Blaine
Play our game and win a Databricks re:Invent 2020 T-Shirt
Attend our session “Stop Struggling with Analytics on the Data Lake” on December 17
Get your questions answered by meeting 1:1 with our industry data and AI experts

Play our game and win a Databricks re:Invent 2020 T-Shirt

In our session, “Stop Struggling with Analytics on the Data Lake” you will hear from Denis Dubeau, Partner Solution Architect about how Comcast and Digital Turbine are simplifying data engineering, analytics and ML with the Databricks Unified Data Analytics Platform. The session will air in three different time zones on December 17. Learn how to use Databricks, with Delta Lake, to make the data in your S3 data lake reliable with higher performance, so it can support all your analytics across data science, machine learning and BI/reporting. Companies like Comcast have used this approach to reduce costs by $9 million and improve model deployment from weeks to minutes!

Visit our re:Invent webpage or log into the re:Invent Databricks booth to check out the four demos available at the conference. Ask our data and AI experts questions live at the booth for any questions about these demos or your own projects.

Delta Lake and AWS Glue – see how Delta Lake, as a managed service on Databricks, is integrated with AWS Glue as a metadata store.
SQL Analytics – see the new SQL Analytics service Databricks announced at Data + AI Summit, enabling your SQL analysts to analyze your entire data lake.
AWS Quickstarts: Databricks – Databricks is available as an AWS Quickstarts, which provides a pre-configured guide to get you up and running with Databricks in no time.
MLflow and AWS SageMaker – see how MLflow and AWS SageMaker are integrated to provide a seamless way to manage your machine learning operations and distribute your models.

To Learn More:

Sign up now for our post-re:Invent webinar taking place on January 14 at 10:00 AM PST.
Get six hours of free training using Databricks on AWS
Talk to an expert: Contact us to get answers to questions you might have as you start your first project or to learn more about available training.

You can visit us at databricks.com/reinvent for more information. We look forward to seeing you there!

Follow us on Twitter, LinkedIn, and Facebook for more Databricks news, customer highlights, and new feature announcements.

Try Databricks for free. Get started today.

The post See Databricks at re:Invent and Demystify Your Data appeared first on Databricks.

We are excited to announce that Azure Databricks is now generally available in Microsoft’s Azure China region, enabling new data and AI use cases with fast, reliable and scalable data processing, analytics, data science and machine learning on the cloud. With availability across more than 30 Azure regions, global organizations appreciate the consistency, ease of use and collaboration enabled by Azure Databricks.

Helping customers and partners scale with global availability

Organizations need a consistent set of cloud services across their global operations and customers looking to migrate on-premises big data workloads from a data center to the cloud frequently need a local Microsoft Azure region to meet data residency and data sovereignty requirements. Azure China is a sovereign cloud in mainland China that meets a high bar of security and compliance requirements. Azure Databricks helps customers deploy and scale batch and streaming data processing, simplify analytics and data science and implement machine learning in a way that is consistent and collaborative.

Now, with Azure Databricks general availability in Azure China, organizations that operate in China can leverage the same enterprise-grade service and collaborative notebooks in the region as well. From retail recommendations and financial services risk analysis to improved diagnostics in healthcare and life sciences, Azure Databricks enables data teams of all sizes and across industries to innovate more quickly and more collaboratively on the cloud.

“Azure Databricks enables our data engineering and data science teams to deliver results even faster than ever. Scalable data processing and machine learning enable our teams to quickly adapt to shifts in consumer demand and buying behavior,” says Kevin Zeng, Greater China IT CTO at Procter & Gamble. “The availability of Azure Databricks in China enables our global teams to provide a consistent experience for our customers in the region.”

You can learn more about Azure Databricks general availability in the Azure China region by visiting the Azure Products by Region page. Learn more about what you need to consider before moving your workloads to the Azure China region with Microsoft’s Azure China checklist and for questions please reach out to the team through Azure Support.

Get started with Azure Databricks by attending a live event and this free, 3-part training series.

Try Databricks for free. Get started today.

The post Azure Databricks Now Generally Available in the Azure China Region appeared first on Databricks.

Last week, Gartner published the Magic Quadrant (MQ) for Cloud Database Management Systems, where Databricks was recognized as a Visionary in the market.¹ This was the first time Databricks was included in a database-related Gartner Magic Quadrant. We believe this is due in large part to our investment in Delta Lake and its ability to enable data warehousing workloads on data lakes. Combined with our position as a Leader in the 2020 Magic Quadrant for Data Science and Machine Learning Platforms² announced earlier this year, Databricks is one of only a few vendors to be included in both MQ reports and the only one to achieve it through a unified platform due to our focus on lakehouse architecture.

Gartner evaluated 16 vendors for their completeness of vision and ability to execute.
We are confident the following attributes contributed to our company’s success:

Our simple platform combines the best attributes of data lakes and data warehouses to enable lakehouse architecture
Our unique ability to unify all of your data types and data workloads across all industries
Our dedication to innovation, data portability, and customer success that’s rooted in technology

One Simple Platform for All Your Data

Databricks’ continued growth has been rooted in the pursuit of lakehouse architecture, which is enabled by a new system design that implements similar data structures and data management features to those in a data warehouse directly on the flexible, low-cost storage used for cloud data lakes. The architecture is what you would get if you had to redesign data warehouses in the modern world, now that cheap and highly reliable storage (in the form of object stores) is available.

By marrying the advantages of both legacy architectures, and leaving behind many of the drawbacks, customers can run both traditional analytics and data science / ML workloads on the same platform. This approach substantially reduces the complex data operations necessary to constantly move data between the data lake and downstream data warehouses. It also has the added benefit of eliminating the inherent data silos that get created so that data teams can operate off of one source of truth. All in all, organizations can increase their velocity and lower their costs by moving towards lakehouse architecture.

Unification of all data types and data workloads

Because of the architectural benefits of a lakehouse, structured, semi-structured, and unstructured data can now coexist as first-class citizens. This is important because the individual roles within data teams are becoming increasingly intertwined.

The biggest advantage of Databricks’ Unified Data Analytics Platform is its ability to run data processing and machine learning workloads at scale and all in one place. Most recently, we significantly extended our data management and analytics capabilities with the announcement of SQL Analytics at the Data+AI Summit Europe 2020. SQL Analytics provides Databricks customers with a first-class experience for performing BI and SQL workloads directly on the data lake. The service provides a dedicated SQL-native workspace, built-in connectors to let analysts query data lakes with the BI tools they already use, innovations in query performance that deliver fast results on larger and fresher data sets than analysts traditionally have access to, and new governance and administration capabilities. Altogether, we can deliver up to 9x better price/performance for analytics workloads than traditional cloud data warehouses.

Additionally with Databricks, data teams can build reliable data pipelines with Delta Lake, which adds reliability and performance to existing data lakes. Data scientists can explore data and build models in one place with collaborative notebooks, track and manage experiments and models across the lifecycle with MLflow, and benefit from built-in and optimized ML environments (including the most common ML frameworks).

Rooted in Open Source

Databricks is the founder of many successful projects, starting with the creation of Apache Spark, a unified analytics engine for large-scale data processing.

Since then, we’ve innovated with Delta Lake as the foundation for the vision of the lakehouse. Delta Lake has brought reliability, performance, governance, and quality to data lakes, which is necessary to enable analytics on the data lake. Thousands of organizations have since adopted Delta Lake to provide an open standard for how they store their data, eliminating the long-term challenges that come with proprietary data formats.

We also created MLflow, an open source machine learning platform to let teams reliably build and productionize ML applications. Since then, we have been humbled and excited by the adoption of the data science community. With more than 2.5 million monthly downloads, 200 contributors from 100 organizations, and 4x year-on-year growth, MLflow has become the most widely used open source ML platform, demonstrating the benefits of an open platform to manage ML development that works across diverse ML libraries, languages, and cloud and on-premise environments. Today, it forms the foundation of our machine learning workflow capabilities to help ensure that customers have access to the most open and flexible set of tools possible.

Overall, with Databricks, customers can make better, faster use of data to drive innovation with one simple, open platform for analytics, data science, and ML that brings together teams, processes and technologies.

GET STARTED WITH DATABRICKS!

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

¹Gartner, “Magic Quadrant for Cloud Database Management Systems,” written by Donald Feinberg, Merv Adrian, Rick Greenwald, Adam Ronthal, Henry Cook, November 23, 2020.
²Gartner “Magic Quadrant for Data Science and Machine Learning,” written by Peter Krensky, Pieter den Hamer, Erick Brethenoux, Jim Hare, Carlie Idoine, Alexander Linden, Svetlana Sicular, Farhan Choudhary, February 11, 2020.

Try Databricks for free. Get started today.

The post Databricks Is Named a Visionary in the 2020 Gartner Magic Quadrant for Cloud Database Management Systems (DBMS) appeared first on Databricks.

Martin Zapletal, Software Engineering Director at Disney+, is presenting at re:Invent 2020 with the session How Disney+ uses fast data ubiquity to improve the customer experience (must be registered to watch but registration is free!).

In this breakout session, Martin will showcase Disney+’s architecture using Databricks on AWS for processing and analyzing millions of real-time streaming events.

Abstract:

Disney+ uses Amazon Kinesis to drive real-time actions like providing title recommendations for customers, sending events across microservices, and delivering logs for operational analytics to improve the customer experience. In this session, you learn how Disney+ built real-time data-driven capabilities on a unified streaming platform. This platform ingests billions of events per hour in Amazon Kinesis Data Streams, processes and analyzes that data in Amazon Kinesis Data Analytics for Apache Flink, and uses Amazon Kinesis Data Firehose to deliver data to destinations without servers or code. Hear how these services helped Disney+ scale its viewing experience to tens of millions of customers with the required quality and reliability.

You can also check out the Databricks Quality of Service blog/notebook based on a similar architecture if you want to see how to process streaming and batch data at scale for video/audio streaming services. This solution demonstrates how to process playback events and quickly identify, flag, and remediate audience experience issues.

See Databricks at AWS re:Invent 2020!

Try Databricks for free. Get started today.

The post Learn How Disney+ Built Their Streaming Data Analytics Platform With Databricks and AWS to Improve the Customer Experience appeared first on Databricks.

At Databricks, we strive to provide a world-class development experience for data scientists and engineers, and new features are constantly getting added to our notebooks to improve our users’ productivity. We are especially excited about the latest of these features, a new autocomplete experience for Python notebooks (powered by the Jedi library ) and new docstring code hints. We are launching these features with the Databricks Runtime 7.4 (or DBR 7.4), so you can take advantage of this experience in Python notebooks that run on clusters with DBR 7.4 or later.

You activate the new autocomplete functionality by pressing the Tab key. Once you do so, the system examines the input at the cursor’s position to show you candidates for the completion of your code and those candidates’ type information based on your notebook’s current state. To get additional help on a completed name, press the Shift+Tab key to open a docstring code hint.

We are also launching a new version of the Koalas library (version 1.4.0) with support for these new autocomplete and docstring features, which comes pre-packaged with DBR 7.5. The Koalas library is a drop-in replacement for the popular pandas Python library in data science; it uses Apache Spark’s big data processing capabilities on the backend while providing pandas’ familiar API interface to the user.

Python autocomplete using static code analysis from the Jedi library

Databricks notebooks run Python code using the IPython REPL, an interactive Python interpreter. The IPython 6.0 REPL introduced the Jedi library for code completion, which is the standard for Python autocomplete functionality in Jupyter notebooks. The Jedi library enables significant improvements over our prior autocomplete implementation by running static code analysis to make suggestions. With static code analysis, object names, their types, and function arguments can be resolved without running a cell (command).

Autocomplete results are available in the Koalas library

Python docstring functionality activated by the Shift+Tab key

In addition to the new autocomplete, DBR 7.4 includes docstring hints activated by the Shift+Tab keyboard shortcut. Docstrings are read from code comments formatted in PEP 257, which are inlined as part of the source code. The docstrings contain the same information as the help() function on a resolved object name. Objects are loaded into the Python REPL by running a notebook cell.

Example of a Koalas library docstring

Koalas: a drop-in replacement for the pandas library

Databricks ships the Koalas Python library as a drop-in replacement to the pandas library, a popular library in data science. Koalas takes advantage of PySpark’s DataFrame API for processing big data on Apache Spark while keeping the API compatible with pandas; see also Koalas: Easy Transition from pandas to Apache Spark and the Koalas documentation. Databricks released a new Koalas library version 1.4.0 with enhanced autocomplete and docstring to improve your development and refactoring of code in Databricks notebooks.

Enhanced type annotations for Koalas

In Koalas 1.4.0, we added return type annotations to major Koalas objects, including DataFrame, Series, Index, etc. These return type annotations help autocomplete infer the actual data type for precise and reliable suggestions, which will help you use the Koalas library as you’re writing code.

With the full coverage of return type annotations, the Koalas library has better support of autocomplete compared to the pandas library. Due to technical constraints in the pandas library, pandas doesn’t autocomplete in some cases, such as the example below.

Unable to get autocomplete results in pandas

Autocompletion results are available in Koalas

Koalas docstrings in the notebook

As part of Koalas 1.4.0, we have added a rich body of docstrings to the Koalas code so developers can quickly digest the Koalas APIs. Since these APIs are designed and implemented to run in a distributed environment, there can be subtle differences between the Koalas APIs and the corresponding pandas APIs. With the new docstring hints feature, you can easily inspect these differences by pressing the Shift+Tab key to access the docstring rather than reading the source code or searching the documentation.

Start using the improved autocomplete

To get the best experience with the new autocomplete and docstring features, attach to a DBR 7.4 cluster in order to enable the new features. At the top of your notebook, create a new cell at the top of your notebook to import all your libraries and execute that cell first. Once libraries are imported, autocomplete suggestions are available for the entire notebook. Then, press the Tab key for autocomplete or Shift+Tab key for docstrings and function parameters as you write your code.

If you don’t plan on running a notebook cell (for example, to do scratch work), then it’s best to keep the import statements and code in the same cell.

To get the latest Koalas autocomplete and docstrings, install the Koalas library 1.4.0 on a DBR 7.4 cluster. The Koalas library is also packaged with the DBR 7.5 release.

Jedi: static code analysis for Python
Koalas: Easy Transition from pandas to Apache Spark
Koalas API documentation: pandas library for Apache Spark
Docstrings in PEP 257

Try Databricks for free. Get started today.

The post Python Autocomplete Improvements for Databricks Notebooks appeared first on Databricks.

This is a guest community post authored by Chaitanya Chandurkar, Senior Software Engineer in the Analytics and Reporting team at McGraw Hill Education. Special thanks to MHE Analytics team members Nick Afshartous, Principal Engineer; Kapil Shrivastava, Engineering Manager; and Steve Stalzer, VP of Engineering / Analytics and Data Science, for their contributions.

Processing facts and dimensions is the core of data engineering. Fact and dimension tables appear in what is commonly known as a Star Schema. Its purpose is to keep tables denormalized enough to write simpler and faster SQL queries. In a data warehouse, a dimension is more like an entity that represents an individual, a non-overlapping data element where the facts are behavioral data produced as a result of an action on or by a dimension. A fact table is surrounded by one or more dimension tables as it holds a reference to dimension natural or surrogate keys.

Late-arriving transactions (facts) aren’t as annoying as late-arriving dimensions. In order to ensure the accuracy of the data, usually dimensions are processed first as they need to be looked up while processing the facts for enrichment or validation. In this blog post, we are going to look at a few use cases of late-arriving dimensions and potential solutions to handle it in Apache Spark pipelines.

Architecture and Constraints

Figure 1 – The Architecture and Constraints

To give a little bit of context: The Analytics & Reporting team at McGraw Hill provides data processing and reporting services that operate downstream from the application. This service has a hybrid ETL structure. Some facts and dimensions generated by our customer-facing applications are consumed in near real-time, transformed, and stored in delta tables. Few dimensions that are not streamed yet, are ingested in batch ETLs from different parts of the system. Some facts that are streamed have a reference to keys of the dimension table. In some cases, it’s possible that the fact being processed does not have a corresponding dimension entry yet because it’s waiting on ETL’s next run or has some problems upstream.

Even if other dimensions were streamed, it would still be possible for corresponding facts to arrive in close proximity.. Think of an automated test generating synthetic data in a lower environment. Another example that can be often seen is organizational migration. Private schools sometimes migrate under a district. This entity migration triggers a wave of Slowly Changing Dimensions and the facts streamed afterward should use the updated dimensions. In such cases, when attempting to join facts and dimensions, it’s possible that the join will fail due to late-arriving dimensions.

The margin of error here can be reduced by scheduling ETL jobs efficiently to ensure new dimensions are processed before processing the new facts. At McGraw Hill, this is not an option because significant delays do occasionally occur in our source systems.

Potential Solutions

There are a few solutions that can be incorporated depending on the use cases and constraints enforced by the infrastructure.

Process now and hydrate later

In this approach, even if dimension lookup fails, facts are dumped into fact tables with default values in the dimension columns and have a hydration process to periodically hydrate the missing dimension data in the facts table. You can filter out data with no or default dimension keys while querying that table to ensure that you aren’t returning the bad data. The caveat of this approach is that it does not work where those dimension keys are the foreign keys in fact tables. Another limitation is that if facts are written to multiple destinations, the hydration process has to update missing dimension columns in all those destinations for the sake of data consistency.

Early detection of late-arriving dimensions

We can detect the early-arriving facts instead (facts for which the corresponding dimension has not arrived yet), put them on hold, and retry them after a period until either they get processed or exhaust retries. This ensures data quality and consistency in the target tables. At McGraw Hill, we have many such streaming pipelines that read facts from Kafka, lookup multiple dimension tables, and write to multiple destinations. To handle such late-arriving dimensions, we built an internal framework that easily plugs into the streaming pipelines. The framework is built around a common pattern that all streaming pipelines use:

Read data from Kafka.
Transform facts and use the join to dimension tables.
If the dimension has not arrived yet, flag the fact record as `retryable`.
Write these retryable records in a separate delta table called `streaming_pipeline_errors`.
MERGE valid records into the target delta table.

Figure 2 – The Common Streaming Pipeline Pattern

Records are flagged as “not_retryable” (is_retryable = false) if it is there’s a schema validation failure (no use of retrying such events). Now, how do we reprocess fact data from the `streaming_pipeline_errors` table given a few limitations on the infrastructure:

We could not put this data back on Kafka because such duplicate events are like noise to other consumers.
We cannot not have another instance of the job running purely in “reconciliation” mode (re-process data only from error tables) as delta does not support doing concurrent MERGE operations on the same delta table.
We could stop the regular job and run only the “reconciliation” version of this job but it could get complicated to orchestrate that with streaming jobs as they run in continuous mode.

A mechanism was needed to process retryable data along with the new data without having to send it back to Kafka. Spark allows you to UNION two or more data frames as long as they have the same schema. Retryable records can be unioned with the new data from Kafka and process it all together. That’s how the reconciliation pattern was designed.

The Reconciliation Pattern

The reconciliation pattern uses a 2-step process to prepare the data to be reconciled.

Write unjoined records to the streaming_pipeline_errors table.
Put a process in place that consolidates multiple failed retries for the same event into a new single fact row with more metadata about the retries.

Using a scheduled batch process for Step-2 could automatically control the frequency of retries through its schedule.

Figure 3 – The Reconciliation Pattern

This is how the flow looks like:

Read new events from Kafka.
Read retryable events from the reconciliation table.
UNION Kafka and retryable events.
Transform/Join this unioned data with users dimension table.
If there’s any early arriving fact (dimension has not arrived yet), mark it as retryable.
Write retryable records to the `streaming_pipeline_errors` table.
MERGE all the good records into the target delta table.
A scheduled batch job dedupes new error records and MERGE into a reconciliation table with updated retry count and status.
Data written to the reconciliation table is picked up by the streaming pipeline in the next trigger.

In a streaming pipeline, data read from reconciliation is flagged as `retry_event`. All the failed retries are written back to the `streaming_pipeline_errors` table with status = ‘NOT_RESOLVED’. When the reconciliation job MERGEs this data into the reconciliation table, it updates the number of retry counts for such failed retries. After certain retries, If data joins with the dimension table, we write it to the target table and also write an updated status to the `streaming_pipeline_errors` table with `status = RESOLVED` indicating that this event is processed successfully so that it is not injected back into the stream in next trigger.

Since the reconciliation table has the `retry_count`, each pipeline can control how many retries are allowed by filtering out reconciliation records that exceed the configured number of retries. If a certain event exhausts the max retry count, the reconciliation job updates its status as DEAD and is not retried anymore.

Example

Refer to this databricks notebook for a sample code. This is the miniature example of our logins pipeline that computes usage stats of our consumer-facing applications. Here login events are processed in near real-time. It joins login events with the user’s dimension table which is updated by another ETL that is scheduled as a batch job.

Logins data from Kafka unioned with retryable data from reconciliation table:

Users data from delta table:

Let’s isolate a particular event #29. Looking at `streaming_pipeline_errors` logs, it can be seen that this event was retried two times before it was successfully joined with corresponding dimensions.

When these error logs are consolidated to a single row in reconciliation table:

A spike in records processed here indicates that more and more retryable records were getting accumulated faster than they were getting resolved. It indicates a potential issue in a batch job that loads dimensions. Once retries are exhausted or all events are processed, this spike will reduce. The best way to optimize this is to partition the reconciliation table on the `status` column so that you are only reading unresolved records.

Conclusion

This reconciliation pattern becomes easy enough to plug into the existing pipelines once a tiny boilerplate framework is added on top of it. It works with both streaming and batch ETLs. In streaming, it relies on the checkpoint state to read new data from the reconciliation table. Whereas in batch mode, it has to rely on the latest timestamp to keep track of new data. The best way of doing this would be to use streams with Trigger–once.

This automated reconciliation saves a lot of manual effort. You can also add alerting and monitoring on top of it. The volume of retryable data can become a performance overhead as it grows over time. It’s better to periodically clean those tables to get rid of older unwanted error logs and keep their size minimal.

Try the Notebook

Try Databricks for free. Get started today.

The post Handling Late Arriving Dimensions Using a Reconciliation Pattern appeared first on Databricks.

We recently held a virtual event, featuring CEO Ali Ghodsi, that showcased the vision of Lakehouse architecture and how Databricks helps customers make it a reality. Lakehouse is a data platform architecture that implements similar data structures and data management features to those in a data warehouse directly on the low-cost, flexible storage used for cloud data lakes. This new, simplified architecture allows traditional analytics, data science, and machine learning to co-exist on the same platform, removes data silos, and enables a single source of truth for organizations.

In the event, we shared our ideas around the Lakehouse and how to implement it, highlighted examples of customers who have transformed their data landscape, and demoed our new SQL Analytics service that completes the vision of the Lakehouse. Most exciting, though, was the incredible engagement we had from the audience. Today, we wanted to share the most popular audience questions from hundreds of interesting and valuable questions we received. For those who were unable to attend, feel free to take a look at the event on demand here.

Q&A from the virtual event

What is Delta Lake and what does it have to do with a Lakehouse?
Delta Lake is an open format storage layer that delivers reliability, security and performance on your data lake — for both streaming and batch operations. Delta Lake eliminates data silos by providing a single home for structured, semi-structured, and unstructured data to make analytics simple and accessible across the enterprise. Ultimately, Delta Lake is the foundation and enabler of a cost-effective, highly scalable Lakehouse architecture.

Why is it called Delta Lake?
There are two main reasons for the name Delta Lake. The first reason is that Delta Lake keeps track of the changes or “deltas” to your data. The second is that Delta Lake acts like a “delta” (where a river splits and spreads out into several branches before entering the sea) to filter data from your data lake.

Can I create a Delta Lake table on Databricks and query it with open-source Spark?
Yes, in order to do this, you would install Open Source Spark and Delta Lake, both are open source. Delta Engine, which is only available on Databricks, will make delta faster than open source, with full support. Read this blog for more information.

What file format is used for Delta Lake?
The file format used for Delta Lake is called delta, which is a combination of parquet and JSON.

A Delta Lake table is a directory on a cloud object store or file system that holds data objects with the table contents and a log of transaction operations (with occasional checkpoints). Learn more here.

What is SQL Analytics?
SQL Analytics provides a new, dedicated workspace for data analysts that uses a familiar SQL-based environment to query Delta Lake tables on data lakes. Because SQL Analytics is a completely separate workspace, data analysts can work directly within the Databricks platform without the distraction of notebook-based data science tools (although we find data scientists really like working with the SQL editor too). However, since the data analysts and data scientists are both working from the same data source, the overall infrastructure is greatly simplified and a single source of truth is maintained.

SQL Analytics enables you to:

integrate with the BI tools, like Tableau and Microsoft Power BI, you use today to query your most complete and recent data in your data lake
complement existing BI tools with a SQL-native interface that allows data analysts and data scientists to query data lake data directly within Databricks
share query insights through rich visualizations and drag-and-drop dashboards with automatic alerting for important changes in your data
bring reliability, quality, scale, security, and performance to your data lake to support traditional analytics workloads using your most recent and complete data

Where can I learn more about SQL performance on Delta Lake?
To learn more about SQL Analytics, Delta, and the Lakehouse architecture (including performance), check out this two-part free training. In the training, we explore the evolution of data management and the Lakehouse. We explain how this model enables teams to work in a unified system that provides highly performant streaming, data science, machine learning and BI capabilities powered by a greatly simplified single source of truth.

In the hands-on portion of these sessions, you’ll learn how to use SQL Analytics, an integrated SQL editing and dashboarding tool. Explore how to easily query your data to build dashboards and share them across your organization. And find out how SQL Analytics enables granular visibility into how data is being used and accessed at any time across an entire Lakehouse infrastructure.

Is SQL Analytics available?
SQL Analytics is available in preview today. Existing customers can reach out to their account team to gain access. Additionally, you can request access via the SQL Analytics product page.

This is just a small sample of the amazing engagement we received from all of you during this event. Thank you for joining us and helping us move the Lakehouse architecture from vision to reality. If you haven’t had a chance to check out the event you can view it here.

WATCH THE EVENT!

Try Databricks for free. Get started today.

The post Top Questions from Our Lakehouse Event appeared first on Databricks.

This is a guest authored post by Shivansh Srivastava, software engineer, Disney Streaming Services. It was originally published on Medium.com

Just a bit of context

We at Disney Streaming Services use Apache Spark across the business and Spark Structured Streaming to develop our pipelines. These applications run on the Databricks Runtime(DBR) environment which is quite user-friendly.

One of our Structured Streaming Jobs uses flatMapGroupsWithState where it accumulates state and performs grouping operations as per our business logic. This job kept on crashing approximately every 3 days. Sometimes even less, and then after that, the whole application got restarted, because of the retry functionality provided by DBR environment. If this had been a normal batch job this would have been acceptable but in our case, we had a structured streaming job and a low-latency SLA to meet. This is the tale of our fight with OutOfMemory Exception(OOM) and how we tackled the whole thing.

Below is the 10-step approach we as a department took to solving the problem:

Step 1: Check Driver logs. What’s causing the problem?

If a problem occurs resulting in the failure of the job, then the driver logs (which can be directly found on the Spark UI) will describe why the last retry of the task failed.

If a task fails more than four (4) times (if spark.task.maxFailures = 4 ), then the reason for the last failure will be reported in the driver log, detailing why the whole job failed.

In our case, it showed that the executor died and got disassociated. Hence the next step was to find out why.

Step 2: Check Executor Logs. Why are they failing?

In our executor logs, generally accessible via ssh, we saw that it was failing with OOM.

We encountered two types of OOM errors:

java.lang.OutOfMemoryError: GC Overhead limit exceeded
java.lang.OutOfMemoryError: Java heap space.

Note: JavaHeapSpace OOM can occur if the system doesn’t have enough memory for the data it needs to process. In some cases, choosing a bigger instance like i3.4x large(16 vCPU, 122Gib ) can solve the problem.

Another possible solution could be to tune the parameters to ensure consumption of what can be processed. What this essentially means is that enough memory must be available to process the amount of data to be processed in one micro-batch.

Step 3: Check Garbage Collector Activity

We saw from our logs that the Garbage Collector (GC) was taking too much time and sometimes it failed with the error GC Overhead limit exceeded when it was trying to perform the full garbage collection.

According to Spark documentation, G1GC can solve problems in some cases where garbage collection is a bottleneck. We enabled G1GC using the following configuration:


    spark.executor.extraJavaOptions: -XX:+UseG1GC

Thankfully, this tweak improved a number of things:

Periodic GC speed improved.
Full GC was still too slow for our liking, but the cycle of full GC became less frequent.
GC Overhead limit exceeded exceptions disappeared.

However, we still had the Java heap space OOM errors to solve. Our next step was to look at our cluster health to see if we could get any clues.

Step 4: Check your Cluster health

Databricks clusters provide support for Ganglia, a scalable distributed monitoring system for high-performance computing systems such as clusters and grids.

Our Ganglia graphs looked something like this:

Cluster Memory Screenshot from Ganglia

Worker_Memory Screenshot from Ganglia

The graphs tell us that the cluster memory was stable for a while, started growing, kept on growing, and then fell off the edge. What does that mean?

This was a stateful job so maybe we were not clearing out the state over time.
A memory leak could have occurred.

Step 5: Check your Streaming Metrics

Looking at our streaming metrics took us down the path of eliminating the culprits creating the cluster memory issue. Streaming metrics, emitted by Spark, provide information for every batch processed.

It looks something like this:

Note: These are not our real metrics. It’s just an example.

    
        Note: These are not our real metrics. It's just an example.
        {
            "id" : "abe526d3-1127-4805-83e6-9c477240e36b",
            "runId" : "d4fec928-4703-4d74-bb9d-233fb9d45208",
            "name" : "display_query_114",
            "timestamp" : "2020-04-23T09:28:18.886Z",
            "batchId" : 36,
            "numInputRows" : 561682,
            "inputRowsPerSecond" : 25167.219284882158,
            "processedRowsPerSecond" : 19806.12856588737,
            "durationMs" : {
                "addBatch" : 26638,
                "getBatch" : 173,
                "getOffset" : 196,
                "queryPlanning" : 400,
                "triggerExecution" : 28359,
                "walCommit" : 247
            },
            "eventTime" : {
                "avg" : "2020-04-23T08:33:03.664Z",
                "max" : "2020-04-23T08:34:58.911Z",
                "min" : "2020-04-23T08:00:34.814Z",
                "watermark" : "2020-04-23T08:33:42.664Z"
            },
            "stateOperators" : [ {
                "numRowsTotal" : 1079,
                "numRowsUpdated" : 894,
                "memoryUsedBytes" : 485575,
                "customMetrics" : {
                    "loadedMapCacheHitCount" : 14400,
                    "loadedMapCacheMissCount" : 0,
                    "stateOnCurrentVersionSizeBytes" : 284151
                }
            }

Plotting stateOperators.numRowsTotal against event time, we noticed stability over time. Hence it eliminates the possibility that OOM is occurring because of the state being retained.

The conclusion: a memory leak occurred, and we needed to find it. To do so, we enabled the heap dump to see what is occupying so much memory.

Step 6: Enable HeapDumpOnOutOfMemory

To get a heap dump on OOM, the following option can be enabled in the Spark Cluster configuration on the executor side:


    spark.executor.extraJavaOptions: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dbfs/heapDumps

Additionally, a path can be provided for heap dumps to be saved. We use this configuration because we can access it from the Databricks Platform. You can also access these files by ssh-ing into the workers and downloading them using tools like rsync.

Step 7: Take Periodic Heap dumps

Taking periodic heap dumps allow for analysis of multiple heap dumps to be compared with the OOM heap dumps. We took heap dumps every 12 hrs from the same executor. Once our executor goes into OOM, we would have at least two dumps available. In our case, executors were taking at least 24 hours to go into OOM.

Steps to take periodic heap dump:

ssh into worker
Get Pid using top of the java process
Get Heapdump jmap -dump:format=b,file=pbs_worker.hprof <pid>
Provide correct permissions to Heapdump file.
sudo chmod 444 pbs_worker.hprof
Download file on your local
./rsync -chavzP –stats
ubuntu@<worker_ip_address>:/home/ubuntu/pbs_worker.hprof .

Step 8: Analyze Heap Dumps

Heap dump analysis can be performed with tools like YourKit or Eclipse MAT.

In our case, heap dumps were large — in the range of 40gb or more. The size of the heap dumps made it difficult to analyze. There is a workaround that can be used to index the large files and then analyze them.

Step 9: Find where it is leaking memory by looking at Object Explorer

YourKit provides inspection of hprof files. If the problem is obvious, it will be shown in the inspection section. In our case, the problem was not obvious.

Looking at our heap histogram, we saw many HashMapNode instances, but based on our business logic, didn’t deem the information too concerning.

HeapHistogram Screenshot from Spark UI

When we looked at the class and packages section in YourKit, we found the same results; as we had expected.

HeapDump Analysis Screenshot from YourKit

What took us by surprise was HashMap$Node[16384] growing over periodic heap dump files. Looking inside HashMap$Node[16384] revealed that these HashMaps were not related to business logic but the AWS SDK.

Screenshot from YourKit

A quick Google search and code analysis gave us our answer: we were not closing the connection correctly. The same issue has also been addressed on the aws-sdk Github issues.

Step 10: Fix the memory leak

By analyzing the heap dump, we were able to determine the location of the problem. While making a connection to Kinesis, we created a new Kinesis client for every partition when the connection was opened (general idea copied from Databricks’ Kinesis documentation):


    class KinesisSink extends ForeachWriter[SinkInput] { 
        private var kinesisClient: KinesisClient = _ 
        override def open(partitionId: Long, version: Long): Boolean = {
            val httpClient = ApacheHttpClient
            .builder()
            .build()   
            kinesisClient = KinesisClient
            .builder()
            .region(Region.of(region))
            .httpClient(httpClient)
            .build()
            true
        } 
        override def process(value: KinesisSinkInput): Unit = {
            // process stuff
        } 
        override def close(errorOrNull: Throwable): Unit = {
            kinesisClient.close()
        }
    }

But in the case of closing the connection, we were closing only the KinesisClient:


override def close(errorOrNull: Throwable): Unit = {
    kinesisClient.close()
}

The Apache Http client was not being closed. This resulted in an increasing number of Http clients being created and TCP connections being opened on the system, causing the issue discussed here. The aws-sdk documentation states that:


* This provider creates a thread in the background to periodically update credentials. If this provider is no longer needed,

We were able to prove it out using the following script:


import $ivy.`software.amazon.awssdk:apache-client:2.13.37`
// causes OOM
(1 to 1e6.toInt).foreach { _ =>
    software.amazon.awssdk.http.apache.ApacheHttpClient.builder.build() }
    // doesn't cause OOM
    (1 to 1e6.toInt).foreach { _ => software.amazon.awssdk.http.apache.ApacheHttpClient.builder.build().close()
    }


override def close(errorOrNull: Throwable): Unit = {
    client.close()
    httpClient.close()
}

Conclusion

What we’ve seen in this post is an example of how to diagnose a memory leak happening in a Spark application. If I faced this issue again, I would attach a JVM profiler to the executor and try to debug it from there.

From this investigation, we got a better understanding of how Spark structured streaming is working internally, and how we can tune it to our advantage. Some lessons learned that are worth remembering:

Memory leaks can happen, but there are a number of things you can do to investigate them.
We need better tooling to read large hprof files.
If you open a connection, when you are done, always close it.

Try Databricks for free. Get started today.

The post A Step-by-step Guide for Debugging Memory Leaks in Spark Applications appeared first on Databricks.

Introducing Embedded TensorBoard, Table of Contents, and Dark Mode for Databricks Notebooks

New TensorBoard support

New Table-of-Contents

Introducing Dark Mode

Introducing nbformat support, diff view, and enterprise security features for Git-based Databricks Projects

Native nbformat support for Databricks notebooks

Staging and visual diff view

New Enterprise Security Features

Next Steps

Annotating Models and Model Versions with Tags

Adding Comments to Model Versions

Notifications via Webhooks

Monitoring Events via Audit Logs

Get Started with the Model Registry

Watch the discussion

Q&A

The need for protecting PII

Building the process to enforce Column-level Encryption

Writing protected PII with Fernet

Reading the protected data from a view with custom UDF

Advantages of this method of column-level encryption

Additional resources:

Get Started

What’s Next?

View the Azure Databricks FedRAMP High authorization assessment and other security compliance documentation

Watch the recording

Amazon Redshift Spectrum integration with Delta

Steps to Access Delta on Amazon Redshift Spectrum

Step 1: Create an AWS Glue DB and connect Amazon Redshift external schema to it

Step 2: Generate Manifest

Step 3: Create an external table directly from Databricks Notebook using the Manifest

Step 4: Options to Add/Delete partitions

Option 1: Using the Hive-Delta API command’s (preferred way)

Option 2: Using Amazon Redshift Data API

Option 2.1 CLI

Option 2.2 Redshift Data API (boto3 interface)

Option 3. Using MSCK Repair

Step 5: Querying the data

Conclusion

Proven analytics and AI at scale

Enabling government agencies and partners to accelerate innovation on Azure Government

Learn more about Azure Government and Azure Databricks

Get your data in one place

Use Azure Databricks, Azure Synapse and Power BI together

What makes Azure Databricks special

The big picture

Context: Docker

Context: Databricks

Context: Databricks Container Services

Solution to address Retina’s pain points

Wrapping Up

Helping customers and partners scale with global availability

One Simple Platform for All Your Data

Unification of all data types and data workloads

Rooted in Open Source

Abstract:

Python autocomplete using static code analysis from the Jedi library

Python docstring functionality activated by the Shift+Tab key

Koalas: a drop-in replacement for the pandas library

Enhanced type annotations for Koalas

Koalas docstrings in the notebook

Start using the improved autocomplete

Read more

Architecture and Constraints

Potential Solutions

Process now and hydrate later

Early detection of late-arriving dimensions

The Reconciliation Pattern

Example

Conclusion

Q&A from the virtual event

Just a bit of context

Step 1: Check Driver logs. What’s causing the problem?

Step 2: Check Executor Logs. Why are they failing?

Step 3: Check Garbage Collector Activity

Step 4: Check your Cluster health

Step 5: Check your Streaming Metrics

Step 6: Enable HeapDumpOnOutOfMemory

Step 7: Take Periodic Heap dumps

Step 8: Analyze Heap Dumps