100 Years of Horror Films: An Analysis Using Databricks SQL

October 28, 2021, 2:11 pm

≫ Next: Moneyball 2.0: Real-time Decision Making With MLB’s Statcast Data

≪ Previous: GPU-accelerated Sentiment Analysis Using Pytorch and Huggingface on Databricks

When it comes to the history of film, perhaps no genre says more about us as humans than horror, which taps into our biggest phobias and uncertainties about the world. With such a huge range – from gruesome to symbolic to comedically terrible – we thought it’d be interesting to analyze IMDb data on horror films from each decade and see what insights we’d discover. More specifically, we wanted to know things like: How has the popularity of certain subgenres shifted over time? How have the most popular horror films influenced the genre as a whole?

This blog post will walk through how we did just that using Databricks SQL and data from IMDb, the world’s most popular and authoritative source for movie, TV and celebrity content data. We thought this would be a fun way (especially with Halloween around the corner) to show just how easy it is to use Databricks SQL to immediately start querying data and creating visuals to draw quick insight.

Why Databricks SQL?

Databricks SQL is a service that allows users to easily perform BI and SQL directly on their data lake for reliable, lightning-fast analytics. Normally on a data warehouse, this would require data teams to integrate a BI tool and then spend hours setting up the data pipelines and processing the data via ETL. With Databricks SQL, because we’re able to query directly from a lakehouse, once we downloaded our data from IMDb (see below), we were able to start querying almost immediately and creating visuals within 30 minutes – all within a single platform.

For our analysis, we used a data set that included over 30,000 horror films from IMDb; we chose this sample data set since it’s easily accessible and available to developers. IMDb is an ideal source for any film analysis, as it includes hundreds of millions of searchable data items – including over 8 million movie, TV and entertainment titles. IMDb also leverages AWS Data Exchange, which makes it easy to find, subscribe to and use third-party data in the cloud, to provide essential metadata for every movie, TV and OTT series, and video game title in its catalog (scroll to the end of this blog for more info on IMDb as a data source).

Horror trope trends by decade

The first question we wanted to answer is: When looking at the films by decade, are there any observable trends on specific tropes (e.g. monsters, theme, etc.)? To do this, we calculated term frequency for every word that appeared in every title. From there, we used this as a foundation to identify commonly used “horror terms” and group them together. We identified the main tropes as these:

Vampire
Ghost
Halloween
Children’s Toys
Possession
Zombie
Witch
Monster

A simple word cloud gives us a high-level overview of the canon – apparently ghost films have always been a popular choice for filmmakers!

Let’s look at this more granularly. Our approach was simple. We took the tropes listed above and created an ontology to classify which movies are associated with each trope. For example, to identify movies within the ghost category, we included variations of: ghost, poltergeist, spirit, phantom and haunting. These variations were easily determined by the term frequency list. Here’s what ghost’s final set looked like:

Ghost

ghost

GHOST

Ghost-Cat

Ghost,

Ghost:

Ghost’s

Ghostbusters

Ghostbusters:

Ghosted

Ghostface

Ghosthunters

Ghosting

Ghostly

Ghostman

Ghosts

Poltergeist

Phantom

Phantoms

Spirit

spirit

Spirited

Spirits

Souls

Soul’s

Soul

soul

Haunted

HAUNTED

haunted

Haunted:

Haunter

Haunting

Hauntings

Haunts

Since we wanted to see how these different themes trended over time, we used the ontology to classify which movies belong to which trope. We then calculated and visualized the distribution of movies belonging to each trope by decade. The results were pretty interesting!

Hundred-year popularity analysis of common themes within the horror movie genre.

Our insights

As you can see, the early 20th century was pretty limited in terms of tropes and also contained the most vampire films from our data set. Interestingly, Dracula, probably the most famous vampire works, was published in 1897, so there is a potential correlation between this work and the prevalence of vampire films.

Another interesting point is the spike of possession films starting in the 70s. Again, this makes sense when looking at the horror film canon, as The Exorcist, arguably one of the most influential horror films ever made, premiered in 1973.

And finally, our data set shows a huge spike in monster films, which quickly tapered off in the second half of the 20th century. This does align with the canon timeline, as popular and influential films such as Godzilla, King Kong and Creature from the Black Lagoon premiered in the 1950s; it would be interesting to do a deeper analysis to see why this eventually trended downwards.

Zombie films had momentum in the 80s following the Dawn of the Dead (1978), a huge commercial success. But it made a huge comeback in the early 2000s, which is when the heavy-hitter zombie movies also hit the scene: 28 Days Later (2002), Resident Evil (2002) and the first “Rom Zom Com” Shaun of the Dead (2004). This “copycat” effect is definitely worth exploring more, and in a deeper analysis, we would like to look at the revenue and profitability of all these films.

Conclusion

While this blog post is meant to showcase the power of data analytics through a fun use case (and it gave us a good excuse to geek out on movies), more than that, it shows how simple it is to take a relatively large metadata set and start generating fast insights with SQL and visualizations. Often media companies are sitting on all sorts of data but aren’t sure how to derive value from it. We wanted to demonstrate how an analyst familiar with SQL but not more complex data science languages can start exploring these data sets to create interesting audience experiences. At Databricks, we’re all about making things simple for data practitioners of all titles and levels.

To dive into more entertainment use cases, check out our Media & Entertainment Solution Accelerators.

More about IMDb

With hundreds of millions of searchable data items, including over 8 million movie, TV and entertainment titles, more than 11 million cast and crew members and over 12 million images, IMDb is the world’s most popular and authoritative source for movie, TV and celebrity content, and has a combined web and mobile audience of more than 200 million monthly visitors.

IMDb enhances the entertainment experience by empowering fans and professionals around the world with cast and crew listings for every movie, TV series and video game, lifetime box office grosses from Box Office Mojo, proprietary film and TV user ratings from IMDb’s global audience of over 200 million fans, and much more.

IMDb licenses information from its vast and authoritative database to third-party businesses, including film studios, television networks, streaming services and cable companies, as well as airlines, electronics manufacturers, non-profit organizations and software developers. These businesses rely on the IMDb database to improve their own customers’ experience, power investment decisions, shape sentiment analysis, inform content acquisition strategies, and much more. Learn more at developer.imdb.com.

Try Databricks for free. Get started today.

The post 100 Years of Horror Films: An Analysis Using Databricks SQL appeared first on Databricks.

↧

Moneyball 2.0: Real-time Decision Making With MLB’s Statcast Data

October 28, 2021, 2:30 pm

≫ Next: Now Generally Available: Simplify Data and Machine Learning Pipelines With Jobs Orchestration

≪ Previous: 100 Years of Horror Films: An Analysis Using Databricks SQL

The Oakland Athletics baseball team in 2002 used data analysis and quantitative modeling to identify undervalued players and create a competitive lineup on a limited budget. The book Moneyball, written by Michael Lewis, highlighted the A’s ‘02 season and gave an inside glimpse into how unique the team’s strategic data modeling was, for its time. Fast forward 20 years – the use of data science and quantitative modeling is now a common practice among all sports franchises and plays a critical role in scouting, roster construction, game-day operations and season planning.

In 2015, the Major League Baseball (MLB) introduced Statcast, a set of cameras and radar systems installed in all 30 MLB stadiums. Statcast generates up to seven terabytes of data during a game, capturing every imaginable data point and metric related to pitching, hitting, running and fielding, which the system collects and organizes for consumption. This explosion of data has created opportunities to analyze the game in real-time, and with the application of machine learning, teams are now able to make decisions that influence the outcome of the game, pitch by pitch. It’s been 20 seasons since the A’s first introduced the use of data modeling to baseball. Here’s an inside look at how professional baseball teams use technologies like Databricks to create the modern-day Moneyball and gain competitive advantages that data teams provide to coaches and players on the field.

Figure 1: Position and scope of Hawkeye cameras at a baseball stadium

Figure 2: Numbers represent events during a play captured by Statcast

Figure 3: Sample of data collected by Statcast

Background

Data teams need to be faster than ever to provide analytics to coaches and players so they can make decisions as the game unfolds. The decisions made from real-time analytics can dramatically change the outcome of a game and a team’s season. One of the more memorable examples of this was in Game six of the 2020 world series. The Tampa Bay Rays were leading the Los Angeles Dodgers 1-0 in the sixth inning when Rays Pitcher Blake Snell was pulled from the mound while pitching arguably one of the best games of his career, a decision head coach Kevin Cash said was made with the insights from their data analytics. The Rays went on to lose the game and world series. Hindsight is always 20-20, but it goes to show how impactful data has become to the game. Coaching staff task their data teams with assisting them in making critical decisions, for example, should a pitcher throw another inning or make a substitution to avoid a potential injury? Does a player have a greater probability of success stealing from first to second base, or from second to third?

I have had the opportunity to work with many MLB franchises and discuss what their priorities and challenges are related to data analytics. Typically, I hear three recurring themes their data teams are focused on that have the most value in helping set their team up for success on the field:

Speed: Since every MLB team has access to the Statcast data during a game, one way to create a competitive advantage is to ingest and process the data faster than your opponent. The average length of time between pitches is 23 seconds, and this window of time represents a benchmark from which Statcast data can be ingested and processed for coaches to use to make decisions that can impact the outcome of the game.
Real-Time Analytics: Another competitive advantage for teams is the creation of insights from their machine learning models in real-time. An example of this is knowing when to substitute out a pitcher from fatigue, where a model interprets pitcher movement and data points created from the pitch itself and is able to forecast deterioration of performance pitch by pitch.
Ease of Use: Analytics teams run into problems ingesting the volumes of data Statcast produces when running data pipelines on their local computers. This gets even more complicated when trying to scale their pipelines to capture minor league data and integrate with other technologies. Teams want a collaborative, scalable analytics platform that automates data ingestion with performance, creating the ability to impact in-game decision-making.

Baseball teams using Databricks have developed solutions for these priorities and several others. They have shaped what the modern-day version of Moneyball looks like. What follows is their successful framework explained in an easy-to-understand way.

Getting the data

When a pitcher throws a baseball, Hawkeye cameras collect the data and save it to an application that teams are able to access using an application programming interface (API) owned by the MLB. You can think of an API as an intermediate connection between two computers to exchange information. The way this works is: a user sends a request to an API, the API confirms that the user has permission to access the data and then sends back the requested data for the user to consume. To use a restaurant as an analogy – a customer tells a waiter what they want to eat, the waiter informs the kitchen what the customer wants to eat, the waiter serves the food to the customer. The waiter in this scenario is the API.

Figure 4: Example of how an API works using a restaurant analogy.

This simple method of retrieving data is called a “batch” style of data collection and processing, where data is gathered and processed once. As noted earlier, however, data is typically available through the API every 23 seconds (the average time between pitches). This means data teams need to make continuous requests to the API in a method known as “streaming,” where data is continuously collected and processed. Just as a waiter can quickly become overworked fulfilling customers’ needs, making continuous API requests for data creates some challenges in data pipelines. With the assistance from these data teams, however, we have created code to accommodate continuously collecting Statcast data during a game. You can see an example of the code using a test API below.

Figure 5: Interacting with an API to retrieve and save data.

from pathlib import Path
import json

class sports_api:
def _init_(self, endpoint, api_key):
self.endpoint = endpoint
self.api_key = api_key
self.connection = self.endpoint + self.api_key

def fetch_payload(self, request_1, request_2, adls_path):
url = f"{self.connection}&series_id={request_1}{request_2}-99.M"
r = requests.get(url)
json_data = r.json()
now = time.strftime("%Y%m%d-%H%M%S")
file_name = f"json_data_out_{now}"
file_path = Path("dbfs:/") / Path(adls_path) / Path(file_name)
dbutils.fs.put(str(file_path), json.dumps(json_data), True)
return str(file_path)

This code decouples the steps of getting data from the API and transforming it into usable information which in the past, we have seen, can cause latency in data pipelines. Using this code, the Statcast data is saved as a file to cloud storage automatically and efficiently. The next step is to ingest it for processing.

Automatically load data with Autoloader

As pitch and play data is continuously saved to cloud storage, it can be ingested automatically using a Databricks feature called Autoloader. Autoloader scans files in the location they are saved in cloud storage and loads the data into Databricks where data teams begin to transform it for their analytics. Autoloader is easy to use and incredibly reliable when scaling to ingest larger volumes of data in batch and streaming scenarios. In other words, Autoloader works just as well for small and large data sizes in batch and streaming scenarios. The Python code below shows how to use Autoloader for streaming data.

Figure 6: Set up of Autoloader to stream data

df = spark.readStream.format("cloudFiles") \
.option(,) \
.schema() \
.load()

df.writeStream.format("delta") \
.option("checkpointLocation", ) \
.trigger() \
.start()

One challenge in this process is working with the file format in which the Statcast is saved, a format called JSON. We are typically privileged to work with data that is already in a structured format, such as the CSV file type, where data is organized in columns and rows. The JSON format organizes data into arrays and despite its wide use and adoption, I still find it difficult to work with, especially in large sizes. Here’s a comparison of data saved in a CSV format and a JSON format.

Figure 7: Comparison of CSV and JSON formats

It should be obvious which of these two formats data teams prefer to work with. The goal then is to load Statcast data in the JSON format and transform it into the friendlier CSV format. To do this, we can use the semi-structured data support available in Databricks, where basic syntax allows us to extract and transform the nested data you see in the JSON format to the structured CSV style format. Combining the functionality of Autoloader and the simplicity of semi-structured data support creates a powerful data ingestion method that makes the transformation of JSON data easy.

Using Databricks’ semi-structured data support with Autoloader

spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaLocation", "") \
.load("") \
.selectExpr(
"*",
"tags:page.name", # extracts {"tags":{"page":{"name":...}}}
"tags:page.id::int", # extracts {"tags":{"page":{"id":...}}} and casts to int
"tags:eventType" # extracts {"tags":{"eventType":...}}
)

As the data is loaded in, we save it to a Delta table to start working with it further. Delta Lake is an open format storage layer that brings reliability, security, and performance to a data lake for both streaming and batch processing and is the foundation of a cost-effective, highly scalable data platform. Semi-structured support with Delta allows you to retain some of the nested data if needed. The syntax allows flexibility to maintain nested data objects as a column within a Delta table without the need to flatten out all of the JSON data. Baseball analytics teams use Delta to version Statcast data and enforce specific needs to run their analytics on while organizing it in a friendly structured format.

Autoloader writing data to a Delta table as a stream

# Define the schema and the input, checkpoint, and output paths.
read_schema = ("id int, " +
"firstName string, " +
"middleName string, " +
"lastName string, " +
"gender string, " +
"birthDate timestamp, " +
"ssn string, " +
"salary int")
json_read_path = '/FileStore/streaming-uploads/people-10m'
checkpoint_path = '/mnt/delta/people-10m/checkpoints'
save_path = '/mnt/delta/people-10m'

people_stream = (spark \
.readStream \
.schema(read_schema) \
.option('maxFilesPerTrigger', 1) \
.option('multiline', True) \
.json(json_read_path))

people_stream.writeStream \
.format('delta') \
.outputMode('append') \
.option('checkpointLocation', checkpoint_path) \
.start(save_path)

With Autoloader continuously streaming in data after each pitch, semi-structured data support transforming it into a consumable format, and Delta Lake organizing it for use, data teams are now ready to build analytics that gives their team the competitive edge on the field.

Machine learning for insights

Recall the Rays pulling Blake Snell from the mound during the World Series — that decision came from insights coaches saw in their predictive models. Statistical analysis of Snell’s historical Statcast data provided by Billy Heylen of sportingnews.com indicated Snell had not pitched more than six innings since July 2019, had a lower probability of striking out a batter when facing them for the third time in a game, and was being relieved by teammate Kevin Anderson, whose own pitch data suggests was one the strongest closers in the MLB with a 0.55 earned run average (ERA) and 0.49 walks and hits per innings pitched (WHIP) during 19 regular-season games he pitched in 2020. Predictive models analyze data like this in real-time and provide supporting evidence and recommendations coaches use to make critical decisions.

Machine learning models are relatively easy to build and use, but data teams often struggle to implement them into streaming use cases. Add in the complexity of how models are managed and stored and machine learning can quickly become out of reach. Fortunately, data teams use MLflow to manage their machine learning models and implement them into their data pipelines. MLflow is an open-source platform for managing the end-to-end machine learning lifecycle and includes support for tracking predictive results, a model registry for centralizing models that are in use and others in development, and a serving capability for using models in data pipelines.

Figure 10: MLFlow overview

To implement machine learning algorithms and models to real-time use cases, data teams use the model registry where a model is able to read data sitting in a Delta table and create predictions that are then used during the game. Here’s an example of how to use a machine learning model while data is automatically loaded with Autoloader:

Getting a machine learning model from the registry and using it with Autoloader

#get model from the model registry
model = mlflow.spark.load_model(
model_uri=f"models:/{model_name}/{'Production'}")

#read data from bronze table as a stream
events = spark.readStream \
.format("delta") \
#.option("cloudFiles.maxFilesPerTrigger", 1)\
.schema(schema) \
.table("baseball_stream_bronze")

#pass stream through model
model_output = model.transform(events)

#write stream to silver delta table
events.writeStream \
.format('delta') \
.outputMode("append") \
.option('checkpointLocation', "/tmp/baseball/") \
.table("default.baseball_stream_silver")

The outputs a machine learning model creates can then be displayed in a data visualization or dashboard and used as printouts or shared on a tablet during a game. MLB franchises working on Databricks are developing fascinating use cases that are being used during games throughout the season. Predictive models are proprietary to the individual teams, but here’s an actual use case running on Databricks that demonstrates the power of real-time analytics in baseball.

Bringing it all together with spin ratios and sticky stuff

The MLB introduced a new rule for the 2021 season meant to discourage pitcher’s use of “sticky stuff,” a substance hidden in mitts, belts, or hats that when applied to a baseball can dramatically increase the spin ratio of a pitch, making it difficult for batters to hit. The rule suspends for 10 games pitchers discovered using sticky stuff. Coaches on opposing teams have the ability to request an umpire check for the substance if they suspect a pitcher to be using it during a game. Spin ratio is a data point that is captured by Hawkeye cameras and with real-time analytics and machine learning, teams are now able to make justified requests to umpires with the hopes of catching a pitcher using the material.

Figure 12: Illustration of how spin affects a pitch

Figure 13: Trending spin rate of fastballs per season and after rule introduction on June 3, 2021

Following the same framework outlined above, we ingest Statcast data pitch by pitch and have a dashboard that tracks the spin ratio of the ball for all pitchers during all MLB games. Using machine learning models, predictions are sent to the dashboard that flag outliers against historical data and the pitcher’s performance in the active game, which can alert coaches when they fall outside of ranges anticipated by the model. With Autoloader, Delta Lake, and MLflow all data ingestion and analytics happen in real-time.

Figure 14: Dashboard for Sticky Stuff detection in real-time

Technologies like Statcast and Databricks have brought real-time analytics to sports and changed the paradigm of what it means to be a data-driven team. As data volumes continue to grow, having the right architecture in place to capture real-time insights will be critical to staying one step ahead of the competition. Real-time architectures will be increasingly important as teams acquire and develop players, plan for the season and develop an analytically enhanced approach to their franchise. Ask about our Solution Accelerator with Databricks partner Lovelytics, which provides sports teams with all the resources they need to quickly create use cases like the ones described in this blog.

Try Databricks for free. Get started today.

The post Moneyball 2.0: Real-time Decision Making With MLB’s Statcast Data appeared first on Databricks.

↧

Now Generally Available: Simplify Data and Machine Learning Pipelines With Jobs Orchestration

November 1, 2021, 9:00 am

≫ Next: Databricks Sets Official Data Warehousing Performance Record

≪ Previous: Moneyball 2.0: Real-time Decision Making With MLB’s Statcast Data

We are excited to announce the general availability of Jobs orchestration, a new capability that lets Databricks customers easily build data and machine learning pipelines consisting of multiple, dependent tasks.

Today, data pipelines are frequently defined as a sequence of dependent tasks to simplify some of their complexity. But, they still demand heavy lifting from data teams and specialized tools to develop, manage, monitor and reliably run such pipelines. These tools are typically separate from the actual data or machine learning tasks. This lack of integration leads to fragmentation of efforts across the enterprise and users having to switch contexts a lot.

With today’s launch, orchestrating pipelines has become substantially easier. Orchestrating multi-step Jobs makes it simple to define data and ML pipelines using interdependent, modular tasks consisting of notebooks, Python scripts and JARs. Data engineers can easily create and manage multi-step pipelines that transform and refine data, and train machine learning algorithms, all within the familiar workspace of Databricks, saving teams immense time and effort.

In the example above, a Job consisting of multiple tasks uses two tasks to ingest data: Clicks_Ingest and Orders_Ingest. This ingested data is then aggregated together and filtered in the “Match” task, from which new machine learning features are generated (Build_Features), persistent (Persist_Features), and used to train new models (Train).

We are deeply grateful to the hundreds of customers who provided feedback during a successful public preview of Jobs orchestration with multiple tasks. Based on their input, we have added further improvements: a streamlined debug workflow, information panels that provide an overview of the job at all times, and a new version 2.1 of the Jobs API (AWS|Azure|GCP) with support for new orchestration features.

“Jobs orchestration is amazing, much better than an orchestration notebook. Each of our jobs now has multiple tasks, and it turned out to be easier to implement than I thought. I can’t imagine implementing such a data pipeline without Databricks.” – Omar Doma, Data Engineering Manager at BatchService

Get started today with the new Jobs orchestration now by enabling it yourself for your workspace (AWS|Azure|GCP). Otherwise, auto-enablement will occur over the course of the following months.

In the coming months, we will make it possible to reuse the same cluster among multiple tasks in a job and to repair failed job runs without requiring a full rerun. We are also looking forward to launching features that will make it possible to integrate with your existing orchestration tools

Enable your workspace

Try Databricks for free. Get started today.

The post Now Generally Available: Simplify Data and Machine Learning Pipelines With Jobs Orchestration appeared first on Databricks.

↧

Databricks Sets Official Data Warehousing Performance Record

November 2, 2021, 8:59 am

≫ Next: Turning 2 Trillion Data Points of Traffic Intelligence into Critical Business Insights

≪ Previous: Now Generally Available: Simplify Data and Machine Learning Pipelines With Jobs Orchestration

Today, we are proud to announce that Databricks SQL has set a new world record in 100TB TPC-DS, the gold standard performance benchmark for data warehousing. Databricks SQL outperformed the previous record by 2.2x. Unlike most other benchmark news, this result has been formally audited and reviewed by the TPC council.

These results were corroborated by research from Barcelona Supercomputing Center, which frequently runs TPC-DS on popular data warehouses. Their latest research benchmarked Databricks and Snowflake, and found that Databricks was 2.7x faster and 12x better in terms of price performance. This result validated the thesis that data warehouses such as Snowflake become prohibitively expensive as data size increases in production.

Databricks has been rapidly developing full blown data warehousing capabilities directly on data lakes, bringing the best of both worlds in one data architecture dubbed the data lakehouse. We announced our full suite of data warehousing capabilities as Databricks SQL in November 2020. The open question since then has been whether an open architecture based on a lakehouse can provide the performance, speed, and cost of the classic data warehouses. This result proves beyond any doubt that this is possible and achievable by the lakehouse architecture.

Rather than just sharing the results, we would like to take this opportunity to share with you the story of how we accomplished this level of performance and the effort that went into it. But we’ll start with the results:

TPC-DS World Record

Databricks SQL delivered 32,941,245 QphDS @ 100TB. This beats the previous world record held by Alibaba’s custom built system, which achieved 14,861,137 QphDS @ 100TB, by 2.2x. (Alibaba had an impressive system supporting the world’s largest e-commerce platform). Not only did Databricks SQL significantly beat the previous record, it did so by lowering the total cost of the system by 10% (based on published listed pricing without any discounts).

It’s perfectly normal if you don’t know what the unit QphDS means. (We don’t either without looking at the formula.) QphDS is the primary metric for TPC-DS, which represents the performance of a combination of workloads, including (1) loading the data set, (2) processing a sequence of queries (power test), (3) processing several concurrent query streams (throughput test), and (4) running data maintenance functions that insert and delete data.

The aforementioned conclusion is further supported by the research team at Barcelona Supercomputing Center (BSC) that recently ran a different benchmark comparing Databricks SQL and Snowflake, and found that Databricks SQL was 2.7x faster than a similarly sized Snowflake setup. They benchmarked Databricks using two different modes: on-demand and spot (underlying machines backed by spot instances with lower reliability but also lower cost). Databricks was 7.4x cheaper than Snowflake in on-demand mode, and 12x in spot.

Chart 1: Elapsed time for test derived from TPC-DS 100TB Power Run, by Barcelona Supercomputing Center.

Chart 2: Price/Performance for test derived from TPC-DS 100TB Power Run, by Barcelona Supercomputing Center.

What is TPC-DS?

TPC-DS is a data warehousing benchmark defined by the Transaction Processing Performance Council (TPC). TPC is a non-profit organization started by the database community in the late 80s, focusing on creating benchmarks that emulate real-world scenarios and, as a result, can be used objectively to measure database systems’ performance. TPC has had a profound impact in the field of databases, with decade-long “benchmarking wars” between established vendors like Oracle, Microsoft, and IBM that have pushed the field forward.

The “DS” in TPC-DS stands for “decision support.” It includes 99 queries of varying complexity, from very simple aggregations to complex pattern mining. It is a relatively new (work started in mid 2000s) benchmark to reflect the growing complexity of analytics. In the last decade or so, TPC-DS has become the de facto standard data warehousing benchmark, adopted by virtually all vendors.

However, due to its complexity, many data warehouse systems, even the ones built by the most established vendors, have tweaked the official benchmark so their own systems would perform well. (Some common tweaks include removing certain SQL features such as rollups or changing data distribution to remove skew). This is one of the reasons why there have been very few submissions to the official TPC-DS benchmark, despite more than 4 million pages on the Internet about TPC-DS. The tweaks also ostensibly explain why most vendors seem to beat all other vendors according to their own benchmarks.

How did we do it?

As mentioned earlier, there have been open questions whether it’s possible for Databricks SQL to outperform data warehouses in SQL performance. Most of the challenges can be distilled into the following four issues:

Data warehouses leverage proprietary data formats and, as a result, can evolve them quickly, whereas Databricks (based on Lakehouse) relies on open formats (such as Apache Parquet and Delta Lake) that don’t change as quickly. As a result, EDWs would have an inherent advantage.
Great SQL performance requires the MPP (massively parallel processing) architecture, and Databricks and Apache Spark were not MPP.
The classic tradeoff between throughput and latency implies that a system can be great for either large queries (throughput focused) or small queries (latency focused), but not both. Since Databricks focused on large queries, we had to perform poorly for small queries.
Even if it is possible, the conventional wisdom is that it’d take a decade or longer to build a data warehouse system. There’s no way progress can be made so quickly.

In the rest of the blog post, we will discuss them one by one.

Proprietary vs open data formats

One of the key tenets of the Lakehouse architecture is the open storage format. “Open” not only avoids vendor lock-in but also enables an ecosystem of tools to be developed independent of the vendor. One of the major benefits of open formats is standardization. As a result of this standardization, most of the enterprise data is sitting in open data lakes and Apache Parquet has become the de facto standard for storing data. By bringing data warehouse-grade performance to open formats, we hope to minimize data movement and simplify the data architecture for BI and AI workloads.

An obvious attack against “open” is that open formats are hard to change, and as a result hard to improve. Although in theory this argument makes sense, it is in practice not accurate.

First, it is definitely possible for open formats to evolve. Parquet, the most popular open format for large data storage, has gone through multiple iterations of improvements. One of the main motivations for us introducing Delta Lake was to introduce additional capabilities that were difficult to do at the Parquet layer. Delta Lake brought additional indexing and statistics to Parquet.

The second point is a more nuanced one that requires some understanding of the systems’ architecture. For most queries, it is not the underlying data format but the intermediate caching format that determines the scan performance. Virtually all modern cloud systems rely on object stores for storage, and local SSDs/memory for caching. Databricks does that too. It turns out a well architected data system can read from the cache for most queries. Databricks SQL exploits this opportunity aggressively, transcoding the data into a more efficient format for NVMe SSDs on the fly during caching.

MPP architecture

A common misconception is that data warehouses employ the MPP architecture that is great for SQL performance, while Databricks does not. MPP architecture refers to the ability to leverage multiple nodes to process a single query. This is exactly how Databricks SQL is architected. It is not based on Apache Spark, but rather Photon, a complete rewrite of an engine, built from scratch in C++, for modern SIMD hardware and does heavy parallel query processing. Photon is thus an MPP engine.

Throughput vs latency trade off

Throughput vs latency is the classic tradeoff in computer systems, meaning that a system cannot get high throughput and low latency simultaneously. If a design favors throughput (e.g. by batching data), it would have to sacrifice latency. In the context of data systems, this means a system cannot process large queries and small queries efficiently at the same time.

We won’t deny that this tradeoff exists. In fact, we often discuss it in our technical design docs. However, current state of the art systems, including our own and all the popular warehouses are far away from the optimal frontier on both throughput and latency fronts.

Consequently, it is entirely possible to come up with a new design and implementation that simultaneously improves both its throughput and latency. That is exactly how we’ve built almost all our key enabling technologies in the last two years: Photon, Delta Lake, and many other cutting-edge technologies have improved the performance of both large and small queries, pushing the frontier to a new performance record.

Time and focus

Finally, conventional wisdom is that it would take at least a decade or so for a database system to mature. Given Databricks’ recent focus on Lakehouse (to support SQL workloads), it would take additional effort for SQL to be performant. This is valid, but let us explain how we did it much faster than one might expect.

First and foremost, this investment didn’t start just a year or two ago. Since the inception of Databricks, we have been investing in various foundational technologies to support SQL workloads that would also benefit AI workloads on Databricks. This includes a full blown cost-based query optimizer, a native vectorized execution engine, and various capabilities like window functions. The vast majority of workloads on Databricks run through these thanks to Spark’s DataFrame API, which maps into its SQL engine, so these components have had years of testing and optimization. What we haven’t done as much was to emphasize SQL workloads. The positioning change towards Lakehouse is a recent one, driven by our customers’ desire to simplify their data architectures.

Second, the SaaS model has accelerated software development cycles. In the past, most vendors had yearly release cycles and then another multi-year cycle for customers to install and adopt the software. In SaaS, our engineering team can come up with a new design, implement it, and release it to a subset of customers in a matter of days. This shortened development cycle enabled teams to get feedback quickly and innovate faster.

Third, Databricks could bring significantly more focus both in terms of leadership bandwidth and capital to this problem. Past attempts at building a new data warehouse system were done either by startups or a new team within a large company. There has never been a database startup as well funded as Databricks (over $3.5B raised) to attract the talent needed to build this. A new effort within a large company would be just yet another effort, and wouldn’t have the leadership’s full attention.

We had a unique situation here: we focused initially on establishing our business not on data warehousing, but on related fields (data science and AI) that shared a lot of the common technological problems. This initial success then enabled us to fund the most aggressive SQL team build out in history; in a short period of time, we’ve assembled a team with extensive data warehouse background, a feat that would take many other companies around a decade. Among them are lead engineers and designers of some of the most successful data systems, including Amazon Redshift; Google’s BigQuery, F1 (Google’s internal data warehouse system), and Procella (Youtube’s internal data warehouse system); Oracle; IBM DB2; and Microsoft SQL Server.

To summarize, it takes multiple years to build out great SQL performance. Not only did we accelerate this leveraging our unique circumstances, but we also started years ago even though we didn’t use a megaphone to advertise the plan.

Real-world customer workloads

We are excited to see these benchmark results validated by our customers. Over 5,000 global organizations have been leveraging the Databricks Lakehouse Platform to solve some of the world’s toughest problems. For example:

Bread Finance is a technology-driven payments platform with big data use cases such as financial reporting, fraud detection, credit risk, loss estimation and a full-funnel recommendation engine. On the Databricks Lakehouse Platform, they are able to move from nightly batch jobs to near real time ingestion, and reduce data processing time by 90%. Moreover, the data platform can scale to 140x the volume of data at only 1.5x the cost.
Shell is using our lakehouse platform to enable hundreds of data analysts execute rapid queries on petabyte scale datasets using standard BI tools, which they consider to be a “game changer.”
Regeneron is accelerating drug target identification, providing faster insights to computational biologists by reducing the time it takes to run queries on their entire dataset from 30 minutes to down 3 seconds – a 600x improvement.

Summary

Databricks SQL, built on top of the Lakehouse architecture, is the fastest data warehouse in the market and provides the best price/performance. Now you can get great performance on all your data at low latency as soon as new data is ingested without having to export to a different system.

This is a testament to the Lakehouse vision, to bring world-class data warehousing performance to data lakes. Of course, we didn’t build just a data warehouse. The Lakehouse architecture provides the ability to cover all data workloads, from warehousing to data science and machine learning.

But we are not done yet. We have assembled the best team on the market, and they are working hard to deliver the next performance breakthrough. In addition to performance, we are also working on a myriad of improvements on ease-of-use and governance. Expect more news from us in the coming year.

Try Databricks for free. Get started today.

The post Databricks Sets Official Data Warehousing Performance Record appeared first on Databricks.

↧

Turning 2 Trillion Data Points of Traffic Intelligence into Critical Business Insights

November 3, 2021, 9:00 am

≫ Next: Building the Next Generation Visualization Tools at Databricks

≪ Previous: Databricks Sets Official Data Warehousing Performance Record

This is a guest authored post by Stephanie Mak, Senior Data Engineer, formerly at Intelematics.

This blog post offers my experience of contributing to the open source community with Bricklayer, which I’d started during my time at Intelematics. Bricklayer is a utility for data engineers whose job is to farm jobs, build map layers and other structures with geospatial data, which was built using Databricks Lakehouse Platform. Having to deal with a copious amount of geospatial data, Bricklayer makes it easier to manipulate and visualize geospatial data and parallelize batch jobs programmatically. Through this open source project, teams working with geospatial data are able to increase their productivity by spending less time writing code for these common functionalities.

Background

Founded in Melbourne in 2001, Intelematics has spent two decades developing a deep understanding of the intelligent transport landscape and its impact on customers. Over the years, a total data footprint of almost 2 trillion data points of traffic intelligence has been accumulated from smart sensors, vehicle probes in both commercial and private fleets, and a range of IoT devices. To make sense of this enormous amount of data, the INSIGHT traffic intelligence platform was created, which helps users with fleet GPS tracking and management and access to a single source of truth for planning, managing and assessing projects that require comprehensive and reliable traffic data and insights.

With the help of Databricks, the INSIGHT traffic intelligence platform is able to process 12 billion road traffic data points in under 30 seconds. It also provides a detailed picture of Australia’s road and movement network to help solve complex road and traffic problems and uncover new opportunities.

The INSIGHT team initially started our open source project, Bricklayer, as a way to monitor internal productivity and to address some of the pain points experienced when performing geospatial analysis and multi-processing. We were able to solve the inefficiencies in our workflow, which entailed switching between our online data retrieval tool (Databricks) and offline geospatial visualization tool (QGIS), and simplify performing arbitrary parallelization in pipelines. We then decided to join the open source community to help shape the foundation of the big data processing ecosystem.

GIS data transformation

The need for spatial analysis in Databricks Workspace
Spatial analysis and manipulation of geographical information was traditionally done by using QGIS, a desktop application running locally or in a server with features like support for multiple vector overlays and immediate visualization of geospatial query and geoprocessing results.

Our development of data transformation pipelines are implemented in a Databricks workspace. To support the fast iteration of data asset development, which progresses from analysis, design, implementation, validation and downstream consumptions, spatial analysis in the local environment is inefficient and impossible with the amount of data we have.

Building map layer to display in Databricks notebook
We decided to use folium maps to render geometry on map tileset such as OpenStreetMap and display in Databricks notebook.

For an example of what this may look like, this public map demo notebook may provide some guidance on usage.

Scaling and parallel batch processing

The need for parallel batch processing
Geospatial transformation is computationally intensive. With the amount of data we have (96 timeslots per day for several years on hundreds of thousands of road segments), it would be impossible to load all data into memory even with the largest instance size in Databricks. Using a “divide and conquer” approach, data can be chopped along the time dimension into evenly distributed batches and be processed in parallel. Since this workload is running in Python, it was not possible to parallelize in threads and not trivial to use Apache Spark™ parallelization.

Job spawning using Jobs API
Databricks allows you to spawn jobs inline using the `dbutils.notebook.run` command; however, running the command is a blocking call, so you are unable to start jobs concurrently. By leveraging Databricks REST API 2.0, Bricklayer can spawn multiple jobs at the same time to address the parallelization problem. We wrap around some common use cases when dealing with jobs, such as creating new jobs with a new job cluster, job status monitoring or terminating the batch.

To trigger multiple jobs:

```python
 
from bricklayer.api import DBSApi 
 
for x in range(3): 
	job = DBSApi().create_job('./dummy_job') 
```

To retrieve specific jobs and terminate them:

```python
from bricklayer.api import DBSApi 
 
for job in DBSApi().list_jobs(job_name='dummy_job'): 
	print(job.job_id) 
    job.stop() 
```

What’s next

Databricks accelerates innovation by unifying workflow of data science, data engineering and business in a single platform. With Bricklayer, Intelematics’ mission is to create more seamless integration to make the life of engineers easier.

We are planning to continuously improve error messages to make them more useful and informative, auto generate tables in multiple formats based on schema provided in Avro and Swagger/ OpenAPI, and validate catalog according to schema. For more updated details please visit the roadmap.

About Intelematics

Founded in Melbourne in 2001, Intelematics has spent two decades developing a deep understanding of the intelligent transport landscape and what it can mean for our customers. We believe that the ever-increasing abundance of data and the desire for connectivity will fundamentally change the way we live our lives over the coming decades. We work with clients to harness the power of technology and data to drive smarter decisions and innovate for the benefit of their customers.

Try Databricks for free. Get started today.

The post Turning 2 Trillion Data Points of Traffic Intelligence into Critical Business Insights appeared first on Databricks.

↧

Building the Next Generation Visualization Tools at Databricks

November 3, 2021, 12:34 pm

≫ Next: Summer 2021 Databricks Internship – Their Work and Their Impact!

≪ Previous: Turning 2 Trillion Data Points of Traffic Intelligence into Critical Business Insights

This post is a part of our blog series on our frontend work. You can see the previous one on “Simplifying Data + AI, One Line of TypeScript at a Time.”

After years of working on data visualization tools, I recently joined Databricks as a founding member of the visualization team, which aims to develop high-performance visual analytics capabilities for Databricks products. In this post, I’m sharing why I am super excited to build the next-generation visualization tools at Databricks.

Mission alignment: simplify data and AI

I joined Databricks because my passion aligns with the company’s mission to simplify data and AI.

For context, I did my PhD at the UW Interactive Data Lab to research new visualization tools that make data more accessible (like the lab did by creating D3.js). After my PhD, I joined Apple’s AI/ML group as their first visualization research scientist and co-founded the Machine Intelligence Visualization team to build better visualization tools for machine learning at Apple. Over the years, I co-authored many open-source projects that aimed to simplify data visualization and AI, including Vega-Lite, Voyager, and the Tensorflow Graph Visualization.

Vega-Lite lets users easily build interactive visualizations with a concise and intuitive JSON API.

Similar to how Apache Spark™ helps people run distributed computations with just a few lines of Python or SQL, Vega-Lite helps users build interactive charts by writing a dozen lines of code (instead of hundreds in D3.js). Vega-Lite’s JSON format also enables the open-source communities to build wrapper APIs in other languages such as Altair in Python. As a result, people can easily create interactive charts in these languages as well.

Voyager is a graphical interface that leverages chart recommendations for data exploration.

Besides simplifying code for visualization, I also built a tool for visualizing data without writing code. The Voyager system leverages chart recommendations to help people quickly explore data in a graphical user interface (GUI). As a research project, Voyager received a lot of traction including integration with JupyterLab. However, building a production-quality GUI tool and integrating it with data science environments require significant resources beyond what a small research team could have. Thus, I had been hoping for an opportunity to take some of these research ideas to the next level.

So when I heard that Databricks was assembling a team to develop new visualization tools on top of their powerful Lakehouse platform, I jumped at the opportunity.

Databricks: Unique opportunity for visualization tool builders

Databricks offers a unique opportunity for building next-generation visualization tools for many reasons:

First, Databricks is where data at scales live. One of the hardest problems visualization tools need to overcome in gaining adoption is to integrate with the data sources. Over 5,000 global organizations are using the Databricks Lakehouse Platform for data engineering, machine learning and analytics. Every day, the platform processes exabytes of data over millions of machines. We can build tools that impact data analysts, data engineers, and data scientists on this platform, where the data is readily available.

Second, the company has a strong open-source culture. Databricks was co-founded by the original authors of Apache Spark and has since built many leading open-source projects including Delta Lake and MLflow. At Databricks, we have the opportunity to both build products that impact customers and contribute to open-source communities.

Third, future visualization tools should be integrated into data, analytics, and machine learning workflows, so people can easily leverage the power of visualizations. As a unified platform for all of these workflows, Databricks is the perfect place to build these integrations.

Last but not least, since visualization is a relatively new area for Databricks, we have the flexibility to innovate a new class of visualization tools without being restricted by decades of legacy.

The Databricks Lakehouse Platform provides a unified environment for data, analytics, and machine learning work. Visualization can be an integral part of these different activities.

Visualization tools as an integral part of a unified platform

There are many exciting challenges and advantages for building visualization tools as an integrated part of a unified platform for data, analytics, and AI. Here are a few highlights.

Bridging coding and graphical user interfaces

As we consider different groups of data workers, which include both programmers and non-programmers, one exciting challenge is to design tools that can benefit from the best of both graphical and coding interfaces. Specifically, existing visualization GUI tools provide ease-of-use and accessibility to non-programmers, but are often built as monolithic standalone tools and thus are not integrated with data science coding environments like notebooks. On the other hand, charting APIs are natural for usage in notebooks and for integration with other engineering tools such as version control and continuous integration. However, they lack the same ease-of-use and interactivity provided by GUI tools.

We think the future of visualization tools will be GUI components that are well integrated with coding environments and the data ecosystems. Prior to joining Databricks, my colleagues and I explored this idea in our mage project and published a paper about it at UIST’20. I am also very excited that Databricks recently acquired 8080 Labs, the creator of Bamboolib, a popular Python library that introduces extendable GUIs to enable low-code analysis in Jupyter notebooks. We have a great opportunity to better bridge the gap between coding and graphical interfaces on the Databricks Lakehouse platform.

Bamboolib introduces extendable GUIs that can export code in Jupyter Notebooks.

Consistent experience for different data activities

By integrating visualizations tools into a unified data platform, users can leverage the same set of features and get consistent experiences for different activities. We are currently integrating visualization capabilities from Databricks SQL across the Lakehouse platform.

With this integration, users may use our tools to profile and clean their data during ETL. They may then use the same tools for their analyses or modeling. They can also reuse the same charts from their analyses in their reports and dashboards, or use similar tools to create new charts. As we enhance our features, our work can benefit all of these use cases.

We can also leverage other tools on the platform to improve the user experience of visualization tools. For example, as users perform data modeling in our data catalog, visualization tools can leverage the resulting metadata (such as data types or relationships between columns) to provide better defaults and make recommendations for our users.

Scalable visualization tools

As the amount of data is growing rapidly, it is critical that future visualization tools must also scale. Databricks is arguably the best place to build visualizations tools at scale because the company is well-known for the scalability of its platform. We have an opportunity to leverage Databricks’ powerful systems on the platform. For example, we are building a new visualization aggregation feature in Databricks SQL that can aggregate data either in the browser or in the backend, depending on the data size. More importantly, we can also collaborate with our world-class backend engineers and influence the design of the platform to better support new use cases such as ad hoc data analytics and streaming visualizations.

You can help us build the future of data experience!

I’m super excited about what we are building at Databricks. We are starting with a small but talented team, with world-class engineers, designers, and product managers that have designed leading data analysis and visualization tools. However, we are just getting started. There are still a lot of exciting things to build at Databricks and you can help us revolutionize how people work with data.

JOIN OUR TEAM!

The post Building the Next Generation Visualization Tools at Databricks appeared first on Databricks.

↧

Summer 2021 Databricks Internship – Their Work and Their Impact!

November 8, 2021, 9:55 am

≫ Next: Eliminating the DeWitt Clause for Database Benchmarking

≪ Previous: Building the Next Generation Visualization Tools at Databricks

With COVID precautions still in place, the 2021 Databricks Software Engineering Summer internship was conducted virtually with members of the intern class joining us from their home offices located throughout the world. As always, this year’s interns collaborated on a number of impactful projects, covering a wide range of data and computer science disciplines, from software development and algorithm design to cloud infrastructure and pipeline architecture and management. This blog features nine of this year’s team members, who share their internship experience and details of their intern projects.

Prithvi Kannan – MLFlow Experiment Actions

MLflow Experiments are used extensively by data scientists to track and monitor ML training processes collaboratively. Prithvi’s internship focused on making MLflow Experiments, an open source platform for managing the end-to-end machine learning lifecycle, first-class objects on Databricks. To execute on this goal, he built an Experiments Permission API that enabled customers to programmatically view and update permissions on Experiments. Next, he worked on giving users the ability to perform Managed Actions on an experiment from within the experiment observatory and experiment page. To deliver these features, Prithvi delivered the features through an end-to-end product development process at Databricks, wearing many hats – he collaborated with PMs on product requirements, UX designers to nail the user interface designs, Tech Leads for engineering design decisions, and finally with tech writers to polish product release statements before rollout. Ultimately, we were excited to make permissions and sharing easier on Databricks ML!

Kritin Singhal – Migrating Auth Session Caches to use Highly Available Distributed Redis Clusters

Kritin worked on the Applications Infrastructure team to improve and scale Databricks authentication service to handle 1 million requests/second. The auth service used local cache in the Kubernetes pod, which made multiple duplicated calls to the database. Kritin onboarded Redis Clusters to the team and deployed a Redis Cluster Coordinator that automated cluster creation and failovers, leading to a highly available caching infrastructure with horizontal scaling and high throughput. He then migrated the local caches to an optimized caching flow using Redis and Guava Local Cache, enabling the team to expand and scale the authentication service pods horizontally and reducing traffic to the database. This improvement would help Databricks scale exponentially on the top-most layer!

Tianyi Zhang – Search Index Live Updater

Tianyi worked with the Data Discovery team to design and implement the live updater, a new service for bringing Elasticsearch indices up-to-date by rewinding and replaying Change Data Capture (CDC) events from input streams with high throughput. It is a major component of the universal search platform, which allows users to search and discover across all the entities within their Databricks workspace – down to the metadata, including notebooks, models, experiments, repos and files they own or create. It is designed to operate in both Azure and AWS and is extensible for other newly introduced search entities in the future. The new search infrastructure will unify existing search experiences with varying capabilities and significantly improve users’ experience finding the right data asset traits.

Steven Chen – Auto-generating Batch Inference Notebooks

This summer, Steven got to work on the ML Inference team to improve the customer user journey for using models in the MLflow model registry. Models are often used to generate predictions regularly on batches of data or predict real-time data by setting up a REST API. Steven’s project focused on building a new feature that auto-generates a batch inference notebook that recreates the training environment and generates and saves predictions on an input table. In this role, Steven had a chance to work on one of Databricks’ open source projects, MLflow, to log additional dependencies and metadata necessary for the model environment, which was released during his internship. Finally, he connected the feature to the MLflow UI to allow for one-click batch inference notebook generation and real-time serving.

Leon Bi — Databricks Repos CRUD API / Ipynb File Support

Leon worked with the Repos team to help implement a CRUD API for the Databricks Repos feature. Databricks Repos provides repository-level integration with Git providers by syncing code in a Databricks workspace with a remote Git repository and lets users use Git functionality such as cloning a remote repo, managing branches, pushing and pulling changes, and visually comparing differences upon commit. With the new CRUD API, users can automate the lifecycle management of their repos by hitting REST endpoints. Other API features include endpoints to manage user access levels for each Repo and cursor pagination for the list. Finally, Leon implemented Ipynb file support in Repos. Ipynb (Jupyter) notebooks are the industry standard for interactive data science notebooks, but this file format was not previously supported in Repos. Now, customers have the ability to import, edit and run their existing Ipynb notebooks within Repos.

James Wei – SCIM API Token At Accounts Level

James worked with the Identity & Access Management team to design and build scoped API tokens that enable user provisioning in Databricks accounts. Accounts are a higher-level abstraction that customers can use to manage multiple workspaces. Previously, user provisioning had to be done on a per-workspace basis and account-level tokens were nonexistent, and access tokens in workspaces could be used to access almost any API. James rebuilt the Databricks token infrastructure to support accounts and introduced a new scoped token authenticator that validates the defined scopes for a token against the authorization policy of an endpoint. He then created a scoped token that can sync enterprise users and groups between an identity provider and Databricks. As a result, user provisioning moved from a multi-step process per workspace into a one-time process per account. This project played a critical role in simplifying the customer journey when adding users into Databricks, and will undoubtedly help drive usage across the Databricks platform!

Brian Wu – Stateless Sessions Thrift Protocol

Brian worked with the SQL Gateway team to simplify and improve the consistency of session state updates in Databricks SQL, which allows customers to operate a multi-cloud lakehouse architecture at a better price/performance) session state updates. SQL commands run within the context of a SQL session, but multiple commands in the same session can run across different Apache Spark clusters. Brian’s contributions modified the protocol used to communicate with these clusters, allowing them to be stateless for sessions while ensuring all commands run with the latest session state. He also implemented command-scoped session state updates, so concurrent commands in the same session executing on the same cluster don’t interfere with each other. His project improves consistency, reduces code complexity and allows the engineering team to simplify the process of adding new features to Databricks SQL.

Regina Wang – SQL Autocomplete

Over the course of the summer, Regina worked on improving the SQL code editing experience for our main notebook product. To this end, she first added in context-aware results for SQL autocomplete — that is, when making autocomplete recommendations, the entire query would be taken into account. This allows for users to have more relevant autocomplete results and more seamless integration of our notebooks with Redash’s SQL parser. In addition, Regina added the ability to display SQL autocomplete and syntax highlighting within Python notebooks, a high-demand feature that enables customers to easily sift through databases and tables while coding in Python. When fully rolled out, these features will improve developer productivity and are touched by ~65% of all customers using our main Python notebook.

Steve Jiang – Migration from SQS/AQS to AutoLoader.

Steve worked with the Ingestion team to build scalable ingestion from Google Cloud Storage via the Auto Loader streaming source. This approach utilizes the GCP PubSub messaging service and event notifications to process and stream changes from a GCS bucket. This form of ingestion is significantly more scalable than the alternative, which is based on repeated directory listings to ingest new files, and will make Auto Loader a much more feasible option for customers with large amounts of data stored on GCP. He also worked on supporting schema evolution for Avro files by extendingDatabricks’ schema inference functionality to merge multiple Avro schemas into one unified schema and building rescued data for Avro files. The new schema inference allows streams to run continuously while accommodating new fields that may pop up in the data schema over time. Rescued data will serve to capture any data that would otherwise be lost due to mismatched schemas in Avro files that have different schemas than expected.

We want to extend a big “thank you” to all of our Summer 2021 engineering interns, who helped develop some important features and updates that will be widely used by our customers.

Want to help make an impact? Our Software Engineering Internship and New Grad (2022 Start) roles are open! We’re always looking for eager engineers that are excited to grow, learn and contribute to building an innovative company together!

Try Databricks for free. Get started today.

The post Summer 2021 Databricks Internship – Their Work and Their Impact! appeared first on Databricks.

↧

Eliminating the DeWitt Clause for Database Benchmarking

November 8, 2021, 10:00 am

≫ Next: Announcing Databricks Engineering Fellowship

≪ Previous: Summer 2021 Databricks Internship – Their Work and Their Impact!

At Databricks, we often use the phrase “the future is open” to refer to technology; it reflects our belief that open data architecture will win out and subsume proprietary ones. “Open” isn’t just about code. It’s about how we as an industry operate and foster debate. Today, many companies in tech have tried to control the narrative on their products’ performance through a legal maneuver called the DeWitt Clause, which prevents comparative benchmarking. We think this practice is bad for customers and bad for innovation, and it’s time for it to go. That’s why we are removing the DeWitt Clause from our service terms, and calling upon the rest of the industry to follow.

What is the DeWitt Clause?

According to Wikipedia, “the original DeWitt Clause was established by Oracle at the behest of Larry Ellison. Ellison was displeased with a benchmark study done by David DeWitt in 1982, …, which showed that Oracle’s system had poor performance.”

Professor David DeWitt is a well-known database researcher who pioneered many of the technologies modern-day database systems depend on. Early on in his career, DeWitt spent a lot of time benchmarking different commercial database systems to understand and push forward the state-of-the-art. He published papers that demonstrated the strengths and weaknesses of such systems, including Oracle.

Triggered by this research, Oracle created the DeWitt Clause, a new provision that prohibits people (researchers, scientists, or competitors) from publishing any benchmarks of Oracle’s database systems.

Over time, the DeWitt Clause became a standard feature of most database vendors’ license agreements. It’s a primary reason you often see benchmarks comparing anonymous systems, sometimes referred to as DBMS-X, in research papers and why many benchmarks are completely absent.

Almost 40 years have passed since the introduction of the original DeWitt Clause – 40 years of DeWitt’s name, the name of a tech pioneer who was pro benchmarking, being synonymous with preventing its use.

An open era calls for open terms: introducing the “DeWitt Embrace Clause”

Like many vendors – from classic ones like Oracle to newer ones like Snowflake – we too included a DeWitt Clause in the past, as it was the industry standard.

But standard isn’t good enough. We owe our users more.

We believe our users should have access to transparent benchmarks to decide which products are best for them. We also don’t think we should get angry at benchmarks in which we do poorly, as long as they were conducted in a transparent, fair fashion. We view those as opportunities to improve our product.

As a result, we have removed the DeWitt clause from our service terms.

But getting rid of our DeWitt clause isn’t enough; if you want to be able to see real, competitive benchmarks, you need to demand that no one relies on a DeWitt clause to stifle competition.

That’s why we’re also introducing what we refer to as a “DeWitt Embrace Clause”: if a competitor or vendor benchmarks Databricks or instructs a third party to do so, this new provision invalidates the vendor’s own DeWitt Clause if there’s any, to allow us to benchmark them and explicitly name them in the benchmark. We are not alone in this. For example, Azure and AWS have adopted similar clauses.

Today we call on the rest of the industry to follow in DeWitt’s footsteps.

We believe companies should win or lose because their products are better or worse. Because their engineers innovate. Not because their lawyers stagnate.

The future is open.

Try Databricks for free. Get started today.

The post Eliminating the DeWitt Clause for Database Benchmarking appeared first on Databricks.

↧

Announcing Databricks Engineering Fellowship

November 10, 2021, 8:41 am

≫ Next: What to Expect at Data + AI World Tour

≪ Previous: Eliminating the DeWitt Clause for Database Benchmarking

We are excited to announce a new program called Databricks Engineering Fellowship to recognize new graduates with exceptional academic achievements or extracurricular impact, in the field of computer science. This fellowship supports new graduates joining Databricks in all areas relevant to Databricks engineering, covering computer systems, databases, frontend development, infrastructure, machine learning, security, and more.

There are two important aspects of this fellowship program:

Each recipient will receive $100,000 sign-on in financial support upon joining.
Each recipient will also be paired with a senior engineer who is a world-class expert in their respective domain as a career and technical mentor.

How do I apply?

Simply apply through the normal university recruiting process. There is no separate application process. All new graduate candidates (including Bachelors, Masters, and PhDs) receiving a job offer from Databricks will be reviewed automatically by the Fellowship Award Committee and will be notified should additional information be required. For candidates with academic research experience, we encourage recommendation letters from the supervising professors. Applications are reviewed on an ongoing basis.

Who qualifies?

You must pass our new grad interview process and receive an offer, and join Databricks engineering full-time after graduation for two or more years. We look for candidates with exceptional academic achievement, research experience, and/or extracurricular impact.

We are starting with candidates only in the United States this year, but will expand to other countries in the future.

Try Databricks for free. Get started today.

The post Announcing Databricks Engineering Fellowship appeared first on Databricks.

↧

What to Expect at Data + AI World Tour

November 10, 2021, 11:34 am

≫ Next: 10 Powerful Features to Simplify Semi-structured Data Management in the Databricks Lakehouse

≪ Previous: Announcing Databricks Engineering Fellowship

This year, we’re doing things a bit differently. In this still very virtual world, we wanted to find a way to bring the energy of Data + AI Summit and the power of lakehouse to a whole new audience. That’s why we’re thrilled to introduce the community to Data + AI World Tour, which brings (virtual) stops in 4 countries and 3 regions this month. With content, customers and speakers tailored to each region, the Data + AI World Tour will dive into why lakehouse is quickly becoming the global standard for data architecture.

This brings us to the theme of the event: Discover Lakehouse. For our audience, this means that we’re showcasing the power of lakehouse, both at the most foundational and more advanced levels.

Whether you’re a longtime Summit attendee, Databricks customer or a data novice, this event is designed to have something for you. You can get all the details here, but these are three things you can expect when attending Data + AI World Tour:

Speakers, sessions and breakouts tailored to your region

Lakehouse is for everyone…but every region has its own unique challenges.

Data + AI World Tour will deliver rich industry insights alongside breakout sessions and customer stories tailored to each specific region. In addition to hearing from industry-wide thought leaders and Databricks executives, including co-founder and CEO Ali Ghodsi, who will share his vision for the data lakehouse, there will be multiple breakout sessions that offer training on a variety of disciplines and how they work together in a lakehouse. These include:

Data Engineering on Databricks

Providing reliable, timely data isn’t easy, and without trusted data, it’s impossible to support ML, data science or analytics initiatives. Discover how Databricks simplifies data engineering — from data ingestion to developing and maintaining an end-to-end ETL lifecycle to scheduling workflows — and how that makes the lakehouse architecture a reality.

Lightning-Fast Analytics on the Lakehouse

In this session, learn the inner workings of the lakehouse and how it can power lightning-fast analytics with Databricks SQL for quick and accurate decision-making from your data lake. You will get to see how a first-class SQL development experience, backed by the Photon-query engine, achieves state-of-the-art performance for all query types.

Databricks for Data Science and Machine Learning

Operationalizing data science and machine learning can be difficult with a miscellany of data sources, ML tools and workflows. In this session, we show you how to bring together data engineers, data scientists and lines of business to collaborate on an open platform and operationalize the full ML lifecycle at scale.

Data-driven organizations share their data challenges and success stories

Our goal is to deliver content that is most relevant to you and your organization’s data challenges and needs. For each of our virtual destination stops, we’ll be featuring regional customers to share the details of their lakehouse journey – from the initial drivers to the lessons learned along the way. They’ll also uncover the specific use cases they’ve tapped into and their future vision for their lakehouse architecture.

We have tons of great data experts joining us, including Kate Hopkins, VP of Data Platform at AT&T, Zulfikar Lazuardi Maulana, Lead Data Scientist at Grab, Dr. Marina Paush, Head of Data Science at Viessmann, and Didier Pellegrin, VP of Analytics and AI at Schneider Electric.

Free training for all participants — with live support

We’ll also offer all registered participants an opportunity to take part in a free follow-along training session on Databricks Lakehouse Foundations. In this workshop, you’ll discover the foundational concepts for using data at scale and how to build successful data teams. You’ll also learn how the Databricks Lakehouse Platform can help you to streamline workflows to more efficiently make use of your data.

Conclusion

This is just a glimpse of what attendees can expect at Data + AI World Tour! Attendees will also have the opportunity to engage with other professionals at Meetups and interact with Databricks experts.

As organizations across the globe spearhead their planning and strategies for next year, it’s the perfect time for data professionals to explore the next-generation of data platforms and how to leverage them for business success. Register for free now!

Try Databricks for free. Get started today.

The post What to Expect at Data + AI World Tour appeared first on Databricks.

↧

10 Powerful Features to Simplify Semi-structured Data Management in the Databricks Lakehouse

November 11, 2021, 9:00 am

≫ Next: Why Scale Matters in Modern Financial Compliance

≪ Previous: What to Expect at Data + AI World Tour

Ingesting and querying JSON with semi-structured data can be tedious and time-consuming, but Auto Loader and Delta Lake make it easy. JSON data is very flexible, which makes it powerful, but also difficult to ingest and query.

The biggest challenges include:

It’s a tedious and fragile process to define a schema of the JSON file being ingested.
The schema can change over time, and you need to be able to handle those changes automatically.
Software does not always pick the correct schema for your data, and you may need to hint at the correct format. For example, the number 32 could be interpreted as either an integer or a long.
Often data engineers have no control of upstream data sources generating the semi-structured data. For example, the column name may be upper or lower case but denotes the same column, or the data type sometimes changes, and you may not want to completely rewrite the already ingested data in Delta Lake.
You may not want to do the upfront work of flattening out JSON documents and extracting every single column, and doing so may make the data very hard to use.
Querying semi-structured data in SQL is hard. You need to be able to query this data in a manner that is easy to understand.

In this blog and the accompanying notebook (Databricks Runtime 9.1 and above), we will show what built-in features make working with JSON simple at scale in the Databricks Lakehouse. Below is an incremental ETL architecture. The left-hand side represents continuous and scheduled ingest, and we will discuss how to do both types of ingest with Auto Loader. After the JSON file is ingested into a bronze Delta Lake table, we will discuss the features that make it easy to query complex and semi-structured data types that are common in JSON data.

In the accompanying notebook, we used sales order data to demonstrate how to easily ingest JSON. The nested JSON sales order datasets get complex very quickly.

Hassle-free JSON ingestion with Auto Loader

Auto Loader provides Python and Scala interfaces to ingest new data from a folder location in object storage (S3, ADLS, GCS) into a Delta Lake table. Auto Loader makes ingestion easy and hassle-free by enabling data ingestion into Delta Lake tables directly from object storage in either a continuous or scheduled way.

Before discussing the general features of Auto Loader, let’s dig into the features that make ingesting the JSON extremely easy. Below is an example of how to ingest very complex JSON data.

 df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.schemaLocation", schemaLocation) \
  .option("cloudFiles.format", "json") \
  .option("cloudFiles.inferColumnTypes", "true") \
  .option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
  .option("cloudFiles.schemaHints", schemaHints) \
  .load(landingZoneLocation)

Flexibility and ease of defining the schema: In the code above, we use two features of Auto Loader to easily define the schema while giving guardrails for problematic data. The two useful features are cloudFiles.inferColumnTypes (Powerful Feature No. 1 – InferColumnTypes) and cloudFiles.schemaHints (Powerful Feature No.2 – Schema Hints). Let’s take a closer look at the definitions:

cloudFiles.inferColumnTypes is the option to turn on/off the mechanism to infer data types, for example, string, integers, longs and floats, during the schema inference process. The default value for cloudFiles.inferColumnTypes is false because, in most cases, it is better to have the top-level columns be strings for schema evolution robustness and avoid issues such as numeric type mismatches(integers, longs, floats) during the schema evolution process.
cloudFiles.schemaHints is the option to specify desired data types to some of the columns, aka “schemaHints”, during the schema inference process. Schema hints are used only if you do not provide a schema to Auto Loader. You can use schema hints whether cloudFiles.inferColumnTypes is enabled or disabled. More details can be found here.

In this use case (notebook), we actually set cloudFiles.inferColumnTypes to true since we want the columns and the complex data types to be inferred, instead of Auto Loader’s default inferred data type of string. Inferring most columns will give the fidelity of this complex JSON and provide flexibility for querying later. In addition, while inferring the column types is very convenient, we also know there are problematic columns ingested. This is where cloudFiles.schemaHints comes into play, working together with cloudFiles.inferColumnTypes. The combination of the two options allows for inferring most columns’ complex data types while specifying the desired data type (string) for only two of the columns.

Let’s take a closer look at the two problematic columns. From the semi-structured JSON data we use in the notebook, we have identified two columns of problematic data: “ordered_products.element.promotion_info” and “clicked_items”. Hence, we hint that they should come in as strings (see data snippets for one of the columns above: “ordered_products.element.promotion_info”). For these columns, we can easily query the semi-structured JSON in SQL, which we will discuss later. You can see that one of the hints is on a nested column inside an array, which makes this feature really functional on complex schemas!

Handling schema changes over time make the ingest and data more resilient: Like schema inference, schema evolution (Powerful Feature No.3) is simple to implement with Auto Loader. All you have to do is set cloudFiles.schemaLocation, which saves the schema to that location in the object storage, and then the schema evolution can be accommodated over time. To clarify, schema evolution is when the schema of the ingested data changes and the schema of the Delta Lake table changes accordingly.

For example, in the accompanying notebook, an extra column named fulfillment_days is added to the data ingested by Auto Loader. This column is identified by Auto Loader and applied automatically to the Delta Lake table. Per the documentation, you can change the schema evolution mode to your liking. Here is a quick overview of the supported modes for Auto Loader’s option cloudFiles.schemaEvolutionMode:

addNewColumns: The default mode when a schema is not provided to Auto Loader. New columns are added to the schema. Existing columns do not evolve data types.
failOnNewColumns: If Auto Loader detects a new column, the stream will fail. It will not restart unless the provided schema is updated, or the offending data file is removed.
rescue: The stream runs with the very first inferred or provided schema. Any data type changes or new columns are automatically saved in the rescued data column as _rescued_data in your stream’s schema. In this mode, your stream will not fail due to schema changes.
none: The default mode when a schema is provided to Auto Loader. It does not evolve the schema. New columns are ignored, and data is not rescued unless the rescued data column is provided separately as an option.

The example above (also in the notebook) does not include a schema, hence we use the default option .option(“cloudFiles.schemaEvolutionMode”, “addNewColumns”) on readStream to accommodate schema evolution.

Capture bad data in an extra column, so nothing is lost: The rescued data column (Powerful Feature No. 4) is where all unparsed data is kept, which ensures that you never lose data during ETL. If data doesn’t adhere to the current schema and can’t go into its required column, the data won’t be lost with the rescued data column. In this use case (notebook), we did not use this option. To turn on this option, you can specify the following: .option(“cloudFiles.schemaEvolutionMode”, “rescue”). Please see more information here.

Now that we have explored the Auto Loader features that make it great for JSON data and tackled challenges mentioned at the beginning, let’s look at some of the features that make it hassle-free for all ingest:

df.writeStream \
  .format("delta") \
  .trigger(once=True) \
  .option("mergeSchema", "true") \
  .option("checkpointLocation", bronzeCheckPointLocation) \
  .start(bronzeTableLocation)

Continuous vs. scheduled ingest: While Auto Loader is an Apache Spark™ Structured Streaming source, it does not have to run continuously. You can use the trigger once option (Powerful Feature No. 5) to turn it into a scheduled job that turns itself off when all files have been ingested. This comes in handy when you don’t have the need for continuously running ingest. Yet, it also gives you the ability to drop the cadence of the schedule over time, and then eventually go to continuously running ingest without changing the code.

How to handle state: State is the information needed to start up where the ingestion process left off if the process is stopped. For example, with Auto Loader, the state would include the set of files already ingested. Checkpoints save the state if the ETL is stopped at any point, whether on purpose or due to failure. By leveraging checkpoints, Auto Loader can run continuously and also be a part of a periodic or scheduled job. In the example above, the checkpoint is saved in the option checkpointLocation (Powerful Feature No. 6 – Checkpoints). If the Auto Loader is terminated and then restarted, it will use the checkpoint to return to its latest state and will not reprocess files that have already been processed.

Querying semi-structured and complex structured data

Now that we have our JSON data in a Delta Lake table, let’s explore the powerful ways you can query semi-structured and complex structured data (Let’s tackle the last challenge).

Until this point, we have used Auto Loader to write a Delta Table to a particular location. We can access this table by location in SQL, but for readability, we point an external table to the location using the following SQL code.

CREATE TABLE autoloaderBronzeTable
LOCATION '${c.bronzeTablePath}';

Easily access top level and nested data in semi-structured JSON columns using syntax for casting values:

SELECT fulfillment_days, fulfillment_days:picking,
  fulfillment_days:packing::double, fulfillment_days:shipping.days
FROM autoloaderBronzeTable
WHERE fulfillment_days IS NOT NULL

When ingesting data, you may need to keep it in a JSON string, and some data may not be in the correct data type. In those cases, syntax in the above example makes querying parts of the semi-structured data simple and easy to read. To double click on this example, let’s look at data in the column filfillment_days, which is a JSON string column:

Accessing top-level columns: Use a single colon (:) to access the top-level of a JSON string column (Powerful Feature No. 7 – Extract JSON Columns). For example, filfillment_days:picking returns the value 0.32 for the first row above.
Accessing nested fields: Use the dot notation to access nested fields (Powerful Feature No. 8 – Dot Notation). For example, fulfillment_days:shipping.days returns the value 3.7 for the first row above.
Casting values: Use a double colon (::) and followed by the data type to cast values (Powerful Feature No. 9 – Cast Values). For example, fulfillment_days:packing::double returns the double data type value 1.99 for the string value of packing for the first row above.

Extracting values from semi-structured arrays even when the data is ill-formed:

SELECT *, reduce(all_click_count_array, 0, (acc, value) -> acc + value) as
sum
FROM (
 SELECT order_number, clicked_items:[*][1] as all_click_counts,
   from_json(clicked_items:[*][1], 'ARRAY')::ARRAY as all_click_count_array
 FROM autoloaderBronzeTable
)

Unfortunately, not all data comes to us in a usable structure. For example, the column clicked_items is a confusing array of arrays in which the count comes in as a string. Below is a snippet of the data in the column clicked_items:

Extracting values from arrays(Powerful Feature No. 10): Use an asterisk (*) to extract all values in a JSON array string. For the specific array indices, use a 0-based value. For example, SQL clicked_items:[*][1]returns the string value of [“54″,”85”].
Casting complex array values: After extracting the correct values for the array of arrays, we can use from_json and ::ARRAY to cast the array into a format that can be summed using reduce. In the end, the first row returns the summed value of 139 (54 + 89). It’s pretty amazing how easily we can sum values from ill-formed JSON in SQL!

Aggregations in SQL with complex structured data:

Accessing complex structured data, as well as moving between structured and semi-structured data, has been available for quite some time in Databricks.

SELECT order_date, ordered_products_explode.name  as product_name,
 SUM(ordered_products_explode.qty) as quantity
FROM (
 SELECT DATE(from_unixtime(order_datetime)) as order_date,
   EXPLODE(ordered_products) as ordered_products_explode
 FROM autoloaderBronzeTable
 WHERE DATE(from_unixtime(order_datetime)) is not null
 )
GROUP BY order_date, ordered_products_explode.name
ORDER BY order_date, ordered_products_explode.name

Earlier in the blog, we used option(“cloudFiles.inferColumnTypes”, “true”) in Auto Loader to read JSON in this section to infer the complex data type for the column ordered_products. In the SQL query above, we explored how to access and aggregate data from the complex structured data. Below is an example of one row of the column ordered_products, and we want to find the quantity of each product sold on a daily basis. As you can see, both the product and quantity are nested in an array.

Accessing array elements as rows: Use explode on the ordered_products column so that each element is its own row, as seen below.
Accessing nested fields: Use the dot notation to access nested fields in the same manner as semi-structured JSON. For example, ordered_products_explode.qty returns the value 1 for the first row above. We can then group and sum the quantities by date and the product name.

Additional Resources: we have covered many topics on querying structured and semi-structured JSON data, but you can find more information here:

Documentation on querying semi-structured JSON in SQL.
A blog on working complex structured and semi-structured data. The specific use case is working with complex data while streaming.

Conclusion

At Databricks, we strive to make the impossible possible and the hard easy. Auto Loader makes ingesting complex JSON use cases at scale easy and possible. The SQL syntax for semi-structured and complex data makes manipulating data easy. Now that you know how to ingest and query complex JSON with Auto Loader and SQL with these 10 powerful features, we can’t wait to see what you build with them.

Try the notebook
(Databricks Runtime 9.1 and above)

The post 10 Powerful Features to Simplify Semi-structured Data Management in the Databricks Lakehouse appeared first on Databricks.

↧

Why Scale Matters in Modern Financial Compliance

November 15, 2021, 9:24 am

≫ Next: Snowflake Claims Similar Price/Performance to Databricks, but Not So Fast!

≪ Previous: 10 Powerful Features to Simplify Semi-structured Data Management in the Databricks Lakehouse

Let’s talk regulation. While not the sexiest topic for banks to deal with, working with regulations and compliance are critical to financial institutions’ success. On average 10% of bank revenue is spent on compliance program costs and represents the largest cost for most financial organizations. Additionally, with the rising tide of regulations since the 2008 global financial crisis, financial service institutions (FSIs) and their chief compliance officers are struggling to keep pace with new regulations like the Fundamental Review of the Trading Book (FRTB), 2023 and Comprehensive Capital Analysis and Review (CCAR). These regulations, along with many others, call for better data management and risk assessment.

FRTB is a new regulatory compliance mandate that goes live January 2023. FRTB will force banks around the world to raise their capital reserves and data management practices to make them better prepared to withstand market downturns. To comply with these new measures, banks will need to aggregate data from many disparate sources to build FRTB reports and calculate capital charges, which will prove to be especially challenging for large banks with multiple front-office systems. Banks will also need to evaluate their market risk and capital sensitivity, which must be computed and integrated with the FRTB aggregation element.

IDG reports additional computational and historical data storage capacity is required to process unprecedented volumes of disparate data and accommodate real-time data ingestion. In fact, estimates from Cloudera cite that FRTB will require 24x boost in historical data storage and a 30x increase in computational capacity.

Therefore ,a FRTB challenge for financial institutions is the need to overhaul market risk infrastructure technology to dramatically boost scalability and performance. Banks that get it right may save millions of dollars from being tied up in capital reserve requirements. Data analytics at scale is a major pillar for banks in the rising tide of regulation. This blog discusses the need for scale and how the lakehouse provides a modern architecture for data-driven compliance in financial institutions.

Computing for modern compliance

To meet modern compliance requirements, FSIs need to report on growing volumes of data stretching years into the past. Risk calculations that were run weekly or daily must now be run several times per day, and in many cases, in real-time as new data comes in. Additionally, regulations like FRTB require risk teams to scale simulations for thousands if not millions of scenarios in parallel. The volume of data, reporting frequency, and scale of calculations require massive compute power that far outstrips the capabilities of legacy on-premises analytics platforms. As a result, compliance risk teams are unable to analyze all their data nor provide timely calculations to regulators.

Additionally, advanced data analytics is playing an increasingly important role in risk-related use cases like AML, KYC, and fraud prevention. These use cases rely on anomaly detection through massive datasets to find a needle in a haystack. Machine learning (ML) enables risk teams to be more effective by reducing false positives and moving beyond rules-based detection. Unfortunately, traditional data warehouses lack the ML capabilities needed to deliver on these needs. Nor can they scale for the billions of transactions that need to be analyzed to power these predictions. Bolt-on solutions for advanced analytics require data to be copied across platforms, leading to data inconsistencies and slow time to insights.

A modern data architecture

With the advent of FRTB and other regulations, data and compliance teams will find themselves considering a modern architecture when looking to take a data-driven approach to risk and compliance. What will be important is to have platform is built in the cloud to provide institutions with the elastic scale they need to analyze massive volumes of data for risk and compliance purposes. A modern system that can process petabytes of batch and streaming data in near real time is needed, which may not always be possible on a data lake or warehouse. Teams need to scale simulations for millions of scenarios across their portfolios to help mitigate risk. Intraday and real-time reporting on controls for CCAR, and FRTB, and other regulations become possible.

Fraud and AML detection is a big component of regulatory compliance that involves anomaly detection. As mentioned earlier, anomaly detection identify malicious activity hidden in mass transaction data. For anomaly detection at scale, voluminous datasets are ingested and processed, FIs need to perform advanced analytics and AI-driven monitoring. This allows FSIs looking at thousands or billions of transactions to detect anomalies, new, unknown patterns and threats.

With advanced analytics, FIs can also correlate isolated signals from threats, and therefore, reduce false positives while improving the quality of alerts so they can focus on relevant, high-risk fraud, AML, KYC and compliance cases. Additionally, the data and AI allows teams to automate repetitive compliance tasks and augment intel for investigations, on massive and changing datasets with AI to focus on high-risk cases to better predict risky events and drive agility within the compliance team.

Risk and compliance teams need an architecture that cuts through all the complexities of ingesting and processing millions of data points to implement anomaly detection at scale — this lends itself well to fraud prevention. This enables teams to move from rules to machine learning to respond fast and reduce operational costs associated with fraud.

Delta Lake and scale

We discussed an architecture that resembles a Lakehouse paradigm. What many modern FIs are using is a Delta Lake- an open-source data management layer that simplifies all aspects of data management for ML. Delta Lake ingests and processes data with reliability and performance at scale, giving the Lakehouse the ability to scale in principle unlimited data sets rapidly. The lakehouse and Delta engine together provide a robust data foundation for ETL and advanced analytics for creating compliance applications in an elastic computing environment. Delta Lake provides advanced analytics in addition to data ETL– enabling ML and AI on the platform. Scalable analytics and AI power compliance systems to detect and learn new patterns to help streamline compliance alert systems to near-perfection, addressing the issue of false positives. An AI system can automate repetitive tasks and can be engineered to detect anomalies and patterns that you’re not looking for — achieving more accuracy and predicting threats before they occur. For example, it can prevent two analysts from investigating the same two alerts that are part of the same threat (contextualizing incident and correlating isolated signals) to reduce the amount of work and improve detection.

Financial institutions are increasingly reporting that current data systems for compliance cannot perform advanced analytics in a live setting that requires scale. The Lakehouse architecture can help simplify and build scalable risk and compliance solutions within a highly regulated environment. FINRA uses the Lakehouse platform to deter misconduct by enforcing rules, detecting and preventing wrongdoing in the U.S. capital markets. With the Lakehouse, FINRA can quickly iterate on ML models and scale detection efforts to 100’s of billions of market events per day on a unified platform.

Learn more about how to modernize compliance on our Smarter risk and compliance with data and AI hub.

Try Databricks for free. Get started today.

The post Why Scale Matters in Modern Financial Compliance appeared first on Databricks.

↧

Snowflake Claims Similar Price/Performance to Databricks, but Not So Fast!

November 15, 2021, 11:51 am

≫ Next: Evolution of the SQL language at Databricks: ANSI standard by default and easier migrations from data warehouses

≪ Previous: Why Scale Matters in Modern Financial Compliance

On Nov 2, 2021, we announced that we set the official world record for the fastest data warehouse with our Databricks SQL lakehouse platform. These results were audited and reported by the official Transaction Processing Performance Council (TPC) in a 37-page document available online at tpc.org. We also shared a third-party benchmark by the Barcelona Supercomputing Center (BSC) outlining that Databricks SQL is significantly faster and more cost effective than Snowflake.

A lot has happened since then: many congratulations, some questions, and some sour grapes. We take this opportunity to reiterate that we stand by our blog post and the results: Databricks SQL provides superior performance and price performance over Snowflake, even on data warehousing workloads (TPC-DS).

Snowflake’s response: “lacking integrity”?

Snowflake responded 10 days after our publication (last Friday) claiming that our results were “lacking integrity.” They then presented their own benchmarks, claiming that their offering has roughly the same performance and price at $267 as Databricks SQL at $242. At face value, this ignores the fact that they are comparing the price of their cheapest offering with that of our most expensive SQL offering. (Note that Snowflake’s “Business Critical” tier is 2x the cost of the cheapest tier.) They also gloss over the fact that Databricks can use spot instances, which most customers use, and bring the price down to $146. But none of this is the focus of this post.

The gist of Snowflake’s claim is that they ran the same benchmarks as BSC and found that they could run the whole benchmark in 3,760 seconds vs 8,397 seconds that BSC measured. They even urged readers to sign up for an account and try it out for themselves. After all, the TPC-DS dataset comes with Snowflake out of the box and they even have a tutorial on how to run it. So it should be easy to verify the results. We did exactly that.

First, we want to commend Snowflake for following our lead and removing the DeWitt clause, which had prohibited competitors from benchmarking their platform. Thanks to this, we were able to get a trial account and verify the basis for claims of “lacking integrity”.

Reproducing TPC-DS on Snowflake

We logged into Snowflake and ran Tutorial 4 for TPC-DS. The results in fact closely matched what they claimed at 4,025 seconds, indeed much faster than the 8,397 seconds in the BSC benchmark. But what unfolded next is much more interesting.

While performing the benchmarks, we noticed that the Snowflake pre-baked TPC-DS dataset had been recreated two days after our benchmark results were announced. An important part of the official benchmark is to verify the creation of the dataset. So, instead of using Snowflake’s pre-baked dataset, we uploaded an official TPC-DS dataset and used identical schema as Snowflake uses on its pre-baked dataset (including the same clustering column sets), on identical cluster size (4XL). We then ran and timed the POWER test three times. The first cold run took 10,085 secs, and the fastest of the three runs took 7,276 seconds. Just to recap, we loaded the official TPC-DS dataset into Snowflake, timed how long it takes to run the power test, and it took 1.9x longer (best of 3) than what Snowflake reported in their blog.

These results can easily be verified by anyone. Get a Snowflake account, use the official TPC-DS scripts to generate a 100 TB data warehouse. Ingest those files into Snowflake. Then run a few POWER runs and measure the time for yourself. We bet the results will be closer to 7000 seconds, or even higher numbers if you don’t use their clustering columns (see next section). You can also just run the POWER test on the dataset they ship with Snowflake. Those results will likely be closer to the time they reported in their blog.

Why official TPC-DS

Why is there such a big discrepancy between running TPC-DS on the pre-baked dataset in Snowflake vs loading the official dataset into Snowflake? We don’t exactly know. But how you lay out your data significantly impacts TPC-DS, and in general all, workloads. In most systems, clustering or partitioning the data for a specific workload (e.g., sorting by the combination of fields used in a query) can improve performance for that workload, but such optimizations come with additional cost. That time and cost need to be included in the benchmark results.

It is for this reason that the official benchmark requires you to report the time it takes to load the data into the data warehouse so that they can correctly account for any time and cost the system takes to optimize the layout. This time can be substantially more than the POWER test queries for some storage schemes. The official benchmark also includes data updates and maintenance, just like real-world datasets and workloads (how often do you query a dataset that never changes?). This is all done to prevent the following scenario: a system spends massive resources optimizing a static dataset offline for an exact set of immutable workloads, and then can run those workloads super quickly.

In addition, the official benchmark requires reproducibility. That’s why you can find all the code to reproduce our record in the submission.

This brings us to our final point. We agree with Snowflake that benchmarks can quickly devolve into industry players “adding configuration knobs, special settings, and very specific optimizations that would improve a benchmark”. Everyone looks really good in their own benchmarks. So instead of taking any one vendor’s word on how good they are, we challenge Snowflake to participate in the official TPC benchmark.

Customer-obsessed benchmarking

When we decided to participate in this benchmark, we set a constraint for our engineering team that they should only use commonly applied optimizations done by virtually all our customers, unlike past entries. They were not allowed to apply any optimizations that would require deep understanding of the dataset or queries (as done in the Snowflake pre-baked dataset, with additional clustering columns). This matches real world workloads and what most customers would like to see (a system that achieves great performance without tuning).

If you read our submission in detail, you can find the reproducible steps that match how a typical customer would like to manage their data. Minimizing the effort to get productive with a new dataset was one of our top design goals for Databricks SQL.

Conclusion

A final word from us at Databricks. As co-founders, we care deeply about delivering the best value to our customers, and the software we build to solve their business needs. Benchmark results that don’t resonate with our understanding of the world can lead to an emotional or visceral reaction. We try to not let that get the best of us. We will seek the truth, and publish end-to-end results that are verifiable. We therefore won’t accuse Snowflake of lacking integrity in the results they published in their blog. We only ask them to verify their results with the official TPC council.

Our primary motivation to participate in the official TPC data warehousing benchmark was not to prove which data warehouse is faster or cheaper. Rather, we believe that every enterprise should be able to become data driven the way the FAANG companies are. Those companies don’t build on data warehouses. They instead have a much simpler data strategy: store all data (structured, text, video, audio) in open formats and use a single copy towards all kinds of analytics, be it data science, machine learning, real-time analytics, or classic business intelligence and data warehousing. They don’t do everything in just SQL. But rather, SQL is one of the key tools in their arsenal, together with Python, R, and a slew of other tools in the open-source ecosystem that leverage their data. We call this paradigm the Data Lakehouse. The Data Lakehouse, unlike Data Warehouses, has native support for Data Science, Machine Learning, and real-time streaming. But it also has native support for SQL and BI. Our goal was to dispel the myth that the Data Lakehouse cannot have best-in-class price and performance. Rather than making our own benchmarks, we sought the truth and participated in the official TPC benchmark. We are therefore very happy that the Data Lakehouse paradigm provides superior performance and price over data warehouses, even on classic data warehousing workloads (TPC-DS). This will benefit enterprises who no longer need to maintain multiple data lakes, data warehouses, and streaming systems to manage all of their data. This simple architecture enables them to redeploy their resources toward solving the business needs and problems that they face every day.

Try Databricks for free. Get started today.

The post Snowflake Claims Similar Price/Performance to Databricks, but Not So Fast! appeared first on Databricks.

↧

Evolution of the SQL language at Databricks: ANSI standard by default and easier migrations from data warehouses

November 16, 2021, 9:00 am

≫ Next: Databricks’ Open Source Genomics Toolkit Outperforms Leading Tools

≪ Previous: Snowflake Claims Similar Price/Performance to Databricks, but Not So Fast!

Today, we are excited to announce that Databricks SQL will use the ANSI standard SQL dialect by default. This follows the announcement earlier this month about Databricks SQL’s record-setting performance and marks a major milestone in our quest to support open standards. This blog post discusses how this update makes it easier to migrate your data warehousing workloads to Databricks lakehouse platform. Moreover, we are happy to announce improvements in our SQL support that make it easier to query JSON and perform common tasks more easily.

Migrate easily to Databricks SQL

We believe Databricks SQL is the best place for data warehousing workloads, and it should be easy to migrate to it. Practically, this means changing as little of your SQL code as possible. We do this by switching out the default SQL dialect from Spark SQL to Standard SQL, augmenting it to add compatibility with existing data warehouses, and adding quality control for your SQL queries.

Standard SQL we can all agree on

With the SQL standard, there are no surprises in behavior or unfamiliar syntax to look up and learn.

String concatenation is such a common operation that the SQL standard designers gave it its own operator. The double-pipe operator is simpler than having to perform a concat() function call:

SELECT
  o_orderstatus || ' ' || o_shippriority as order_info
FROM
  orders;

The FILTER clause, which has been in the SQL standard since 2003, limits rows that are evaluated during an aggregation. Most data warehouses require a complex CASE expression nested within the aggregation instead:

SELECT
  COUNT(DISTINCT o_orderkey) as order_volume,
  COUNT(DISTINCT o_orkerkey) FILTER (WHERE o_totalprice > 100.0) as big_orders -- using rows that pass the predicate
FROM orders;

SQL user-defined functions (UDFs) make it easy to extend and modularize business logic without having to learn a new programming language:

CREATE FUNCTION inch_to_cm(inches DOUBLE)
RETURNS DOUBLE RETURN 2.54 * inches;

SELECT inch_to_cm(5); -- returns 12.70

Compatibility with other data warehouses

During migrations, it is common to port hundreds or even thousands of queries to Databricks SQL. Most of the SQL you have in your existing data warehouse can be dropped in and will just work on Databricks SQL. To make this process simpler for customers, we continue to add SQL features that remove the need to rewrite queries.

For example, a new QUALIFY clause to simplify filtering window functions makes it easier to migrate from Teradata. The following query finds the five highest-spending customers in each day:

SELECT
  o_orderdate,
  o_custkey,
  RANK(SUM(o_totalprice)) OVER (PARTITION BY o_orderdate ORDER BY SUM(o_totalprice) DESC) AS rank
FROM orders
GROUP BY o_orderdate, o_custkey
QUALIFY rank <= 5; -- applies after the window function

We will continue to increase compatibility features in the coming months. If you want us to add a particular SQL feature, don’t hesitate to reach out.

Quality control for SQL

With the adoption of the ANSI SQL dialect, Databricks SQL now proactively alerts analysts to problematic queries. These queries are uncommon but they are best caught early so you can keep your lakehouse fresh and full of high-quality data. Below is a selection of such changes (see our documentation for a full list).

Invalid input values when casting a STRING to an INTEGER
Arithmetic operations that cause an overflow
Division by zero

Easily and efficiently query and transform JSON

If you are an analyst or data engineer, chances are you have worked with unstructured data in the form of JSON. Databricks SQL natively supports ingesting, storing and efficiently querying JSON. With this release, we are happy to announce improvements that make it easier than ever for analysts to query JSON.

Let’s take a look at an example of how easy it is to query JSON in a modern manner. In the query below, the raw column contains a blob of JSON. As demonstrated, we can query and easily extract nested fields and items from an array while performing a type conversion:

SELECT
  raw:customer.full_name,     -- nested field
  raw:customer.addresses[0],  -- array
  raw:customer.age::integer,  -- type cast
FROM customer_data;

With Databricks SQL you can easily run these queries without sacrificing performance or by having to extract the columns out of JSON into separate tables. This is just one way in which we are excited to make life easier for analysts.

Simple, elegant SQL for common tasks

We have also spent time doing spring cleaning on our SQL support to make other common tasks easier. There are too many new features to cover in a blog post, but here are some favorites.

Case-insensitive string comparisons are now easier:

SELECT
  *
FROM
  orders
WHERE
  o_orderpriority ILIKE '%urgent'; -- case insensitive string comparison

Shared WINDOW frames save you from having to repeat a WINDOW clause. Consider the following example where we reuse the win WINDOW frame to calculate statistics over a table:

SELECT
  round(avg(o_totalprice) OVER win, 1) AS price,
  round(avg(o_totalprice) OVER win, 1) AS avg_price,
  min(o_totalprice) OVER win           AS min_price,
  max(o_totalprice) OVER win           AS max_price,
  count(1) OVER win              AS order_count
FROM orders
-- this is a shared WINDOW frame
WINDOW win AS (ORDER BY o_orderdate ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING);

Multi-value INSERTs make it easy to insert multiple values into a table without having to use the UNION keyword, which is common most other data warehouses:

CREATE TABLE employees
(name STRING, dept STRING, salary INT, age INT);

-- this is a multi-valued INSERT
INSERT INTO employees
VALUES ('Lisa', 'Sales', 10000, 35),
       ('Evan', 'Sales', 32000, 38),
       ('Fred', 'Engineering', 21000, 28);

Lambda functions are parameterized expressions that can be passed to certain SQL functions to control their behavior. The example below passes a lambda to the transform function, concatenating together the index and values of an array (themselves an example of structured types in Databricks SQL).

-- this query returns ["0: a","1: b","2: c"]
SELECT
  transform(
    array('a','b','c'),
    (x, i) -> i::string || ': ' || x -- this is a lambda function
  );

Update data easily with standard SQL

Data is rarely static, and it is common to to update a table based on changes in another table. We are making it easy for users to deduplicate data in tables, create slowly-changing data and more with a modern, standard SQL syntax.

Let’s take a look at how easy it is to update a customers table, merging in new data as it arrives:

MERGE INTO customers    -- target table
USING customer_updates  -- source table with updates
ON customers.customer_id = customer_updates.customer_id
WHEN MATCHED THEN
  UPDATE SET customers.address = customer_updates.address

Needless to say, you do not sacrifice performance with this capability as table updates are blazing fast. You can find out more about the ability to update, merge and delete data in tables here.

Taking it for a spin

We understand language dialect changes can be disruptive. To facilitate the rollout, we are happy to announce a new feature, channels, to help customers safely preview upcoming changes.

When you create or edit a SQL endpoint, you can now choose a channel. The “current” channel contains generally available features while the preview channel contains upcoming features like the ANSI SQL dialect.

To test out the ANSI SQL dialect, click SQL Endpoints in the left navigation menu, click on an endpoint and change its channel. Changing the channel will restart the endpoint, and you can always revert this change later. You can now test your queries and dashboards on this endpoint.

You can also test the ANSI SQL dialect by using the SET command, which enables it just for the current session:

SET ANSI_MODE = true; -- only use this setting for testing

SELECT CAST('a' AS INTEGER);

Please note that we do NOT recommend setting ANSI_MODE to false in production. This parameter will be removed in the future, hence you should only set it to FALSE temporarily for testing purposes.

The future of SQL at Databricks is open, inclusive and fast

Databricks SQL already set the world record in performance, and with these changes, it is standards compliant. We are excited about this milestone, as it is key in dramatically improving usability and simplifying workload migration from data warehouses over to the lakehouse platform.

Please learn more about changes included in the ANSI SQL dialect. Note that the ANSI dialect is not enabled as default yet for existing or new clusters in the Databricks data science and engineering workspace. We are working on that next.

Try Databricks for free. Get started today.

The post Evolution of the SQL language at Databricks: ANSI standard by default and easier migrations from data warehouses appeared first on Databricks.

↧

Databricks’ Open Source Genomics Toolkit Outperforms Leading Tools

November 17, 2021, 9:11 am

≫ Next: Accenture and Databricks Lakehouse Accelerate Digital Transformation

≪ Previous: Evolution of the SQL language at Databricks: ANSI standard by default and easier migrations from data warehouses

Genomic technologies are driving the creation of new therapeutics, from RNA vaccines to gene editing and diagnostics. Progress in these areas motivated us to build Glow, an open-source toolkit for genomics machine learning and data analytics. The toolkit is natively built on Apache Spark™, the leading engine for big data processing, enabling population-scale genomics.

The project started as an industry collaboration between Databricks and the Regeneron Genetics Center. The goal is to advance research by building the next generation of genomics data analysis tools for the community. We took inspiration from bioinformatics libraries such as Hail, Plink and bedtools, married with best-in-class techniques for large-scale data processing. Glow is now 10x more computationally efficient than industry leading tools for genetic association studies..

The vision for Glow and genomic analysis at scale

The primary bottleneck slowing the growth in genomics is the complexity of data management and analytics. Our goal is to make it simple for data engineers and data scientists who are not trained in bioinformatics to contribute to genomics data processing in distributed cloud computing environments. Easing this bottleneck will in turn drive up the demand for more sequencing data in a positive feedback loop.

When to use Glow

Glow’s domain of applicability falls in aggregation and mining of genetic variant data. Particularly for data analyses that are run many times iteratively or that take more than a few hours to complete, such as:

Annotation pipelines
Genetic association studies
GPU-based deep learning algorithms
Transforming data into and out of bioinformatics tools.

As an example, Glow includes a distributed implementation of the Regenie method. You can run Regenie on a single node, which is recommended for academic scientists. But for industrial applications, Glow is the world’s most cost effective and scalable method of running thousands of association tests. Let’s walk through how this works.

Benchmarking Glow against Hail

We focused on genetic association studies for benchmarks because they are the most computationally intensive steps in any analytics pipeline. Glow is >10x more performant for Firth regression relative to Hail without trading off accuracy (Figure 1). We were able to achieve this performance because we apply an approximate method first, restricting the full method to variants with a suggestive association with disease (P < 0.05). Firth regression is the most powerful method on biobank data because it reduces bias from small counts of individuals with rare diseases. Further benchmarks can be found on the Glow documentation.

Figure 1: Databricks SQL dashboard showing Glow and Hail benchmarks on a simulated dataset of 500k samples, and 250k variants (1% of UK Biobank scale) run across a 768 core cluster with 48 memory-optimized virtual machines. We used Glow v1.1.0 and Hail v0.2.76. Relative runtimes are shown. To reproduce these benchmarks, please download the notebooks from the Glow Github repository and use the associated docker containers to set up the environment.

Glow on the Databricks Lakehouse Platform

We had a small team of engineers working on a tight schedule to develop Glow. So how were we able to catch up with the world’s leading biomedical research institute, the brain power behind Hail? We did it by developing Glow on the Databricks Lakehouse Platform in collaboration with industry partners. Databricks provides infrastructure that makes you productive with genomics data analytics. For example, you can use Databricks Jobs to build complex pipelines with multiple dependencies (Figure 2).

Furthermore, Databricks is a secure platform trusted by both Fortune 100 and healthcare organizations with their most sensitive data, adhering to principles of data governance (FAIR), security and compliance (HIPAA and GDPR).

Figure 2: Glow on the Databricks Lakehouse Platform

What lies in store for the future?

Glow is now at a v1 level of maturity, and we are looking to the community to help contribute to build and extend it. There’s lots of exciting things in store.

Genomics datasets are so large that batch processing with Apache Spark can hit capacity limits of certain cloud regions. This problem will be solved by the open Delta Lake format, which unifies batch and stream processing. By leveraging streaming, Delta Lake enables incremental processing of new samples or variants, with edge cases quarantined for further analysis. Combining Glow with Delta Lake will solve the “n+1 problem” in genomics.

A further problem in genomics research is data explosion. There are over 50 copies of the Cancer Genome Atlas on Amazon Web Services alone. The solution proposed today is a walled garden, managing datasets inside genomics domain platforms. This solves data duplication, but then locks data into platforms.

This friction will be eased through Delta Sharing, an open protocol for secure real-time exchange of large datasets, which will enable secure data sharing between organizations, clouds and domain platforms. Unity Catalog will then make it easy to discover, audit and govern these data assets.

We’re just at the beginning of the industrialization of genomics data analytics. To learn more, please see the Glow documentation, tech talks on YouTube, and workshops.

Try Databricks for free. Get started today.

The post Databricks’ Open Source Genomics Toolkit Outperforms Leading Tools appeared first on Databricks.

↧

Accenture and Databricks Lakehouse Accelerate Digital Transformation

November 17, 2021, 11:04 am

≫ Next: Now Generally Available: Introducing Databricks Partner Connect to Discover and Connect Popular Data and AI Tools to the Lakehouse

≪ Previous: Databricks’ Open Source Genomics Toolkit Outperforms Leading Tools

This is a collaborative post from Accenture and Databricks. We thank Matt Arellano, Managing Director, Global Data & AI Ecosystem Lead — Accenture, for his contributions.

To keep pace with the competition and address customer demands, companies are looking to quickly bring new capabilities to the market, boost innovation and scale more efficiently. Customers set out to achieve these results by leveraging the cloud to set a strong foundation for their digital transformation and deliver greater value at speed and scale. But today, most businesses have a significant amount of investments and data already stored on their on-premises systems. Currently only 20% of businesses are in the cloud, moving the other 80% rapidly and cost-effectively is a big change that requires bold new solutions and services.

For a company to successfully move to a cloud provider, such as Amazon AWS, Microsoft Azure or Google Cloud Platform, the first thing they need to address is the current state of data, which is difficult to access, maintain and manage when it is siloed and fragmented across legacy systems. Therefore, IT needs a unified cloud data architecture that supports various data types which is easily manageable, efficient and future-proof.

Having a streamlined data platform brings together everything needed to innovate — data, people, partners, processes and technology. To facilitate innovation, organizations can leverage the Databricks Lakehouse Platform, supported by an open architecture that combines the best data management features of data lakes and data warehouses on low cost storage. By implementing a lakehouse architecture, organizations are able to derive value from their data lake quickly and easily by sharing data securely, analyzing data at scale and applying machine-learning models effectively.

Accenture offerings built on Lakehouse

Through our partnership with Accenture, we’ve helped many businesses architect and migrate to the data lakehouse. Organizations such as Navy Federal Credit Union and Nationwide have benefitted from the Databricks Lakehouse Platform with Accenture’s unparalleled expertise, services and accelerators. Many of our joint customers have been able to innovate faster, break-down data-silos with agile and adaptive processes and enable data-driven decision making to solve real-world problems.

Watch this video to see how Databricks and Accenture are partnering to help clients leverage insights from data and optimize their business models.

We continue to heavily invest and partner together on training and development in order to help businesses innovate toward their future and become disruptive. This means a dedicated innovation center staffed by Accenture Databricks experts who are working every day to create reusable assets, accelerators, and solutions to ensure a faster time to market. For example, Accenture and Databricks have partnered together to help clients operationalize ML at scale and adopt AI throughout their business at an innovation rate 3x faster than the typical product life cycle. While every solution is already fine-tuned for a specific industry and function, each can be quickly tailored to solve unique client challenges.

Together, we are just scratching the surface of the data platform evolution and are excited to continue to partner with Accenture to help businesses innovate toward their future.

Learn more

The Databricks and Accenture partnership combine Accenture’s AI expertise, data services, and IP with the Databricks Lakehouse Platform. Our scalable, modular solutions minimize time to market and maximize business impact. To learn how we can help you meet your business goals and achieve a faster time to value, please email us at cloud.data.team@accenture.com.

Additional resources

Please email us at cloud.data.team@accenture.com

Try Databricks for free. Get started today.

The post Accenture and Databricks Lakehouse Accelerate Digital Transformation appeared first on Databricks.

↧

Now Generally Available: Introducing Databricks Partner Connect to Discover and Connect Popular Data and AI Tools to the Lakehouse

November 18, 2021, 8:15 am

≫ Next: Build Your Business on Databricks With Partner Connect

≪ Previous: Accenture and Databricks Lakehouse Accelerate Digital Transformation

Databricks is thrilled to announce Partner Connect, a one-stop portal for customers to quickly discover a broad set of validated data, analytics, and AI tools and easily integrate them with their Databricks lakehouse across multiple cloud providers. Partner Connect makes it easy for customers to integrate validated data, analytics, and AI tools directly within their Databricks lakehouse. Additionally, Partner Connect transforms how technology partners integrate with Databricks by providing deep integrations that easily reach thousands of customers natively within the platform.

With this announcement, Databricks establishes a complete ecosystem that connects customers to native partner solutions that significantly extend the capabilities of lakehouse architecture. Initially, Databricks Partner Connect includes integrations with Fivetran, Microsoft Power BI, Tableau, Prophecy, Rivery, and Labelbox with Airbyte, Blitzz, dbt Labs, and many more to come in the months ahead.

The need to unify the entire data ecosystem

Enterprises want to drive complexity out of their data infrastructure and adopt more open technologies to take better advantage of analytics and AI. Data lakehouse architecture has put thousands of customers on this path for their analytics and AI workloads. But, the data ecosystem is vast, and no one vendor can accomplish everything. Every enterprise has a multitude of tools and data sources that need to be connected, secured, and governed to allow every user within an organization to find, use, and share data-driven insights. Stitching everything together has historically been a burden on the customer and partners, making it very complicated and expensive to execute at any scale.

Partner Connect solves this challenge by making it easy for customers to integrate data, analytics, and AI tools directly within their Databricks lakehouse. In just a few clicks, Partner Connect will automatically configure resources such as clusters, tokens, and connection files for customers to connect with data ingestion, prep and transformation, and BI and ML tools.

Not only does Partner Connect allow customers to integrate the data tools they already use, but it also enables them to discover new, pre-validated solutions from Databricks partners that complement their expanding analytics needs.

How does it work?

Partner Connect is designed with simplicity in mind, starting with easy access via the left navigation bar in Databricks. Let’s take the example of ingestion. Partner Connect offers pre-built integrations with partners such as Fivetran and Rivery. Resources such as a cluster, SQL endpoints, security tokens are automatically created, and connection details are sent to Fivetran/Rivery to facilitate creating a trial account for customers.

Customers can finish signing up for a trial account on the partner’s website or directly log in if they already used Partner Connect to create a trial account. Once they log in, they will see that Databricks is already configured as a destination in the partner portal and ready to be used.

Ingestion partners such as Fivetran and Rivery unlock access to hundreds of data sources including custom connectors to enterprise-wide source systems such as databases and SaaS applications (e.g., Salesforce, NetSuite, SAP, Marketo, Facebook, Google Analytics, and many others) that would otherwise take months of engineering resources to bring this data in the lakehouse.

Let’s take another example of BI integrations within Partner Connect. To connect Databricks SQL endpoints or interactive clusters, users can simply select Tableau or PowerBI and download the connection file. Once they click on this connection file, the BI solution is automatically launched, and Databricks is configured as a destination. These BI integrations in Partner Connect give a self-service way for data analysts to connect Databricks to their familiar BI solutions reducing their dependence on data engineers and admins. Organizations can leverage existing investments in BI tools so data analysts are using the same tools they have purchased and can continue to use them within Databricks to visualize data in their lakehouse.

Databricks partners

The vision of the data lakehouse aims to serve all the ways a customer might want to derive value from data, that makes Databricks partners a critical part of every customer’s lakehouse. By building and publishing innovative and tightly integrated solutions that customers can easily connect to their lakehouse, partners become instantly discoverable and accessible to customers.

Databricks partners who want to tap into the opportunity to reach new customers – and grow their existing customers – are developing deep, first-class integrations with Databricks using direct APIs and dedicated support. Partner Connect underscores Databricks’ partner-first approach with an incentives program that drives business to partners and rewards Databricks sellers for delivering new and existing customer opportunities to grow the partner business. For more information, read our blog Build Your Business on Databricks with Partner Connect.

Getting started

Partner Connect is now available for Databricks customers at no additional cost. To learn more about using Partner Connect as a customer, click here.
If you are an analytics solution company and would like your product on Partner Connect, please visit the Partner Connect registration page.

Try Databricks for free. Get started today.

The post Now Generally Available: Introducing Databricks Partner Connect to Discover and Connect Popular Data and AI Tools to the Lakehouse appeared first on Databricks.

↧

Build Your Business on Databricks With Partner Connect

November 18, 2021, 8:15 am

≫ Next: Ray on Databricks

≪ Previous: Now Generally Available: Introducing Databricks Partner Connect to Discover and Connect Popular Data and AI Tools to the Lakehouse

At Databricks we believe that to create the ultimate customer experience, we must leverage the work of more than just our employees and create a platform others can extend. To see the importance of this, think of the apps on your phone. Were they all made by Apple or Google? How much less valuable would your phone be if that’s all you had?

That’s why today we are announcing the launch of Partner Connect that brings together the best data, analytics, and AI tools in one place for our customers to discover and integrate with. We’ve designed Partner Connect to meet the needs of both customers and partners, because that’s the only way we can create a virtuous cycle that will continue to grow and generate value for both groups.

Partner Connect is not just a list of logos, it’s a deep integration into Databricks that is directly visible and accessible to Databricks customers, and this is what makes it so valuable. By making their products available in Partner Connect, our partners can expect three key benefits that will help them build their businesses.

New leads

With thousands of existing customers using the Databricks Lakehouse Platform and more joining every day, our partners in Partner Connect can expect significantly more inbound connections. Whether that means new customers or increased consumption from your existing customers, it’s a win either way.

Deep integration

Partner Connect creates a seamless experience for Databricks customers to create a new free trial account of our partners’ products AND automatically connect that account to their Databricks workspace. That means Databricks customers can find your product, create an account in your system, and play with your product with their Databricks Lakehouse already connected. How does it work? Partner Connect was built to invoke our partner’s APIs to establish connections, create accounts and pass details for the connection back to Databricks. The hardest parts of onboarding have been automated.

Visibility & confidence

We want our sales teams and our partners to work and co-sell together. By putting your product in Partner Connect, it serves as a clear signal to the market that your product’s connection to Databricks Is built on a deep, quality integration. That means customer champions, our sales teams, and our partners can recommend it with full confidence, and that makes all the difference.

We look forward to working with you! If you would like to discuss adding your product to Partner Connect, please visit the Partner Connect registration page.

Try Databricks for free. Get started today.

The post Build Your Business on Databricks With Partner Connect appeared first on Databricks.

↧

Ray on Databricks

November 19, 2021, 8:00 am

≫ Next: Building Analytics on the Lakehouse Using Tableau With Databricks Partner Connect

≪ Previous: Build Your Business on Databricks With Partner Connect

Ray is an open-source project first developed at RISELab that makes it simple to scale any compute-intensive Python workload. With a rich set of libraries and integrations built on a flexible distributed execution framework, Ray brings new use cases and simplifies the development of custom distributed Python functions that would normally be complicated to create.

Running Ray on top of an Apache Spark™ cluster creates the ability to distribute the internal code of PySpark UDFs as well as Python code that used to be only run on the driver node. It also adds the ability to use Ray’s scalable reinforcement learning RLlib out of the box. These abilities allow for a wide array of new applications.

Why need another distributed framework on top of Spark?

There are two ways to think of how to distribute a function across a cluster. The first way is where parts of a dataset are split up and a function acts on each part and collects the results. This is called data parallelism, which is the most common form in big data, and the best example is Apache Spark. Modern forms of data parallelism frameworks typically have DataFrame functions and are not meant for low-level building of the internals of distributed operations, such as hand-crafted functions outside of UDFs (user-defined functions).

Data parallelism is the most common way to distribute tasks across a cluster. Here, parts of the dataset are split up and a function acts on each part and collects the results.

Figure 1: Data Parallelism

Another form of distributing functions is when the data set is small, but the operations are complicated enough that simply running the same function on different partitions doesn’t solve the problem. This is known as task parallelism or logical parallelism and describes when many functions can be run concurrently and are set up in complicated pipelines using parameter servers and schedulers to coordinate dependencies. This type of parallelism is mostly found in HPC (High Performance Computing) or custom distributed jobs that aren’t possible with DataFrame operations. Often, these frameworks are meant for designing distributed functions from scratch. Examples include physics simulations, financial trading algorithms and advanced mathematical computations.

Task parallelism is another way to distribute tasks across a cluster and is typically reserved for more complex use cases. Here, many tasks can be run concurrently within a complicated pipeline.

Figure 2: Task Parallelism

However, many task-parallel and traditional HPC libraries are written for C++ instead of Python workloads (which is required in many data science pipelines) and don’t generalize enough to accommodate custom job requirements such as advanced design patterns. They may also be made for hardware optimization of multi-core CPU architectures, such as improving the performance of linear algebra operations on a single machine, instead of distributing functions across a cluster. Such hardware libraries could also be created for specialized hardware instead of commodity cloud hardware. The main difficulty with the majority of task parallel libraries is the level of complexity required to create dependencies between tasks and the amount of development time. To overcome these challenges, many open-source Python libraries have been developed that combine the simplicity of Python with the ability to scale custom tasks.

One of the best recent examples of task or logical parallelism in Python is Ray. Its simplicity, low-latency distributed scheduling and ability to quickly create very complicated dependencies between distributed functions solves the issues of generality, scalability and complexity. See a Gentle Introduction to Ray for more details.

A Simple Introduction to Ray Architecture

Figure 3: Ray Architecture

An important distinction of Ray’s architecture is that there are two levels of abstraction for how to schedule jobs. Ray treats the local system as a cluster, where separate processes, or Raylets, function like a node in the typical big data terminology. There is also a global scheduler, which can treat the separate machines as nodes. This allows for efficient scaling from the single node or laptop level for development all the way up to the massive scale of cloud computing. As each node has its own local scheduler that can also communicate with the global scheduler, a task can be sent from any node to the rest of the cluster. This feature lets the developer create remote tasks that can trigger other remote tasks and bring many design patterns of object-oriented programming to distributed systems, which is vital for a library designed for creating distributed applications from scratch. There is also a node that manages the global control store, which keeps track of tasks, functions, events, and other system-level metadata.

Figure 4: Data flow diagram between worker nodes and the GCS

The object store in Ray is a distributed object store built on Apache Arrow that manages the shared functions, objects and tasks used by the cluster. One of the most important aspects of Ray is that its object store is in-memory with a hierarchy of memory management for either evicting or persisting objects (in Ray v1.2+) that cause a memory spill. This high-speed in-memory system allows for high performance communication at large scale, but requires that the instances have large amounts of memory to avoid memory spills.

Take the following simple example of a remote task that calls another remote task within the function. The program’s dependencies are represented by the task graph and the physical execution shows how the object store holds common variables and results while functions are executed on separate worker nodes.

Figure 5: Example of the relation of the driver and worker nodes and object store in application.

Remote class objects (called remote actors in Ray) allow for parameter servers and more sophisticated design patterns such as nested trees of actors or functions. Using this simple API and architecture, complicated distributed tasks can be designed quickly without the need to create the underlying infrastructure. Examples of many design patterns can be found here.

@ray.remote
class Counter(object):
    def __init__(self):
        self.value = 0

    def increment(self):
        self.value += 1
        return self.value

    def get_counter(self):
        return self.value
counter_actor = Counter.remote()

For more details on the underlying architecture, see the Ray 1.0 Architecture whitepaper

Starting Ray on a Databricks Cluster

Note: The official Ray documentation describes Spark integration via the RayDP project. However, this is about “Ray on Spark” since a Databricks cluster starts as a managed Spark cluster instead of being able to initialize as a Ray cluster. Ray is also not officially supported by Databricks.

Some custom setup is needed before being able to run Ray on a Databrick script. An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM starts. Instructions on how to configure an init script can be found here.

Run the following cell in a Databricks notebook to create the init script:

%python

kernel_gateway_init = """
#!/bin/bash

#RAY PORT
RAY_PORT=9339
REDIS_PASS="d4t4bricks"

# install ray
/databricks/python/bin/pip install ray

# Install additional ray libraries
/databricks/python/bin/pip install ray[debug,dashboard,tune,rllib,serve]

# If starting on the Spark driver node, initialize the Ray head node
# If starting on the Spark worker node, connect to the head Ray node
if [ ! -z $DB_IS_DRIVER ] && [ $DB_IS_DRIVER = TRUE ] ; then
  echo "Starting the head node"
  ray start  --min-worker-port=20000 --max-worker-port=25000 --temp-dir="/tmp/ray" --head --port=$RAY_PORT --redis-password="$REDIS_PASS"  --include-dashboard=false
else
  sleep 40
  echo "Starting the non-head node - connecting to $DB_DRIVER_IP:$RAY_PORT"
  ray start  --min-worker-port=20000 --max-worker-port=25000 --temp-dir="/tmp/ray" --address="$DB_DRIVER_IP:$RAY_PORT" --redis-password="$REDIS_PASS"
fi
""" 
# Change ‘username’ to your Databricks username in DBFS
# Example: username = “stephen.offer@databricks.com”
username = “”
dbutils.fs.put("dbfs:/Users/{0}/init/ray.sh".format(username), kernel_gateway_init, True)

Configure the cluster to run the init script that the notebook created on startup of the cluster. The advanced options, if using the cluster UI, should look like this:

Figure 6: Advanced cluster configuration example

Distributing Python UDFs

User-Defined Functions (UDFs) can be difficult to optimize since the internals of the function still run linearly. There are options to help optimize Spark UDFs such as using a Pandas UDF, which uses Apache Arrow to transfer data and Pandas to work with the data, which can help with the UDF performance. These options allow for hardware optimization, but Ray can be used for logical optimization to drastically reduce the runtime of complicated Python tasks that would not typically be able to be distributed. Example included in the attached notebook for distributing a ML model within a UDF to achieve 2x performance.

Reinforcement Learning

Figure 7: Diagram of Reinforcement Learning

An important and growing application of machine learning is reinforcement learning in which can ML agent trains to learn actions in an environment to maximize a reward function. Its applications range from autonomous driving to power consumption optimization to state-of-the-art gameplay. Reinforcement learning is the third major category of machine learning along with unsupervised and supervised learning.

The challenges of creating reinforcement learning applications include the need for creating a learning environment or simulation in which the agent can train, the complexity of scaling, and the lack of open source standards. Each application requires an environment, which often is custom made and created through historical records or physics simulations that can provide the result of every action the agent can perform. Such simulation environment examples include OpenAI Gym (environments ranging from classic Atari games to robotics), CARLA (the open-source driving simulator), or Tensor Trade (for training stock market trading algorithms).

For these simulations to scale, they cannot simply run on partitions of a dataset. Some simulations will complete before others and they must communicate their copy of the machine learning model’s weights back to some central server for model consolidation in the simplest form of distributed model training. Therefore, this becomes an issue of task parallelism where it is not big data, but rather computing many simultaneous computations of high complexity. The last issue to mention is the lack of open source standards in reinforcement learning libraries. Whereas deep learning or traditional machine learning have had more time to establish standards or libraries that bridge the differences of frameworks (such as MLflow), reinforcement learning is in a younger form of development and does not yet have a well-established standard of model libraries and can vary widely. This causes more development time when switching between algorithms or frameworks.

To solve these problems, Ray comes with a reinforcement learning library named RLlib for high scalability and a unified API. It can run OpenAI Gym and user-defined environments, can train on a very wide variety of algorithms and supports TensorFlow and PyTorch for the underlying neural networks. Combining RLlib with Databricks allows for the benefits of highly scalable streaming and data integration with Delta Lake along with the high performance of state-of-the-art reinforcement learning models.

RLlib uses Tune, a Ray library for scalable hyperparameter tuning that runs variations of the models to find the best one. In this code example, it runs a PPO (Proximal Policy Optimization) agent on an OpenAI Gym’s CartPole environment and performs a grid search on three options for the learning rate. What is going on under the hood is that the Ray process on the Spark nodes is running simulations of the environment and sending back the batches to a central training Ray process that trains the model on these batches. It then sends the model to the rollout workers to collect more training data. While the trainer process can use GPUs to speed up training, by setting “num_gpus” to 0, it will train on less expensive CPU nodes.

The Ray library Tune uses a Proximal Policy Optimization (PPO) architecture to accelerate model training.

Figure 8: PPO Architecture

from ray import tune

tune.run(
    "PPO",
    stop={"episode_reward_mean": 200},
    config={
        "env": "CartPole-v0",
        "num_gpus": 0,
        "num_workers": 3,
        "lr": tune.grid_search([0.01, 0.001, 0.0001]),
    },
)

Applications of reinforcement learning broadly consist of scenarios wherever a simulation is able to run, a cost function can be established, and the problem is complicated enough that hard-set logical rules or simpler heuristical models cannot be applied. The most famous cases of reinforcement learning are typically research-orientated with an emphasis on game-play such as AlphaGo, super-human level Atari agents, or simulated autonomous driving, but there are many real-world business use cases. Examples of recent applications are robotic manipulation control for factories, power consumption optimization, and even marketing and advertising recommendations.

Get started

The benefits of Ray integrated with the power of using Spark help to expand the possible applications of using the Databricks Lakehouse Platform by allowing for scalable task parallelism as well as reinforcement learning. The integration combines the reliability, security, distributed-compute performance, and a wide array of partner integrations with Delta Lake, taking advantage of Ray’s universal distributed-compute framework to add new streaming, ML and big data workloads.

Try the Notebook

Try Databricks for free. Get started today.

The post Ray on Databricks appeared first on Databricks.

↧

Building Analytics on the Lakehouse Using Tableau With Databricks Partner Connect

November 22, 2021, 8:32 am

≫ Next: Announcing Databricks Seattle R&D Site

≪ Previous: Ray on Databricks

This is a guest authored post by Madeleine Corneli, Sr. Product Manager, Tableau

On November 18, Databricks announced Partner Connect, an ecosystem of pre-integrated partners that allows customers to discover and connect data, analytics and AI tools to their lakehouse. Tableau is excited to be among a set of launch partners to be featured in Partner Connect, helping users visualize all the data in their data lakehouse.

“For every data-driven organization, a robust data visualization and analytics solution, like Tableau, ensures that people can easily access, analyze and understand the data that’s driving their business forward. Databricks Partner Connect eliminates the complexity of connecting Tableau to a customers’ lakehouse to uncover data-driven insights even faster. We’re excited to be partnered even more closely with Tableau to bring this speed and agility to our customers together,” said Adam Conway, SVP of Products at Databricks.

Databricks Partner Connect: Analytics for your Lakehouse

Tableau on Databricks Partner Connect helps customers get to insights faster, improving the time to value for big data and data science investments. Within seconds, users can move seamlessly from the Databricks UI to Tableau Desktop to stay in the flow of their analysis.

The promise of the lakehouse is to bring all the data to every user in the tools they know and love. Using Tableau with Databricks helps unlock the full value of your lakehouse by allowing you to create a comprehensive picture of your organization’s data through visual analytics. With comprehensive analytics and faster insights, customers like Wehkamp use Tableau and Databricks together to make data-driven decisions and eliminate bottlenecks between analysts and data stewards.

Tableau in Partner Connect allows your users to easily access and analyze relevant data in a secure and governed environment. Tableau offers numerous ways to explore your data visually as well as with smart analytics features like Ask and Explain Data. You can read more here to learn more about how Tableau and Databricks work together to bring value to your organization.

Tableau on Databricks Partner Connect streamlines the data connection process by:

Simplifying the user journey
Keeping users in the flow of their analysis
Programatically creating a Tableau data source ready for immediate analysis

How to Launch Tableau via Databricks Partner Connect

In Databricks Partner Connect, select Tableau under BI and visualization.
Select your compute endpoint and download the connection file.
Once you open the connection file, enter the credentials to connect to the Databricks cluster from your Tableau desktop.
Start your analysis.

Or, if you are more of a visual learner, check out our quick demo below:

To learn more about Partner Connect, click here.

Try Databricks for free. Get started today.

The post Building Analytics on the Lakehouse Using Tableau With Databricks Partner Connect appeared first on Databricks.

↧