Quantcast
Channel: Databricks
Viewing all 1873 articles
Browse latest View live

Top Three Data Sharing Use Cases With Delta Sharing

$
0
0

Data sharing has become an essential component to drive business value as companies of all sizes look to securely exchange data with their customers, suppliers and partners. According to a recent Gartner survey, organizations that promote data sharing will outperform their peers on most business value metrics.

There are various challenges with the existing data sharing solutions that limit the data sharing within or across organizations and fail to realize the true value of data. Over the last 30 years, data sharing solutions have come in two forms: homegrown solutions or third-party commercial solutions. With homegrown solutions, data sharing has been built on legacy technologies such as SFTP and REST APIs, which have become difficult to manage, maintain or scale with new data requirements. Alternatively, commercial data sharing solutions only allow you to share data with others leveraging the same platform, which limits the data sharing and can be costly.

These challenges have led us, at Databricks, to rethink the future of data sharing as open. During the Data + AI Summit 2021, we announced Delta Sharing, the world’s first open protocol for secure and scalable real-time data sharing. Our vision behind Delta Sharing is to build a data-sharing solution that simplifies secure live data sharing across organizations, independent of the platform on which the data resides or is consumed. With Delta Sharing, organizations can easily share existing large-scale datasets based on the Apache Parquet and Delta Lake formats without moving data and empower data teams with the flexibility to query, visualize and enrich shared data with their tools of choice.

Delta Sharing ecosystem

Delta Sharing ecosystem

Since the private preview launch, we have seen tremendous engagement from customers across industries to collaborate and develop a data-sharing solution fit for purpose and open to all. Customers have already shared petabytes of data using Delta Sharing. The Delta Sharing partner ecosystem has also grown since the announcement with both commercial and open-source clients having built-in Delta Sharing connectors such as PowerBI, Pandas, and Apache Spark™ with many others to be released soon.

Through our customer conversations, we have identified three common use cases: data commercialization, data sharing with external partners and customers, and line of business data sharing. In this blog post, we explore each one of the top use cases and share some of the insights we are hearing from our customers.

Use case 1: Data commercialization

Customer example: A financial data provider was interested in reducing operational inefficiencies with their legacy data delivery channels and making it easier for the end customers to seamlessly access large new datasets.

Challenges

The data provider recently launched new textual datasets that were large in size, with terabytes of data being produced regularly. Providing quick and easy access to these large datasets has been a persistent challenge for the data provider as the datasets were difficult to ingest in bulk for the data recipients. With the current solution, the provider had to replicate data to external SFTP servers, which had many potential points of failure and increased latency.

On the recipient side, ingesting and managing this data was not easy due to its size and scale. Data recipients had to set up infrastructure for ingestion, which further required approvals from IT and database administrators, resulting in delays that could take weeks if not longer to complete before the end consumer could start using the data.

How Delta Sharing helps

With Delta Sharing, the data provider can now share large datasets in a seamless manner and overcome the scalability issues with the SFTP servers. These large terabyte sized textual datasets which had to be extracted in batches to SFTP can now be accessed in real time through Delta Sharing. The provider now can simply grant and manage access to the data recipients instead of replicating the data, thereby reducing complexity and latency. With the improved scalability, the data provider is seeing a significant increase in customer adoption as the data consumers have access to live data instead of having to pull the datasets on a regular basis.

Use case 2: Data sharing with external partners/customers

Customer example: A large retailer needed to easily share product data (e.g., cereal SKU sales) with partners without being on the same data sharing or cloud computing platform as them. The retailer wanted to create partitioned datasets based on SKUs for partners to easily access the relevant data in real time.

Challenges

The retailer was utilizing homegrown SFTP and APIs to share data with partners, which had become unmanageable. This solution required a considerable amount of development resources to maintain and operate. The retailer looked at other data sharing solutions, but these solutions required their partners to be on the same platform, which is not feasible for all parties due to cost considerations and operational overhead of replicating data across different regions.

How Delta Sharing helps

Delta Sharing was an exciting proposition for the retailer to manage and share data efficiently across cloud platforms without the need to replicate the data across regions. The retailer found it easy to manage, create and audit data shares for their 100+ partners through Delta Sharing. For each partner, the retailer can easily create partitions and share the data securely without the need to be on the same data platform. In addition to making the management of the shares easy, Delta Sharing also minimizes the cost, as the data provider only incurs data egress cost from the underlying cloud provider and does not have to pay for any compute charges for data sharing.

Use case 3: Internal data sharing with line of business

Customer example: A manufacturer wants data scientists across its 15+ divisions and subsidiaries to have access to permissioned data to build predictive models. The manufacturer wants to do this with strong governance, controls, and auditing capabilities because of data sensitivity.

Challenges

The manufacturer has many data lake deployments, making it difficult for teams across the organization to access the data securely and efficiently. Managing all this data across the organization is done in a bespoke manner with no strong controls over entitlements and governance. Additionally, many of these datasets are petabytes in size causing concern in the ability to scalably share this data. Management was hesitant about sharing data without the proper data access controls and governance. As a result, the manufacturer was missing unique opportunities to unlock value and allow more unique insights for the data science teams.

How Delta Sharing helps

With Delta Sharing, the manufacturer now has the ability to govern and share data across distinct internal entities without having to move data. Delta Sharing lets the manufacturer grant, track, and audit access to shared data from a single point of enforcement. Without having to move these large datasets, the manufacturer doesn’t have to worry about managing different services to replicate the data. Delta Sharing enabled the manufacturer to securely share data much quicker than they expected, allowing for immediate benefits as the end-users could begin working with unique datasets that were previously siloed. The manufacturer is also excited to utilize the built-in Delta Sharing connector with PowerBI, which is their tool of choice for data visualization.

Getting started with Delta Sharing

Delta Sharing makes it simple to share data with other organizations regardless of which data platforms they use. We are thrilled to share the first solution that provides an open and secure solution without proprietary lock-in that helps data teams easily share data, manage privacy, security and compliance across organizations.

To try Delta Sharing on Databricks, reach out to your Databricks account executive or sign up to get an early access. For many of our customers, governance is top of mind when sharing data. Delta Sharing is natively integrated with Unity Catalog, which enables customers to add fine-grained governance and security controls, making it easy and safe to share data internally or externally. Once you have enabled Unity Catalog in your databricks account, try out the quick start notebooks below to get started with Delta Sharing on Databricks:

  1. Creating a share and granting access to a data recipient
  2. Connecting to a share and accessing the data

 

To try the open source Delta Sharing release, follow the instructions at delta.io/sharing.

Interested in participating in the Delta Sharing open source project?

We’d love to get your feedback on the Delta Sharing project and ideas or contributions for new features. Get involved with the Delta Sharing community by following the instructions here.

--

Try Databricks for free. Get started today.

The post Top Three Data Sharing Use Cases With Delta Sharing appeared first on Databricks.


Improving Drug Safety With Adverse Event Detection Using NLP

$
0
0

The World Health Organization defines pharmacovigilance as “the science and activities relating to the detection, assessment, understanding and prevention of adverse effects or any other medicine/vaccine-related problem.” In other words, drug safety.

Pharmacovigilance: drug safety monitoring in the real-world

While all medicines and vaccines undergo rigorous testing for safety and efficacy in clinical trials, certain side effects may only emerge once these products are used by a larger and more diverse patient population, including people with other concurrent diseases.

To support ongoing drug safety, biopharmaceutical manufacturers must report adverse drug events (ADEs) to regulatory agencies, such as the US Food and Drug Administration (FDA) in the United States and the European Medicines Agency (EMA) in the EU. Adverse drug reactions or events are medical problems that occur during treatment with a drug or therapy. Of note, ADEs do not necessarily have a casual relationship with the treatment. But in aggregate, the proactive reporting of adverse events is a key part of the signal detection system used to ensure drug safety.

Adverse event detection requires the right data foundation

Monitoring patient safety is becoming more complex as more data is collected. In fact, less than 5% of ADEs are reported via official channels and the vast majority are captured in free-text channels: emails and phone calls to patient support centers, social media posts, sales conversations between clinicians and pharma sales reps, online patient forums, and so on.

Robust drug safety monitoring requires manufacturers, pharmaceutical companies and drug safety groups to monitor and analyze unstructured medical text from a variety of jargons, formats, channels and languages. To do this effectively, organizations need a modern, scalable data and AI platform that can provide scientifically rigorous, near real-time insights.

The path forward begins with the Databricks Lakehouse, a modern data platform that combines the best elements of a data warehouse with the low-cost, flexibility and scale of a cloud data lake. This new, simplified architecture enables healthcare providers and life sciences organizations to bring together all their data—structured (like diagnoses and procedure codes found in EMRs), semi-structured (like clinical notes) and unstructured (like images)— into a single, high-performance platform for both traditional analytics and data science.

The Databricks and John Snow Labs architecture for analyzing unstructured healthcare text data using NLP tools.

Building on these capabilities, Databricks has partnered with John Snow Labs, the leader in healthcare natural language process (NLP), to provide a robust set of NLP tools tailored for healthcare text. This is critical, as much of the data used for adverse event detection is text-based. You can learn more about our partnership with John Snow in our previous blog, Applying Natural Language Processing to Health Text at Scale.

Solution accelerator for adverse drug event detection

To help organizations monitor drug safety issues, Databricks and John Snow Labs built a solution accelerator notebook for ADE using NLP. As demonstrated in our previous blog, by leveraging the Databricks Lakehouse Platform, we can use pre-trained NLP models to extract highly-specialized structures from unstructured text and build powerful analytics and dashboards for different personas. In this solution accelerator, we show how to use pre-trained models to process conversational text, extract adverse events and drug information and build a Lakehouse for pharmacovigilance that powers various downstream use cases.

The Databricks and John Snow Labs end-to-end workflow for extracting adverse drug events from unstructured text for pharmacovigilance.

The solution accelerator follows 4 basic steps:

  1. Ingest unstructured medical text at scale.
  2. Use pre-trained NLP models to extract useful information such as adverse events (e.g., renal damage), drug names and timing of the events in near real-time.
  3. Correlate adverse events with drug entities to establish a relationship.
  4. Measure frequency of events to determine significance.

Below is a brief summary of the workflow contained within the notebook.

Overview of the adverse drug event detection workflow

Starting with raw text data, we use a corpus of 20,000 texts with known ADE status (4,200 texts containing ADE) and apply a pre-trained biobert model to detect ADE status and assess the specificity and sensitivity of the model based on the ground truth and the confidence level in accuracy of the assignment. In addition, we extract ADE status and drug entities from the conversational texts by using a combination of ner_ade_clinical and ner_posology models.

The Databricks and John Snow Labs solution uses a combination of ner_ade_clinical and ner_posology models to extract ADE status and drug entities from conversational texts.

By simply adding a stage in the pipeline, we can detect the assertion status of the ADE (present, absence, occured in the past, etc).

The Databricks and John Snow Labs NLP pipeline for this solution can detect the assertion status of the ADE.

To infer the relationship status of an ADE with a clinical entity, we use a pre-trained model (re_ade_clinical), which detects the relationships between a clinical entity (in this case drug) and the inferred ADE.

The Databricks and John Snow Labs solutions uses a pre-trained model (re_ade_clinical) which detects the relationships between a clinical entity (in this case drug) and the inferred ADE.

The sparknlp_display library has the ability to show relations on the raw text and their linguistic relationships and dependencies as demonstrated below.

With the Databricks and John Snow Labs solution, the sparknlp_display library has the ability to show relations on the raw text and their linguistic relationships and dependencies as demonstrated below.

After the ADE and drug entity data has been processed and correlated, we can build powerful dashboards to monitor the frequency of ADE and drug entity pairs in real time.

After the ADE and drug entity data has been processed and correlated, the uses can build powerful dashboards to monitor the frequency of ADE and drug entity pairs in real time.

Get started analyzing adverse drug events with NLP on Databricks

With this solution accelerator, Databricks and John Snow Labs make it easy to analyze large volumes of text data to help with real-time drug signal detection and safety monitoring. To use this solution accelerator, you can preview the notebooks online and import them directly into your Databricks account. The notebooks include guidance for installing the related John Snow Labs NLP libraries and license keys.

You can also visit our industry pages to learn more about our Healthcare and Life Sciences solutions.

--

Try Databricks for free. Get started today.

The post Improving Drug Safety With Adverse Event Detection Using NLP appeared first on Databricks.

Understanding New Years Trends: A Simple, Unified Pipeline on the Databricks Lakehouse

$
0
0

For many people, the start of a new year marks the perfect time to make a change. That’s why, despite the rather polarizing nature, New Year’s resolutions remain an important tradition for kickstarting a personal goal.

Oftentimes, they’re not terribly creative – improve fitness, adopt a hobby, go someplace new. But over the past two years, as we collectively handle a global pandemic, many of us have experienced a shift in mindset on what’s important or what success means. We’ve seen this shift in all sorts of ways — The Great Resignation, definitions of wealth, new norms for socializing and more.

With this in mind and the onset of 2022, a few of us at Databricks thought it would be interesting to examine how post-pandemic life has impacted New Year’s resolutions, which are essentially snapshots into the most popular goals and trends. To do this, we used Databricks and the Twitter API to perform keyword search based on a pre-trained collection of word vectors provided by GloVe — and the results were pretty interesting.

This blog post will walk through how exactly we went about executing this use case leveraging Databricks, the Twitter API and easily-accessible open source tools. Then, we’ll share the findings of our analysis, which we think truly reflect the changing nature of the times. Let’s dive in!

Why Databricks?

First, let’s give a brief intro to Databricks and why it made this use case so simple to execute.

To perform this use case, we needed to aggregate the relevant data set from Twitter, process and prepare it for our keyword search, classify it, and then store the results in a place where the data can be queried and visualized in meaningful ways. Databricks offers us all of these capabilities out of the box with our Lakehouse platform, which combines the reliability, performance and governance of data warehouses with the openness and flexibility of data lakes. Not a single external system needed to be set up.

Databricks facilitates easy implementation of the Lakehouse architecture and Delta Lake as a managed service, allowing data practitioners to take advantage of the cost-effective and highly-scalable nature of cloud object storage while also enabling performant queries and visualizations to be built on top of the stored data without requiring it to be converted to a proprietary format or funneled into a traditional data warehouse. This means that an entire data team (data scientists, analysts, and data engineers) can execute this use case end-to-end with the tools they are most comfortable with and within an open, collaborative environment.

How we did it


 

Data Ingestion & Processing

The first step was in many ways the hardest: determining what data set we’d use to capture global New Year’s resolutions. We determined Twitter was the best option since it’s a conversational platform with a global user base, is easily searchable and comes with a Developer API. Since the goal was to compare pre and post-pandemic goals, we needed a historical data set. We used a public domain-licensed historical dataset from 2015 that provided over 5,000 New Year’s resolution-related tweets.

For comparison’s sake, we then aggregated a data set of relevant tweets from this year using the Twitter API. First, we built a Notebook to ingest tweets and build our data set. We collected tweets based on selected phrases – #NewYearsResolutions and associated hashtags and keywords – between the dates of 12/17/2021 and 1/2/2022. We ended up with quite a large sample of tweets, so we randomly sampled approximately 10,000 of them to be more in line with the size of our historic data set.

To accelerate the ingestion step, we used Tweepy, a Python library that makes it easy to interact with the Twitter API. As an aside, since Databricks notebooks allow for mixing languages, it was very easy to run a shell command to import the needed Python libraries into our environment and then write the rest of the code in Python. We did some cleanup of the text by removing things like URLs, punctuation, and hashtags.

With our data prepared, and once again with the help of magic commands to mix languages, we inserted a SQL statement into our notebook to MERGE the data from our Apache Spark™ Dataframe into our bronze Delta Table. With every pull of the Twitter API, there were some tweets duplicated across multiple batches; the MERGE operation allows us to only push new tweets into our table, avoiding duplication.

Classifying & analyzing tweets

For this project, we followed a simplified version of a medallion architecture. In our case, we landed the pre-processed tweets into our bronze table via the MERGE, ran them through our classifier, and then used another MERGE to insert the results into our gold table. This highlights again how MERGE makes it really easy to push high volumes of unique records through a pipeline on top of Delta Lake without having to write complex logic for deduplication.

For the actual tweet classification, we used the pre-trained GloVe vectors (downloaded via Gensim) to construct relevant categories and keywords for classifying each resolution. One really nice thing about the GloVe vectors is that they were trained on over 2 billion tweets worth of data from Twitter. This solved the challenge of us not having enough training data upfront to build our own vectors.

After some discussion, we came up with these categories* as common New Year’s Resolutions themes:

  • exercise
  • learn something new
  • finance
  • eco-friendly
  • outdoors
  • travel
  • healthy diet
  • reading
  • self-care
  • quit smoking

* We also had an “other” category for all tweets that didn’t fit into the topics above. We ended up not using the other category for our analysis since a large portion of these tweets consisted of ads, sarcastic or funny comments, trolling, and other irrelevant messages

We came up with a few seed keywords for each category, and then GloVe provided additional keywords that were most relevant to each, giving us a basis to do our classification.

Now that we had each category seeded with a large number of keywords, we ran each tweet through our classifier to determine the dominant category. We did this by counting the number of keywords from each category that appeared in each tweet: whichever category had the largest number of matched keywords is how we classified that tweet.

We executed this process for both the 2015 and 2022 data sets. Using Databricks, we wrote these into a gold Delta Table and were able to quickly develop visualizations in Databricks SQL. This was the final product that was the basis for our analysis, which we’ll dive into below:

A glimpse into the post-pandemic mindset

While the 2015 dataset included human-labeled topics, we executed the above process for both the 2015 and 2022 data sets to classify all of the tweets according to our selected categories in order to get a consistent view.

Now that our data science was complete, it was just a matter of using this data and visualizations to actually extract insights. We performed our analysis and were pretty surprised just how different the two years’ resolutions were. Here’s a summary of our findings:

A growing interest in physical health

“Eating better” and “exercising more” are some of the most stereotypical New Year’s resolutions. But when we compare 2015 and 2022, it’s clear that there’s a more meaningful shift at play.

In 2015, self-care – usually used to describe overall wellbeing with an emphasis on physical and mental behaviors, mindfulness, etc – was the most common New Year’s resolution. This theme still remains strong in 2022, as it was the second most popular resolution.

However, a stark contrast is the increased focus on physical health goals. Pre-pandemic, healthy-diet wasn’t hugely top of mind, accounting for only 12.5% of tweets. Healthy diet nearly doubled in 2022, making it the top resolution on Twitter. This dramatic change makes total sense given the context of the time. For many of us, the pandemic has pushed ideas of health risks and ailments to the top of our minds. While it might not be directly related to COVID-19, it’s not surprising to see people set goals around adopting an overall healthier lifestyle and eating habits.

Less desire to learn

Another noticeable difference between the two years is in the learn something new category, which can really describe anything from picking up a new hobby to acquiring a skill set to just expanding overall knowledge. As you can see, 2015 showed a huge interest in learning something new and ranked #2 in popularity. However, in 2022, that number shrank from 13% to less than 9%, bumping it down to #5.

Like a healthy diet, it’s possible to view this shift as a response to the past two years. In that timeframe, people have had to spend significantly more time at home, often apart from friends and most loved ones. Naturally, without the typical avenues of entertainment and going out, many of us had ample time to explore new avenues and hobbies. But two years in, it’s not surprising to see that people are rather fatigued of ‘learning’ or perhaps have already reached these goals and are ready to commit to something different, such as behaviors to improve health.

Some things never change

It’s important to note that while a lot has changed, a lot has also stayed the same in terms of what people care about and their personal motivations.

One stable New Year’s resolution was reading. While it’s great to see reading as high on the list both years, this was a little surprising given that learning-new experienced such a dip in 2022. However, its ability to remain a top priority in 2022 could be explained by fatigue around connecting online (e.g., Zoom meetings and happy hours) and more time spent online or on streaming services. With this in mind, it seems practical that a lot of people are ready to take breaks from the Internet and explore a different avenue of entertainment.

Another constant that was exciting to see was the consistent focus on self-care. While, as mentioned above, it lost its spot as the #1 resolution, there wasn’t a big change between 2015 and 2021 (22.3% and 19.5%, respectively). Considering the stresses and unknowns since 2019, all we can say is we’re happy to see that people are still prioritizing taking care of their own needs and health.

Conclusion

These are just some of our insights from comparing 2015 and 2022 New Year’s resolutions, but they do suggest a growing shift in our personal goals and interests. Even more so, this use case shows how Dabricks’ Lakehouse truly is a unified platform. Every teammate involved was able to execute every aspect of this use case on Databricks, and do it quickly and collaboratively.

New to Lakehouse? Check out this blog post from our co-founders for an overview of the architecture and how it can be leveraged across data teams.

--

Try Databricks for free. Get started today.

The post Understanding New Years Trends: A Simple, Unified Pipeline on the Databricks Lakehouse appeared first on Databricks.

Hunting Anomalous Connections and Infrastructure With TLS Certificates

$
0
0

According to Sophos, 46% of all malware now uses Transport Layer Security (TLS) to conceal its communication channels. A number that has doubled in the last year alone. Malware, such as LockBit ransomware, AgentTesla and Bladabini remote access tools (RATs), has been observed using TLS for powerShell based droppers, for accessing pastebin to retrieve code and many others recently.

In this blog, we will walk through how security teams can ingest x509 certificates (found in the TLS handshake) into Delta Lake from AWS S3 storage, enrich it, and perform threat hunting techniques on it.

LS is the de facto standard for securing web applications, and forms part of the overall trust hierarchy within a public key infrastructure (PKI) solution.

During the initial connection from a client to server, the TLS protocol performs a two-phase handshake, whereby the web server proves its identity to the client by way of information held in the x509 certificate. Following this, both parties agree on a number of algorithms, then generate and exchange symmetric keys, which are subsequently used to transmit encrypted data.

For all practical purposes x509 certificates are totally unique and can be identified using hashing algorithms (commonly SHA1, SHA256 and MD5) called fingerprints. The nature of hashing makes them great threat indicators and are commonly used in threat intelligence feeds to represent objects. Since the information within them is used for cryptographic key material (agreement, exchange, creation etc), they themselves are encoded but not encrypted and therefore, can be read.

Capturing, storing and analyzing network traffic is a challenging task. However, landing it into cheap cloud object storage, processing it at scale with Databricks and only keeping the interesting bits could be a valuable workflow for security analysts and threat hunters. If we can identify suspicious connections, we have an opportunity to create indicators of compromise (iocs) and have our SIEM security tools help to prevent further malicious activity downstream.

About the data sets

We are using x509 data collected from a network scan, and alongside it, we will use the Cisco Umbrella top 1 million list and the SSL blacklist produced by abuse.ch as lookups.

One of the best places within an enterprise network to get hold of certificate data is off the wire using packet capture techniques. Zeek, TCPDump and Wireshark are all good examples.

If you are not aware of the cyber threat hunting tool SSLblacklist, it is run by abuse.ch with the goal of detecting malicious SSL connections. The Cisco Umbrella top 1m are the most popular DNS lookups on the planet as seen by Cisco. We will use this to demonstrate filtering and lookup techniques. If you or your hunt team want to follow along with the notebook and data you can import the accompanying notebook.

The SSLblacklist is run by abuse.ch with the goal of detecting malicious SSL connections.

Source:https://sslbl.abuse.ch

Ingesting the data sets

For simplicity, if you are following along at home, we will be using Delta Lake batch capability to ingest the data from an AWS S3 bucket to a bronze table, refine and enrich it into a silver table (medallion architecture).  However, you can upgrade your experience in real-world applications using structured streaming!

We’ll focus on the blacklist and umbrella files first, followed by x509 certificate data.

# Alexa-Top1m
<>rawTop1mDF = read_batch(spark, top1m_file, format=‘csv’, schema=alexa_schema)

# Write to Bronze Table
alexaTop1mBronzeWriter = create_batch_writer(spark=spark, df=rawTop1mDF, mode=‘overwrite’)
alexaTop1mBronzeWriter.saveAsTable(databaseName + “.alexaTop1m_bronze”)

# Make Transformations to Top1m
bronzeTop1mDF = spark.table(databaseName + “.alexaTop1m_bronze”)
bronzeTop1mDF = bronzeTop1mDF.filter(~bronzeTop1mDF.alexa_top_host.rlike(‘localhost’)).drop(“RecordNumber”)
display(bronzeTop1mDF)

# Write to Silver Table
alexaTop1mSilverWriter = create_batch_writer(spark=spark, df=bronzeTop1mDF, mode=‘overwrite’)
alexaTop1mSilverWriter.saveAsTable(databaseName + “.alexaTop1m_silver”)

The above code snippets read the csv file from an s3 bucket, writes it directly to the bronze table unaltered, then reads the bronze delta table, makes transformations and writes it to the silver table. That’s the top 1 million data ready for use!

Sample dataframe displaying results from the Databricks cyber threat hunting ETL process.
Resulting Dataframe

Next we follow the same format for the SSL blacklist data

# SSLBlacklist
rawBlackListDF = read_batch(spark, blacklist_file, format=‘csv’)
rawBlackListDF = rawBlackListDF.withColumnRenamed(‘listingDate’)

# Write to Bronze Table
sslBlBronzeWriter = create_batch_writer(spark=spark, df=rawBlackListDF, mode=‘overwrite’)
sslBlBronzeWriter.saveAsTable(databaseName + “.sslBlacklist_bronze”)

# Make Transformations to the SSLBlacklist
bronzeBlackListDF = spark.table(databaseName + “.sslBlackList_bronze”)
bronzeBlackListDF = bronzeBlackListDF.select(*(col(x).alias(‘sslbl_’ + x) for x in bronzeBlackListDF.columns)

# Write to Silver Table
BlackListSilverWriter = create_batch_writer(spark=spark, df=bronzeBlackListDF, mode=‘overwrite’)
BlackListSilverWriter.saveAsTable(databaseName + “.sslBlackList_silver”)

The above process is the same, as for the top 1 million file presented below. Our transformation simply prefixes all columns with ‘sslbl_’ so it is easily identified later. 

Sample sslblacklist dataframe generated by Databricks cyber threat hunting workflow.

Resulting sslblacklist dataframe
Next we ingest the x509 certificate data using exactly the same methodologies. Here’s how that dataframe looks after ingestion and transformation into the silver table. 

Sample dataframe with x509 certificate data generated as part of the Databricks cyber threat hunting workflow.

X509 certificates are complex and there are many fields available. Some of the most interesting for our initial purposes are; 

  • subject, issuer, common_name, valid to/from fields, dest_ip, dest_port, rdns

Analyze the data

Looking for certificates of interest can be done in many ways. We’ll begin by looking for distinct values in the issuer field.

Example of how Databricks cyber threat hunting solution can be used to identify potential threats and vulnerabilities within TSL certificates.

If you are new to pyspark, it is a python API for Apache Spark. The above search makes use of collect_set, countDistinct, agg, and groupBy. You can read more about those in the links.

A hypothesis we have is that when certificates are either temporary, self-signed or otherwise not used for genuine purposes, the issuer field tends to have limited detail. Let’s create a search looking at the length of that field.

Sample search that looks at the length of the issuer field, which is part of the cyber threat hunting techniques used by the Databricks solution.

withColumn adds a new column, after evaluating the given expression.

The top entry has the shortest length and has unique subject and issuer fields. This is a good candidate for some OSINT!

Sample Google search displaying a number of hits to a TLS certificate believed to be used on malicious websites.

A google search shows a number of hits that believe this certificate is or has been used on malicious websites. This hash is a great candidate to pivot from and explore further in our network.

Let’s now use our ssl blacklist table to correlate with known malicious hashes. 

# SSLBlacklist
isSSLBlackListedDF = silverX509DF.select(
“sslbl_Listingreason”,“common_name”, “country”, “dest_ip”,“rdns”,“issuer”,
“sha1_fingerprint”, “not_valid_before”, “not_valid_after”

).filter(silverX509DF.sslbl_SHA1 != ‘null’)
display(isSSLBlackListedDF)

Sample search result displaying the movements of adversary infrastructure over time, used as part of the Databricks cyber threat hunting methodology.

This search raises some interesting findings. The top four entries show hits for a number of different malware families’ command and control infrastructure. We also see the same sha1 fingerprint being used for ransomware command and control, using different IP addresses and DNS names. There could be a number of reasons for this but the observation would be that  adversary infrastructure is moving around over time. First seen, last seen work should be done using the threat data’s listing date and other techniques such as passive DNS lookups to further understand this and gain more situational awareness. New information discovered here should also be used to pivot back into an organisation for any other signs of communication with any of these hosts or IP addresses.

Finally, a great technique for hunting command and control communication is to use a Shannon entropy calculation to look for randomized strings.

def entropy(string):
“Calculates the Shannon entropy of a string”
try:
# get probability of chars in string
prob = [ float(string.count(c)) / len(string) for c in dict.fromkeys(list(string)) ]

# calculate the entropy
entropy = – sum([ p * math.log(p) / math.log(2.0) for p in prob ])
except Exception as e:
print(e)
entropy = -1

return entropy


entropy_udf = udf(entropy, StringType())

entropyDF = silverX509DF.where(length(col(“subject”)) < 15).select(
“common_name”,“subject”,“issuer”,“subject_alternative_names”,“sha1_fingerprint”
).withColumn(“entropy_score”,
entropy_udf(col(‘common_name’))).orderBy(col(“entropy_score”).desc()).where(col(‘entropy_score’) > 1.5)

display(entropyDF)

Sample search result identifying potentially malicious fingerprints, generated as part of Databricks’ cyber threat hunting solution.

As the saying goes, ‘the internet is a bad place’ and ‘math is a bad mistress’! Our initial search included all certificates, which produces a lot of noise due to the nature of the fields content. Experimenting further, we learned from our earlier search and focused only on those with a subject length of less than fifteen characters, and surfaced only the highest entropy of that data set. The resulting nine entries can be manually googled, or further automation could be applied. The top entry in this scenario is of interest, as this appears to be used as part of the CobaltStrike exploit kit.

Sample Google search of a suspicious fingerprint, demonstrating how Databricks’ cyber threat hunting techniques can be applied to real-world situations.

Further work

This walk through has demonstrated some techniques we can use to identify suspicious or malicious traffic using simple unique properties of x509 certificates. Further exploration using machine learning techniques may also provide benefits.

Conclusion

Analyzing certificates for unusual properties, or against known threat data can identify infrastructure known to host malicious software. It can be used as an initial pivot point to gather further information that can be used to search for signs of compromise. However, since the certificate identifies a host and not the content it serves, it cannot provide high confidence alerts alone.

Before being eligible for operationalization in a security operations centre (soc), the initial indicators need to be triaged further. Data from other internal and external sources such as, firewalls, passive DNS, VirusTotal, who is and also process creation events from endpoints should be used.

Let us know at cybersecurity@databricks.com how you think processing TLS/x509 data either in an enterprise or more passively on the internet can be used to track adversaries and their infrastructure. If you are not already a Databricks customer, feel free to spin up a community edition.

Download the notebook.

--

Try Databricks for free. Get started today.

The post Hunting Anomalous Connections and Infrastructure With TLS Certificates appeared first on Databricks.

Top Four Characteristics of Successful Data and AI-driven Companies

$
0
0

At Databricks, we have had the opportunity to help thousands of organizations modernize their data architectures to be cloud-first and extract value from their data at scale with analytics and AI. Over the past few years, we’ve been fortunate to engage directly with customers across industries and regions about their data-driven aspirations – and the roadblocks that slow down their ability to get there. While challenges vary greatly among industries and even individual organizations, we have developed a rich understanding of the top four habits of data and AI-driven organizations.

Before diving into the habits, let’s take a quick look at how organizations have approached enabling data strategies. First, data teams have made technology decisions over time that propel a way of thinking that is based around technology stacks: data warehousing, data engineering, streaming real-time data science, and machine learning. The problem is that’s not how business units think. They think about use cases, the decision-making process, and business problems (e.g., customer 360, personalization, fraud detection, etc). As a result, enabling use cases becomes a complex stitching exercise across technology stacks. These pain points aren’t just anecdotal. In a recent survey conducted by Databricks and MIT Technology Review, 87% of surveyed organizations struggle to succeed with their data strategy; it often comes back to their approach of focusing on a ’technology stack.’ Second, there continues to be ample support within IT teams to custom-build solutions rather than buying off-the-shelf offerings. This is not to say there aren’t valid scenarios where custom-built solutions are the right choice, but in many cases technology vendors have managed to solve the majority of the common and low changing use cases enabling teams to focus on more value-added initiatives indexed on creating value for the business faster. Lastly, from a people perspective, organizations have been well-intentioned in their strategies tying technology to business outcomes but have failed because the corporate culture around data hasn’t been addressed – in fact in the 2022 Data and AI Leadership Executive Survey, 91.9% of respondents identify culture as the greatest challenge to becoming data-driven organizations.

Luckily, these challenges are solvable – but require a different approach. We’re currently in a “data renaissance” where enterprises realize that to execute on novel data and AI use cases, the legacy model of siloed technology stacks needs to give way to a unified approach. In other words, it’s not about just data analytics or just ML – it’s about building a full enterprise-wide data, analytics, and AI platform. They also recognize that they need to empower their data teams with more turnkey solutions in order to focus on creating business value and not building tech stacks. Organizations also realize that the strategy can’t be some top-down authoritarian initiative but needs to be supported with trainings to improve data literacy and capabilities that make data ubiquitous and part of everyday life. Ultimately, every organization trying to figure out how to achieve all this while making things simple. So how can you get there? These are the top habits we’ve identified among successful data and AI-driven organizations.

1. Embrace an AI future

When we first started on the Databricks journey, we often discussed how high-quality data is critical for analytics, but even more so for AI, and that the latter, especially for data-driven decision making, will power the future. Over time, as use cases like personalization, forecasting, vaccine discovery, and churn analysis have accelerated and advanced with AI, people are more comfortable with the fact that the future is in AI. The habits are shifting from just asking what happened? to now focusing on why, generating high confidence predictions, and ultimately influencing future outcomes and business decisions. And we see around the world organizations like Rolls-Royce, ABN AMRO, Shell, Regeneron, Comcast, and HSBC are using data for advanced analytics and AI to deliver new capabilities or drastically enhance existing ones. And we see this across every vertical. In fact, Duan Peng, SVP of Data and AI at WarnerMedia, believes “The impact of AI has only started. In the coming years, we’ll see a massive increase in how AI is used to reimagine customer experiences.”

2. Understand that the future is open

There’s an interesting statistic from MIT that states 50% of data and technology leaders say that if given the redo button, they would embrace more open standards and open formats in their data architectures – in other words, optionality. The challenge to this approach is that many data practitioners and leaders associate “open” strictly with open source – and primarily within the context of the on-prem world (i.e., Apache Hadoop). But oftentimes, you’ve got an open source engine, and it was just about how do you get services and support around it.

In our conversations with CIOs and CDOs about what open means to them, it comes down to three core tenents. First, for their existing solution, what is the cost of portability? It’s great you threw some code on GitHub repost somewhere. That’s not what they care about. What they really care about is, if it comes down to it, the viability of moving off the platform from both a capability and cost standpoint. Next, how well do these capabilities allow for plugging into a rich ecosystem, whether it’s homegrown or leveraging other vendors products? Third, what is the learning curve for internal practitioners when onboarding? How quickly can they get up to speed?

Every organization is under increasing pressure to fly the plane while it’s being upgraded, but as we get to the point where there are multiple options on how to fly and upgrade, that open nature allows optionality for the future. The optionality enabled by an embrace of open standards and formats is becoming a critical component organizations are increasingly prioritizing in their strategies.

3. Be multi-cloud ready

There are three types of data and AI-driven organizations: those who are already multi-cloud, those who are becoming multi-cloud, and those that are on the fence about multi-cloud. In fact, Accenture went on to predict multi-cloud as their number 4 cloud trend for 2021 and beyond. There are many drivers for a multi-cloud approach, such as the ability to deliver new capabilities with cloud-specific best-of-breed tools, mergers and acquisitions, and requirements of doing business like regulations, customer cloud-specific demands, etc. But one of the biggest drivers is economic leverage. As cloud adoption grows and data grows, for many, spending on cloud infrastructure will be one of the largest line items. As organizations think about a multi-cloud architecture, two things roll up to the top as requirements. First, the end-user experience needs to be the same. Data leaders don’t want end-users to think about how to manage data, run analytics or build models separately across cloud providers. Second, in this pursuit of consistency, they don’t want some watered-down capability either. There’s a lot of investments that cloud providers are making in their infrastructure. And successful organizations recognize the need to ensure that, as they operate on each cloud, they deeply integrate with them across performance, capabilities, security, and billing. This is pretty hard to get right.

4. Simplify the data architecture

Productivity and efficiency are critical. Ultimately any modernization efforts are aimed at simplifying architectures as a means to increase productivity, which has a domino effect on organizations’ ability to get to new insights, build data products and deliver innovations faster. Organizations want their data teams to focus on solving business problems and creating new opportunities, not just managing infrastructure or reporting the news. To give you an example, Google published “Hidden Technical Debt in Machine Learning Systems,” outlining the tax associated with building ML products. Ultimately the findings concluded that data teams spend more time on everything else from data curation, management, and pipelines than the actual ML code, which is what ultimately will move the business forward.

This begs the question: how can data teams automate as much as possible and spend more time on the things that will move the needle? Many organizations have engineers who love to build everything. But the questions you want to ask are: is building everything yourself the right approach? How do you focus on the core strength and competitive advantage? The fundamental needs of any organization aren’t that unique; in fact, many are on the same journey, and third-party solutions are becoming incredibly effective at automating turnkey tasks. Ask yourself how much is it worth to basically lower your overall TCO and be able to move faster? Or as, Habsah Nordin, Head of Enterprise Data, at PETRONAS puts it, “It’s not about how sophisticated your technology stack is. The focus should be: Will it help create the most value from the data you have?”

Why are so many struggling if it is this simple?

The answer: 30+ years of fragmented, polarizing legacy tech stacks that just keep getting bigger and more complex. The figure below is a simplified picture of what is the reality for many. In fact, only 13% of organizations are actually succeeding with their data strategies and it’s owed to their focus on getting the foundations of sound data management and architecture right.

Data infrastructure is too complicated, with most t organizations landing all of the data first and foremost in a data lake, but then to make it usable they have to build four separate siloed stacks.

Most organizations land all of the data first and foremost in a data lake, but to make it usable, they have to build four separate siloed stacks. The red dots represent customer and product data that must be copied and moved around these different systems. The root of this complexity comes from the fact that there are two separate approaches that are at odds with each other.

On one hand, you have data lakes, which are open, and on the other end you have data warehouses that are proprietary. They’re not really compatible. One is primarily based on Python and Java. The other one is primarily based on SQL, and these two worlds don’t mix and match very well. You also have incompatible security and governance models. So you have to do security governance and control for files in the data lake and for tables and columns in the data warehouse. They are like two magnets that instead of coming together, repel each other making it nearly impossible for the broader population of organizations to build on the four habits outlined above. Even the original architect of the data warehouse, Bill Inmon, is recognizing that the current state of affairs outlined in the image above is not what’s going to unlock the next decade-plus of innovation.

The great convergence of lakes and warehouses

As organizations think about their approach, there are only two paths, a lake-first or warehouse-first. Let’s first explore data warehouses, which have been available for decades. They’re fantastic for business analytics and rearview mirror data analysis that focuses on reporting the news. But they’re not good for advanced analytics capabilities, and working with them gets quite complex when data teams are forced to move all of that data into a data lake just to drive new use cases. Furthermore, data warehouses tend to be costly as you try to scale. Data lakes, where the majority of the world’s data resides today, helped solve many of these challenges. Over the years, a bunch of great water analogies emerged around data lakes like data streams, data rivers, data reservoirs that support ML and AI natively. But they don’t do a good job supporting some of those core business intelligence (BI) use cases, and they are missing the data quality and data governance pieces that data warehouses encompass. Data lakes too often become data swamps. As a result, we are now seeing the convergence between lakes and warehouses with the rise of the data lakehouse architecture.

Lakehouse combines the best of both data warehouses and data lakes with a lake-first approach (see FAQs). If your data is already in the lake, why migrate it out and confine it to a data warehouse…and then be miserable when trying to execute on both AI and analytics use cases? Instead, with lakehouse architecture, organizations can begin building on the four habits seen in successful data and AI-driven organizations. By having a lake-first approach that unlocks all the organizational data, and is supported by easy-to-use, automated, and auditable tooling, an AI future becomes increasingly possible. Organizations also gain optionality that was once thought to be a fairy tale. True lakehouse architecture is built on open standards and formats (and the Databricks platform is built on Delta Lake, MLflow, and Apache Spark™), which empowers organizations with the ability to take advantage of the widest range of existing and future technology, as well as to access to a vast pool of talent. This optionality also extends to being multi-cloud ready by not only gaining leverage but also ensuring a consistent experience for your users with one data platform regardless of which data resides with which cloud provider. Lastly, simplicity, this goes without saying, if you can reduce the complexity of two uniquely distinct tech stacks that are fundamentally built for separate outcomes, a simplified tech landscape becomes absolutely possible.

What you should hopefully takeaway is that you’re not alone on this journey, and there are great examples of organizations across every industry and geo that are making progress on simplifying their data, analytics, and AI platforms in order to become data-driven innovation hubs. Check out the Enabling Data and AI at Scale strategy guide to learn more about the best practices building data-driven organizations as well as the latest on the 2021 Gartner Magic Quadrants (MQs) where Databricks is the only cloud-native vendor to be named a leader in both the Cloud Database Management Systems and the Data Science and Machine Learning Platforms MQs.

--

Try Databricks for free. Get started today.

The post Top Four Characteristics of Successful Data and AI-driven Companies appeared first on Databricks.

My First 90 Days as a Databricks Engineering Leader

$
0
0

I recently joined Databricks as Site Lead for the Seattle office and lead engineering for the Partner Platform team. This career move builds off of my 18+ years in the data and platforms space. I started my career at Microsoft on the Excel team and most recently was VP of Engineering & Product for the Data Management team at Tableau. From working on pivot tables to visual data prep, I have enjoyed helping customers make better decisions through data.

So, why did I join Databricks? As we ramp up and expand our presence in the Seattle area, I spend a lot of time talking to future Bricksters (candidates) – this is one of the questions I hear the most. The short answer – Databricks has the right ingredients to truly democratize data and AI. Databricks has a strong history of innovation and execution, has attracted top talent in the field and has the customer reach to realize this vision. There is a lot of excitement and positive momentum, so I thought I’d explain how I approached my decision to join Databricks and reflect on the experience of my first 90 days here.

Inspiring mission to simplify and democratize data

Databricks is on a mission to simplify and democratize data and AI, helping data teams solve the hardest enterprise data challenges. With the Databricks Lakehouse Platform, customers can unify all their data, analytics and AI workloads on one simple platform. This is so powerful! In my short time here, I’ve learned about incredible customer stories like Regeneron and Shell, and the impact these customers are having on the world is truly inspirational.

With Databricks, Regeneron has accelerated drug target identification, increasing overall productivity and getting new drugs out faster and cheaper. If this pandemic has taught us anything, it’s that rapid medical innovation is vital, so helping companies like Regeneron process data faster can ultimately save lives. Shell is using Databricks to help deliver innovative energy solutions for a cleaner world. As a mom of two young boys, protecting our environment for future generations is so important.

These customer stories are not relegated to a specific industry or sector. We have the opportunity to help customers across various industries – from life sciences to non-profits, finance to technology and everything in between. Working at a company where you believe in the mission and get to be part of solving some of the larger problems affecting society is inspiring and unifying.

Collaborative, humble-smart co-workers

Working to solve challenging problems is way more fun and effective when you like who you work with. This brings me to one of the best parts of Databricks: the people! All of my colleagues and teammates are really knowledgeable about their areas and passionate about our customers and the opportunity for impact. It’s been really exciting to work with top talent across our various business functions. Our senior leaders truly care, many of them are co-founders and have shepherded the company from its infancy to a period of rapid growth. Whether it’s my manager Reynold (one of the co-founders of Dataricks who is the top contributor to Apache Spark), Vinod our SVP of engineering, or my engineering peers including:

  • Lei engineering leader for the visualization team,
  • Michalis engineering leader for the query processing and storage team,
  • Sridhar engineering leader for the SQL Gateway team,
  • Arik founder of Redash and architect
  • Shant architect for Databricks SQL

This is a talented team I enjoy working with and learning from every day. I often tell my colleagues and teams that I spend more of my awake hours with my team than my two kids, so I want to make sure I work with collaborative, friendly, and smart colleagues.

Culture and speed of innovation

A great mission, great people and a proven culture of innovation is the secret sauce to success. Databricks has a history of innovation. It was started by the founders of Apache Spark, but has not stopped there. The company is constantly looking at ways to out innovate themselves. In a short time, the company has created MLflow, Delta Tables, Delta Sharing, Unity Catalog, Databricks SQL…the list goes on. Our teams are solving hard problems every day. Whether it’s launching and monitoring millions of virtual machines per day that process exabytes of data or building user experiences to simplify the complexity of data science, there’s opportunity here to sink your teeth into solving hard technical problems across the software stack.

In the 90 days I’ve been here, we launched the Seattle site, launched Partner Connect, set the official data warehousing performance record, announced GA of Databricks SQL and were named a leader in the Gartner Magic Quadrant for DBMS and DSML – making Databricks the only cloud-native vendor to achieve this. Wow – it’s been so humbling to be part of these incredible achievements. If this is what the first 3 months looks like – am eager to see what we accomplish as a team in 2022!

In my time here, I’ve also been impressed with our strong engineering culture: well thought-through design docs, investment in dev tooling and productivity, and strong end-to-end ownership across teams. Of course, fast growth doesn’t come without growing pains. For instance, our onboarding needs to constantly be revamped because we’re growing so quickly! What worked 6 months ago doesn’t work for an engineering team that has now more than doubled in size. Even with a great document culture, there’s still tribal knowledge, and it takes time to figure out who knows what. There’s so much opportunity to be part of growing the team, the product and the culture on this rocket ship.

We are hiring!

And yes – we’re hiring =) Our Seattle site is very strategic to our growth and we have core efforts being developed out of Seattle. You can read more in our blog post here.

Despite the challenges of COVID, we are developing an exciting culture. Before the holidays, our Seattle Bricksters got together to build gingerbread houses and enjoyed boba with members of the senior leadership team. Last month we had our first industry networking event in Bellevue, hosting over 50 engineering leaders who got to learn more about Databricks from our CTO Matei and SVP of engineering Vinod. Databricks and our leadership is committed to growing our presence locally.

I can’t wait to look back on this blog post of my first 90 days and see what else we will achieve down the road!

--

Try Databricks for free. Get started today.

The post My First 90 Days as a Databricks Engineering Leader appeared first on Databricks.

Delta Sharing Release 0.3.0

$
0
0

We are excited for the release of Delta Sharing 0.3.0, which introduces several key improvements and bug fixes, including the following features:

  • Delta Sharing is now available for Azure Blob Storage and Azure Data Lake Gen2: You can now share Delta Tables on Azure Blob Storage and Azure Data Lake Gen2 (#56, #59).
  • Token expiration time: An optional expirationTime field has been added to the Delta Sharing profile to specify a token expiration time (#77).
  • Query limit parameters: The Python Connector now accepts an optional limit parameter to allow fetching a subset of rows when using the load_as_pandas function (#76). Similarly, users can also send a limitHint parameter when submitting a sharing query using the Apache Spark™ Connector (#55).
  • Improved API to list all tables in a share: A new API has been added for listing all tables in a share that supports pagination (#63, #66, #67, #88).
  • Automatic Refresh of Pre-signed URLs: A new cache has been added to the Apache Spark driver that automatically refreshes pre-signed file URLs for long-running queries (#69).

In this blog post, we will go through some of the great improvements in this release.

Delta Sharing on Azure Blob Storage and Azure Data Lake Gen2

Azure Blob Storage has proven to be a cost-effective solution for storing Delta Tables in the Azure cloud. New to this release, you can now share Delta Tables stored on Azure Blob Storage and Azure Data Lake Gen2 in the reference implementation of Delta Sharing Server.

With Delta Sharing 0.3.0, you can now share Delta tables stored on Azure Blob Storage and Azure Data Lake Gen2.

Delta Sharing on Azure Blob Storage example

Sharing Delta Tables on Azure Blob Storage is easier than ever! For example, to share a Delta Table called classics in an Azure Blob container called movie_recommendations, you can simply update the Delta Sharing profile with the location of the Delta table on Azure Blob Storage:

delta-sharing-profile.yaml

# Config shares/schemas/tables to share
shares:
- name: "my_share"
 schemas:
 - name: "movies"
   tables:
   - name: "classics"
     location: "wasbs://movie_recommendations@delta_sharing.blob.core.windows.net/delta/classics"

Delta Sharing on Azure Data Lake Storage Gen2 example

For those who would prefer to leverage the built-in hierarchical directory structure and fine-grained access controls, you can share Delta Tables on Azure Data Lake Storage Gen2 as well. Simply update the Delta Sharing profile with the location on Azure Data Lake Storage Gen2 of your Delta Table, and the Delta Sharing server will automatically process the data for a Delta Sharing query:

delta-sharing-profile.yaml

# Config shares/schemas/tables to share
shares:
- name: "my_share"
 schemas:
 - name: "movies"
   tables:
   - name: "comedy_heaven"
     location: "abfss://movie_recommendations@delta_sharing.dfs.core.windows.net/delta/comedy_heaven"

Query limit parameters

Sometimes it might be helpful to explore just a few records in a shared dataset. Rather than loading the entire dataset into memory from blob storage, you can now add a limit hint in your Delta Sharing queries. The query limit will be pushed down and sent to the Delta Sharing server as a limit hint.

For example, to load a shared Delta Table as a Pandas DataFrame and limit the number of rows to 100, you can now add the limit as a parameter to the load_as_pandas() function call:

import delta_sharing

from IPython.display import display

profile_file = "~/wgirten/delta-sharing-profile.yaml" 

client = delta_sharing.SharingClient(profile_file)
 
table_url = profile_file + "#my_share.movies.comedy_heaven"

# Add a query limit to limit amount of data to only 100 rows
sample_pdf = delta_sharing.load_as_pandas(table_url, limit=100)

display(sample_pdf)

Similarly, if the Apache Spark Connector finds a LIMIT clause in your Spark SQL query, it will try to push down the limit to the server to request less data:


-- Create a new table, specifying the location to the share as a table path
CREATE TABLE my_comedy_movies 
USING deltaSharing 
LOCATION '~/wgirten/delta-sharing-profile.yaml#my_share.movies.comedy_heaven';

-- Display the first 100 rows by passing a limit hint in the query
SELECT * FROM my_comedy_movies LIMIT 100;

Improved API for listing all tables

Included in this release is a new and improved API for listing all the tables under all schemas in a share. The new API supports pagination similar to other APIs.

For example, to list all the tables in the Delta share my_share, you can simply send a GET request to the /shares/{share_name}/all-tables endpoint on the sharing server.

curl http://localhost/shares/wgirten_share/all-tables -H "Authorization: Bearer "
{"items":[
   {
 "share": "my_share",
      "schema":"movies",
      "name":"classics"
   },
   {
      "share":"my_share",
      "schema":"movies",
      "name":"comedy_heaven"
   }
],
"nextPageToken": "..."
}

Automatic refresh of pre-signed URLs

When reading a Delta Sharing table, the Delta Sharing server automatically generates the pre-signed file URLs for a Delta Table. However, for long-running queries, the pre-signed file URLs may expire before the sharing client has a chance to read the files. This release adds a pre-signed URL cache in the Spark driver, which automatically refreshes pre-signed file URLs inside of a background thread. Tasks running in Spark executors communicate to the Spark driver to fetch the latest pre-signed file URLs.

What’s next

We are already gearing up for our next release of Delta Sharing. One of the major features we are currently working on is Google Cloud Storage support. You can track all the upcoming releases and planned features in github milestones.



Credits
We’d like to extend a special thanks for the contributions to this release to Denny Lee, Felix Cheung, Lin Zhou, Matei Zaharia, Shixiong Zhu, Will Girten, Xiaotong Sun, Yuhong Chen, kohei-tosshy, and William Chau.

--

Try Databricks for free. Get started today.

The post Delta Sharing Release 0.3.0 appeared first on Databricks.

Beyond LDA: State-of-the-art Topic Models With BigARTM

$
0
0

Introduction

This post follows up on the series of posts in Topic Modeling for text analytics. Previously, we looked at the LDA (Latent Dirichlet Allocation) topic modeling library available within MLlib in PySpark. While LDA is a very capable tool, here we look at a more scalable and state-of-the-art technique called BigARTM. LDA is based on a two-level Bayesian generative model that assumes a Dirichlet distribution for the topic and word distributions. BigARTM (GigARTM GitHub and https://bigartm.org) is an open source project based on Additive Regularization on Topic Models (ARTM), which is a non-Bayesian regularized model and aims to simplify the topic inference problem. BigARTM is motivated by the premise that the Dirichlet prior assumptions conflict with the notion of sparsity in our document topics, and that trying to account for this sparsity leads to overly-complex models. Here, we will illustrate the basic principles behind BigARTM and how to apply it to the Daily Kos dataset.

Why BigARTM over LDA?

As mentioned above, BigARTM is a probabilistic non-Bayesian approach as opposed to the Bayesian LDA approach. According to Konstantin Vorontsov’s and Anna Potapenko’s paper on additive regularization the assumptions of a Dirichlet prior in LDA do not align with the real-life sparsity of topic distributions in a document. BigARTM does not attempt to build a fully generative model of text, unlike LDA; instead, it choosesto optimize certain criteria using regularizers. These regularizers do not require any probabilistic interpretations. It is therefore noted that the formulation of multi-objective topic models are easier with BigARTM.

Overview of BigARTM

Problem statement

We are trying to learn a set of topics from a corpus of documents. The topics would consist of a set of words that make semantic sense. The goal here is that the topics would summarize the set of documents. In this regard, let us summarize the terminology used in the BigARTM paper:

D = collection of texts, each document ‘d’ is an element of D, each document is a collection of ‘nd’ words (w0, w1,…wd)

W = collection of vocabulary

T = a topic, a document ‘d’ is supposed to made up of a number of topics

We sample from the probability space spanned by words (W), documents (D) and topics(T). The words and documents are observed but topics are latent variables.

The term ‘ndw’ refers to the number of times the word ‘w’ appears in the document ‘d’.

There is an assumption of conditional independence that each topic generates the words independent of the document. This gives us

p(w|t) = p(w|t,d)

The problem can be summarized by the following equation

What we are really trying to infer is the probabilities within the summation term, (i.e., the mixture of topics in a document (p(t|d)) and the mixture of words in a topic (p(w|t)). Each document can be considered to be a mixture of domain-specific topics and background topics. Background topics are those that show up in every document and have a rather uniform per-document distribution of words. Domain-specific topics tend to be sparse, however.

Stochastic factorization

Through stochastic matrix factorization, we infer the probability product terms in the equation above. The product terms are now represented as matrices. Keep in mind that this process results in non-unique solutions as a result of the factorization; hence, the learned topics would vary depending on the initialization used for the solutions.

We create a data matrix F almost equal to [fwd] of dimension WxD, where each element fwd is the normalized count of word ‘w’ in document ‘d’ divided by the number of words in the document ‘d’. The matrix F can be stochastically decomposed into two matrices ∅ and θ so that

F ≈ [∅] [θ]

[∅] corresponds to the matrix of word probabilities for topics, WxT

[θ] corresponds to the matrix of topic probabilities for the documents, TxD

All three matrices are stochastic and the columns are given by:

[∅]t which represents the words in a topic and

[θ]d which represents the topics in a document respectively.

The number of topics is usually far smaller than the number of documents or the number of words.

LDA

In LDA the matrices Phi and Theta have columns, [∅]t and [θ]d that are assumed to be drawn from Dirichlet distributions with hyperparameters given by β and α respectively.

β= [βw] which is a hyperparameter vector corresponding to the number of words

α= α[αt] which is a hyperparameter vector corresponding to the number of topics

Likelihood and additive regularization

The log-likelihood we would like to maximize to obtain the solution is given by the equations below. This is the same as the objective function in Probabilistic Latent Semantic Analysis (PLSA) and will be the starting point for BigARTM.

We are maximizing the log of the product of the joint probability of every word in each document here. Applying Bayes Theorem results in the summation terms seen on the right side in the equation above. Now for BigARTM, we add ‘r’ regularizer terms, which are the regularizer coefficients τi multiplied by a function of ∅ and θ.

where Ri is a regularizer function that can take a few different forms depending on the type of regularization we seek to incorporate. The two common types are:
  1. Smoothing regularization
  2. Sparsing regularization

In both cases, we use the KL Divergence as a function for the regularizer. We can combine these two regualizers to meet a variety of objectives. Some of the other types of regularization techniques are decorrelation regularization and coherence regularization. (http://machinelearning.ru/wiki/images/4/47/Voron14mlj.pdf, eg 34 and eq 40. The final objective function then becomes the following

L(∅,θ) + Regularizer

Smoothing regularization

Smoothing regularization is applied to smooth out background topics so that they have a uniform distribution relative to the domain-specific topics. For smoothing regularization, we

  1. Minimize the KL Divergence between terms [∅]t and a fixed distribution beta
  2. Minimize the KL Divergence between terms [θ]d and a fixed distribution alpha
  3. Sum the two terms from (1) and (2) to get the regularizer term

We want to minimize the KL Divergence here to make our topic and word distributions as close to the desired alpha and beta distributions respectively.

Sparsing strategy for fewer topics

To get fewer topics we employ the sparsing strategy. This helps us to pick out domain-specific topic words as opposed to the background topic words. For sparsing regularization, we want to:

  1. Maximize the KL Divergence between the term [∅]t and a uniform distribution
  2. Maximize the KL Divergence between the term [θ]d and a uniform distribution
  3. Sum the two terms from (1) and (2) to get the regularizer term

We are seeking to obtain word and topic distributions with minimum entropy (or less uncertainty) by maximizing the KL divergence from a uniform distribution, which has the highest entropy possible (highest uncertainty) . This gives us ‘peakier’ distributions for our topic and word distributions.

Model quality

The ARTM model quality is assessed using the following measures:

  1. Perplexity: This is inversely proportional to the likelihood of the data given the model. The smaller the perplexity the better the model, however a perplexity value of around 10 has been experimentally proven to give realistic documents.
  2. Sparsity: This measures the percentage of elements that are zero in the ∅ and θ matrices.
  3. Ratio of background words:A high ratio of background words indicates model degradation and is a good stopping criterion. This could be due to too much sparsing or elimination of topics.
  4. Coherence: This is used to measure the interpretability of a model. A topic is supposed to be coherent, if the most frequent words in a topic tend to appear together in the documents. Coherence is calculated using the Pointwise Mutual Information (PMI). The coherence of a topic is measured as:
    i. Get the ‘k’ most probable words for a topic (usually set to 10)
    ii. Compute the Pointwise Mutual Information (PMIs) for all pairs of words from the word list in step (a)
    iii. Compute the average of all the PMIs
  5. Kernel size, purity and contrast: A kernel is defined as the subset of words in a topic that separates a topic from the others, (i.e. Wt = {w: p(t|w) >δ}, where is δ selected to about 0.25). The kernel size is set to be between 20 and 200. Now the terms purity and contrast are defined as:

which is the sum of the probabilities of all the words in the kernel for a topic

For a topic model, higher values are better for both purity and contrast.

Using the BigARTM library

Data files

The BigARTM library is available from the BigARTM website and the package can be installed via pip. Download the example data files and unzip them as shown below. The dataset we are going to use here is the Daily Kos dataset.

wget https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt.gz

wget https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt

gunzip docword.kos.txt.gz

LDA

We will start off by looking at their implementation of LDA, which requires fewer parameters and hence acts as a good baseline. Use the ‘fit_offline’ method for smaller datasets and ‘fit_online’ for larger datasets. You can set the number of passes through the collection or the number of passes through a single document.

import artm

batch_vectorizer = artm.BatchVectorizer(data_path='.', data_format='bow_uci',collection_name='kos', target_folder='kos_batches')

lda = artm.LDA(num_topics=15, alpha=0.01, beta=0.001, cache_theta=True, num_document_passes=5, dictionary=batch_vectorizer.dictionary)

lda.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10)

top_tokens = lda.get_top_tokens(num_tokens=10)

for i, token_list in enumerate(top_tokens):

print('Topic #{0}: {1}'.format(i, token_list))

Topic #0: ['bush', 'party', 'tax', 'president', 'campaign', 'political', 'state', 'court', 'republican', 'states']

Topic #1: ['iraq', 'war', 'military', 'troops', 'iraqi', 'killed', 'soldiers', 'people', 'forces', 'general']

Topic #2: ['november', 'poll', 'governor', 'house', 'electoral', 'account', 'senate', 'republicans', 'polls', 'contact']

Topic #3: ['senate', 'republican', 'campaign', 'republicans', 'race', 'carson', 'gop', 'democratic', 'debate', 'oklahoma']

Topic #4: ['election', 'bush', 'specter', 'general', 'toomey', 'time', 'vote', 'campaign', 'people', 'john']

Topic #5: ['kerry', 'dean', 'edwards', 'clark', 'primary', 'democratic', 'lieberman', 'gephardt', 'john', 'iowa']

Topic #6: ['race', 'state', 'democrats', 'democratic', 'party', 'candidates', 'ballot', 'nader', 'candidate', 'district']

Topic #7: ['administration', 'bush', 'president', 'house', 'years', 'commission', 'republicans', 'jobs', 'white', 'bill']

Topic #8: ['dean', 'campaign', 'democratic', 'media', 'iowa', 'states', 'union', 'national', 'unions', 'party']

Topic #9: ['house', 'republican', 'million', 'delay', 'money', 'elections', 'committee', 'gop', 'democrats', 'republicans']

Topic #10: ['november', 'vote', 'voting', 'kerry', 'senate', 'republicans', 'house', 'polls', 'poll', 'account']

Topic #11: ['iraq', 'bush', 'war', 'administration', 'president', 'american', 'saddam', 'iraqi', 'intelligence', 'united']

Topic #12: ['bush', 'kerry', 'poll', 'polls', 'percent', 'voters', 'general', 'results', 'numbers', 'polling']

Topic #13: ['time', 'house', 'bush', 'media', 'herseth', 'people', 'john', 'political', 'white', 'election']

Topic #14: ['bush', 'kerry', 'general', 'state', 'percent', 'john', 'states', 'george', 'bushs', 'voters']

You can extract and inspect the ∅ and θ matrices, as shown below.

phi = lda.phi_   # size is number of words in vocab x number of topics

theta = lda.get_theta() # number of rows correspond to the number of topics

print(phi)
topic_0       topic_1  ...      topic_13      topic_14

sawyer        3.505303e-08  3.119175e-08  ...  4.008706e-08  3.906855e-08

harts         3.315658e-08  3.104253e-08  ...  3.624531e-08  8.052595e-06

amdt          3.238032e-08  3.085947e-08  ...  4.258088e-08  3.873533e-08

zimbabwe      3.627813e-08  2.476152e-04  ...  3.621078e-08  4.420800e-08

lindauer      3.455608e-08  4.200092e-08  ...  3.988175e-08  3.874783e-08

...                    ...           ...  ...           ...           ...

history       1.298618e-03  4.766201e-04  ...  1.258537e-04  5.760234e-04

figures       3.393254e-05  4.901363e-04  ...  2.569120e-04  2.455046e-04

consistently  4.986248e-08  1.593209e-05  ...  2.500701e-05  2.794474e-04

section       7.890978e-05  3.725445e-05  ...  2.141521e-05  4.838135e-05

loan          2.032371e-06  9.697820e-06  ...  6.084746e-06  4.030099e-08

print(theta)
             1001      1002      1003  ...      2998      2999      3000

topic_0   0.000319  0.060401  0.002734  ...  0.000268  0.034590  0.000489

topic_1   0.001116  0.000816  0.142522  ...  0.179341  0.000151  0.000695

topic_2   0.000156  0.406933  0.023827  ...  0.000146  0.000069  0.000234

topic_3   0.015035  0.002509  0.016867  ...  0.000654  0.000404  0.000501

topic_4   0.001536  0.000192  0.021191  ...  0.001168  0.000120  0.001811

topic_5   0.000767  0.016542  0.000229  ...  0.000913  0.000219  0.000681

topic_6   0.000237  0.004138  0.000271  ...  0.012912  0.027950  0.001180

topic_7   0.015031  0.071737  0.001280  ...  0.153725  0.000137  0.000306

topic_8   0.009610  0.000498  0.020969  ...  0.000346  0.000183  0.000508

topic_9   0.009874  0.000374  0.000575  ...  0.297471  0.073094  0.000716

topic_10  0.000188  0.157790  0.000665  ...  0.000184  0.000067  0.000317

topic_11  0.720288  0.108728  0.687716  ...  0.193028  0.000128  0.000472

topic_12  0.216338  0.000635  0.003797  ...  0.049071  0.392064  0.382058

topic_13  0.008848  0.158345  0.007836  ...  0.000502  0.000988  0.002460

topic_14  0.000655  0.010362  0.069522  ...  0.110271  0.469837  0.607572

ARTM

This API provides the full functionality of ARTM, however, with this flexibility comes the need to manually specify metrics and parameters.

model_artm = artm.ARTM(num_topics=15, cache_theta=True, scores=[artm.PerplexityScore(name='PerplexityScore', dictionary=dictionary)], regularizers=[artm.SmoothSparseThetaRegularizer(name='SparseTheta', tau=-0.15)])

model_plsa.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=6))

model_artm.scores.add(artm.SparsityPhiScore(name='SparsityPhiScore'))

model_artm.scores.add(artm.TopicKernelScore(name='TopicKernelScore', probability_mass_threshold=0.3))

model_artm.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=6))

model_artm.regularizers.add(artm.SmoothSparsePhiRegularizer(name='SparsePhi', tau=-0.1))

model_artm.regularizers.add(artm.DecorrelatorPhiRegularizer(name='DecorrelatorPhi', tau=1.5e+5))

model_artm.num_document_passes = 1

model_artm.initialize(dictionary=dictionary)
model_artm.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=15)

There are a number of metrics available, depending on what was specified during the initialization phase. You can extract any of the metrics using the following syntax.
model_artm.scores
[PerplexityScore, SparsityPhiScore, TopicKernelScore, TopTokensScore]

model_artm.score_tracker['PerplexityScore'].value

[6873.0439453125,

 2589.998779296875,

 2684.09814453125,

 2577.944580078125,

 2601.897216796875,

 2550.20263671875,

 2531.996826171875,

 2475.255126953125,

 2410.30078125,

 2319.930908203125,

 2221.423583984375,

 2126.115478515625,

 2051.827880859375,

 1995.424560546875,

 1950.71484375]
 

You can use the model_artm.get_theta() and model_artm.get_phi() methods to get the ∅ and θ matrices respectively. You can extract the topic terms in a topic for the corpus of documents.

for topic_name in model_artm.topic_names:

    print(topic_name + ': ',model_artm.score_tracker['TopTokensScore'].last_tokens[topic_name])

topic_0:  ['party', 'state', 'campaign', 'tax', 'political', 'republican']

topic_1:  ['war', 'troops', 'military', 'iraq', 'people', 'officials']

topic_2:  ['governor', 'polls', 'electoral', 'labor', 'november', 'ticket']

topic_3:  ['democratic', 'race', 'republican', 'gop', 'campaign', 'money']

topic_4:  ['election', 'general', 'john', 'running', 'country', 'national']

topic_5:  ['edwards', 'dean', 'john', 'clark', 'iowa', 'lieberman']

topic_6:  ['percent', 'race', 'ballot', 'nader', 'state', 'party']

topic_7:  ['house', 'bill', 'administration', 'republicans', 'years', 'senate']

topic_8:  ['dean', 'campaign', 'states', 'national', 'clark', 'union']

topic_9:  ['delay', 'committee', 'republican', 'million', 'district', 'gop']

topic_10:  ['november', 'poll', 'vote', 'kerry', 'republicans', 'senate']

topic_11:  ['iraq', 'war', 'american', 'administration', 'iraqi', 'security']

topic_12:  ['bush', 'kerry', 'bushs', 'voters', 'president', 'poll']

topic_13:  ['war', 'time', 'house', 'political', 'democrats', 'herseth']

topic_14:  ['state', 'percent', 'democrats', 'people', 'candidates', 'general']

Conclusion

LDA tends to be the starting point for topic modeling for many use cases. In this post, BigARTM was introduced as a state-of-the-art alternative. The basic principles behind BigARTM were illustrated along with the usage of the library. I would encourage you to try out BigARTM and see if it is a good fit for your needs!

Please try the attached notebook.

--

Try Databricks for free. Get started today.

The post Beyond LDA: State-of-the-art Topic Models With BigARTM appeared first on Databricks.


Why We Invested in Labelbox: Streamline Unstructured Data Workflows in a Lakehouse

$
0
0

Last month, Databricks announced the creation of Databricks Ventures, a strategic investment vehicle to foster the next generation of innovation and technology harnessing the power of data and AI. We launched with the Lakehouse Fund, inspired by the growing adoption of the lakehouse architecture, which will support early and growth-stage companies extending the lakehouse ecosystem or powered by lakehouse. That’s why today, I’m thrilled to share Databricks Ventures’ first announced investment: Labelbox.

Labelbox is a leading training data platform for machine learning applications. Rather than requiring companies to build their own expensive and incomplete homegrown tools, Labelbox created a collaborative training data platform that acts as a command center for data scientists to collaborate with dispersed annotation teams.

Together, Databricks and Labelbox deliver an ideal environment for unstructured data workflows. Users can simply take unstructured data (images, video, text, geospatial and more) from their data lake, annotate it with Labelbox, and then perform data science in Databricks.

Earlier this year, Labelbox launched a connector to Databricks so customers can use the Labelbox training data platform to quickly produce structured data from unstructured data, and train AI on unstructured data in the Databricks Lakehouse. Labelbox is also a launch partner for Databricks Partner Connect, which offers customers an even easier way to configure and integrate Databricks with Labelbox. We have been impressed by the Labelbox team and the company’s momentum since we first started working with them. Investing in Labelbox is a natural next step and solidifies our shared commitment to delivering streamlined, powerful capabilities for joint customers to manage unstructured data workflows. Databricks Ventures is excited to support Labelbox and our rapidly growing number of joint customers even more closely in the future.

Check out the Labelbox connector for Databricks.

--

Try Databricks for free. Get started today.

The post Why We Invested in Labelbox: Streamline Unstructured Data Workflows in a Lakehouse appeared first on Databricks.

Taming JavaScript Exceptions With Databricks

$
0
0

This post is a part of our blog series on our frontend work. You can see the previous one on “Simplifying Data + AI, One Line of TypeScript at a Time.” and “Building the Next Generation Visualization Tools at Databricks.”

At Databricks, we take the quality of our customer experience very seriously. As such, we track many metrics for product reliability. One metric we focus on is the percentage of sessions that see no JavaScript (JS) exceptions. Our goal is to keep this happy case above 99.9%, but historically, these issues have been tracked manually, which for many reasons wasn’t sufficient for keeping errors at bay.

A JS exception in the wild on Databricks.

Image: A JS exception in the wild on Databricks

In the past, we used Sentry to aggregate and categorize a variety of exceptions, including those from JS. Sentry both ingests the errors and, on the front end, aggregates sourcemaps to decode minified stack traces.

An example Sentry issue reported by the Databricks UI.

Image: An example Sentry issue

Using Databricks to track JS exceptions

While considering how we could better automate our exception tracking and, thus, decrease the number of issues being shipped out, we looked into extending Sentry. Unfortunately, we found that the effort required was high. As we looked into what Sentry was solving for our use case, we realized that Databricks’ products could largely accomplish the same tasks, with an easier path for extensibility.

Diagram of the Databricks JS exception pipeline.

Image: Diagram of our JS exception pipeline

First, Databricks is more than a data platform; it’s essentially a general-purpose computing and app infrastructure that sits on top of your data. This lets you create an ETL where you ingest all kinds of information and apply programmatic transformations, all from within the web product.

And once you’ve constructed that ETL, you can use the results to build dynamic dashboards, connect to third-party APIs or anything else. Databricks even has GUIs to orchestrate pipelines of tasks and handles alerting when anything fails.

With that in mind, our challenge was to build an internal, maintainable pipeline for our JS exceptions, with the goal of automatically creating tickets whenever we detected issues in staging or production.

Moving from Sentry to Databricks

Aggregating into Delta

The first step in constructing our ETL was to find our source of truth. This was our usage_logs table, which contains a wide variety of different logs and metrics for customer interactions with the product. Every JS exception was stored here with the minified stack traces.

We started by building a Databricks Notebook to process our usage_logs. This table is gigantic and difficult to optimize, so querying it for exceptions can take thirty minutes or more. So, we aggregated the data we wanted into a standalone Delta Table, which enabled us to query and slice the data (approximately a year’s worth of exceptions) in seconds.

Data enrichment: stack trace decoding

Critically, we needed to find a way to decode the minified stack traces in our usage_logs as a part of the ETL. This would let us know what file and line caused a given issue and take further steps to enrich the exception based on that knowledge.

An example minified stack, decoded as part of the Databricks ETL process, to enable JS error catching and handling.

Image: An example minified stack, with only some indication of where the problem was happening.

The first step here was to store our sourcemaps in an AWS S3 bucket as a part of our build. Databricks helpfully gives you the ability to mount S3 buckets into your workspace’s file system, which makes those sourcemaps easily-accessible to our code.

Once we had the sourcemaps in S3, we had the ability to decode the stack traces on Databricks. This was done entirely in Databricks Notebooks, which have the ability to install Python libraries via pip. We installed the sourcemap package to handle the decode, then built a small Python script to evaluate a given stacktrace and fetch the relevant sourcemaps from the file system.

An outline of how Databricks decode stack traces within the product.

Image: An outline of how we decode stack traces within the Databricks product

Once we had that, we wrapped the script in a UDF so that we could run it directly from SQL queries in our notebooks! This gave us the ability to decode the stack trace and return the file that caused the error, the line and context of source code, and the decoded stack itself, all of which were saved in separate columns.

Code ownership

Once we decoded the stack traces, we had high confidence on which file was responsible for each error and could use that to determine which team owned the issue. To do this, we used Github’s API to crawl the repository, find the nearest OWNERS file and map the owning team to a JIRA component.

We built this into another UDF and added it to our aggregator, so when an exception came in, it was pre-triaged to the correct team!

Databricks SQL dashboards

To gain visibility into what was going on in the product, we used Databricks SQL to build dashboards for high-level metrics. This helped us visualize trends and captured the fine-grain issues happening in the current release.

A high-level dashboard for analyzing JS exceptions in the Databricks product.

Image: A high-level dashboard for JS exceptions in the Databricks product

We also built dashboards for analyzing particular issues, which show error frequency, variations of the error and more. This, in effect, replaces Sentry’s UI, and we can augment it to provide whichever data is the most relevant to our company.

Detailed dashboard for analyzing an individual JS exception in Databricks SQL

Image: Detailed dashboard for an individual JS exception in Databricks SQL

Ticketing

Once we had our ETL built and populated, we looked at the incident frequency in staging and production relative to the number of Databricks users in those environments. We decided that it made sense to automatically raise a JIRA ticket anytime an exception occurred in staging, while in production, we set the threshold at ten distinct sessions during a release.

This immediately raised dozens of tickets. The majority were in some way or another known but were all low enough impact that the team hadn’t tackled them. In aggregate, however, dozens of small tickets were greatly regressing our experience. Around this time, we calculated that 20% of sessions saw at least one error!

With all the data we could pull and enrich, our engineers were able to effectively jump right into a fix rather than wading through different services and logs to get the information they needed to act. As a result, we quickly burned down a large portion of our issues and got back above our 99.9% error-free goal.

 The current evolution of Databricks exception tickets, with decoded stack traces and code context

Image: The current evolution of our exception tickets, with decoded stack traces and code context

Task orchestration with Jobs

When executing our pipeline, we have one notebook that handles the ETL and another that compares the state of the delta table to JIRA and opens any necessary issues. Running these requires some orchestration, but luckily, Databricks Jobs makes it easy to handle this.

A job pipeline on Databricks, demonstrating how JS exceptions are flagged and remediated.

Image: A job pipeline on Databricks

With Jobs, we can run those notebooks for staging and production in sequence. This is very easy to set up in the web GUI to handle routing of failures to our team’s alert inbox.

Final thoughts

Overall, the products we’ve been building at Databricks are incredibly powerful and give us the capability to build bespoke tracking and analytics for anything we’re working on. We’re using processes like these to monitor frontend performance, keep track of React component usage, manage dashboards for code migrations and much more.

Projects like this one present us with an opportunity to use our products as a customer would, to feel their pain and joy and to give other teams the feedback they need to make Databricks even better.

If working on a platform like this sounds interesting, we’re hiring! There’s an incredible variety of frontend work being done and being planned, and we could use your help. Come and join us!

--

Try Databricks for free. Get started today.

The post Taming JavaScript Exceptions With Databricks appeared first on Databricks.

Hunters and Databricks Ventures Partner for Advanced Security on the Lakehouse

$
0
0

Modern security teams must quickly detect, investigate and respond to threats to minimize their impact and better mitigate the risk to the organization. With the growth of modern IT infrastructure, organizations must process more data than ever before, across their entire environment. That requires an underlying data platform that can handle massive, diverse datasets at scale, with first-class streaming and machine learning capabilities at a predictable cost.

This is where the Databricks Lakehouse Platform, when combined with an industry-leading security operations center (SOC) platform like Hunters, becomes a key enabler of modern security use cases. The integrated solution transforms the visibility of a customer’s SOC into security events – on a unified, cloud-native platform across all data streams from the entire IT and security environment.

Hunters’ cloud-native SOC Platform is becoming the choice of modern security teams in the Fortune 500. Hunters’ customers have been looking to apply security analytics where the data resides – which is frequently in the customer’s data lake.

That’s why Databricks Ventures invested in Hunters’ Series C funding round to build a deeper partnership and tighter integration between Hunters and the Databricks Lakehouse. Modern security teams are increasingly turning to Hunters to solve security problems at scale, where legacy SIEM vendors or other data storage solutions are unable to keep up. Databricks has been very impressed with the Hunters team and the company’s traction among some of the most demanding enterprise customers. This deeper partnership will allow joint customers to gain a holistic picture into their security posture by combining both structured and unstructured datasets in Delta Lake, which will house a myriad of datasets from endpoint logs to firewalls to operational systems data. These combined Delta data sets can then be leveraged for analytical and machine learning workloads with a security lens.

Databricks is excited to be partnering closely with Hunters and enabling their SOC platform – and all Hunters and Databricks customers – to fully leverage the power of the Databricks Lakehouse platform. We look forward to additional announcements later this calendar year.

--

Try Databricks for free. Get started today.

The post Hunters and Databricks Ventures Partner for Advanced Security on the Lakehouse appeared first on Databricks.

Building Data Applications on the Lakehouse With the Databricks SQL Connector for Python

$
0
0

We are excited to announce General Availability of the Databricks SQL Connector for Python. This follows the recent General Availability of Databricks SQL on Amazon Web Services and Azure. Python developers can now build data applications on the lakehouse, benefiting from record-setting performance for analytics on all their data.

The native Python connector offers simple installation and a Python DB API 2.0 compatible interface that makes it easy to query data. It also automatically converts between Databricks SQL and Python data types, removing the need for boilerplate code.

In this blog post, we will run through some examples of connecting to Databricks and running queries against a sample dataset.

Simple installation from PyPI

With this native Python connector, there’s no need to download and install ODBC/JDBC drivers. Installation is through pip, which means you can include this connector in your application and use it for CI/CD as well:

pip install databricks-sql-connector

Installation requires Python 3.7+

Query tables and views

The connector works with SQL endpoints as well as All Purpose Clusters. In this example, we show you how to connect to and run a query on a SQL endpoint. To establish a connection, we import the connector and pass in connection and authentication information. You can authenticate using a Databricks personal access token (PAT) or a Microsoft Azure active directory (AAD) token.

The following example retrieves a list of trips from the NYC taxi sample dataset and prints the trip distance to the console. cursor.description contains metadata about the result set in the DB-API 2.0 format . cursor.fetchall() fetches all the remaining rows as a Python list.


from databricks import sql

# The with syntax will take care of closing your cursors and connections
with sql.connect(server_hostname="", http_path="", access_token="") as conn:
  with conn.cursor() as cursor:
    cursor.execute(“SELECT * FROM samples.nyctaxi.trips WHERE trip_distance < %(distance)s LIMIT 2”, {"distance": 10})

    # The description is in the format (col_name, col_type, …) as per DB-API 2.0
    print(f”Description: {cursor.description}”)
    print(“Results:”)
    for row in cursor.fetchall():
      print(row.trip_distance)

Output (edited for brevity):


5


Description: [('tpep_pickup_datetime', 'timestamp', …), ('tpep_dropoff_datetime', 'timestamp', …), ('trip_distance', 'double', …), …]

Results:
5.35
6.5
5.8
9.0
11.3
…

Note: when using parameterized queries, you should carefully sanitize your input to prevent SQL injection attacks.

Insert data into tables

The connector also lets you run INSERT statements, which is useful for inserting small amounts of data (e.g. thousands of rows) generated by your Python app into tables:


cursor.execute("CREATE TABLE IF NOT EXISTS squares (x int, x_squared int)")

squares = [(i, i * i) for i in range(100)]
values = ",".join([f"({x}, {y})" for (x, y) in squares])
cursor.execute(f"INSERT INTO squares VALUES {values}")

cursor.execute("SELECT * FROM squares")
print(cursor.fetchmany(3))

Output:

[Row(x=0, x_squared=0), Row(x=1, x_squared=1), Row(x=2, x_squared=4)]

To bulk load large amounts of data (e.g. millions of rows), we recommend first uploading the data to cloud storage and then executing the COPY INTO command.

Query metadata about tables and views

As well as executing SQL queries, the connector makes it easy to see metadata about your catalogs, databases, tables and columns. The following example will retrieve metadata information about columns from a sample table:


cursor.columns(schema_name="default", table_name="squares")

for row in cursor.fetchall():
  print(row.COLUMN_NAME)

Output (edited for brevity):


x
x_squared

A bright future for Python app developers on the lakehouse

We would like to thank the contributors to Dropbox’s PyHive connector, which provided the basis for early versions of the Databricks SQL Connector for Python. In the coming months, we plan to open-source the Databricks SQL Connector for Python and begin welcoming contributions from the community.

We are excited about what our customers will build with the Databricks SQL connector for Python. In upcoming releases, we are looking forward to adding support for additional authentication schemes, multi-catalog metadata and SQLAlchemy. Please try out the connector, and give us feedback. We would love to hear from you on what you would like us to support.

--

Try Databricks for free. Get started today.

The post Building Data Applications on the Lakehouse With the Databricks SQL Connector for Python appeared first on Databricks.

Creating a Faster TAR Extractor

$
0
0

Tarballs are used industry-wide for packaging and distributing files, and this is no different at Databricks. Every day we launch millions of VMs across multiple cloud providers. One of the first steps on every one of these VMs is extracting a fairly sizable tar.lz4 file containing a specific Apache Spark™ runtime. As part of an effort to help bring down bootstrap times, we wanted to see what could be done to help speed up the process of extracting this large tarball.

Existing methods

Right now, the most common method for extracting tarballs is to invoke some command (e.g. curl, wget, or even their browser) to download the raw tarball locally, and then use tar to extract the contents to their final location on disk. There are two general methods that exist right now to improve upon this.

Piping the download directly to tar

Tar uses a sequential file format, which means that extraction always starts at the beginning of the file and makes its way towards the end. A side effect of this is that you don’t need the entire file present to begin extraction. Indeed tar can take in “-“ as the input file and it will read from standard input. Couple this with a downloader dumping to standard output (wget -O -) and you can effectively start untarring the file in parallel as the rest is still being downloaded. If both the download phase and the extraction phase take approximately the same time, this can theoretically halve the total time needed.

Parallel download

Single stream downloaders often don’t maximize the full bandwidth of a machine due to bottlenecks in the I/O path (e.g., bandwidth caps set per download stream from the download source). Existing tools like aria2c help mitigate this by downloading with parallel streams from one or more sources to the same file on disk. This can offer significant speedups, both by utilizing multiple download streams and by writing them in parallel to disk.

What fastar does different

Parallel downloads + piping

The first goal of fastar was to combine the benefits of piping downloads directly to tar with the increased speed of parallel download streams. Unfortunately aria2c is designed for writing directly to disk. It doesn’t have the necessary synchronization mechanisms needed for converting the multiple download streams to a single logical stream for standard output.

Fastar employs a group of worker threads that are all responsible for downloading their own slices of the overall file. Similar to other parallel downloaders, it takes advantage of the HTTP RANGE header to make sure each worker only downloads the chunk it’s responsible for. The main difference is that these workers make use of golang channels and a shared io.Writer object to synchronize and merge the different download streams. This allows for multiple workers to be constantly pulling data in parallel while the eventual consumer only sees a sequential, in-order stream of bytes.

Assuming 4 worker threads (this number is user configurable), the high-level logic is as follows:

  1. Kick off threads (T1 – T4), which start downloading chunks in parallel starting at the beginning of the file. T1 starts immediately writing to stdout while threads T2-T4 save to in-memory buffers until it’s their turn.

  1. Once T1 is finished writing its current chunk to stdout, it signals T2 that it’s their turn, and starts downloading the next chunk it’s responsible for (right after T4’s current chunk). T2 starts writing the data they saved in their buffer to stdout while the rest of the threads continue with their downloads. This process continues for the whole file.

Multithreaded tar extraction

The other big area for improvement was actual extraction of files to disk by tar itself. As alluded to earlier, one of the reasons aria2c is such a fast file downloader is that it writes to disk with multiple streams. Keeping a high queue depth when writing ensures that the disk always has work to do and isn’t sitting idle waiting for the next command. It also allows the disk to constantly rearrange write operations to maximize throughput. This is especially important when, for example, untarring many small files. The built in tar command is single threaded, extracting all files from the archive in a single hot loop.

To get around this, fastar also utilizes multiple threads for writing individual extracted files to disk. For each file in the stream, fastar will copy the file data to a buffer, which is then passed to a thread to write to disk in the background. Some file types need to be handled differently here for correctness. Folders are written synchronously to ensure they exist before any subfiles are written to disk. Another special case here is hard links. Since they require the dependent file to exist unlike symlinks, we need to take care to synchronize file creation around them.

Quality of life features

Fastar also includes a few features to improve ease of use:

  • S3 hosted download support. Fastar also supports downloading from S3 buckets using the s3://bucket/key format
  • Compression support. Fastar internally handles decompression of gzip and lz4 compressed tarballs. It can even automatically infer which compression schema is used by sniffing the first few bytes for a magic number.

Performance numbers

To test locally, we used an lz4 compressed tarball of a container filesystem (2.6GB compressed, 4.3GB uncompressed). This was hosted on a local HTTP server serving from an in-memory file system. The tarball was then downloaded and extracted to the same in-memory file system. This should represent a theoretical best case scenario as we aren’t IO bound with the memory backed file system.

Fastar is 3 times faster than wget and tar.

For production impact, the following shows the speed difference when extracting one of the largest images we support on a live cluster (7.6GB compressed, 16.1GB uncompressed). Pre-fastar, we used aria2c on Azure and boto3 on AWS to download the image before extracting it with tar.

Fastar provides a 2x speedup on AWS

From the tests above, fastar can offer significant speed improvements, both in synthetic and real-world benchmarks. In synthetic workloads we achieve a nearly 3x improvement over naively calling wget && tar and double the performance compared to using the already fast aria2c && tar. Finally in production workloads we see a 1.3x improvement in Azure Databricks and over a 2x improvement in Databricks on AWS.

Interested in working on problems like this? Consider applying to Databricks!

Also let us know if you would find this tool useful and we can look into open sourcing it!

--

Try Databricks for free. Get started today.

The post Creating a Faster TAR Extractor appeared first on Databricks.

Investing in TickSmith: Enabling an E-Commerce Data Experience With Open Data Exchange

$
0
0

We are excited to announce Databricks Ventures’ investment in TickSmith, a leading SaaS platform that simplifies the online data shopping experience. The investment through the Lakehouse Fund, created to support early and growth-stage companies extending the lakehouse ecosystem, aligns with Databricks’ vision of supporting an open data economy. Together, this investment and partnership enable customers with the capabilities to package, commercialize, and share data products seamlessly.

TickSmith’s platform gives data buyers an easy, consumer-like experience while purchasing data, and enables data providers with the necessary tools to create a web store, package and commercialize their data. To augment their core capabilities, TickSmith was one of the first partners to implement Delta Sharing, an open and secure sharing protocol to enhance their distribution and solve some of the biggest challenges their customers were facing around data sharing.

The challenges that TickSmith heard from customers with existing data sharing solutions echoes what our own customers say. Traditional data sharing solutions limit data access to users on proprietary platforms, while homegrown solutions (SFTP and API) are difficult to manage, maintain and scale. This reduced the reach and full monetization potential for their customers. These core challenges come at a time when data sharing is critical as enterprises look to securely exchange data with their customers, suppliers and partners. This ultimately has led us, at Databricks, to rethink the future of data sharing and enable companies with better and more open solutions.

Now with Delta Sharing powering the TickSmith platform, data providers can easily create shares, manage recipients and distribute their offerings in an integrated environment. Data providers can send data across platforms in an open, secure and scalable way without the challenges mentioned before. Coupled with TickSmith’s native functionality for web store and data package creation, this helps organizations both big and small to monetize, exchange, and realize the full value of their data without the need for costly resources, large infrastructure, and data replication. Together, TickSmith and Delta Sharing are helping move execution of sharing data from IT to the end business users.

From a data consumer perspective, in addition to easy discovery and add-to-cart purchase provided by TickSmith, they have seamless access to the most up-to-date data they are provisioned for. Data consumers will receive credentials that will immediately grant them access to the data for them to analyze. As the data is updated the consumer will access live, ready-to-query data reducing delays and increasing efficiency.

We have already seen strong interest and traction with our joint offering, and the strategic investment deepens our relationship. TickSmith and Databricks plan to continue improving our joint solution for both data providers and consumers. Data providers will be able to set up a Databricks environment for the consumer to try out a sample dataset and provide quick-start solutions to accelerate the consumer journey. On the data consumer side, we plan to tighten integrations with the Databricks platform for consumers to be able to quickly view updated datasets in their own Databricks environment.

Learn more about Ticksmith with Delta Sharing and see a demo of the capabilities below.

Data Provider Demo

Data Consumer Demo

--

Try Databricks for free. Get started today.

The post Investing in TickSmith: Enabling an E-Commerce Data Experience With Open Data Exchange appeared first on Databricks.

Orchestrating Databricks Workloads on AWS With Managed Workflows for Apache Airflow

$
0
0

In this blog, we explore how to leverage Databricks’ powerful jobs API with Amazon Managed Apache Airflow (MWAA) and integrate with Cloudwatch to monitor Directed Acyclic Graphs (DAG) with Databricks-based tasks. Additionally, we will show how to create alerts based on DAG performance metrics.

Before we get into the how-to section of this guidance, let’s quickly understand what are Databricks job orchestration and Amazon Managed Airflow (MWAA)?

Databricks orchestration and alerting

Job orchestration in Databricks is a fully integrated feature. Customers can use the Jobs API or UI to create and manage jobs and features, such as email alerts for monitoring. With this powerful API-driven approach, Databricks jobs can orchestrate anything that has an API ( e.g., pull data from a CRM). Databricks orchestration can support jobs with single or multi-task option, as well as newly added jobs with Delta Live Tables.

Amazon Managed Airflow

Amazon Managed Workflows for Apache Airflow (MWAA) is a managed orchestration service for Apache Airflow. MWAA manages the open-source Apache Airflow platform on the customers’ behalf with the security, availability, and scalability of AWS. MWAA gives customers additional benefits of easy integration with AWS Services and a variety of third-party services via pre-existing plugins, allowing customers to create complex data processing pipelines.

High-Level architecture diagram

We will create a simple DAG that launches a Databricks Cluster and executes a notebook. MWAA monitors the execution. Note: we have a simple job definition, but MWAA can orchestrate a variety of complex workloads.

High-Level architecture diagram for creating a simple DAG that launches a Databricks Cluster and executes a notebook.

Setting up the environment

The blog assumes you have access to Databricks workspace. Sign up for a free one here and configure a Databricks cluster. Additionally, create an API token to be used to configure connection in MWAA.

Databricks users can create an Amazon Managed Workflows for Apache Airflow (MWAA) directly from their dashboard.

To create an MWAA environment follow these instructions.

How to create a Databricks connection

The first step is to configure the Databricks connection in MWAA.

The first step in creating an MWAA on Databricks is establishing a connection between MWAA and the Databricks Workspace.

Example DAG

Next upload your DAG into the S3 bucket folder you specified when creating the MWAA environment. Your DAG will automatically appear on the MWAA UI.

Example Airflow DAG

Here’s an example of an Airflow DAG, which creates configuration for a new Databricks jobs cluster, Databricks notebook task, and submits the notebook task for execution in Databricks.

from airflow import DAG
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator, DatabricksRunNowOperator
from datetime import datetime, timedelta 

#Define params for Submit Run Operator
new_cluster = {
    'spark_version': '7.3.x-scala2.12',
    'num_workers': 2,
    'node_type_id': 'i3.xlarge',
     "aws_attributes": {
        "instance_profile_arn": "arn:aws:iam::XXXXXXX:instance-profile/databricks-data-role"
    }
}

notebook_task = {
    'notebook_path': '/Users/xxxxx@XXXXX.com/test',
}

#Define params for Run Now Operator
notebook_params = {
    "Variable":5
}

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=2)
}

with DAG('databricks_dag',
    start_date=datetime(2021, 1, 1),
    schedule_interval='@daily',
    catchup=False,
    default_args=default_args
    ) as dag:

    opr_submit_run = DatabricksSubmitRunOperator(
        task_id='submit_run',
        databricks_conn_id='databricks_default',
        new_cluster=new_cluster,
        notebook_task=notebook_task
    )
    opr_submit_run

Get the latest version of the file from the GitHub link.

Trigger the DAG in MWAA.

Triggering the Airflow DAG via the MWAA UI.

Once triggered you can see the job cluster on the Databricks cluster UI page.

Once an Airflow DAG is triggered, the respective job cluster is displayed on the Databricks cluster UI page.

Troubleshooting

Amazon MWAA uses Amazon CloudWatch for all Airflow logs. These are helpful troubleshooting DAG failures.

Amazon MWAA uses Amazon CloudWatch for all Airflow logs.

CloudWatch metrics and alerts

Next, we create a metric to monitor the successful completion of the DAG. Amazon MWAA supports many metrics.

Databricks creates a metric to monitor the successful completion of the Airflow DAG.

We use TaskInstanceFailures to create an alarm.

Databricks uses TaskInstanceFailures to create alarms once an Airflow DAG has run to, for example, notify if any failures are recorded over a specific period of time.

For threshold we select zero ( i.e., we want to be notified when there are any failures over a period of one hour).

Lastly, we select an Email notification as the action.

Databricks’ UI makes it easy to configure the notification action, e.g., email, for issues uncovered by the Airflow DAG run.

Here’s an example of the Cloudwatch Email notification generated when the DAG fails.

You are receiving this email because your Amazon CloudWatch Alarm “DatabricksDAGFailure” in the US East (N. Virginia) region has entered the ALARM state, because “Threshold Crossed

Example of the Cloudwatch alert generated when the DAG fails.

Conclusion

In this blog, we showed how to create an Airflow DAG that creates, configures, and submits a new Databricks jobs cluster, Databricks notebook task, and the notebook task for execution in Databricks. We leverage MWAA’s out-of-the-box integration with CloudWatch to monitor our example workflow and receive notifications when there are failures.

What’s next

Code Repo
MWAA-DATABRICKS Sample DAG Code

--

Try Databricks for free. Get started today.

The post Orchestrating Databricks Workloads on AWS With Managed Workflows for Apache Airflow appeared first on Databricks.


The Ubiquity of Delta Standalone: Java, Scala, Hive, Presto, Trino, Power BI, and More!

$
0
0

We are excited for the release of Delta Connectors 0.3.0, which introduces support for writing Delta tables. The key features in this release are:

Delta Standalone

  • Write functionality – This release introduces new APIs to support creating and writing Delta tables without Apache Spark™. External processing engines can write Parquet data files and then use the APIs to commit the files to the Delta table atomically. Following the Delta Transaction Log Protocol, the implementation uses optimistic concurrency control to manage multiple writers, automatically generates checkpoint files, and manages log and checkpoint cleanup according to the protocol. The main Java class exposed is OptimisticTransaction, which is accessed via DeltaLog.startTransaction().
    • OptimisticTransaction.markFilesAsRead(readPredicates) must be used to read all metadata during the transaction (and not the DeltaLog. It is used to detect concurrent updates and determine if logical conflicts between this transaction and previously-committed transactions can be resolved.
    • OptimisticTransaction.commit(actions, operation, engineInfo) is used to commit changes to the table. If a conflicting transaction has been committed first (see above) an exception is thrown, otherwise, the table version that was committed is returned.
    • Idempotent writes can be implemented using OptimisticTransaction.txnVersion(appId) to check for version increases committed by the same application.
    • Each commit must specify the Operation being performed by the transaction.
    • Transactional guarantees for concurrent writes on Microsoft Azure and Amazon S3. This release includes custom extensions to support concurrent writes on Azure and S3 storage systems, which on their own do not have the necessary atomicity and durability guarantees. Please note that transactional guarantees are only provided for concurrent writes on S3 from a single cluster.
  • Memory-optimized iterator implementation for reading files in a snapshot: DeltaScan introduces an iterator implementation for reading the AddFiles in a snapshot with support for partition pruning. It can be accessed via Snapshot.scan() or Snapshot.scan(predicate), the latter of which filters files based on the predicate and any partition columns in the file metadata. This API significantly reduces the memory footprint when reading the files in a snapshot and instantiating a DeltaLog (due to internal utilization).
  • Partition filtering for metadata reads and conflict detection in writes: This release introduces a simple expression framework for partition pruning in metadata queries. When reading files in a snapshot, filter the returned AddFiles on partition columns by passing a predicate into Snapshot.scan(predicate). When updating a table during a transaction, specify which partitions were read by passing a readPredicate into OptimisticTransaction.markFilesAsRead(readPredicate) to detect logical conflicts and avoid transaction conflicts when possible.
  • Miscellaneous updates:
    • DeltaLog.getChanges() exposes an incremental metadata changes API. VersionLog wraps the version number and the list of actions in that version.
    • ParquetSchemaConverter converts a StructType schema to a Parquet schema.
    • Fix #197 for RowRecord so that values in partition columns can be read.
    • Miscellaneous bug fixes.

Delta Connectors

  • Hive 3 support for the Hive Connector
  • Microsoft PowerBI connector for reading Delta tables natively: Read Delta tables directly from PowerBI from any storage supported system without running a Spark cluster. Features include online/scheduled refresh in the PowerBI service, support for Delta Lake time travel (e.g., VERSION AS OF), and partition elimination using the partition schema of the Delta table. For more details see the dedicated README.md.
  • What is Delta Standalone?

    The Delta Standalone project in Delta connectors, formerly known as Delta Standalone Reader (DSR), is a JVM library that can be used to read and write Delta Lake tables. Unlike Delta Lake Core, this project does not use Spark to read or write tables and has only a few transitive dependencies. It can be used by any application that cannot use a Spark cluster (read more: How to Natively Query Your Delta Lake with Scala, Java, and Python).

    The project allows developers to build a Delta connector for an external processing engine following the Delta protocol without using a manifest file. The reader component ensures developers can read the set of parquet files associated with the Delta table version requested. As part of Delta Standalone 0.3.0, the reader includes a memory-optimized, lazy iterator implementation for DeltaScan.getFiles (PR #194). The following code sample reads Parquet files in a distributed manner where Delta Standalone (as of 0.3.0) includes Snapshot::scan(filter)::getFiles, which supports partition pruning and an optimized internal iterator implementation.

    import io.delta.standalone.Snapshot;
    
    DeltaLog log = DeltaLog.forTable(new Configuration(), "$TABLE_PATH$");
    Snapshot latestSnapshot = log.update();
    StructType schema = latestSnapshot.getMetadata().getSchema();
    DeltaScan scan = latestSnapshot.scan(
        new And(
            new And(
                new EqualTo(schema.column("year"), Literal.of(2021)),
                new EqualTo(schema.column("month"), Literal.of(11))),
            new EqualTo(schema.column("customer"), Literal.of("XYZ"))
        )
    );
    
    CloseableIterator iter = scan.getFiles();
    
    try {
        while (iter.hasNext()) {
            AddFile addFile = iter.next();
    
            // Zappy engine to handle reading data in `addFile.getPath()` and apply any `scan.getResidualPredicate()`
        }
    } finally {
        iter.close();
    }
    

    As well, Delta Standalone 0.3.0 includes a new writer component that allows developers to generate parquet files themselves and add these files to a Delta table atomically, with support for idempotent writes (read more: Delta Standalone Writer design document). The following code snippet shows how to commit to the transaction log to add the new files and remove the old incorrect files after writing Parquet files to storage.

    import io.delta.standalone.Operation;
    import io.delta.standalone.actions.RemoveFile;
    import io.delta.standalone.exceptions.DeltaConcurrentModificationException;
    import io.delta.standalone.types.StructType;
    
    List removeOldFiles = existingFiles.stream()
        .map(path -> addFileMap.get(path).remove())
        .collect(Collectors.toList());
    
    List addNewFiles = newDataFiles.getNewFiles()
        .map(file ->
            new AddFile(
                file.getPath(),
                file.getPartitionValues(),
                file.getSize(),
                System.currentTimeMillis(),
                true, // isDataChange
                null, // stats
                null  // tags
            );
        ).collect(Collectors.toList());
    
    List totalCommitFiles = new ArrayList<>();
    totalCommitFiles.addAll(removeOldFiles);
    totalCommitFiles.addAll(addNewFiles);
    
    // Zippy is in reference to a generic engine
    
    try {
        txn.commit(totalCommitFiles, new Operation(Operation.Name.UPDATE), "Zippy/1.0.0");
    } catch (DeltaConcurrentModificationException e) {
        // handle exception here
    }
    

    Hive 3 using Delta Standalone

    Delta Standalone 0.3.0 supports Hive 2 and 3 allowing Hive to natively read a Delta table. The following is an example of how to create a Hive external table to access your Delta table.

    CREATE EXTERNAL TABLE deltaTable(col1 INT, col2 STRING)
    STORED BY 'io.delta.hive.DeltaStorageHandler'
    LOCATION '/delta/table/path'
    

    For more details on how to set up Hive, please refer to Delta Connectors > Hive Connector. It is important to note this connector only supports Apache Hive; it does not support Apache Spark or Presto.

    Reading Delta Lake from PrestoDB

    As demonstrated in PrestoCon 2021 session Delta Lake Connector for Presto, the recently merged Presto/Delta connector utilizes the Delta Standalone project to natively read the Delta transaction log without the need of a manifest file. The memory-optimized, lazy iterator included in Delta Standalone 0.3.0 allows PrestoDB to efficiently iterate through the Delta transaction log metadata and avoids OOM issues when reading large Delta tables.

    With the Presto/Delta connector, in addition to querying your Delta tables natively with Presto, you can use the @ syntax to perform time travel queries and query previous versions of your Delta table by version or timestamp. The following code sample is querying earlier versions of the same NYCTaxi 2019 dataset using version.

    # Version 1 of s3://…/nyctaxi_2019_part table 
    WITH nyctaxi_2019_part AS (
      SELECT * FROM deltas3."$path$"."s3://…/nyctaxi_2019_part@v1)
    SELECT COUNT(1) FROM nyctaxi_2019_part;
    
    # output
    59354546
    
    
    # Version 5 of s3://…/nyctaxi_2019_part table
    WITH nyctaxi_2019_part AS (
      SELECT * FROM deltas3."$path$"."s3://…/nyctaxi_2019_part@v5)
    SELECT COUNT(1) FROM nyctaxi_2019_part;
    
    # output
    78959576
    

    With this connector, you can both specify the table from your metastore and query the Delta table directly from the file path using the syntax of deltas3."$path$"."s3://…

    For more information about the PrestoDB/Delta connector:

    Note, we are currently working with the Trino (here’s the current branch that contains the Trino 359 Delta Lake reader) and Athena communities to provide native Delta Lake connectivity.

    Reading Delta Lake from Power BI Natively

    We also wanted to give a shout-out to Gerhard Brueckl (github: gbrueckl) for continuing to improve Power BI connectivity to Delta Lake. As part of Delta Connectors 0.3.0, the Power BI connector includes online/scheduled refresh in the PowerBI service, support for Delta Lake time travel, and partition elimination using the partition schema of the Delta table.

     Reading Delta Lake Tables natively in PowerBI

    Source: Reading Delta Lake Tables natively in PowerBI

    For more information, refer to Reading Delta Lake Tables natively in PowerBI or check out the code-base.

    Discussion

    We are really excited about the rapid adoption of Delta Lake by the data engineering and data sciences community. If you’re interested in learning more about Delta Standalone or any of these Delta connectors, check out the following resources:


    Credits
    We want to thank the following contributors for updates, doc changes, and contributions in Delta Standalone 0.3.0: Alex, Allison Portis, Denny Lee, Gerhard Brueckl, Pawel Kubit, Scott Sandre, Shixiong Zhu, Wang Wei, Yann Byron, Yuhong Chen, and gurunath.

    --

    Try Databricks for free. Get started today.

    The post The Ubiquity of Delta Standalone: Java, Scala, Hive, Presto, Trino, Power BI, and More! appeared first on Databricks.

    Make Your Data Lakehouse Run, Faster With Delta Lake 1.1

    $
    0
    0

    Delta Lake 1.1 improves performance for merge operations, adds the support for generated columns and improves nested field resolution

    With the tremendous contributions from the open-source community, the delta lake community recently announced the release of Delta Lake 1.1.0 on Apache Spark™ 3.2. Similar to Apache Spark, the delta lake community has released Maven artifacts for both Scala 2.12 and Scala 2.13 and in PyPI (delta_spark).

    This release includes notable improvements around MERGE operation, nested field resolution, support for generated columns in a merge operation, support for Python type annotations, support for arbitrary expressions in ‘replaceWhere’ among others. It is super important that Delta Lake keeps up to date with the innovation in Apache Spark. This means that you can take advantage of increased performance in Delta Lake using the features that are available in Spark Release 3.2.0.

    This post will go over the major changes and notable features in the new 1.1.0 release. Check out the project’s Github repository for details.

    Want to get started with Delta Lake right away instead? Learn more about what is Delta Lake and use this guide to build lakehouses with Delta Lake.

    Key features of Delta Lake 1.1.0

    • Performance improvements in MERGE operation
      On partitioned tables, MERGE operations will automatically repartition the output data before writing to files. This ensures better performance out-of-the-box for both the MERGE operation as well as subsequent read operations.
    • Support for passing Hadoop configurations via DataFrameReader/Writer options – You can now set Hadoop FileSystem configurations (e.g., access credentials) via DataFrameReader/Writer options. Earlier the only way to pass such configurations was to set Spark session configuration which would set them to the same value for all reads and writes. Now you can set them to different values for each read and write. See the documentation for more details.
    • Support for arbitrary expressions in replaceWhere DataFrameWriter option – Instead of expressions only on partition columns, you can now use arbitrary expressions in the replaceWhere DataFrameWriter option. That is you can replace arbitrary data in a table directly with DataFrame writes. See the documentation for more details.
    • Improvements to nested field resolution and schema evolution in MERGE operation on an array of structs – When applying the MERGE operation on a target table having a column typed as an array of nested structs, the nested columns between the source and target data are now resolved by name and not by position in the struct. This ensures structs in arrays have a consistent behavior with structs outside arrays. When automatic schema evolution is enabled for MERGE, nested columns in structs in arrays will follow the same evolution rules (e.g., column added if no column by the same name exists in the table) as columns in structs outside arrays. See the documentation for more details.
    • Support for Generated Columns in MERGE operation – You can now apply MERGE operations on tables having Generated Columns.
    • Fix for rare data corruption issue on GCS – Experimental GCS support released in Delta Lake 1.0 has a rare bug that can lead to Delta tables being unreadable due to partially written transaction log files. This issue has now been fixed (1, 2).
    • Fix for the incorrect return object in Python DeltaTable.convertToDelta() – This existing API now returns the correct Python object of type delta.tables.DeltaTable instead of an incorrectly-typed, and therefore unusable object.
    • Python type annotations – We have added Python type annotations which improve auto-completion performance in editors which support type hints. Optionally, you can enable static checking through mypy or built-in tools (for example Pycharm tools).

    Other Notable features in the Delta Lake 1.1.0 release are as follows:

    1. Removed support to read tables with certain special characters in the partition column name. See the migration guide for details.
    2. Support for “delta.`path`” in DeltaTable.forName() for consistency with other APIs
    3. Improvements to DeltaTableBuilder API introduced in Delta 1.0.0
      • Fix for bug that prevented the passing of multiple partition columns in Python DeltaTableBuilder.partitionBy.
      • Throw error when the column data type is not specified.
    4. Improved support for MERGE/UPDATE/DELETE on temp views.
    5. Support for setting user metadata in the commit information when creating or replacing tables.
    6. Fix for an incorrect analysis exception in MERGE with multiple INSERT and UPDATE clauses and automatic schema evolution enabled.
    7. Fix for incorrect handling of special characters (e.g. spaces) in paths by MERGE/UPDATE/DELETE operations.
    8. Fix for Vacuum parallel mode from being affected by the Adaptive Query Execution enabled by default in Apache Spark 3.2.
    9. Fix for earliest valid time travel version.
    10. Fix for Hadoop configurations not being used to write checkpoints.
    11. Multiple fixes (1, 2, 3) to Delta Constraints.

    In the next section, let’s dive deeper into the most notable features of this release.

    Better performance out-of-the-box for MERGE operation

    Comparison chart of Delta Lake 1.0.0 merge operations before the flag was enabled and afterward.

    • The above graph shows the significant reduction in execution time from 19.66 minutes (before) to 7.6 minutes (after) the feature flag was enabled.
    • Notice the difference in stages in the DAG visualization below for both the queries before and after. There is an additional stage for AQE ShuffleRead after the SortMergeJoin.
    Figure: DAG for the delta merge query with repartitionBeforeWrite disabled.

    Figure: DAG for the delta merge query with repartitionBeforeWrite disabled.

    Figure: DAG for the delta merge query with repartitionBeforeWrite enabled.

    Figure: DAG for the delta merge query with repartitionBeforeWrite enabled.

    Let’s take a look at the example now:
    The dataset used for this example, customers1 and customers2 has 200000 rows and 11 columns with information about customers and sales. A merge table, customers_merge with 45000 rows was used to perform a merge operation on the former tables. Full script and results for the merge example available here.

    To ensure that the feature was disabled, you can run the following command:

    sql(”SET spark.databricks.delta.merge.repartitionBeforeWrite.enabled = false”)

    CODE:

    
    from delta.tables import *
    deltaTable = DeltaTable.forPath(spark, "/temp/data/customers1")
    mergeDF = spark.read.format("delta").load("/temp/data/customers_merge")
    deltaTable.alias("customers1").merge(mergeDF.alias("c_merge"),"customers1.customer_sk = c_merge.customer_sk").whenNotMatchedInsertAll().execute()
    
    

    Results:
    Note: The full operation took 19.66 minutes while the feature flag was disabled. You can refer to this full result for the details of the query.

    For partitioned tables, the merge can produce a much larger number of small files than the number of shuffle partitions. This is because every shuffle task can write multiple files in multiple partitions, and can become a performance bottleneck. To enable faster merge operation on our partitioned table, let’s enable repartitionBeforeWrite using the code snippet below.

    Enable the flag and run the merge again.

    sql(”SET spark.databricks.delta.merge.repartitionBeforeWrite.enabled = true”)
    

    This will allow MERGE operation to automatically repartition the output data of partitioned tables before writing to files. In many cases, it helps to repartition the output data by the table’s partition columns before writing it. This ensures better performance out-of-the-box for both the MERGE operation as well as subsequent read operations. Let’s run the merge operation on our table customer_t0 now.

    
    from delta.tables import *
    deltaTable = DeltaTable.forPath(spark, "/temp/data/customers2")
    mergeDF = spark.read.format("delta").load("/temp/data/customers_merge")
    deltaTable.alias("customers2").merge(mergeDF.alias("c_merge"),"customers2.customer_sk = c_merge.customer_sk").whenNotMatchedInsertAll().execute()
    

    Note: After enabling the feature “repartitionBeforeWrite”, the merge query took 7.68 minutes. You can refer to this full result for the details of the query.

    Tip: Organizations working around the GDPR and CCPA use case can highly appreciate this feature as it provides a cost-effective way to do fast point updates and deletes without rearchitecting your entire data lake.

    Support for arbitrary expressions in replaceWhere DataFrameWriter option

    To atomically replace all the data in a table, you can use overwrite mode:

    INSERT OVERWRITE TABLE default.customer_t10 SELECT * FROM customer_t1

    With Delta Lake 1.1.0 and above, you can also selectively overwrite only the data that matches an arbitrary expression using dataframes. The following command atomically replaces records with the birth year ‘1924’ in the target table, which is partitioned by c_birth_year, with the data in customer_t1:

    
    input = spark.read.table("delta.`/usr/local/delta/customer_t1`")
    
    input.write.format("delta") \
      .mode("overwrite") \
      .option("overwriteSchema", "true") \
      .partitionBy("c_birth_year") \
      .option("replaceWhere", "c_birth_year >= '1924' AND c_birth_year <= '1925'") \
      .saveAsTable("customer_t10")
    

    This query will result in a successful run and an output like below:
    Sample query output produced by Delta Lake 0.1.0.
    However, for the past releases of Delta Lake which were before 1.1.0, the same query would result in the following error:

    You can try it by disabling the replaceWhere flag.

    Python Type Annotations

    Python type annotations improve auto-completion performance in editors which support type hints. Optionally, you can enable static checking through mypy or built-in tools (for example Pycharm tools). Here is a video from the original author of the PR, Maciej Szymkiewicz describing the changes in the behavior of python within delta lake 1.1.

    Hope you got to see some cool Delta Lake features through this blog post. Excited to find out where you are using these features and if you have any feedback or examples of your work, please share with the community.

    Summary

    Lakehouse has become a new norm for organizations wanting to build Data platforms and architecture. And all thanks to Delta Lake – which allowed in excess of 5000 organizations out there to build successful production Lakehouse Platform for their data and Artificial Intelligence applications. With the exponential data increase, it’s important to process volumes of data faster and reliably. With Delta lake, developers can make their lakehouses run much faster with the improvements in version 1.1 and keep the pace of innovation.

    Interested in the open-source Delta Lake?
    Visit the Delta Lake online hub to learn more, you can join the Delta Lake community via Slack and Google Group. You can track all the upcoming releases and planned features in GitHub milestones and try out Managed Delta Lake on Databricks with a free account.


    Credits
    We want to thank the following contributors for updates, doc changes, and contributions in Delta Lake 1.1.0: Abhishek Somani, Adam Binford, Alex Jing, Alexandre Lopes, Allison Portis, Bogdan Raducanu, Bart Samwel, Burak Yavuz, David Lewis, Eunjin Song, ericfchang, Feng Zhu, Flavio Cruz, Florian Valeye, Fred Liu, gurunath, Guy Khazma, Jacek Laskowski, Jackie Zhang, Jarred Parrett, JassAbidi, Jose Torres, Junlin Zeng, Junyong Lee, KamCheung Ting, Karen Feng, Lars Kroll, Li Zhang, Linhong Liu, Liwen Sun, Maciej, Max Gekk, Meng Tong, Prakhar Jain, Pranav Anand, Rahul Mahadev, Ryan Johnson, Sabir Akhadov, Scott Sandre, Shixiong Zhu, Shuting Zhang, Tathagata Das, Terry Kim, Tom Lynch, Vijayan Prabhakaran, Vítor Mussa, Wenchen Fan, Yaohua Zhao, Yijia Cui, YuXuan Tay, Yuchen Huo, Yuhong Chen, Yuming Wang, Yuyuan Tang, and Zach Schuermann.

    --

    Try Databricks for free. Get started today.

    The post Make Your Data Lakehouse Run, Faster With Delta Lake 1.1 appeared first on Databricks.

    Streamline MLOps With MLflow Model Registry Webhooks

    $
    0
    0

    As machine learning becomes more widely adopted, businesses need to deploy models at speed and scale to achieve maximum value. Today, we are announcing MLflow Model Registry Webhooks, making it easier to automate your model lifecycle by integrating it with the CI/CD platforms of your choice.

    Model Registry Webhooks enable you to register callbacks that are triggered by Model Registry events, such as creating a new model version, adding a new comment, or transitioning the model stage. You can use these callbacks to invoke automation scripts to implement MLOps on Databricks. For example, you can trigger CI builds when a new model version is created or notify your team members through Slack each time a model transition to production is requested. By automating your ML workflow, you can improve developer productivity, accelerate model deployment and create more value for your end-users and organization.

    MLflow Model Registry Webhooks are now available in public preview for all Databricks customers.
    Databricks Model Registry Webhooks enable you to invoke automation scripts to implement MLOps on Databricks.

    Webhooks simplify integrations with MLflow Model Registry

    The MLflow Model Registry provides a central repository to manage the model deployment lifecycle. Today, ML teams manually manage their models in Model Registry. However, as teams grow and cover more ML use cases, the number of models continues to increase, making it inefficient and impractical to operate these models manually. Many teams automate the model deployment lifecycle by building an ad-hoc service that frequently polls the Model Registry to look for changes. Model Registry Webhooks simplify this automation by sending real-time notifications when events happen in Model Registry. Webhooks can be configured to trigger a workflow in a CI/CD platform or a pre-defined Databricks job

    MLOps use cases with Webhooks

    With webhooks, you can automate your machine learning workflow by setting up integrations with the MLflow Model Registry. For example, you can use webhooks to perform the following integrations:

    • Trigger a CI workflow to validate your model when a new version of the model is created
    • Notify your team of the pending request through a messaging app when a model has received a stage transition request
    • Invoke a workflow to evaluate model fairness and bias when a model transition to production is requested
    • Trigger a deployment pipeline to automatically deploy your model when a tag is created.

    By automating your model deployment lifecycle, you can improve model quality, reduce rework, and ensure that each ML team member focuses on what they do best. Some of the most advanced users of the MLflow Model Registry are already using webhooks to manage millions of ML models.

    Get started with the MLflow Model Registry Webhooks

    Ready to get started or try it out for yourself? You can read more about MLflow Model Registry Webhooks and how to use them in our documentation at AWS, Azure, and GCP.

    --

    Try Databricks for free. Get started today.

    The post Streamline MLOps With MLflow Model Registry Webhooks appeared first on Databricks.

    Scaling SHAP Calculations With PySpark and Pandas UDF

    $
    0
    0

    Motivation

    With the proliferation of applications of Machine Learning (ML) and especially Deep Learning (DL) models in decision making, it is becoming more crucial to see through the black box and justify key business decisions based off the back of such models’ outputs. For example, if an ML model rejects a customer’s loan request or assigns a credit risk in peer-to-peer lending to a certain customer, giving business stakeholders an explanation about why this decision was made could be a powerful tool in encouraging the adaptation of the models. In many cases, interpretable ML is not just a business requirement but a regulatory requirement to understand why a certain decision or option was given to a customer. SHapley Additive exPlanations (SHAP) is an important tool one can leverage towards explainable AI and to help establish trust in the outcome of ML models and neural networks in solving business problems.

    SHAP is a state-of-the-art framework for model explanation based on Game Theory. The approach involves finding a linear relationship between features in a model and the model output for each data point in your dataset. Using this framework, you can interpret your model’s output globally or locally. Global interpretability helps you understand how much each feature contributes to the outcomes positively or negatively. On the other hand, local interpretability helps you understand the effect of each feature for any given observation.

    The most common SHAP implementations adopted widely in the data science community are run on single node machines, meaning that they run all the computations on a single core, regardless of how many cores are available.Therefore, they do not take advantage of distributed computation capabilities and are bounded by the limitations of a single core.

    In this post, we will demonstrate a simple way to parallelize SHAP value calculations across several machines, specifically for local interpretability. We will then explain how this solution scales with the growing number of rows and columns in the dataset. Finally, we will highlight some of our findings on what works and what to avoid when parallelizing SHAP calculations with Spark.

    Single-node SHAP

    To realize explainability, SHAP turns a model into an Explainer; individual model predictions are then explained by applying the Explainer to them. There are several implementations of SHAP value calculations in different programming languages including a popular one in Python. With this implementation, to get explanations for each observation, you can apply an explainer appropriate for your model. The following code snippet illustrates how to apply a TreeExplainer to a Random Forest Classifier.

    import shap
    explainer = shap.TreeExplainer(clf)
    shap_values = explainer.shap_values(df)
    

    This method works well for small data volumes, but when it comes to explaining an ML model’s output for millions of records, it does not scale well due to the single-node nature of the implementation. For example, the visualization in figure 1 below shows the growth in execution time of a SHAP value calculation on a single node machine (4 cores and 30.5 GB of memory) for an increasing number of records. The machine ran out of memory for data shapes bigger than 1M rows and 50 columns, therefore, those values are missing in the figure. As you can see, the execution time grows almost linearly with the number of records, which is not sustainable in real-life scenarios. Waiting, for example, 10 hours to understand why a machine learning model has made a model prediction is neither efficient nor acceptable in many business settings.

    Single-node SHAP Calculation Execution Time

    Figure 1: Single-node SHAP Calculation Execution Time

    One way you may look to solve this problem is the use of approximate calculation. You can set the approximate argument to True in the shap_values method. That way, the lower splits in the tree will have higher weights and there is no guarantee that the SHAP values are consistent with the exact calculation. This will speed up the calculations, but you might end up with an inaccurate explanation of your model output. Furthermore, the approximate argument is only available in TreeExplainers.

    An alternative approach would be to take advantage of a distributed processing framework such as Apache Spark™ to parallelize the application of the Explainer across multiple cores.

    Scaling SHAP calculations with PySpark

    To distribute SHAP calculations, we are working with this Python implementation and Pandas UDFs in PySpark. We are using the kddcup99 dataset to build a network intrusion detector, a predictive model capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections. This dataset is known to be flawed for intrusion detection purposes. However, in this post, we are purely focusing on SHAP value calculations and not the semantics of the underlying ML model.

    The two models we built for our experiments are simple Random Forest classifiers trained on datasets with 10 and 50 features to show scalability of the solution over different column sizes. Please note that the original dataset has less than 50 columns, and we have replicated some of these columns to reach our desired volume of data. The data volumes we have experimented with range from 4MB to 1.85GB.

    Before we dive into the code, let’s provide a quick overview of how Spark Dataframes and UDFs work. Spark Dataframes are distributed (by rows) across a cluster, each grouping of rows is called a partition and each partition (by default) can be operated on by 1 core. This is how Spark fundamentally achieves parallel processing. Pandas UDFs are a natural choice, as pandas can easily feed into SHAP and is performant. A pandas UDF, sometimes known as a vectorized UDF, gives us better performance over Python UDFs by using Apache Arrow to optimize the transfer of data.

    The code snippet below demonstrates how to parallelize applying an Explainer with a Pandas UDF in PySpark. We define a pandas UDF called calculate_shap and then pass this function to mapInPandas. This method is then used to apply the parallelized method to the PySpark dataframe. We will use this UDF to run our SHAP performance tests.

    def calculate_shap(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
        for X in iterator:
            yield pd.DataFrame(
                explainer.shap_values(np.array(X), check_additivity=False)[0],
                columns=columns_for_shap_calculation,
            )
    
    return_schema = StructType()
    for feature in columns_for_shap_calculation:
        return_schema = return_schema.add(StructField(feature, FloatType()))
    
    shap_values = df.mapInPandas(calculate_shap, schema=return_schema)
    

    Figure 2 compares the execution time of 1M rows and 10 columns on a single-node machine vs clusters of sizes 2, 4, 8, 16, 32, and 64 respectively. The underlying machines for all clusters are similar (4 cores and 30.5 GB of memory). One interesting observation is that the parallelized code takes advantage of all the cores across the nodes in the cluster. Therefore, even using a cluster of size 2 improves performance almost 5 fold.

    Single-node vs Parallel SHAP Calculation Execution Time (1M rows, 10 columns)

    Figure 2: Single-node vs Parallel SHAP Calculation Execution Time (1M rows, 10 columns)

    Scaling with growing data size

    Due to how SHAP is implemented, additional features have a greater impact on performance than additional rows. Now we know that SHAP values can be calculated faster using Spark and Pandas UDF. Next we will look at how SHAP performs with additional features/columns.

    Intuitively growing data size means more calculations to crunch through for the SHAP algorithm. Figure 3 illustrates SHAP values execution times on a 16-node cluster for different numbers of rows and columns. You can see that scaling the rows increases the execution time almost directly proportional, that is, doubling the row count almost doubles execution time. Scaling the number of columns has a proportional relationship with the execution time; adding one column increases the execution time by almost 80%.

    These observations (Figure 2 and Figure 3) led us to conclude that the more data you have, the more you can scale your computation horizontally (adding more worker nodes) to keep the execution time reasonable.

    6-node Parallel SHAP Calculation Execution Time for Different Row and Column Counts

    Figure 3: 16-node Parallel SHAP Calculation Execution Time for Different Row and Column

    When to consider parallelization?

    Questions we wanted to answer are: when is parallelization worth it? When should one start using PySpark to parallelize SHAP calculations – even with the knowledge that might add to the computation? We set up an experiment to measure the effect of doubling cluster size on improving SHAP calculation execution time. The aim of the experiment is to figure out what size of data justifies throwing more horizontal resources (i.e., adding more worker nodes) at the problem.

    We ran the SHAP calculations for 10 columns of data and for row counts of 10, 100, 1000, and so forth up to 10M. For each row count, we measured the SHAP calculation execution time 4 times for cluster sizes of 2, 4, 32, and 64. The execution time ratio is the ratio of execution time of SHAP value calculation on the bigger cluster sizes (4 and 64) over running the same calculation on a cluster size with half the number of nodes (2 and 32 respectively).

    Figure 4 illustrates the result of this experiment. Here are the key takeaways:

      • For small row counts, doubling cluster sizes does not improve execution time and, in some cases, worsens it due to the overhead added by Spark task management (hence Execution Time Ratio > 1).
      • As we increase the number of rows, doubling the cluster size gets more effective. For 10M rows of data, doubling the cluster size almost halves the execution time.
      • For all row counts, doubling the cluster size from 2 to 4 is more effective than doubling from 32 to 64 (notice the gap between blue and orange lines). As your cluster size grows, the overhead of adding more nodes also grows. This is due to having partition sizes where the data size per partition is too small, and it adds more overhead to create a separate task to process the small amount of data than to use a more optimal data/partition size.
    The Effect of Doubling Cluster Size on Execution Time for Different Data Volumes

    Figure 4: The Effect of Doubling Cluster Size on Execution Time for Different Data Volumes

    Gotchas

    Repartitioning

    As mentioned above, Spark implements parallelism through the notion of partitions; data is partitioned into chunks of rows and each partition is processed by a single core by default. When data is initially read by Apache Spark it may not necessarily create partitions that are optimal for the computation that you want to run on your cluster. In particular, for calculating SHAP values, we can potentially get better performance by repartitioning our dataset.

    It is important to strike a balance between creating small enough partitions and not so small that the overhead of creating them outweighs the benefits of parallelizing the calculations.
    For our performance test we decided to make use of all the cores in the cluster using the following code:

    df = df.repartition(sc.defaultParallelism)
    

    For even bigger volumes of data you may want to set the number of partitions to 2 or 3 times the number of cores. The key is to experiment with it and find out the best partitioning strategy for your data.

    Use of display()

    If you are working on a Databricks Notebook, you may want to avoid the use of display() function when benchmarking the execution times. The use of display() may not necessarily show you how long a full transformation takes; it has an implicit row limit, which is injected into the query and, depending on the operation you want to measure, e.g., writing to a file, there is additional overhead in gathering results back to the driver. Our execution times were measured using Spark’s write method using “noop” format.

    Conclusion

    In this blog post, we introduced a solution to speed up SHAP calculations by parallelizing it with PySpark and Pandas UDFs. We then evaluated the performance of the solution on increasing volumes of data, different machine types and changing configurations. Here are the key takeaways:

        • Single-node SHAP calculation grows linearly with the number of rows and columns.
        • Parallelizing SHAP calculations with PySpark improves the performance by running computation on all CPUs across your cluster.
        • Increasing cluster size is more effective when you have bigger data volumes. For small data, this method is not effective.

    Future work

    Scaling Vertically – The purpose of the blog post was to show how scaling horizontally with large datasets can improve the performance of calculating SHAP values. We started on the premise that each node in our cluster had 4 cores, 30.5 GB. In the future, it would be interesting to test the performance of scaling vertically as well as horizontally; for example, comparing performance between a cluster of 4 nodes (4 cores, 30.5GB each) with a cluster of 2 nodes (8 cores, 61GB each).

    Serialize/Deserialize – As mentioned, one of the core reasons to use Pandas UDFs over Python UDFs is that Pandas UDFs uses Apache Arrow to improve the serialization/deserialization of data between the JVM and python process. There could be some potential optimizations when converting Spark data partitions to Arrow record batches, experimenting with the Arrow batch size could lead to further performance gains.

    Comparison with distributed SHAP implementations – It would be interesting to compare the results of our solution to distributed implementations of SHAP, such as Shparkley. In conducting such a comparative study, it would be important to make sure the outputs of both solutions are comparable in the first place.

    --

    Try Databricks for free. Get started today.

    The post Scaling SHAP Calculations With PySpark and Pandas UDF appeared first on Databricks.

    Google Datastream Integration With Delta Lake for Change Data Capture

    $
    0
    0

    This is a collaborative post between the data teams as Badal, Google and Databricks. We thank Eugene Miretsky, Partner, and Steven Deutscher-Kobayashi, Senior Data Engineer, of Badal, and Etai Margolin, Product Manager, Google, for their contributions.

    Operational databases capture business transactions that are critical to understanding the current state of the business. Having real-time insights into how your business is performing enables your data teams to quickly make business decisions in response to market conditions.

    Databricks provides a managed cloud platform to analyze data collected from source systems, including operational databases, in real-time. With the Databricks Lakhouse Platform, you can store all of your data in a secure and open lakehouse architecture that combines the best of data warehouses and data lakes to unify all of your analytics and AI workloads. Today, we’re excited to share our partner Badal.io’s release of their Google Datastream Delta Lake connector, which enables Change Data Capture (CDC) for MySQL and Oracle relational databases. CDC is a software-based process that identifies and tracks changes to data in a source data management system, such as a relational database (RDBMS). CDC can provide real-time activity of data by processing data continuously as new database events occur.

    Why log-based CDC

    Log-based CDC is an alternative approach to traditional batch data ingestion. It reads the database’s native transaction log (sometimes called redo or binary log) and provides real-time or near-real-time replication of data by streaming the changes continuously to the destination as events occur.

    CDC presents the following benefits:

    • Simplified ingestion: Batch ingestion typically requires intimate knowledge of the source data model to handle incremental uploads and deletes; data engineers need to work with domain experts to configure the ingestion for each table. CDC decreases both the time and cost of ingesting new datasets.
    • Real-time data: CDC streams changes with seconds or minutes latency, enabling a variety of real-time use cases, such as near real-time dashboards, database replication and real-time analytics.
    • Minimal disruption to production workloads: While regular batch ingestion utilizes database resources to query data, CDC reads changes from the database’s redo or archive log, resulting in minimal consumption of resources.
    • Event-based architecture: Microservices can subscribe to changes in the database in the form of events. The microservices can then build their own views, caches and indexes while maintaining data consistency.

    Why Datastream

    Google Cloud Datastream is an easy-to-use CDC and replication service that allows you to synchronize data across heterogeneous databases, storage systems and applications reliably and with minimal latency.

    The benefits of Datastream include:

    • Serverless, so there are no resources to provision or manage, and the service automatically scales up and down as needed.
    • Easy to use setup and monitoring experiences that achieve super fast time-to-value
    • Secure, with private connectivity options and the security you expect from Google Cloud, with no impact to source databases.
    • Accurate and reliable with transparent status reporting and robust processing flexibility in the face of data and schema changes.
    • Data written to the destination is normalized into a unified-type schema. This means that downstream consumers are almost entirely source-agnostic, making it a simple solution that is easily scalable to support a wide range of different sources.

     Datastream is a serverless and easy-to-use Change Data Capture (CDC) and replication service

    Connector design

    Badal.io and Databricks collaborated on writing a Datastream connector for Delta Lake.

    Architecture

    Datastream writes change log records to files in Google Cloud Storage (GCS) files in either avro or JSON format. The datastream-delta connector uses Spark Structured Streaming to read files as they arrive and streams them to a Delta Lake table.

    Delta Lake CDC architecture, whereby Datastream writes change log records to files in Google Cloud Storage (GCS) files in either avro or JSON format.

    The connector creates two Delta Lake tables per source table:

    1. Staging table: This table contains every single change that was made in the source database since the replication started. Each row represents a Datastream DML statement (insert, update, delete). It can be replayed to rebuild the state of the database at any given point in the past. Below is an example of the staging table.
    read_timestamp source_timestamp object source_metadata payload
    2021-05-16
    T00:40:05.000
    +0000
    2021-05-16
    T00:40:05.000
    +0000
    demo_inventory.
    voters
    {“table”:”inventory.voters”,”database”:”demo”,
    “primary_keys”:[“id”],”log_file”:”mysql-bin.000002″,
    “log_position”:27105167,”change_type”
    :”INSERT”,”is_deleted”:false}
    {“id”:”743621506″,”name”:”Mr. Joshua Jackson”,”address”:”567 Jessica Plains Apt. 106\nWhitestad, HI 51614″,”gender”:”t”}
    2021-05-16
    T00:40:06.000
    +0000
    2021-05-16
    T00:40:06.000
    +0000
    demo_inventory.
    voters
    {“table”:”inventory.voters”,”database”:”demo”,
    “primary_keys”:[“id”],”log_file”:”mysql-bin.000002″,
    “log_position”:27105800,”change_type”:
    “UPDATE”,”is_deleted”:false}
    {“id”:”299594688″,”name”:”Ronald Stokes”,”address”:”940 Jennifer Burg Suite 133\nRyanfurt, AR 92355″,”gender”:”m”}
    2021-05-16
    T00:40:07.000
    +0000
    2021-05-16
    T00:40:07.000
    +0000
    demo_inventory.
    voters
    {“table”:”inventory.voters”,”database”:”demo”,
    “primary_keys”:[“id”],”log_file”:”mysql-bin.000002″,
    “log_position”:27106451,”change_type”:
    “DELETE”,”is_deleted”:false}
    {“id”:”830510405″,”name”:”Thomas Olson”,”address”:”2545 Cruz Branch Suite 552\nWest Edgarton, KY 91433″,”gender”:”n”}
    1. Target table: Contains the most recent snapshot of the source table.
    id name address gender datastream_metadata
    _source_timestamp
    datastream_metadata
    _source_metadata_log
    _file
    datastream_metadata
    _source_metadata_log
    _position
    207846446 Michael Thompson 508 Potter Mountain m 2021-05-16
    T00:21:02.000
    +0000
    mysql-bin.000002 26319210
    289483866 Lauren Jennings 03347 Brown Islands t 2021-05-16
    T02:55:40.000
    +0000
    mysql-bin.000002 31366461
    308466169 Patricia Riley 991 Frederick Dam t 2021-05-16
    T00:59:59.000
    +0000
    mysql-bin.000002 27931699
    348656975 Dr. Riley Moody 89422 Devin Ridge t 2021-05-16
    T00:08:32.000
    +0000
    mysql-bin.000002 25820266
    385058605 Elizabeth Gill 728 Dorothy Locks f 2021-05-16
    T00:18:47.000
    +0000
    mysql-bin.000002 26226299

    The connector breaks the data ingestion into a multi-step process:

    1. Scans GCS to discover all active tables. The Datastream stores each table in a separate sub directory.
    2. Parses the table metadata to create a new Delta Lake database and table if required.
    3. Initialize two streams for each table:
    • Structured Stream from a GCS source
    • Structured Stream using Delta table as a source
  • Modify the schema of the staging and target tables if it’s different from the schema of the current micro-batch. Staging table schema is migrated using Delta Lake automatic schema migration feature, which has a target table schema that is modified programmatically before executing the MERGE statement.
  • Stream the changes (for each table) into a staging table. The staging table is an append-only table that stores rows of the change log, in which each row represents a DML statement (insert, update, delete).
  • Stream changes from the staging table, and merge them into the final table using Delta Lake MERGE statements.
  • Table metadata discovery

    Datastream sends each event with all metadata required to operate on it: table schema, primary keys, sort keys, database, table info, etc.

    As a result, users don’t need to provide an additional configuration for each table they want to ingest. Instead, tables are auto-discovered and all relevant information is extracted from the events for each batch. This includes:

    1. Table and Database name
    2. Table Schema
    3. Primary keys, and sort keys to use in the merge statement.

    Merge logic

    This section will describe how the MERGE operation works at a high-level. This code is executed by the library and is not implemented by the user. The MERGE into the target table needs to be designed with care to make sure that all the records are updated correctly, in particular:

    1. Records representing the same entity are identified correctly using the primary key.
    2. If a micro-batch has multiple entries for the same record, only the latest entry is used.
    3. Out-of-order records are handled properly by comparing the timestamp of the record in the target table to the record in the batch, and using the latest version.
    4. Delete records are handled properly.

    First, for each microbatch, we execute an operations such as:

     
    SELECT * 
    RANK() OVER (PARTITION BY pkey1, pkey2 
     ORDER BY source_timestamp, source_metadata.log_file, source_metadata.log_position
    ) AS row_number
    FROM T_STAGING A.*
    WHERE row_number = 1
    

    Then a merge operation comparable to the following SQL is executed:

    
    MERGE INTO target_table as t
    USING staging_table AS s
    ON t.pKey1 = s.pKey1 AND t.pKey2 = s.pKey2
    WHEN MATCHED AND t.datastream_metadata_source_timestamp <= s.source_timestamp AND 
    s.source_metadata.is_deleted THEN DELETE 
    WHEN MATCHED AND t.datastream_metadata_source_timestamp <= s.source_timestamp  
    THEN update set t.colA = s.colA,
    WHEN NOT MATCHED BY TARGET AND {stagingAlias}.{deleteColumn}!=True 
    THEN insert (colA) values (colA)
    * for readability, we omitted updating the metadata columns used for ordering 
    

    Compaction and clean up

    Streaming workloads can result in a sub-optimal size of parquet files being written. Typically, if the data volume is not large enough, a tradeoff needs to be made between writing smaller files and increasing streaming latency to allow accumulating more data to write. Small files may lead to degraded read and merge performance, as the job needs to scan a lot of files.

    Further, MERGE queries tend to result in a lot of unused data when new entries for updated records overwrite older entries. The unused records don’t affect query correctness, but degrade both CDC and user query performance over time.

    To alleviate the problem, users are encouraged to do one of the following:

    1. If using a Databricks managed cluster, the best option is to use Auto optimize and compaction to optimize file sizes
    2. Schedule a query to periodically call OPTIMIZE and VACUUM
    3. Use the connector’s built-in feature to coalesce partitions before writing to the target table, by setting the DELTA_MICROBATCH_PARTITIONS option. This is a simplified (and less effective) version using Databrick auto-optimize.

    Why Delta Lake to build the Lakehouse

    Delta Lake is an open-source project to build reliable data lakes that you can easily govern and scale out to billions of files. Delta Lake uses open-source Apache Parquet as the columnar file format for data that can be stored in cloud object storage, including Google Cloud Storage (GCS), Azure Blob Storage, Azure Data Lake Storage (ADLS), AWS Simple Storage Service (S3) and the Hadoop Distributed File System (HDFS). Thousands of organizations use Delta Lake as the foundation for their enterprise data and analytics platforms. Reliability, scalability and governance for data lakes are achieved through the following features of Delta Lake:

    • ACID transactions for Apache Spark workloads: Serializable isolation levels ensure that multiple concurrent readers and writers can operate in parallel and never see inconsistent data. Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations and streaming upserts.
    • Scalable metadata handling: Can handle large tables consisting of billions of partitions and files at ease.
    • Schema enforcement: Schema on read is useful for certain use cases, but this can lead to poor data quality and reporting anomalies. Delta Lake provides the ability to specify a schema and enforce it.
    • Audit History: A transaction log records all changes made to data providing a full audit trail of the operation performed, by who, when, and more.
    • Time travel: Data versioning enables rollbacks for point-in-time recovery to restore data.

    Delta Lake is fully compatible with Apache Spark APIs so you can use it with existing data pipelines with minimal change. Databricks provides a managed cloud service to build your data lake and run your analytics workloads with several additional performance features for Delta Lake:

    • Photon execution engine: New execution engine that provides extremely fast performance and is compatible with Apache Spark APIs.
    • Data Skipping Indexes: Create file-level statistics to avoid scanning files that do not contain the relevant data. Imagine having millions of files containing sales data, but only a dozen of the files contain the actual information you need. With data skipping indexes, the optimizer will know exactly which files to read and skip the rest, thereby avoiding a full scan of the millions of files.
    • File Compaction (bin-packing): Improve the speed of read queries by coalescing small files into larger ones. Data lakes can accumulate lots of small files, especially when data is being streamed and incrementally updated. Small files cause read operations to be slow. Coalescing small files into fewer larger ones through compaction is a critical data lake maintenance technique for fast read access.
    • Z-Ordering: Sort related fields in the same set of files to reduce the amount of data that needs to be read.
    • Bloom Filter Indexes: Quickly search through billions of rows to test for membership of an element in a set.

    Delta Lake is an open-source project to build reliable data lakes that you can easily govern and scale out to billions of files.

    To get started, visit the Google Datastream Delta Lake connector GitHub project. If you don’t already have a Databricks account, then try Databricks for free.

    --

    Try Databricks for free. Get started today.

    The post Google Datastream Integration With Delta Lake for Change Data Capture appeared first on Databricks.

    Viewing all 1873 articles
    Browse latest View live