It was a Virtual Summer!

September 28, 2020, 9:00 am

≫ Next: Diving Into Delta Lake: DML Internals (Update, Delete, Merge)

≪ Previous: How Relogix reduced infrastructure costs by 80% with Delta Lake and Azure Databricks

Our Summer 2020 Intern Olympics Kick-Off Event

This summer Databricks hosted a completely virtual intern program, in which interns around the world worked on projects to help data teams solve the world’s toughest problems. Our interns were introduced to their teams and quickly got to work on their summer projects. Alongside their project work, interns participated in hackathons, Lunch & Learns with executive leaders, the first annual Intern Olympics, Employee Resource Group events, and coffee chats with employees across the company! Here are some highlights about what our interns were up to this summer:

“At Databricks, I was able to participate in not one, but two hackathons in twelve weeks. Unlike most companies, at Databricks all engineers participate in hackathons and all of our standard work pauses for two entire days. Not only was it a nice change of scenery from my regular intern project but I had the opportunity to work with some tenured staff engineers on my second project. It was pretty surreal to be able to collaborate side by side with them as we worked on implementing our feature, which involved diving deep into how Apache Spark™ execution works. After all, the saying at Databricks goes ‘all of our popular features come from Hackathons’!” – Ned, Compute Fabric Team, San Francisco

Our interns participated in a virtual improv session to build confidence in their public speaking skills and enjoy a fun team building activity together.

“If I could give advice to future interns, I’d say make sure to participate in social events and make time to hang out with your fellow interns!” – Hannah, Photon Team, San Francisco

Collaboration is a big part of the day-to-day at Databricks, and interns were able to work on projects within their teams, across the Engineering organization, and with some unexpected team members.

“Everyone on my team was super helpful and patient, willing to hop on a call on a moment’s notice to help with explaining or debugging. The project was very interesting and challenging, and I felt like I was a real contributor to both the design and implementation of it.” Ryan, Photon Team, San Francisco

Ali, our CEO, had a meet and greet session with interns to get to know everyone and answer any and all questions!

“Databricks has an amazing culture for fostering great talent. Interns are given full trust and responsibility for driving their projects forward and each project has the potential to make a huge impact on the company.” – Brandon, Observability Team, San Francisco

Every intern received a surprise swag box and we loved seeing everyone rep their swag around the world!

“When working remotely, the work you do every day becomes very important. I was assigned to work on a high stake project which showed the level of confidence my team and fellow team members had in me to succeed. Frequent contact with both my manager and mentor helped me to trust myself and ultimately succeed in my project.” – Philip, Jobs Team, Amsterdam

During our Intern Olympics, teams met for morning coffee to strategize, prepare for events, and get to know each other!

“My internship at Databricks this summer was amazing! I had the chance to work on high impact projects and large scale services and work with brilliant people. The company transparency was impressive and I had access to world-class learning materials.” –Xiaoqiao, Service Infrastructure Team, San Francisco

A huge thank you to our Summer 2020 intern class for their patience and positive attitudes as we all navigated through our first virtual internship program together. We’re so proud of all the work you contributed to Databricks and can’t wait to see you soon!

Interested in joining our next class of interns? Check out our Careers Page.

Try Databricks for free. Get started today.

The post It was a Virtual Summer! appeared first on Databricks.

↧

Diving Into Delta Lake: DML Internals (Update, Delete, Merge)

September 29, 2020, 9:00 am

≫ Next: Data + AI Summit Europe Goes Virtual With a Data-Centric Agenda

≪ Previous: It was a Virtual Summer!

Diving Into Delta Lake: DML Internals (Update, Delete, Merge)

In previous blogs Diving Into Delta Lake: Unpacking The Transaction Log and Diving Into Delta Lake: Schema Enforcement & Evolution, we described how the Delta Lake transaction log works and the internals of schema enforcement and evolution. Delta Lake supports DML (data manipulation language) commands including `DELETE`, `UPDATE`, and `MERGE`. These commands simplify change data capture (CDC), audit and governance, and GDPR/CCPA workflows, among others. In this post, we will demonstrate how to use each of these DML commands, describe what Delta Lake is doing behind the scenes when you run one, and offer some performance tuning tips for each one. More specifically:

A quick primer on the Delta Lake ACID Transaction Log
Understand the fundamentals when running DELETE, UPDATE, and MERGE
Understand the actions performed when performing these tasks
Understand the basics of partition pruning in Delta Lake
How do streaming queries work within Delta Lake

If you prefer watching this information, you can also review the Diving into Delta Lake Part 3: How do DELETE, UPDATE, and MERGE work tech talk.

Table of Contents

Diving Into Delta Lake: DML Internals (Update, Delete, Merge)
Delta Lake: Basic Mechanics
Delta Lake DML: UPDATE
UPDATE: Under the hood
UPDATE + Delta Lake time travel = Easy debugging
UPDATE: Improving performance
Delta Lake DML: DELETE
DELETE: Under the hood
DELETE + VACUUM: Cleaning up old data files
DELETE: Improving performance
Delta Lake DML: MERGE
MERGE: Under the hood
MERGE: Improving performance
Summary

Delta Lake: Basic Mechanics

If you would like to know more about the basic mechanics of Delta Lake, please expand the following section.

Click to expand

First, let’s do a quick review of how a Delta Lake table is structured at the file level. When you create a new table, Delta saves your data as a series of Parquet files and also creates the _delta_log folder, which contains the Delta Lake transaction log. The ACID transaction log serves as a master record of every change (known as a transaction) ever made to your table. As you modify your table (by adding new data, or performing an update, merge, or delete, for example), Delta Lake saves a record of each new transaction as a numbered JSON file in the delta_log folder starting with 00...00000.json and counting up. Every 10 transactions, Delta also generates a “checkpoint” Parquet file within the same folder, that allows the reader to quickly recreate the state of the table.

Ultimately, when you query a Delta Lake table, a supported reader refers to the transaction log to quickly determine which data files make up the most current version of the table. Instead of listing files from your cloud object stores, the paths of the exact files needed are provided significantly improving query performance. With DML operations, like the ones we’ll discuss in this post, Delta Lake creates new versions of files rather than modifying them in place — and uses the transaction log to keep track of it all. Learn more by reading the previous article in this series, Diving Into Delta Lake: Unpacking The Transaction Log.

Now that you have a basic understanding of how Delta Lake works at the file system level, let’s dive into how to use DML commands on Delta Lake, and how each operation works under the hood. The following examples will use the SQL syntax as part of Delta Lake 0.7.0 and Apache Spark 3.0; for more information, refer to Enabling Spark SQL DDL and DML in Delta Lake on Apache Spark 3.0.

Delta Lake DML: UPDATE

You can use the `UPDATE` operation to selectively update any rows that match a filtering condition, also known as a predicate. The code below demonstrates how to use each type of predicate as part of an `UPDATE` statement. Note that Delta Lake offers APIs for Python, Scala, and SQL, but for the purposes of this post we’ll include only the SQL code.

-- Update events
UPDATE events SET eventType = 'click' WHERE eventType = 'click'

UPDATE: Under the hood

Delta Lake performs an `UPDATE` on a table in two steps:

Find and select the files containing data that match the predicate, and therefore need to be updated. Delta Lake uses data skipping whenever possible to speed up this process.
Read each matching file into memory, update the relevant rows, and write out the result into a new data file.

Once Delta Lake has executed the `UPDATE` successfully, it adds a commit in the transaction log indicating that the new data file will be used in place of the old one from now on. The old data file is not deleted, though. Instead, it’s simply “tombstoned” — recorded as a data file that applied to an older version of the table, but not the current version. Delta Lake is able to use it to provide data versioning and time travel.

UPDATE + Delta Lake time travel = Easy debugging

Keeping the old data files turns out to be very useful for debugging because you can use Delta Lake “time travel” to go back and query previous versions of a table at any time. In the event that you update your table incorrectly and want to figure out what happened, you can easily compare two versions of a table to one another.

SELECT * FROM events VERSION AS OF 12

UPDATE: Performance tuning tips

The main way to improve the performance of the `UPDATE` command on Delta Lake is to add more predicates to narrow down the search space. The more specific the search, the fewer files Delta Lake needs to scan and/or modify.

The Databricks managed version of Delta Lake features other performance enhancements like improved data skipping, the use of bloom filters, and Z-Order Optimize (multi-dimensional clustering), which is like an improved version of multi-column sorting. Z-ordering reorganizes the layout of each data file so that similar column values are strategically colocated near one another for maximum efficiency. Read more about Z-Order Optimize on Databricks.

Delta Lake DML: DELETE

You can use the `DELETE` command to selectively delete rows based upon a predicate (filtering condition).

DELETE FROM events WHERE date < '2017-01-01'

In the event that you want to revert an accidental DELETE operation, you can use time travel to roll back your table to the way it was, as demonstrated in the following Python snippet below.

# Read correct version of table into memory
dt = spark.read.format("delta") \
                .option("versionAsOf", 4) \
                .load("/tmp/loans_delta")
    
# Overwrite current table with DataFrame in memory
dt.write.format("delta") \
        .mode("overwrite") \
        .save(deltaPath)

DELETE: Under the hood

`DELETE` works just like `UPDATE` under the hood. Delta Lake makes two scans of the data: the first scan is to identify any data files that contain rows matching the predicate condition. The second scan reads the matching data files into memory, at which point Delta Lake deletes the rows in question before writing out the newly clean data to disk.

After Delta Lake completes a `DELETE` operation successfully, the old data files are not deleted — they’re still retained on disk, but recorded as “tombstoned” (no longer part of the active table) in the Delta Lake transaction log. Remember, those old files aren’t deleted immediately because you might still need them to time travel back to an earlier version of the table. If you want to delete files older than a certain time period, you can use the `VACUUM` command.

DELETE + VACUUM: Cleaning up old data files

Running the `VACUUM` command permanently deletes all data files that are:

no longer part of the active table, and
older than the retention threshold, which is seven days by default.

Delta Lake does not automatically `VACUUM` old files — you must run the command yourself, as shown below. If you want to specify a retention period that is different from the default of seven days, you can provide it as a parameter.

    
from delta.tables import *

# vacuum files not required by versions older than the default
# retention period, which is 168 hours (7 days) by default
dt.vacuum()
deltaTable.vacuum(48) # vacuum files older than 48 hours

Caution: Running the `VACUUM` command with a retention period of 0 hours will delete all files that are not used in the most recent version of the table. Make sure that you do not run this command while there are active writes to the table in progress, as data loss may occur.

For more information about the VACUUM command, as well as examples of it in Scala and SQL, take a look at the documentation for the VACUUM command.

DELETE: Performance tuning tips

Just like with the `UPDATE` command, the main way to improve the performance of a `DELETE` operation on Delta Lake is to add more predicates to narrow down the search space. The Databricks managed version of Delta Lake also features other performance enhancements like improved data skipping, the use of bloom filters, and Z-Order Optimize (multi-dimensional clustering), as well. Read more about Z-Order Optimize on Databricks.

Delta Lake DML: MERGE

The Delta Lake `MERGE` command allows you to perform “upserts”, which are a mix of an `UPDATE` and an `INSERT`. To understand upserts, imagine that you have an existing table (a.k.a. a target table), and a source table that contains a mix of new records and updates to existing records. Here’s how an upsert works:

When a record from the source table matches a preexisting record in the target table, Delta Lake updates the record.
When there is no such match, Delta Lake inserts the new record.

MERGE INTO events
USING updates
    ON events.eventId = updates.eventId
    WHEN MATCHED THEN UPDATE
        SET events.data = updates.data
    WHEN NOT MATCHED THEN 
        INSERT (date, eventId, data) VALUES (date, eventId, data)

The Delta Lake `MERGE` command greatly simplifies workflows that can be complex and cumbersome with other traditional data formats like Parquet. Common scenarios where merges/upserts come in handy include change data capture, GDPR/CCPA compliance, sessionization, and deduplication of records. For more information about upserts, read the blog posts Efficient Upserts into Data Lakes with Databricks Delta, Simple, Reliable Upserts and Deletes on Delta Lake Tables using Python API, and Schema Evolution in Merge Operations and Operational Metrics in Delta Lake.

For more in-depth information about the `merge` programmatic operation, including the use of conditions with the `whenMatched` clause, visit the documentation.

MERGE: Under the hood

Delta Lake completes a `MERGE` in two steps.

Perform an inner join between the target table and source table to select all files that have matches.
Perform an outer join between the selected files in the target and source tables and write out the updated/deleted/inserted data.

The main way that this differs from an `UPDATE` or a `DELETE` under the hood is that Delta Lake uses joins to complete a `MERGE`. This fact allows us to utilize some unique strategies when seeking to improve performance.

MERGE: Performance tuning tips

To improve performance of the `MERGE` command, you need to determine which of the two joins that make up the merge is limiting your speed.

If the inner join is the bottleneck (i.e., finding the files that Delta Lake needs to rewrite takes too long), try the following strategies:

- Add more predicates to narrow down the search space.
- Adjust shuffle partitions.
- Adjust broadcast join thresholds.
- Compact the small files in the table if there are lots of them, but don’t compact them into files that are too large, since Delta Lake has to copy the entire file to rewrite it.

On Databricks’ managed Delta Lake, use Z-Order optimize to exploit the locality of updates.

On the other hand, if the outer join is the bottleneck (i.e. rewriting the actual files themselves takes too long), try the strategies below:

Adjust shuffle partitions.
- Can generate too many small files for partitioned tables.
- Reduce files by enabling automatic repartitioning before writes (with Optimized Writes in Databricks Delta Lake)
Adjust broadcast thresholds. If you’re doing a full outer join, Spark cannot do a broadcast join, but if you’re doing a right outer join, Spark can do one, and you can adjust the broadcast thresholds as needed.
Cache the source table / DataFrame.
- Caching the source table can speed up the second scan, but be sure not to cache the target table, as this can lead to cache coherency issues.

Summary

Delta Lake supports DML commands including `UPDATE`, `DELETE`, and `MERGE INTO`, which greatly simplify the workflow for many common big data operations. In this article, we demonstrated how to use these commands in Delta Lake, shared information about how each one works under the hood, and offered some performance tuning tips.

Interested in the open source Delta Lake?
Visit the Delta Lake online hub to learn more, download the latest code and join the Delta Lake community.

Articles in this series:
Diving Into Delta Lake #1: Unpacking the Transaction Log
Diving Into Delta Lake #2: Schema Enforcement & Evolution
Diving Into Delta Lake #3: DML Internals (Update, Delete, Merge)

Other resources:
Delta Lake Quickstart
Databricks documentation on UPDATE, MERGE, and DELETE
Simple, Reliable Upserts and Deletes on Delta Lake Tables using Python APIs

Try Databricks for free. Get started today.

The post Diving Into Delta Lake: DML Internals (Update, Delete, Merge) appeared first on Databricks.

↧

Data + AI Summit Europe Goes Virtual With a Data-Centric Agenda

October 4, 2020, 6:00 pm

≫ Next: Detecting Criminals and Nation States through DNS Analytics

≪ Previous: Diving Into Delta Lake: DML Internals (Update, Delete, Merge)

Technical conferences evolve over time. They expand beyond their initial focus, adding new technologies, attracting new attendees and broadening their range of sessions and speakers. Formerly known as Spark + AI Summit Europe, the new Data + AI Summit Europe has embraced a data-centric approach — focusing on Apache Spark, MLflow and Delta Lake use cases; data engineering and Delta Lake infrastructure; SQL analytics, BI and data visualization; and machine learning automation, MLOps and AI use cases.

We are pleased to share the agenda for this year’s Data + AI Summit. You’ll see a range of sessions that focus on how to put the latest technologies and techniques into practice. With 125+ sessions, the event will cover the most popular open source projects in the industry, including Apache Spark, MLflow, Delta Lake, Koalas, TensorFlow, PyTorch and the Python data science ecosystem.

With this data-centric focus, the Summit aims to bring together data teams — data visionaries, Spark experts, machine learning developers, data engineers, scientists and analysts — to demonstrate innovation at scale, share how they solve tough problems and realize the full potential of data and AI.

AI, data, SQL and BI analytics, and other sessions by theme

Among the organizations presenting are Databricks, Microsoft, Facebook, Salesforce, IBM, Informatica, Walmart, H&M, Ernst & Young, SNCF, ByteDance, Intel, Lyft, CERN, Nielsen, Levi Strauss, Seldon, KTH, National University of Singapore, Stanford, University of KwaZulu-Natal, and many more. Topics and themes include:

Automation and AI use cases: Learn how to use machine learning and other AI technologies to automate workflows, processes and systems. We have speakers from leading research groups and industry sectors, including IT and software, financial services, retail, logistics, IoT and media and advertising.
Machine learning and deep learning: Explore tracks on popular libraries and tools, use cases and applications in forecasting and anomaly detection, recommenders, computer vision, and natural language processing.
Building, deploying and maintaining data pipelines: As data and machine learning applications become more sophisticated, underlying data pipelines have become harder to build and maintain. We have a series of presentations on best practices for data engineering teams to build reliable data pipelines using Apache Spark and Delta Lake.
MLOps and productionising ML: Choose from more than 20 presentations focusing on managing the machine learning development lifecycle, and how to deploy and monitor models once they’ve been deployed. This is an area where open source projects such as MLflow coupled with MLOps best practices are starting to emerge.
Data management and platforms: Early in the year, Databricks introduced a new data management paradigm — Lakehouse — for the age of data, machine learning and AI. Summit features sessions that will examine the different components of a Lakehouse architecture, including data management and data ingestion.
SQL analytics, BI and data visualization: We’ve expanded the conference with a focus on SQL and BI workloads. We will feature a dedicated track for data analysts, covering these use cases and related open source technologies like Redash.
Training and deep dives: For developers, one of the most popular features of these conferences is training. Summit features a full day of training with courses in Delta Lake, Apache Spark 3.0, MLOps with MLflow, Deep Learning and Machine Learning with Apache Spark. And deep dives will immerse attendees into technical aspects of open source technologies such as Redash, Apache Spark, Delta Lake and MLflow.
Spark performance and scalability: As always, Apache Spark will play a central role in Data and AI Summit Europe, with more than 20 sessions on scaling and tuning machine learning models, Spark SQL internals, and what’s new in Apache Spark 3.0.

Come and join us

Join the European data community online and enjoy the camaraderie at Data + AI Summit Europe 2020. Register to save your free spot!

Also, check out who is giving keynotes and what courses we offer on training day.

Save Your Spot

The post Data + AI Summit Europe Goes Virtual With a Data-Centric Agenda appeared first on Databricks.

↧

Detecting Criminals and Nation States through DNS Analytics

October 5, 2020, 10:08 am

≫ Next: Measuring Advertising Effectiveness with Sales Forecasting and Attributing

≪ Previous: Data + AI Summit Europe Goes Virtual With a Data-Centric Agenda

You are a security practitioner, a data scientist or a security data engineer; you’ve seen the Large Scale Threat Detection and Response talk with Databricks . But you’re wondering, “how can I try Databricks in my own security operations?” In this blog post, you will learn how to detect a remote access trojan using passive DNS (pDNS) and threat intel. Along the way, you’ll learn how to store, and analyze DNS data using Delta, Spark and MLFlow. As you well know, APT’s and cyber criminals are known to utilize DNS. Threat actors use the DNS protocol for command and control or beaconing or resolution of attacker domains. This is why academic researchers and industry groups advise security teams to collect and analyze DNS events to hunt, detect, investigate and respond to threats. But you know, it’s not as easy as it sounds.

Detecting AgentTeslaRAT with Databricks

Using the notebook below, you will be able to detect the Agent Tesla RAT. You will be using analytics for domain generation algorithms (DGA), typosquatting and threat intel enrichments from URLhaus. Along the way you will learn the Databricks concepts of:

Data ingestion
Ad hoc analytics
How to enrich event data, such as DNS queries
Model building and
Batch and Streaming analytics

Why use Databricks for this? Because the hardest thing about security analytics aren’t the analytics. You already know that analyzing large scale DNS traffic logs is complicated. Colleagues in the security community tell us that the challenges fall into three categories:

Deployment complexity: DNS server data is everywhere. Cloud, hybrid, and multi-cloud deployments make it challenging to collect the data, have a single data store and run analytics consistently across the entire deployment.
Tech limitations: Legacy SIEM and log aggregation solutions can’t scale to cloud data volumes for storage, analytics or ML/AI workloads. Especially, when it comes to joining data like threat intel enrichments.
Cost: SIEMs or log aggregation systems charge by volume of data ingest. With so much data SIEM/log licensing and hardware requirements make DNS analytics cost prohibitive. And moving data from one cloud service provider to another is also costly and time consuming. The hardware pre-commit in the cloud or the expense of physical hardware on-prem are all deterrents for security teams.

In order to address these issues, security teams need a real-time data analytics platform that can handle cloud-scale, analyze data wherever it is, natively support streaming and batch analytics and, have collaborative, content development capabilities. And… if someone could make this entire system elastic to prevent hardware commits… now wouldn’t that be cool!

Below is the Databricks notebook to execute the analytics. You can use this notebook in the Databricks community edition or in your own Databricks deployment. There are lot of lines here but the high level flow is this:

Read passive DNS data from AWS S3 bucket
Specify the schema for DNS and load the data into Delta
Explore the data with string matches
Build the DGA detection model. Build the typosquatting model.
Enrich the output of the DGA and typosquatting with threat intel from URLhaus
Run the analytics and detect the AgentTesla RAT

Each section of the notebook has comments. We invite you to email us: cybersecurity@databricks.com. We look forward to your questions and suggestions for making this notebook easier to understand and deploy.

Now, we invite you, to log in to the community edition or your own Databricks account and run this notebook. We look forward to your feedback and suggestions.

You can create a community edition account by going to this link. Then you can import the notebook:

Go to databricks community edition
In the left navigation, click on workspace
Right click in the whitespace of workspace pane, and click import
Select, Import from URL
Paste this link in the URL field

Please refer to the docs for detailed instructions on importing the notebook to run.

TRY THE NOTEBOOK!

Try Databricks for free. Get started today.

The post Detecting Criminals and Nation States through DNS Analytics appeared first on Databricks.

↧

Measuring Advertising Effectiveness with Sales Forecasting and Attributing

October 5, 2020, 11:02 am

≫ Next: Flipp Presents Their Lakehouse Architecture with Delta Lake at Tableau Conference

≪ Previous: Detecting Criminals and Nation States through DNS Analytics

Click below to download the notebooks for this solution accelerator:

Campaign Effectiveness — ETL

Campaign Effectiveness — Machine Learning

How do you connect the impact of marketing and your ad spend toward driving sales? As the advertising landscape continues to evolve, advertisers are finding it increasingly challenging to efficiently pinpoint the impact of various revenue-generating marketing activities within their media mix.

Brands spend billions of dollars annually promoting their products at retail. This marketing spend is planned 3 to 6 months in advance and is used to drive promotional tactics to raise awareness, generate trials, and increase the consumption of the brand’s product and services. This entire model is being disrupted with COVID. Consumer behaviors are changing rapidly, and brands no longer have the luxury of planning promotional spend months in advance. Brands need to make decisions in weeks and days, or even in near real-time. As a result, brands are shifting budgets to more agile channels such as digital ads and promotions.

Making this change is not easy for brands. Digital tactics hold the promise of increased personalization by the delivering the message most likely to resonate with the individual consumer. Traditional statistical analysis and media planning tools however, have been built around long lead times using aggregate data, which makes it harder to optimize messaging at the segment or individual level. Marketing or Media Mix Modeling (MMM) is commonly used to understand the impact of different marketing tactics in relation to other tactics, and determine optimal levels of spend for future initiatives, but MMM is a highly manual, time-intensive and backwards-looking exercise due to the challenges of integrating a wide range of data sets of different levels of aggregation.

Your print and TV advertising agency might send a biweekly excel spreadsheet providing impressions at a Designated Market Area (DMA) level; digital agencies might provide CSV files showing clicks and impressions at the zip-code level; your sales data may be received at a market level; and search and social each have their own proprietary reports via APIs slicing up audiences by a variety of factors. Compounding this challenge is that as brands shift to digital media and more agile methods of advertising, they increase the number of disparate data sets that need to be quickly incorporated and analyzed. As a result, most marketers conduct MMM exercises at most once a quarter (and most often just once a year) since rationalizing and overlaying these different data sources is a process that can take several weeks or months.

While MMM is useful for broader level marketing investment decisions, brands need the ability to quickly make decisions at a finer level. They need to integrate new marketing data, perform analysis, and accelerate decision making from months or weeks to days or hours. Brands that can respond to programs that are working in real time will see significantly higher return on investment as a result of their efforts.

Introducing the Sales Forecasting & Advertising Attribution Dashboard Solution Accelerator

Based on best-practices from our work with the leading brands, we’ve developed solution accelerators for common analytics and machine learning use cases to save weeks or months of development time for your data engineers and data scientists.

Whether you’re an ad agency or an in-house marketing analytics team, this solution accelerator allows you to easily plug in sales, ad engagement, and geo data from a variety of historical and current sources to see how these drive sales at a local level. With this solution, you can also attribute digital marketing efforts at the aggregate trend level without cookie/device ID tracking and mapping, which has become a bigger concern with the news of Apple deprecating IDFA.

Normally, attribution can be a fairly expensive process, particularly when running attribution against constantly updating datasets without the right technology. Fortunately, Databricks provides a Unified Data Analytics Platform with Delta Lake — an open source transaction layer for managing your cloud data lake — for large scale data engineering and data science on a multi-cloud infrastructure. This blog will demonstrate how Databricks facilitates the multi-stage Delta Lake transformation, machine learning, and visualization of campaign data to provide actionable insights.

Three things make this solution accelerator unique compared to other advertising attribution tools:

Ability to easily integrate new data sources into the schema: One of the strengths of the Delta architecture is how easily it blends new data into the schema. Through the automated data enrichment within Delta Lake, you can easily, for example, integrate a new data source that uses a different time/date format compared to the rest of your data. This makes it easy to overlay marketing tactics into your model, integrating new data sources with ease.
Real-time dashboarding: While most MMM results in a point-in-time analysis, the accelerator’s automated data pipelines feed easily-shared dashboards that allow business users to immediately map or forecast ad-impressions-to-sales as soon as those files are generated to get daily-level or even segment-level data visualizations.
Integration with machine learning: With the machine learning models in this solution, marketing data teams can build more granular top-down or ground-up views into which advertising is resonating with which customer segments at the daily- or even individual-level.

By providing the structure and schema enforcement on all your marketing data, Delta Lake on Databricks can make this the central source of data consumption for BI and AI teams, effectively making this the Marketing Data Lake.

How this solution extends and improves on traditional MMM, forecasting, and attribution
The two biggest advantages to this solution are faster time to insight and increased granularity over traditional MMM, forecasting, and attribution by combining reliable data ingestion & preparation, agile data analysis, and machine learning efforts into a unified insights platform.

When trying to determine campaign spend optimizations via MMM, marketers have traditionally relied on manual processes to collect long-term media buying data, as well as observing macro factors that may influence campaigns, like promotions, competitors, brand equity, seasonality, or economic factors. The typical MMM cycle can take weeks or months, often not providing actionable insights until long after campaigns have gone live, or sometimes, not until a campaign has ended! By the time traditional marketing mix models are built and validated, it may be too late to act upon valuable insights and key factors in order to ensure a maximally effective campaign.

Furthermore, MMM focuses on recommending media mix strategies from a big-picture perspective providing only top-down insight without taking optimal messaging at a more granular level into account. As advertising efforts have heavily shifted toward digital media, traditional MMM approaches fail to offer insights into how these user-level opportunities can be effectively optimized.

By unifying the ingestion, processing, analysis, and data science of advertising data into a single platform, marketing data teams can generate insights at a top-down and bottom-up granular level. This will enable marketers to perform immediate daily-level or even user-level deep dives, and help advertisers determine precisely where along the marketing mix their efforts are having the most impact so they can optimize the right messaging at the right time through the right channels. In short, marketers will benefit tremendously from a more efficient and unified measurement approach.

Solution overview

At a high level we are connecting a time series of regional sales to regional offline and online ad impressions over the trailing thirty days. By using ML to compare the different kinds of measurements (TV impressions or GRPs versus digital banner clicks versus social likes) across all regions, we then correlate the type of engagement to incremental regional sales in order to build attribution and forecasting models. The challenge comes in merging advertising KPIs such as impressions, clicks, and page views from different data sources with different schemas (e.g., one source might use day parts to measure impressions while another uses exact time and date; location might be by zip code in one source and by metropolitan area in another).

As an example, we are using a SafeGraph rich dataset for foot traffic data to restaurants from the same chain. While we are using mocked offline store visits for this example, you can just as easily plug in offline and online sales data provided you have region and date included in your sales data. We will read in different locations’ in-store visit data, explore the data in PySpark and Spark SQL, and make the data clean, reliable and analytics ready for the ML task. For this example, the marketing team wants to find out which of the online media channels is the most effective channel to drive in-store visits.

The main steps are:

Ingest: Mock Monthly Foot Traffic Time Series in SafeGraph format – here we’ve mocked data to fit the schema (Bronze)
Feature engineering: Convert to monthly time series data so we match numeric value for number of visits per date (row = date) (Silver)
Data enrichment: Overlay regional campaign data to regional sales. Conduct exploratory analysis of features like distribution check and variable transformation (Gold)
Advanced Analytics / Machine Learning: Build the Forecasting & Attribution model

About the Data:

We are using SafeGraph Patterns to extract in-store visits. SafeGraph’s Places Patterns is a dataset of anonymized and aggregated visitor foot-traffic and visitor demographic data available for ~3.6MM points of interest (POI) in the US. In this exercise, we look at historical data (Jan 2019 – Feb 2020) for a set of limited-service restaurant in-store visits in New York City.

1. Ingest data into Delta format (Bronze)

Start with the notebook “Campaign Effectiveness_Forecasting Foot Traffic_ETL”.

The first step is to load the data in from blob storage. In recent years more and more advertisers choose to ingest their campaign data to blob storage. For example, ‍‍you can retrieve data programmatically through the FBX Facebook Ads Insights API. You can query endpoints for impressions, CTRs, and CPC. In most cases, data will be returned in either CSV or XLS format. In our example, the configuration is quite seamless: we pre-mount S3 bucket to dbfs, so that once the source file directory is set up, we can directly load the raw CSV files from the blob to Databricks.

    raw_sim_ft = spark.read.format("csv").option("header", "true").option("sep", ",").load("/tmp/altdata_poi/foot_traffic.csv")
    raw_sim_ft.createOrReplaceTempView("safegraph_sim_foot_traffic")

Then I create a temp view which allows me to directly interact with those files using Spark SQL. Because the in-store visit by day is a big array at this point, we will have some feature engineering work to do later. At this point I’m ready to write out the data using the Delta format to create the Delta Lake Bronze table to capture all my raw data pointing to a blob location. Bronze tables serve as the first stop of your data lake, where raw data comes in from various sources continuously via batch or streaming, and it’s a place where data can be captured and stored in its original raw format. The data at this step can be dirty because it comes from different sources.

    raw_sim_ft.write.format('delta').mode('overwrite').save('/home/layla/data/table/footTrafficBronze')

2. Feature engineering to make sales time series ready to plot (Silver)

After bringing the raw data in, we now have some data cleaning and feature engineering tasks to do. For instance, to add the MSA region and parse Month/Year. And since we found out visit_by_date is an Array, we need to explode the data into separate rows. This block of functions will flatten the array. After running this, it will return visits_by_day df, with num_visit mapped to each row:

    def parser(element):
        return json.loads(element)
    def parser_maptype(element):
        return json.loads(element, MapType(StringType(), IntegerType()))
    jsonudf = udf(parser, MapType(StringType(), IntegerType()))
    convert_array_to_dict_udf = udf(lambda arr: {idx: x for idx, x in enumerate(json.loads(arr))}, MapType(StringType(), IntegerType()))
    
    def explode_json_column_with_labels(df_parsed, column_to_explode, key_col="key", value_col="value"):
        df_exploded = df_parsed.select("safegraph_place_id", "location_name", "msa", "date_range_start", "year", "month", "date_range_end", explode(column_to_explode)).selectExpr("safegraph_place_id", "date_range_end", "location_name","msa", "date_range_start", "year", "month", "key as {0}".format(key_col), "value as {0}".format(value_col))
        return(df_exploded)

After feature engineering, the data is ready for use by downstream business teams. We can persist the data to Delta Lake Silver table so that everyone on the team can access the data directly. At this stage, the data is clean and multiple downstream Gold tables will depend on it. Different business teams may have their own business logic for further data transformation. For example, you can imagine here we have a Silver table “Features for analytics” that hydrates several downstream tables with very different purposes like populating an insights dashboard, generating reports using a set of metrics, or feeding into ML algorithms.

3. Data enrichment with advertising campaign overlay (Gold – Analytics Ready)

At this point, we are ready to enrich the dataset with the online campaign media data. In the traditional MMM data gathering phase, data enrichment services normally take place outside the data lake, before reaching the analytics platform. With either approach the goal is the same: transform a business’s basic advertising data (i.e. impressions, clicks, conversions, audience attributes) into a more complete picture of demographic, geographic, psychographic, and/or purchasing behaviors.

Data enrichment is not a one-time process. Information like audience locations, preferences, and actions change over time. By leveraging Delta Lake, advertising data and audience profiles can be consistently updated to ensure data stays clean, relevant, and useful. Marketers and data analysts can build more complete consumer profiles that evolve with the customer. In our example, we have banner impressions, social media FB likes, and web landing page visits. Using Spark SQL it’s really easy to join different streams of data to the original dataframe. To further enrich the data, we call the Google Trend API to pull in Google trend keywords search index to represent the organic search element – this data is from Google Trends.

    def parser(element):x
        return json.loads(element)
    def parser_maptype(element):
        return json.loads(element, MapType(StringType(), IntegerType()))
    
    jsonudf = udf(parser, MapType(StringType(), IntegerType()))
    
    convert_array_to_dict_udf = udf(lambda arr: {idx: x for idx, x in enumerate(json.loads(arr))}, MapType(StringType(), IntegerType()))
    
    def explode_json_column_with_labels(df_parsed, column_to_explode, key_col="key", value_col="value"):
        df_exploded = df_parsed.select("safegraph_place_id", "location_name", "msa", "date_range_start", "year", "month", "date_range_end", explode(column_to_explode)).selectExpr("safegraph_place_id", "date_range_end", "location_name","msa", "date_range_start", "year", "month", "key as {0}".format(key_col), "value as {0}".format(value_col))
    
        return(df_exploded)

Finally, a dataset combining the in-store num_visit and the online media data is produced. We can quickly derive insights by plotting the num_visit time series. For instance, you can use the graph to visualize trends in counts or numerical values over time. In our case, because date and time information is a continuous aggregate count data, points are plotted along the x-axis and connected by a continuous line. Missing data is displayed with a dashed line.

Time series graphs can answer questions about your data, such as: How does the trend change over time? Or do I have missing values? The graph below shows in-store visits in the period from Jan, 2019 to Feb, 2020. The highest period for In-store visits occurred in mid Sept 2019. If marketing campaigns occurred in those months, that would imply that the campaigns were effective, but only for a limited time.

We write out this clean, enriched dataset to Delta Lake, and create a Gold table on top of it.

4. Advanced Analytics and Machine Learning to build forecast and attribution models

Traditional MMM uses a combination of ANOVA and multi regression. In this solution we will demonstrate how to use an ML algorithm XGBoost with the advantage of being native to the model explainer SHAP in the second ML notebook. Even if this solution does not replace traditional MMM processes, traditional MMM statisticians can just write a single node code and use pandas_udf to run it.

For this next step, use the notebook “Campaign Effectiveness_Forecasting Foot Traffic_Machine Learning”.

Up to this point, we’ve used Databricks to ingest and combine all the raw data; we then cleaned, transformed, and added extra reliability to the data by writing to Delta Lake for faster query performance

At this point, we should feel pretty good about the dataset. Now it’s time to create the attribution model. We are going to use the curated gold table data to look closely at foot traffic in New York City to understand how the fast food chain’s various advertising campaign efforts drove in-store visits.

The main steps are:

Create a machine learning approach that predicts the number of in-store visits given a set of online media data
Leverage SHAP model interpreter to decomposite the model prediction and quantify how much foot traffic a certain media channel drove

As the standard step for data science, we want to understand the probability distribution of the target variable store-visit and the potential features because it tells us all the possible values (or intervals) of the data that implies underlying characteristics of the population. We can quickly identify from this chart that for all NY State in-store visits, there are 2 peaks, indicating multimodal distribution. This implies underlying differences of population in different segments that we should drill into further.

    %sql
    select * from 
    (select region, city, cast(year as integer) year, cast(month as integer) month, cast(day as integer) day, sum(num_visits) num_visits 
    from layla_v2.Subway_foot_traffic 
    where region = 'NY' and num_visits >= 50
    group by region, city, cast(year as integer), cast(month as integer), cast(day as integer)
    order by year, month, day, num_visits
    )

When separating New York City traffic from all the other cities, the distribution looks close to normal – NYC must be a unique region!

Then we also check the distribution for all the features using Q-Q Plots and normality tests. From the charts we can tell the features look quite normally distributed. Good bell curves here.

One great advantage of doing analysis on Databricks is that I can freely switch from Spark dataframe to pandas, and use popular visualization libraries like plotly to plot charts in the notebook to explore my data. The following chart is from plotly. We can zoom in, zoom out, and drill in to look closely at any data points.

plotly chart with zoom in out panel

As we can see, it’s really easy to create all the stat plots that we needed without leaving the same notebook environment.

Now, we are confident that the data is suitable for model training. Let’s train a prediction model. For the Algorithm choice, we will use XGBoost. The dataset isn’t big, so single node training is an efficient approach. When the data fits into the memory, we recommend that you train ML models on a single machine if the training data size fits into the memory (say < 10GBs), as distributed training and inference can be more complex and slower due to internodes communication overhead. However, you have the option to distribute your single node model to the cluster and have multiple models trained in parallel.

We can also leverage Databricks Runtime AutoML capabilities – HyperOpt – to distributedly tune the model’s hyper parameters so that we can increase the efficiency of finding the best hyperparameters:

    from hyperopt import fmin, tpe, rand, hp, Trials, STATUS_OK
    import xgboost
    from xgboost import XGBRegressor
    from sklearn.model_selection import cross_val_score
    import mlflow
    import mlflow.xgboost
    
    from sklearn.model_selection import train_test_split
    pdf = city_pdf.copy()
    X_train, X_test, y_train, y_test = train_test_split(pdf.drop(['region',	'year',	'month','day','date', 'num_visits'], axis=1), pdf['num_visits'], test_size=0.33, random_state=55)
    
    def train(params):
        """
        An example train method that computes the square of the input.
        This method will be passed to `hyperopt.fmin()`.
        
        :param params: hyperparameters. Its structure is consistent with how search space is defined. See below.
        :return: dict with fields 'loss' (scalar loss) and 'status' (success/failure status of run)
        """
        curr_model =  XGBRegressor(learning_rate=params[0],
                                gamma=int(params[1]),
                                max_depth=int(params[2]),
                                n_estimators=int(params[3]),
                                min_child_weight = params[4], objective='reg:squarederror')
        score = -cross_val_score(curr_model, X_train, y_train, scoring='neg_mean_squared_error').mean()
        score = np.array(score)
        
        return {'loss': score, 'status': STATUS_OK, 'model': curr_model}
    
    
    # define search parameters and whether discrete or continuous
    search_space = [ hp.uniform('learning_rate', 0, 1),
                        hp.uniform('gamma', 0, 5),
                        hp.randint('max_depth', 10),
                        hp.randint('n_estimators', 20),
                        hp.randint('min_child_weight', 10)
                    ]
    # define the search algorithm (TPE or Randomized Search)
    algo= tpe.suggest
    
    from hyperopt import SparkTrials
    search_parallelism = 4
    spark_trials = SparkTrials(parallelism=search_parallelism)
    
    with mlflow.start_run():
        argmin = fmin(
        fn=train,
        space=search_space,
        algo=algo,
        max_evals=8,
        trials=spark_trials)
    
    
    def fit_best_model(X, y): 
        client = mlflow.tracking.MlflowClient()
        experiment_id = client.get_experiment_by_name(experiment_name).experiment_id
    
        runs = mlflow.search_runs(experiment_id)
        best_loss = runs['metrics.loss'].min()
        best_run=runs[runs['metrics.loss'] == best_loss]
    
        best_params = {}
        best_params['gamma'] = float(best_run['params.gamma'])
        best_params['learning_rate'] = float(best_run['params.learning_rate'])
        best_params['max_depth'] = float(best_run['params.max_depth'])
        best_params['min_child_weight'] = float(best_run['params.min_child_weight'])  
        best_params['n_estimators'] = float(best_run['params.n_estimators'])
        
        xgb_regressor =  XGBRegressor(learning_rate=best_params['learning_rate'],
                                max_depth=int(best_params['max_depth']),
                                n_estimators=int(best_params['n_estimators']),
                                gamma=int(best_params['gamma']),
                                min_child_weight = best_params['min_child_weight'], objective='reg:squarederror')
    
        xgb_model = xgb_regressor.fit(X, y, verbose=False)
    
        return(xgb_model)
    
    # fit model using best parameters and log the model
    xgb_model = fit_best_model(X_train, y_train) 
    mlflow.xgboost.log_model(xgb_model, "xgboost") # log the model here 
    
    
    from sklearn.metrics import r2_score
    from sklearn.metrics import mean_squared_error
    train_pred = xgb_model.predict(X_train)
    test_pred = xgb_model.predict(X_test)

Note that by specifying Spark_trials, the HyperOpt automatically distributes a tuning job across an Apache Spark cluster. After HyperOpt finds the best set of parameters, we only need to fit the model once to get the best model fit which is much more efficient than running hundreds of iterations of model fits and cross validating to find the best model. Now we can use the fitted model to forecast NYC In-Store Traffic:

    def fit_best_model(X, y): 
    client = mlflow.tracking.MlflowClient()
    experiment_id = client.get_experiment_by_name(experiment_name).experiment_id
    
    runs = mlflow.search_runs(experiment_id)
    best_loss = runs['metrics.loss'].min()
    best_run=runs[runs['metrics.loss'] == best_loss]
    
    best_params = {}
    best_params['gamma'] = float(best_run['params.gamma'])
    best_params['learning_rate'] = float(best_run['params.learning_rate'])
    best_params['max_depth'] = float(best_run['params.max_depth'])
    best_params['min_child_weight'] = float(best_run['params.min_child_weight'])  
    best_params['n_estimators'] = float(best_run['params.n_estimators'])
    
    xgb_regressor =  XGBRegressor(learning_rate=best_params['learning_rate'],
                                max_depth=int(best_params['max_depth']),
                                n_estimators=int(best_params['n_estimators']),
                                gamma=int(best_params['gamma']),
                                min_child_weight = best_params['min_child_weight'], objective='reg:squarederror')
    
    xgb_model = xgb_regressor.fit(X, y, verbose=False)
    
    return(xgb_model)
    
    # fit model using best parameters and log the model
    xgb_model = fit_best_model(X_train, y_train) 
    mlflow.xgboost.log_model(xgb_model, "xgboost") # log the model here 
    
    from sklearn.metrics import r2_score
    from sklearn.metrics import mean_squared_error
    train_pred = xgb_model.predict(X_train)
    test_pred = xgb_model.predict(X_test)

The red line is prediction while blue is actual visits – looks like the model captures the major trend though it misses a few spikes here and there. It definitely needs some tweaking later. Still, it’s pretty decent for such a quick effort!

Once we get the prediction model, one natural question is how does the model make predictions? How do each of the features contribute to this black box algo? In our case, the question becomes “how much does each media input contributes to in-store foot traffic.”

By directly using SHAP library which is an OSS model interpreter, we can quickly derive insights such as “what are the most important media channels driving my offline activities?”

There are a few benefits that we can get from using SHAP. Firstly, it can produce explanations at the level of individual inputs: each individual observation will have their own set of SHAP values compared to traditional feature importance algorithms which tell us which features are most important across the entire population. However, by looking only at the trends at a global level, these individual variations can get lost, with only the most common denominators remaining. With individual-level SHAP values, we can pinpoint which factors are most impactful for each observation, allowing us to make the resulting model more robust and the insights more actionable. So in our case, it will compute SHAP values for each media input for each day in-store visit.

From the shapley value chart, we can quickly identify social media and landing page visits had the highest contribution to the model:

    shap.summary_plot(shap_values, X, plot_type="bar")

    display(spark.createDataFrame(sorted(list(zip(mean_abs_shap, X.columns)), reverse=True)[:8], ["Mean |SHAP|", "Feature"]))

SHAP can provide the granular insight of media mix contribution at the individual level. We can directly relate feature values to the unit of output. Here, SHAP is able to quantify the impact of a feature on the unit of model target, the in-store visit. In this graph, we try to predict salary and we can read the impact of a feature in the unit of visits, which greatly improves the interpretation of the results compared to the relative score from feature importance.

    plot_html = shap.force_plot(explainer.expected_value, shap_values[n:n+1], feature_names=X.columns, plot_cmap='GnPR')  
    displayHTML(bundle_js + plot_html.data)

And finally we can create the full decomposition chart for daily foot-traffic time series and have a clear understanding on how the in-store visit attributes to each online media input. Traditionally, it requires the data scientists to build decomp matrix, involving a messy transformation and back-and-forth calculations. Here with SHAP you get the value out of the box!

    import plotly.graph_objects as go
    
    fig = go.Figure(data=[
        go.Bar(name='base_value', x=shap_values_pdf['date'], y=shap_values_pdf['base_value'], marker_color='lightblue'),
        go.Bar(name='banner_imp', x=shap_values_pdf['date'], y=shap_values_pdf['banner_imp']),
        go.Bar(name='social_media_like', x=shap_values_pdf['date'], y=shap_values_pdf['social_media_like']),
        go.Bar(name='landing_page_visit', x=shap_values_pdf['date'], y=shap_values_pdf['landing_page_visit']),
        go.Bar(name='google_trend', x=shap_values_pdf['date'], y=shap_values_pdf['google_trend'])
    ])
    # Change the bar mode
    fig.update_layout(barmode='stack')
    fig.show()

plotly chart with zoom in out panel

We are making the code behind our analysis available for download and review. If you have any questions about how this solution can be deployed in your environment, please don’t hesitate to reach out to us.

Try Databricks for free. Get started today.

The post Measuring Advertising Effectiveness with Sales Forecasting and Attributing appeared first on Databricks.

↧

Flipp Presents Their Lakehouse Architecture with Delta Lake at Tableau Conference

October 5, 2020, 4:53 pm

≫ Next: Analyzing Algorand Blockchain Data with Databricks Delta

≪ Previous: Measuring Advertising Effectiveness with Sales Forecasting and Attributing

Databricks at the Tableau Conference 2020

Our session

Our page

The Tableau Conference 2020 begins tomorrow, with our session Databricks: Data Science & Analytics for Data Lakes at 1:30 PM PDT. In this session, Ameya Malondkar and Yana Yang from Flipp, a joint customer using Dataricks and Tableau together, will present how they enable all their analysts to access and analyze their entire data lake. They have a great story about their journey to a modern cloud data platform that data driven teams aspire to create!

Flipp is utilizing a Lakehouse data management paradigm on Databricks and Delta Lake that enables them to get data into the system fast, to progressively refine it, and to deliver it to multiple audiences depending upon their use cases. Business analysts and sales teams can see how partners and customers are progressing, data science teams can build powerful predictive analytics, and engineering teams can create new product features. The Lakehouse approach makes all the data in the data lake available to these groups for both regular reporting and for ad-hoc investigations.

The Flipp team will talk through their data pipelines on Delta Lake, and how their data is refined through bronze, silver and gold stages. They’ll also cover a number of analytics use cases, with sample visualizations to show how they represent the data.

You can find out more about Databricks and our presence at the Tableau conference on the Databricks Tableau Conference page. We have some cool games and the ability to win a limited edition t-shirt!

Learn More About Databricks at Tableau Conference

Try Databricks for free. Get started today.

The post Flipp Presents Their Lakehouse Architecture with Delta Lake at Tableau Conference appeared first on Databricks.

↧

Analyzing Algorand Blockchain Data with Databricks Delta

October 8, 2020, 9:48 am

≫ Next: Integrating large-scale Genomic Variation and Annotation Data with Glow

≪ Previous: Flipp Presents Their Lakehouse Architecture with Delta Lake at Tableau Conference

Algorand is a public, decentralized blockchain system that uses a proof of stake consensus protocol. It is fast and energy-efficient, with a transaction commit time under 5 seconds and throughput of one thousand transactions per second. The Algorand system is composed of a network of distributed nodes that work collaboratively to process transactions and add blocks to its distributed ledger.

The following diagram illustrates how blocks containing transactions link together to form the blockchain.

Figure 1: Blocks linked sequentially in Algorand Blockchain

To ensure optimum network performance, it is important to continually monitor and analyze business and operational metrics.

Databricks provides a Unified Data Analytics Platform for massive-scale data engineering and collaborative data science on multi-cloud infrastructure. Delta is an open-source storage layer from Databricks that brings reliability and performance for big data processing. This blog post will demonstrate how Delta facilitates real-time data ingestion, transformation, and visualization of the blockchain data to provide the necessary insights.

In this article, we will show how to use Databricks to analyze the operational aspects of the Algorand network. This will include ingestion, transformation, and visualization of Algorand network data to answer questions like:

To ensure optimum network health, is the ratio of nodes to relays similar across different regions?
Are the average incoming/outgoing connections by host by country within a reasonable threshold to ensure good utilization of the network without straining specific hosts/regions?
Do average connection durations between nodes fall below a certain threshold that might indicate high rates of connection failures or a more persistent problem?
Are there any potential weak links in the network topology for global reach/access?
Given a node, what are the other nodes that it is connected to, and how does that change over time?
Having different methods of visualizing the data facilitates detecting and addressing issues.

This is the first of a 2 part blog. In part 2, we’ll analyze block, transaction, and account data.

Algorand Network

The Algorand blockchain is a decentralized network of nodes and relays geographically distributed and connected by the Internet. The nodes and relays follow the consensus protocol to agree on the next block of the blockchain. The proof of stake consensus protocol is fast and new blocks are produced in less than 5 seconds. In order to produce a block, 77.5% of the stake must agree on the block. All nodes have a copy of the ledger, which is the collection of all blocks produced to date. There is a significant amount of communication between the nodes and relays so good connectivity is essential for proper operation.

The Algorand network is composed of:

Node: An instance of Algorand software that is primarily responsible for participating in the consensus protocol. Nodes communicate with other Nodes through Relays. Because it is a distributed ledger, each node has its own copy of the transaction details.
Relay: An instance of the Algorand software that provides a communication hub for the Nodes
Nodes and relays form a star topology where
- A node can connect only to one or more relays.
- A relay can connect to other relays
The connections between nodes and relays are periodically updated to favor the best performing connections by disconnecting from slow connections.

Node Telemetry Data

Figure 2: Nodes and Relays in an Algorand Blockchain Networks

Nodes process transactions and participate in the consensus protocol by voting on blocks
77.5% of the stake have to agree on the next block proposal, so it is important for them to efficiently communicate with each other during the different stages of voting
Nodes propagate votes and transactions through relays
Relays act as communication hubs for the nodes
Relays connect to 1 or more other relays
Nodes and Relays all maintain a copy of the distributed ledger

Note: The data used is from the Algorant mainnet blockchain. The node identities (IP and names) have been obfuscated in the notebook.

Algorand Data

#	Data Type	What	Why	Where
1	Node Telemetry Data (JSON data from Elasticsearch)	Peer connection data that describes the network topology of nodes and relays.	It gives a real-time view of the active nodes & relays and their interconnectivity. It is important to ensure that the network is not partitioned, there is an equal distribution of nodes across the world and there is a balanced load shared across them.	The nodes (compute instances) periodically transmit this information to a configured Elastic Search Endpoint and the analytic system ingests from ES.
2	Block, Transaction, Account Data (JSON/CSV data from S3)	Transaction data and individual account balances are committed into blocks chained sequentially.	This gives visibility into usage of the blockchain network and people(accounts) transacting on it where each account is an established identity and each tx/block has a unique identifier.	This data is generated on individual nodes comprising the blockchain network that is pushed into S3 and ingested by the analytic system.

Table1: Algorand data types

Analytics Workflow

The following diagram illustrates the present data flow.wp-caption: Figure 3: Algorand Analytics Primary Components

The ability to stream data into Delta tables as it arrives supports near real-time analysis of the network. The data is received in JSON format with the following schema:

Timestamp
Host
MessageType
Data
- List of OutgoingPeers
  - Address, ConnectionDuration, Endpoint, HostName
- List of IncomingPeers
  - Address, ConnectionDuration, Endpoint, HostName

Algorand nodes and relays send telemetry data updates to an Elasticsearch cluster once per hour. Elastic is based on Lucene and is primarily a search technology, with analytics layered on top of it and Kibana to offer a real-time dashboard with time-series functionality out of the box. However, the data ingested is in a proprietary format and there is no separation of compute and storage and over time this can get expensive and require substantial effort to maintain. Elasticsearch does not provide support for transactions and limited data manipulation processing options. The syntax to use queries requires some learning curve and for any advanced ML, data has to be pulled out.

To support analysis using Databricks, the data is pulled from Elasticsearch using the Elastic Connector and stored in S3. The data is then transformed into Delta Tables which support the full range of CRUD operations with ACID compliance. S3 is used as the data storage layer and is scalable, reliable, and affordable. Only when compute is required, a spark cluster is spun up. BI Reporting, Interactive Queries for EDA (Exploratory Data Analysis), and ML workloads can all work off the data in S3 using a variety of tools, frameworks, and familiar languages including SQL, R, Python, and Scala. BI reporting tools like Tableau can directly tap into the data in Delta. For very specific BI Reporting needs, some datasets can be pushed to other systems including a Data Warehouse or an Elasticsearch cluster.

It is worth noting that the present data flow can be simplified by writing the telemetry data directly into S3 to provide a consolidated data lake, skipping Elasticsearch altogether. The existing analytics currently done in Elasticsearch can be transferred to Databricks.

Blockchain Network Analysis Process

The following steps define the process to retrieve and transform the data into the resulting Delta table.

Periodically retrieve telemetry data from Elasticsearch
Parse & flatten the JSON data and save the incoming & outgoing peer connection details in separate tables.
Call geocoding APIs for new node IPs
1. Convert IP addresses to ISO3 country code and lat/long coordinates
2. Convert from ISO3 to ISO2 country codes
Analyze the node telemetry data for trends and outliers
1. Using SQL
2. Using Graph libraries
Visualize Node Network Data
1. Using pyvis libraries
2. Expose data in a Delta table for geospatial visualization using Tableau
3. Output node and edge CSV data files in S3 for visualization in a web browser using D3

The following diagram illustrates the analysis process:

Figure 4: Algorand Analytics Data Flow

Step 1: Pull data from Elasticsearch

Use SQL to periodically read from ElasticSearch Connector to pull the telemetry data from Elasticsearch into S3. Data can be indexed and queried from Spark SQL transparently as Elasticsearch is a native source for Spark SQL. The connector pushes down the operations directly to the source, where the data is efficiently filtered out so that only the required data is streamed back to Spark. This significantly increases the query performance and minimizes the CPU, memory, and I/O on both Spark and Elasticsearch clusters as only the needed data is returned.

Use the Databricks Jobs API to schedule the pull
Use Secrets API to store sensitive credential information

    s_sql= '''
    create temporary table node_telemetry
    using org.elasticsearch.spark.sql
    options('resource'='stable-mainnet-v1.0', 
    'nodes'= '{}',
    'es.nodes.wan.only'='true',
    'es.port'='{}',
    'es.net.ssl'='true',
    'es.read.field.as.array.include'='Data.details.OutgoingPeers,Data.details.IncomingPeers',
    'es.net.http.auth.user'= '{}',
    'es.net.http.auth.pass'= '{}')
    '''.format(ES_EP, ES_PORT, ES_USER, ES_PWD)
    spark.sql(s_sql)

Step 2: Parse and flatten JSON data

Create ‘out’ and ‘in’ tables to hold peer connection data received over the last hour. Use the SparkSQL explode function to create a row for each connection.

    %sql
    CREATE TABLE algo.out 
    LOCATION "/mnt/algo-data/node_telemetry/Outgoing"
    AS
    select `@timestamp`, Host, Message ,  explode(Data.details.OutgoingPeers) as Peers
    from node_telemetry where `@timestamp` >= cast('2020-04-27T02:00:00.000+0000' as TIMESTAMP) and `@timestamp` < cast('2020-04-27T03:00:00.000+0000' as TIMESTAMP) and Message = '/Network/PeerConnections';
    
    CREATE TABLE algo.in 
    LOCATION "/mnt/algo-data/node_telemetry/Incoming"
    AS
    select `@timestamp`, Host, Message ,  explode(Data.details.IncomingPeers) as Peers
    from node_telemetry where `@timestamp` >= cast('2020-04-27T02:00:00.000+0000' as TIMESTAMP) and `@timestamp` < cast('2020-04-27T03:00:00.000+0000' as TIMESTAMP) and Message = '/Network/PeerConnections'

Step 3a: Create Edge information and save in S3 as CSV file

    edge_s = '''
    SELECT Host as origin, Peers.Hostname as destination, 1 as num_cnx FROM algo.Out where Host != 'null'
    UNION
    SELECT Peers.Hostname as origin, Host as destination, 1 as num_cnx FROM algo.In where Peers.Hostname != 'null'
    '''
    
    edge_df = spark.sql(edge_s)
    edge_df.coalesce(1).toPandas().to_csv("/dbfs/mnt/algo-data/edge_display/edge.csv", header=True)

Step 3b: Create Node information and add geocoding

Create node information and save it in S3. Call geocoding REST API (e.g., IPStack, Google Geocoding API) within a UDF to map the Node’s IP addresses to its location (latitude, longitude, Country, State, City, Zip, etc). For each node, convert from ISO3 to ISO2 country codes. Add a column to indicate the node type (Relay or Node). This is determined based on the number of incoming connections, where only relays have incoming connections.

    val IPACCESS_KEY = dbutils.secrets.get(scope = "anindita_scope", key = "IPACCESS_KEY")
    val geo_map = (ip: String) => { 
        val url = "http://api.ipstack.com/"+ip+"?access_key="+IPACCESS_KEY+"&format=1"
        val result = scala.io.Source.fromURL(url).mkString
        result
    }
    val geoMapUDF = udf(geo_map)

    %scala
    import org.apache.spark.sql.functions.{col, udf, lit}
    
    val select_sql="""
    SELECT DISTINCT(A.ip), A.host from
    (select Peers.address as ip, Peers.Hostname as host  from algo.In
    UNION
    select Peers.Address as ip, Peers.Hostname as host from algo.Out)A
    """
    val geo_df=spark.sql(select_sql)
    val mapped_df=geo_df.withColumn("address", geoMapUDF(col("ip")))

    host:string
    ip:string
    continent_name:string
    country_name:string
    country_code:string
    region_name:string
    region_code:string
    city:string
    latitude:string
    longitude:string
    zip:string
    country_iso3:string
    is_relay:integer

Table 2: Geo Augmented Data for each node

Step 4a: Analyze node telemetry data using SQL and charting

Figure 5: Node Analysis using SQL

(A) shows the percentages of nodes and relays for top 10 countries. The data shows that there is a higher concentration of nodes and relays in the US. Over time, as the Algorand network grows, it should become more equally distributed across the world as is desirable in a decentralized blockchain.

(B) shows the load distribution (Top 10 countries with highest Avg number of incoming connections per relay node) The incoming connection distribution looks fairly balanced except for the region around Ireland, Japan and Italy which are under higher load.

(C) shows the Heat map of nodes. The geographic distribution shows higher deployments in the Americas, as the Algorand network grows, the distribution of nodes should cover more of the world.

Step 4b: Analyze node telemetry data using Graph Libraries.

A Graph Frame is created out of node and edge information and provides high-level APIs for DataFrame based graph data manipulation. It is a successor to GraphX and encompasses all the functionality with the added convenience of seamlessly leveraging the same underlying data frame.

Detect potential weak links in the network topology using the Connected Components algorithm. Connected component membership returns a graph with each vertex assigned a component ID.

    import org.graphframes._
    
    val vertices_df = spark.sql("SELECT distinct(name) as id  FROM algo.node")
    val edges_df = spark.sql("SELECT concat(origin, destination) as id, origin as src, destination as dst FROM algo.edge").distinct()
    val nodeGraph = GraphFrame(vertices_df, edges_df)
    
    spark.sparkContext.setCheckpointDir("/tmp/spark-checkpoint")
    val cc = nodeGraph.connectedComponents.run

Each node belongs to the same component indicating that there are currently no partitions or breaks. This is evident when we select distinct on the components generated and it returns just 1 value - in this case, component 0.

Analyze network load using the PageRank algorithm

    val rank = nodeGraph.pageRank.maxIter(10).run().vertices
    display(rank.orderBy(rank("pagerank").desc))

PageRank measures the importance of each relay and is measured by the number of incoming connections. The resulting table identifies the most important relays in the network. GraphX comes with static and dynamic implementations of PageRank, here we are using the static one with 10 iterations, the dynamic PageRank will run until the ranks converge.

Step 5: Visualization of the geographic distribution of nodes and relays

It is important to understand the entire network which needs to be monitored for global reach/access to ensure the decentralized nature of the blockchain. The map is interactive and allows zooming into specific regions for more details. For example, in Tableau which has rich geospatial integration, the search feature could be handy.

Step 5a: Pyvis visualization of the star topology

The pyvis library is intended for a quick generation of visual network graphs with minimal python code. It is especially handy for interactive network visualizations. There are many connections between nodes and relays and visualizing the data as a directed graph helps to understand the structure. Apart from source/destination information, there are additional properties such as the number of connections between two nodes, and the node type.

The red vertices denote relays, the blue ones are nodes. Yellow edges represent connections between relays and the blue edges represent connections originating from nodes. The number of connected edges determines the size of vertices which is why the relays (in red) are generally bigger because they typically have more connections than the nodes. This results in a star topology that we see below.

Figure 6: Pyvis visualization of Algorand Blockchain Network of nodes (blue) and relays (red), edges between relays are yellow and those with nodes are blue

From the diagram, you can see that most nodes connect to 4 relays. Relays have more connections than nodes and act as a communication mesh to quickly distribute messages between the nodes.

    from pyvis.network import Network
    
    #Ex. of network creation
    algo_net = Network(height="1050px", width="100%", directed=True, bgcolor="#222222", font_color="white")
    
    #Ex. of adding a node (relay node)
    algo_net.add_node(src, label=src, size=w, title=src, color="red") 
    
    #Ex. of adding edge  (relay --> relay communication)
    algo_net.add_edge(src, dst, value=1, color="yellow")

Step 5b: Geospatial Visualization of Node and Edge data in Delta using Tableau

Geospatial data from the Delta tables can be natively overlaid in Tableau to visualize the network on a world map. One convenient way to visualize the data in Tableau is to use a spider map which is an origin-destination path map.

This requires the Delta table to be in a schema like this.

The schema requires two rows for each path - one using the origin as the location, and the other using the destination as the location. This is crucial to enable Tableau to draw the paths correctly.

    tableau_df = spark.sql('''
    SELECT A.*, B.latitude, B.longitude, B.is_relay FROM
    (SELECT host as location, concat(host, '_', Peers.Hostname) as path_ID  from algo.out
    UNION
    SELECT Peers.Hostname as location, concat(host, '_', Peers.Hostname) as path_ID  from algo.out
    UNION
    SELECT Peers.Hostname as location, concat(Peers.Hostname, '_', host ) as path_ID  from algo.in
    UNION
    SELECT host as location, concat(Peers.Hostname, '_', host ) as path_ID  from algo.in
    )A
    JOIN algo.geo_map B ON A.location = B.host
    ''')

Where location refers to the node, path_ID is the concatenation of src & destination nodes along with latitude, longitude, and relay information.

To connect this table to Tableau, follow the instructions located here: notes to connect Tableau to Delta table. Periodic refreshes of the data will be reflected in the Tableau dashboard.

Tableau recognizes the latitude and longitude coordinates to distribute the nodes on the world map.

Figure 7: Tableau geospatial display of Algorand nodes and relays across the US

The map view makes it easy to visualize the geographical distribution of nodes and relays.

Add the path qualifier to show the edges connecting the nodes and relays.

Figure 8: Tableau geospatial display of Algorand network paths

With the connections enabled, the highly connected nature of the Algorand network becomes clear. The orange connections are between relays and the blue ones connect nodes to relays.

Step 5c: Visualization of Node and Edge data in D3

D3 (Data-Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations that can be rendered in web browsers. It combines map and graph visualizations in a single display using a data-driven approach to DOM manipulation. In the previous section, we saw how an external tool like Tableau could access the data, D3 can be run from within a Databricks notebook as well as a standalone external script. Manipulating and presenting geographic data can be very tricky, but D3.js makes it simple using the following steps.

Draw a world map based on the data stored in a JSON-compatible data format.
Define the size of the map and the geographic projection to use
Use spherical Mercator projection to map the 3-dimensional spherical Earth onto 2-dimensional surfaces(d3.geo.mercator)
Define an SVG element and append it to the DOM
Load the map data using JSON (data formatted in JSON format namely topojson)
Map styling is done via CSS

The following diagram demonstrates map projections, TopoJSON, Voronoi diagrams, force-directed layouts, and edge bundling based on this example. Updated node and edge data in Delta can be used to periodically refresh an HTML dashboard similar to the one below.

Figure 9: Algorand relays (orange) and nodes (blue) in US reporting their connection links

In this post, we have shown how Databricks is a versatile platform for analyzing the operational data (node telemetry) from the Algorand blockchain. Spark’s distributed compute architecture along with Delta provides a scalable cloud infrastructure to perform exploratory analysis using SQL, Python, Scala & R to analyze structured, semi-structured, and unstructured data. Graph algorithms can be applied with very few lines of code to analyze node importance and component connectedness. Different visualization libraries and tools help inspect the network state. In the next post, we will show how block, transaction, and account information can be analyzed in real-time using Databricks. To get started, view the Delta Architecture Webinar. To learn more about Algorand Blockchain, please visit Algorand.

DOWNLOAD THE NOTEBOOK

Try Databricks for free. Get started today.

The post Analyzing Algorand Blockchain Data with Databricks Delta appeared first on Databricks.

↧

Integrating large-scale Genomic Variation and Annotation Data with Glow

October 9, 2020, 1:25 pm

≫ Next: Using MLOps with MLflow and Azure

≪ Previous: Analyzing Algorand Blockchain Data with Databricks Delta

Genomic annotations augment variant data by providing context for each change in the genome. For example, annotations help answer questions like does this mutation cause a change in the protein-coding sequence of a gene? If so, how does it change the protein? Or is the mutation in a low information part of the genome, also known as “junk DNA”? And everything in between. Recently (since release 0.4), Glow, an open-source toolkit for large-scale genomic analysis, introduced the capability to ingest genomic annotation data from the GFF3 (Generic Feature Format Version 3) flat-file format. GFF3 was proposed by the Sequence Ontology Project in 2013 and has become the de-facto format for genome annotation. This format is widely used by genome browsers and databases such as NCBI RefSeq and GenBank. GFF3 is a 9-column tab-separated text format that typically carries the majority of the annotation data in the ninth column. This column is called attributes and stores the annotations as a semi-colon separated list of <tag>=<value> entries. As a result, although GFF3 files can be read as Apache Spark™ DataFrames using Spark’s standard csv data source, the resulting DataFrame is unwieldy for query and data manipulation of annotation data, because the whole list of attribute tag-value pairs for each sequence will appear as a single semicolon-separated string in the attributes column of the DataFrame.

Glow’s new and flexible gff Spark data source addresses this challenge. While reading the GFF3 file, the gff data source parses the attributes column of the file to create an appropriately typed column for each tag. In each row, this column will contain the value corresponding to that tag in that row (or null if the tag does not appear in the row). Consequently, all tags in the GFF3 attributes column will have their own corresponding column in the Spark DataFrame, making annotation data query and manipulation much easier.

Ingesting GFF3 Annotation Data

Like any other Spark data source, reading GFF3 files using Glow’s gff data source can be done in a single line of code. Figure 1 shows how we can ingest the annotations of the Homo Sapiens genome assembly GRCh38.p13 from a GFF3 file (obtained from RefSeq) as shown below. Here, we have also filtered the annotations to chromosome 22 in order to use the resulting annotations_df DataFrame in continuation of our example. The annotations_df alias is for the same purpose as well.

import glow
glow.register(spark)

gff_path = '/databricks-datasets/genomics/gffs/GCF_000001405.39_GRCh38.p13_genomic.gff.bgz'

annotations_df = spark.read.format('gff').load(gff_path) \
        .filter("seqid = 'NC_000022.11'") \
        .alias('annotations_df')

Figure 1: A small section of the annotations_df DataFrame

In addition to reading uncompressed .gff files, the gff data source supports all compression formats supported by Spark’s csv data source, including .gz and .bgz. It is strongly recommended to use splittable compression formats like .bgz instead of .gz to enable parallelization of the read process.

Schema

Let us have a closer look at the schema of the resulting DataFrame, which was automatically inferred by Glow’s gff data source:

annotations_df.printSchema()

    root
    |-- seqId: string (nullable = true)
    |-- source: string (nullable = true)
    |-- type: string (nullable = true)
    |-- start: long (nullable = true)
    |-- end: long (nullable = true)
    |-- score: double (nullable = true)
    |-- strand: string (nullable = true)
    |-- phase: integer (nullable = true)
    |-- ID: string (nullable = true)
    |-- Name: string (nullable = true)
    |-- Parent: array (nullable = true)
    |    |-- element: string (containsNull = true)
    |-- Target: string (nullable = true)
    |-- Gap: string (nullable = true)
    |-- Note: array (nullable = true)
    |    |-- element: string (containsNull = true)
    |-- Dbxref: array (nullable = true)
    |    |-- element: string (containsNull = true)
    |-- Is_circular: boolean (nullable = true)
    |-- align_id: string (nullable = true)
    |-- allele: string (nullable = true)
    .
    .
    .
    |-- transl_table: string (nullable = true)
    |-- weighted_identity: string (nullable = true)

This schema has 100 fields (not all shown here). The first eight fields (seqId, source, type, start, end, score, strand, and phase), here referred to as the “base” fields, correspond to the first eight columns of the GFF3 format cast in the proper data types. The rest of the fields in the inferred schema are the result of parsing the attributes column of the GFF3 file. Fields corresponding to any “official” tag (those referred to as “tags with a pre-defined meaning” in the GFF3 format description), if present in the GFF3 file, are automatically assigned the appropriate data types. The official fields are then followed by the “unofficial” fields (fields corresponding to any other tag) in alphabetical order. In the example above, ID, Name, Parent, Target, Gap, Note, Dbxref, and Is_circular are the official fields, and the rest are the unofficial fields. The gff data source discards the comments, directives, and FASTA lines that may be in the GFF3 file.

As it is not uncommon for the official tags to be spelled differently in terms of letter case and underscore usage within and/or across different GFF3 files the gff data source is designed to be insensitive to letter case and underscore when extracting official tags from the attributes field. For example, the official tag Dbxref will be correctly extracted as an official field even if it appears as dbxref or dbx_ref in the GFF3 file. Please see Glow documentation for more details.

Like other Spark data sources, Glow’s gff data source is also able to accept a user-specified schema through the .schema command. The data source behavior in this case is also designed to be quite flexible. More specifically, the fields (and their types) in the user-specified schema are treated as the list of fields, whether base, official, or unofficial, to be extracted from the GFF3 file (and cast to the specified types). Please see the Glow documentation for more details on how user-specified schemas can be used.

Example: Gene Transcripts and Transcript Exons

With the annotation tags extracted as individual DataFrame columns using Glow’s gff data source, query and data preparation over genetic annotations becomes as easy as writing common Spark commands in the user’s API of choice. As an example, here we demonstrate how simple queries can be used to extract data regarding hierarchical grouping of genomic features from the annotations_df created earlier.

One of the main advantages of the GFF3 format compared to older versions of GFF is the improved presentation of feature hierarchies (see the GFF3 format description for more details). Two examples of such hierarchies are:

Transcripts of a gene (here, gene is the “parent” feature and its transcripts are the “children” features).
Exons of a transcript (here, the transcript is the parent and its exons are the children).

In the GFF3 format, the parents of the feature in each row are identified by the value of the parent tag in the attributes column, which includes the ID(s) of the parent(s) of the row. Glow’s gff data source extracts this information as an array of parent ID(s) in a column of the resulting DataFrame called parent.

Assume we would like to create a DataFrame, called gene_transcript_df, which, for each gene on chromosome 22, provides some basic information about the gene and all its transcripts. As each row in the annotations_df of our example has at most a single parent, the parent_child_df DataFrame created by the following query will help us in achieving our goal. This query joins annotations_df with a subset of its own columns on the parent column as the key. Figure 2 shows a small section of the parent_child_df dataframe.

from pyspark.sql.functions import *

    parent_child_df = annotations_df \
    .join(
        annotations_df.select('id', 'type', 'name', 'start', 'end').alias('parent_df'),
        col('annotations_df.parent')[0] == col('parent_df.id') # each row in annotation_df has at most one parent
    ) \
    .orderBy('annotations_df.start', 'annotations_df.end') \
    .select(
        'annotations_df.seqid',
        'annotations_df.type',
        'annotations_df.start',
        'annotations_df.end',
        'annotations_df.id',
        'annotations_df.name',
        col('annotations_df.parent')[0].alias('parent_id'),
        col('parent_df.Name').alias('parent_name'),
        col('parent_df.type').alias('parent_type'),
        col('parent_df.start').alias('parent_start'),
        col('parent_df.end').alias('parent_end')
    ) \
    .alias('parent_child_df')

Figure 2: A small section of the parent_child_df DataFrame

Having the parent_child_df DataFrame, we can now write the following simple function, called parent_child_summary, which, given this DataFrame, the parent type, and the child type, generates a DataFrame containing basic information on each parent of the given type and all its children of the given type.

from pyspark.sql.dataframe import *

    def parent_child_summary(parent_child_df: DataFrame, parent_type: str, child_type: str) -> DataFrame:
        return parent_child_df \
        .select(
            'seqid',
            col('parent_id').alias(f'{parent_type}_id'),
            col('parent_name').alias(f'{parent_type}_name'),
            col('parent_start').alias(f'{parent_type}_start'),
            col('parent_end').alias(f'{parent_type}_end'),
            col('id').alias(f'{child_type}_id'),
            col('start').alias(f'{child_type}_start'),
            col('end').alias(f'{child_type}_end'),
        ) \
        .where(f"type == '{child_type}' and parent_type == '{parent_type}'") \
        .groupBy(
            'seqid',
            f'{parent_type}_id',
            f'{parent_type}_name',
            f'{parent_type}_start',
            f'{parent_type}_end'
        ) \
        .agg(
            collect_list(
            struct(
                f'{child_type}_id',
                f'{child_type}_start',
                f'{child_type}_end'
            )
            ).alias(f'{child_type}s')
        ) \
        .orderBy(
            f'{parent_type}_start',
            f'{parent_type}_end'
        ) \
        .alias(f'{parent_type}_{child_type}_df')

Now we can generate our intended gene_transcript_df DataFrame, shown in Figure 3, with a single call to this function:

gene_transcript_df = parent_child_summary(parent_child_df, 'gene', 'transcript')

Figure 3: A small section of the gene_transcript_df DataFrame.

In each row of this DataFrame, the transcripts column contains the ID, start and end of all transcripts of the gene in that row as an array of structs.

The same function can now be used to generate any parent-child feature summary. For example, we can generate the information of all exons of each transcript on chromosome 22 with another call to the parent_child_summary function as shown below. Figure 4 shows the generated transcript_exon_df DataFrame.

transcript_exon_df = parent_child_summary(parent_child_df, 'transcript', 'exon')

Figure 4: A small section of the transcript_exon_df DataFrame

Example Continued: Integration with Variant Data

Glow has data sources to ingest variant data from common flat file formats such as VCF, BGEN, and PLINK. Combining Glow’s variant data sources with the new gff data source, users can seamlessly annotate their variant DataFrames by joining them with annotation DataFrames.

As an example, let us load the chromosome 22 variants of the 1000 Genome Project (on GRCh38 genome assembly) from a VCF file (obtained from the project’s ftp site). Figure 5 shows the resulting variants_df.

    vcf_path = "/databricks-datasets/genomics/1kg-vcfs/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz""

    variants_df = spark.read \
      .format("vcf") \
      .load(vcf_path) \
      .alias('variants_df')

Figure 5: A small section of the variants_df DataFrame

Figure 6 shows a DataFrame which, for each variant on a gene on chromosome 22, provides the information of the variant as well as the exon, transcript, and gene on which the variant resides. This is computed using the following query with two joins. Note that the first two exploded DataFrames can also be constructed directly from parent_child_df. Here, since we had already defined gene_transcript_df and transcript_exon_df, we generated these exploded DataFrames simply by applying the explode function followed by Glow’s expand_struct function on them.

    from glow.functions import *

    gene_transcript_exploded_df = gene_transcript_df \
      .withColumn('transcripts', explode('transcripts')) \
      .withColumn('transcripts', expand_struct('transcripts')) \
      .alias('gene_transcript_exploded_df')

    transcript_exon_exploded_df = transcript_exon_df \
      .withColumn('exons', explode('exons')) \
      .withColumn('exons', expand_struct('exons')) \
      .alias('transcript_exon_exploded_df')

    variant_exon_transcript_gene_df = variants_df \
    .join(
      transcript_exon_exploded_df,
      (variants_df.start < transcript_exon_exploded_df.exon_end) &
      (transcript_exon_exploded_df.exon_start < variants_df.end)
    ) \
    .join(
      gene_transcript_exploded_df,
      transcript_exon_exploded_df.transcript_id == gene_transcript_exploded_df.transcript_id
    ) \
    .select(
      col('variants_df.contigName').alias('variant_contig'),
      col('variants_df.start').alias('variant_start'),
      col('variants_df.end').alias('variant_end'),
      col('variants_df.referenceAllele'),
      col('variants_df.alternateAlleles'),
      'transcript_exon_exploded_df.exon_id',
      'transcript_exon_exploded_df.exon_start',
      'transcript_exon_exploded_df.exon_end',
      'transcript_exon_exploded_df.transcript_id',
      'transcript_exon_exploded_df.transcript_name',
      'transcript_exon_exploded_df.transcript_start',
      'transcript_exon_exploded_df.transcript_end',
      'gene_transcript_exploded_df.gene_id',
      'gene_transcript_exploded_df.gene_name',
      'gene_transcript_exploded_df.gene_start',
      'gene_transcript_exploded_df.gene_end'
    ) \
    .orderBy(
      'variant_contig',
      'variant_start',
      'variant_end'
    )

Figure 6: A small section of the variant_exon_transcript_gene_df DataFrame

Try Glow!

Glow is installed in the Databricks Genomics Runtime (Azure | AWS) and is optimized for improved performance when using cloud computing to analyze large genomics datasets. Learn more about our genomics solutions and how we’re helping to further human and agricultural genome and other genetic research and enable advances like population-scale next-generation sequencing in the Databricks Unified Analytics Platform for Genomics and try out a preview today.

TRY THE NOTEBOOK!

Try Databricks for free. Get started today.

The post Integrating large-scale Genomic Variation and Annotation Data with Glow appeared first on Databricks.

↧

Using MLOps with MLflow and Azure

October 13, 2020, 1:50 pm

≫ Next: Azure Databricks in Public Preview for the Azure China Region

≪ Previous: Integrating large-scale Genomic Variation and Annotation Data with Glow

The blog contains code examples in Azure Databricks, Azure DevOps and plain Python. Please note that much of the code depends on being inside an Azure environment and will not work in the Databricks Community Edition or in AWS-based Databricks.

Most organizations today have a defined process to promote code (e.g. Java or Python) from development to QA/Test and production. Many are using Continuous Integration and/or Continuous Delivery (CI/CD) processes and oftentimes are using tools such as Azure DevOps or Jenkins to help with that process. Databricks has provided many resources to detail how the Databricks Unified Analytics Platform can be integrated with these tools (see Azure DevOps Integration, Jenkins Integration). In addition, there is a Databricks Labs project – CI/CD Templates – as well as a related blog post that provides automated templates for GitHub Actions and Azure DevOps, which makes the integration much easier and faster.

When it comes to machine learning, though, most organizations do not have the same kind of disciplined process in place. There are a number of different reasons for that:

The Data Science team does not follow the same Software Development Lifecycle (SDLC) process as regular developers. Key differences in the Machine Learning Lifecycle (MLLC) are related to goals, quality, tools and outcomes (see diagram below).
Machine Learning is still a young discipline and it is often not well integrated organizationally.
The Data Science and deployment teams do not treat the resulting models as separate artifacts that need to be managed properly.
Data Scientists are using a multitude of tools and environments which are not integrated well and don’t easily plug into the above mentioned CI/CD Tools.

To address these and other issues, Databricks is spearheading MLflow, an open-source platform for the machine learning lifecycle. While MLflow has many different components, we will focus on the MLflow Model Registry in this Blog.

The MLflow Model Registry component is a centralized model store, set of APIs, and a UI, to collaboratively manage the full lifecycle of a machine learning model. It provides model lineage (which MLflow experiment and run produced the model), model versioning, stage transitions (for example from staging to production), and annotations.

The Azure Databricks Unified Data and Analytics platform includes managed MLflow and makes it very easy to leverage advanced MLflow capabilities such as the MLflow Model Registry. Moreover, Azure Databricks is tightly integrated with other Azure services, such as Azure DevOps and Azure ML.

Azure DevOps is a cloud-based CI/CD environment integrated with many Azure Services. Azure ML is a Machine Learning platform which in this example will serve the resulting model. This blog provides an end-to-end example of how all these pieces can be connected effectively.

An end-to-end model governance process

To illustrate why an MLOps pipeline is useful, let’s consider the following business scenario: Billy is a Data Scientist working at Wine Inc. Wine Inc. is a global wholesaler of wines that prides itself on being able to find and promote high-quality wines that are a lot less expensive than comparable quality wines. The key success factor of Wine Inc. is a machine learning model for wine that can predict the quality of the wine (for example purposes we are using a public wine qualities dataset (published by Cortez et al.). Key features of the dataset include chemical ones such as fixed acidity, citric acid, residual sugar, chlorides, density, pH and alcohol. It also includes a sensory based quality score between 0 and 10. Billy is constantly rolling out improvements to the model to make it as accurate as possible. The main consumers of the model are the field wine testers. The field wine testers are testing wines across the globe and can quickly analyze the key features of the wine. Wine testers promptly enter the wine features in a mobile app which immediately returns a predictive quality score to the tasters. If a score is high enough, the tasters can acquire the wine for wholesale distribution on the spot.

Billy has started to use the MLFlow Model Registry to store and manage the different versions of his wine quality model. The MLflow Model Registry builds on MLflow’s existing capabilities to provide organizations with one central place to share ML models, collaborate on moving them from experimentation to testing and production, and implement approval and governance workflows.

The registry is a huge help in managing the different versions of the models and their lifecycle.

Once a machine learning model is properly trained and tested, it needs to be put into production. This is also known as the model serving or scoring environment. There are multiple types of architectures for ML model serving. The right type of ML production architecture is dependent on the answer to two key questions:

Frequency of data refresh: How often will data be provided, e.g. once a day, a few times a day, continuously, or ad hoc?
Inference request response time: How quickly do we need a response to inference requests to this model, e.g. within hours, minutes , seconds or sub-second/ milliseconds?

If the frequency is a few times a day and the inference request response time required is minutes to hours, a batch scoring model will be ideal. If the data is provided continuously, a streaming architecture should be considered, especially if the answers are needed quickly. If the data is provided ad hoc and the answer is needed within seconds or milliseconds, a REST API-based scoring model would be ideal.

In the case of Wine Inc., we assume that the latter is the case, i.e. the field testers request the results ad hoc and expect an immediate response. There are multiple options to provide REST based model serving, e.g. using Databricks REST Model serving or a simple Python based model server which is supported by MLFlow. Another popular option for model serving inside of the Azure ecosystem is using AzureML. Azure ML provides a container-based backend that allows for the deployment of REST-based model scoring. MLflow directly supports Azure ML as a serving endpoint. The remainder of this blog will focus on how to best utilize this built-in MLflow functionality.

The diagram above illustrates which end-to-end steps are required. I will use the diagram as the guide to walk through the different steps of the pipeline. Please note that this pipeline is still somewhat simplified for demo purposes.

The main steps are:

Steps 1, 2 and 3: Train the model and deploy it in the Model Registry
Steps 4 through 9: Setup the pipeline and run the ML deployment into QA
Steps 10 through 13: Promote ML model to production

Steps 1, 2 and 3: Train the model and deploy it in the Model Registry

Please look at the following Notebook for guidance:

Billy continuously develops his wine model using the Azure Databricks Unified Data and Analytics Platform. He uses Databricks managed MLflow to train his models and run many model variations using MLFlow’s Tracking server to find the best model possible. Once Billy has found a better model, he stores the resulting model in the MLflow Model Registry, using the Python code below.

result = mlflow.register_model(
    model_uri,
    model_name
)
time.sleep(10)
version = result.version

(The sleep step is needed to make sure that the registry has enough time to register the model).
Once Billy has identified his best model, he registers it in the Model Registry as a “staging” model.

client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name=model_name,
    version=version,
    stage="staging")

Azure DevOps provides a way to automate the end-to-end process of promoting, testing and deploying the model in the Azure ecosystem. It requires the creation of an Azure DevOps pipeline. The remainder of this blog will dive into how best define the Azure DevOps pipeline and integrate it with Azure Databricks and Azure.

Once Billy defines the Azure DevOps pipeline, he can then trigger the pipeline programmatically, which will test and promote the model into the production environment used by the mobile app.

The Azure Pipeline is the core component of Azure DevOps. It contains all the necessary steps to access and run code that will allow the testing, promotion and deployment of a ML pipeline. More info on Azure pipelines can be found here.

Another core component of Azure DevOps is the repo. The repo contains all the code that is relevant for a build and deploy pipeline. The repo stores all the artifacts that are required, including:

Python notebooks and/or source code
Python scripts that interact with Databricks and MLflow
Pipeline source files (YAML)
Documentation/read me etc.

The image below shows the DevOps project and repo for the Wine Inc. pipeline:

The DevOps pipeline is defined in YAML. This is an example YAML file for the pipeline in this blog post,

Line 3: Trigger: Oftentimes, pipelines will be triggered automatically by code changes. Since promoting a model in the Model Registry is not a code change, the Azure DevOps REST API can be used to trigger the pipeline programmatically. The pipeline can also be triggered manually via the UI.

This is the code in the training notebook that uses the DevOps REST API to trigger the pipeline:

from azure.devops.connection import Connection
from msrest.authentication import BasicAuthentication
from azure.devops.v6_0.pipelines.models import RunPipelineParameters,Variable

# Fill in with your personal access token and org URL
personal_access_token = dbutils.secrets.get('ml-gov','ado-token')
organization_url = 'https://dev.azure.com/ML-Governance'
    # Create a connection to the org
credentials = BasicAuthentication('', personal_access_token)
connection = Connection(base_url=organization_url, creds=credentials)

# Get a client (the "core" client provides access to projects, teams, etc)
pipeline_client=connection.clients_v6_0.get_pipelines_client()

#Set the variables for the pipeline
variable=Variable(value=model_name)
variables={'model_name':variable}
run_parameters=RunPipelineParameters(variables=variables)
print(run_parameters)

# Run pipeline in MKL Governance Project V2 with id 6 (ML Governance V3))
runPipeline = pipeline_client.run_pipeline(run_parameters=run_parameters,project='ML Governance V2',pipeline_id=6)

Steps 4 through 9: Setup the pipeline and run the ML deployment into QA

The Azure pipeline is a YAML file. It will first setup the environment (all Python based) and then deploy the model into an Azure QA environment where it can be tested.

Line 15 to 19: Prerequisites: the pipeline installs a set of libraries that it needs to run the scripts. We are using Python to run the scripts. There are a variety of different options to run code in Python when using Azure Databricks. We will use a few of them in this blog.

Using the Databricks Command Line Interface: The Databricks CLI provides a simple way to interact with the REST API. It can create and run jobs, upload code etc. The CLI is most useful when no complex interactions are required. In the example the pipeline is used to upload the deploy code for Azure ML into an isolated part of the Azure Databricks workspace where it can be executed. The execution is a little more complicated, so it will be done using the REST API in a Python script further below.

Lines 32 to 37: This step executes the Python script executenotebook.py. It takes a number of values as parameters, e.g. the Databricks Host name, etc. It will also allow the passing of parameters into the notebook, such as the name of the Model that should be deployed and tested.

The code is stored inside the Azure DevOps repository along with the Databricks notebooks and the pipeline itself. Therefore it is always possible to reproduce the exact configuration that was used when executing the pipeline.

- task: PythonScript@0
inputs:
    scriptSource: 'filePath'
    scriptPath: '$(Build.Repository.LocalPath)/cicd-scripts/executenotebook.py'
    arguments: '--shard $(DATABRICKS_HOST) --token $(DATABRICKS_TOKEN) --cluster $(EXISTING_CLUSTER_ID) --localpath $(Build.Repository.LocalPath)/notebooks/Users/oliver.koernig@databricks.com/ML/deploy --workspacepath /Demo/Test --outfilepath /home/vsts/work/1/s/notebooks/Users/oliver.koernig@databricks.com --params model_name=$(model_name)'
displayName: 'Deploy MLflow Model from Registry to Azure ML for Testing'

This Notebook “deploy_azure_ml_model” performs one of the key tasks in the scenario, mainly deploying an MLflow model into an Azure ML environment using the built in MLflow deployment capabilities. The notebook is parameterized, so it can be reused for different models, stages etc.

The following code snippet from the Notebook is the key piece that deploys the model in Azure ML using the MLflow libraries:

import mlflow.azureml

model_image, azure_model = mlflow.azureml.build_image(model_uri=model_uri, 
                                                        workspace=workspace, 
                                                        model_name=model_name+"-"+stage,
                                                        image_name=model_name+"-"+phase+"-image",
                                                        description=model_name, 
                                                        tags={
                                                        "alpha": str(latest_sk_model.alpha),
                                                        "l1_ratio": str(latest_sk_model.l1_ratio),
                                                        },
                                                        synchronous=True)

This will create a container image in the Azure ML workspace. The following is the resulting view within the Azure ML workspace:

The next step is to create a deployment that will provide a REST API:

The Execution of this Notebook takes around 10-12 min. The executenotebook.py provides all the code that allows the Azure DevOps environment to wait until the Azure ML deployment task has been completed. It will check every 10 seconds if the job is still running and go back to sleep if indeed it is. When the model is successfully deployed on Azure ML, the Notebook will return the URL for the resulting model REST API. This REST API will be used further down to test if the model is properly scoring values.

When the pipeline is running, users can monitor the progress. The screen shot reveals the API calls and then 10 sec wait between calls.

The next step is executing the test of the Notebook. Like the previous step it triggers the executenotebook.py code and passes the name of the test notebook (“test_api”) as well as the REST API from the previous step.

task: PythonScript@0
inputs:
    scriptSource: 'filePath'
    scriptPath: '$(Build.Repository.LocalPath)/cicd-scripts/executenotebook.py'
    arguments: '--shard $(DATABRICKS_HOST) --token $(DATABRICKS_TOKEN) --cluster $(EXISTING_CLUSTER_ID) --localpath $(Build.Repository.LocalPath)/notebooks/Users/oliver.koernig@databricks.com/ML/test --workspacepath /Demo/Test --outfilepath /home/vsts/work/1/s/notebooks/Users/oliver.koernig@databricks.com --params model_name=$(model_name),scoring_uri=$(response)'
displayName: 'Test MLflow Model from Registry against REST API'

The testing code can be as simple or complicated as necessary. The “test_api” notebook simply uses a record from the initial training data and submits it via the model REST API from the Azure ML. If it returns a meaningful value the test is considered a success. A better way to test would be to define a set of expected results using the API and using a much larger set of records.

Steps 10 through 13: Promote ML model to production

Given a successful test, two things need to subsequently happen:

The model in the MLflow Model Registry should be promoted to “Production”, which will tell Billy and other Data Scientists which model is the latest production model in use.
The model needs to be put into production within Azure ML itself.

The next step will take care of the first step. It uses the managed MLflow REST API on Azure Databricks. Using the API, the model can be promoted (using the mlflow.py script within Dev Ops) w/o executing any code on Azure Databricks itself. It will only take a few seconds.

This script promotes the latest model with the given name out of staging into production

This script promotes the latest model with the given name out of staging into production
import importlib,pprint,json,os
from  mlflow_http_client import MlflowHttpClient, get_host,get_token
import pprint
    
client = MlflowHttpClient(host=get_host(),token=get_token())
pp = pprint.PrettyPrinter(indent=4)
model_name=os.environ.get('MODEL_NAME')
print("Mode Name is: "+model_name)
rsp = client.get("registered-models/get-latest-versions?name="+model_name+"&stages=staging")
if len(rsp) >= 1:
    version = rsp['model_versions'][0]['version']
else:
    raise BaseException('There is no staging model for the model named: '+model_name)
result=rsp['model_versions'][0]['version']
data = {"name": model_name,"version":version,"stage":"production","archive_existing_versions":False}
rsp = client.post("model-versions/transition-stage", data)
pp.pprint(rsp)
response=rsp['model_version']['version']
print ("Return value is:"+response)
print('##vso[task.setvariable variable=response;]%s' % (response))

We can verify with the Azure Databricks Model UI that this has indeed happened: We can see that there is a new production level model (version 4). As a side note, while it is possible to have multiple models in production, we don’t consider that good practice, so all other production versions should be archived (MLflow provides a feature to automatically enable this by setting archive_existing_versions=true).

The next step is simply a repeat of steps 4 through 11. We will re-deploy the model in Azure ML and indicate that this is the production environment. Please note that Azure DevOps has a separate set of deploy pipelines which we are not utilizing in this blog in order to keep things a little simpler.

Discussion

This blog post has demonstrated how an MLLC can be automated by using Azure Databricks , Azure DevOps and Azure ML. It demonstrated the different ways Databricks can integrate with different services in Azure using the Databricks REST API, Notebooks and the Databricks CLI. See below for links to the three notebooks referenced in this blog

For questions or comments, please contact oliver.koernig@databricks.com.

Try Databricks for free. Get started today.

The post Using MLOps with MLflow and Azure appeared first on Databricks.

↧

Azure Databricks in Public Preview for the Azure China Region

October 14, 2020, 9:28 am

≫ Next: Building the Latinx Network and Celebrating Our Heritage

≪ Previous: Using MLOps with MLflow and Azure

We are excited to announce that Azure Databricks is now available in public preview for Microsoft’s Azure China region, enabling new data and AI use cases with fast, reliable and scalable data processing, analytics, data science and machine learning on the cloud. With availability across more than 30 Azure regions, global organizations appreciate the consistency, ease of use and collaboration enabled by Azure Databricks.

Helping customers and partners scale with global availability

Organizations need a consistent set of cloud services across their global operations and customers looking to migrate on-premises big data workloads from a data center to the cloud frequently need a local Microsoft Azure region to meet data residency and data sovereignty requirements. Azure Databricks helps customers deploy and scale batch and streaming data processing, simplify analytics and data science and implement machine learning in a way that is consistent and collaborative.

Now, with Azure Databricks availability in Azure China, organizations that operate in China can leverage the same scalable service and collaborative workbooks in the region as well. From retail recommendations and financial services risk analysis to improved diagnostics in healthcare and life sciences, Azure Databricks enables data teams of all sizes and across industries to innovate more quickly and more collaboratively on the cloud.

“Azure Databricks enables our data engineering and data science teams to deliver results even faster than ever. Scalable data processing and machine learning enable our teams to quickly adapt to shifts in consumer demand and buying behavior,” says Kevin Zeng, Greater China IT CTO at Procter & Gamble. “The availability of Azure Databricks in China enables our global teams to provide a consistent experience for our customers in the region.”

Learn more about the Azure China region and Azure Databricks

You can learn more about Azure Databricks availability in the Azure China region by visiting the Azure Products by Region page. Learn more about what you need to consider before moving your workloads to the Azure China region with Microsoft’s Azure China checklist and for questions please reach out to the team through Azure Support.

Get started with Azure Databricks by attending a live event and this free, 3-part training series.

Try Databricks for free. Get started today.

The post Azure Databricks in Public Preview for the Azure China Region appeared first on Databricks.

↧

Building the Latinx Network and Celebrating Our Heritage

October 15, 2020, 3:25 pm

≫ Next: Announcing Summit Keynotes: Malcolm Gladwell, Dr. Kira Radinsky, Jeremy Singer-Vine

≪ Previous: Azure Databricks in Public Preview for the Azure China Region

Virtual performance from Mariachi Sol Mixteco during the Cafe con Leche event

This past spring, we launched our Latinx Network Employee Resource Group with a lot of excitement. We co-founded this group to foster a comunidad (community) that supports, learns and grows together. We love the diversity of our culture and want to embrace it and share it with our colleagues around the world. The mission of the Latinx Network is to promote inclusion, career growth, leadership, mentorship and networking within Databricks and our local communities.

Over the last few months, our team came together to plan our first annual Hispanic Heritage Month at Databricks. This was a time for us to reflect on where we came from, where we are today and what we can do to create a better future for our community. To celebrate, we hosted a few events, one of which was called Café con Leche, a virtual event that created space for our colleagues to share some of their favorite recipes, poetry and family stories. This was a great way for us to highlight our talented Latinx colleagues and allies across the globe while we connected over our individual and collective history.

To reflect back on this month, we connected with fellow Databricks employees to learn more about their perspectives on celebrating our heritage with their families and the importance of building community.

Cynthia Garcia — Sales Development Representative

Members of the Databricks Latinx Network with family
Cynthia (second to the left) with her family

Q: Why do you think it is important to have a Latinx Network at Databricks?
I believe that it’s important to have a Latinx Network at Databricks because it creates a sense of community and belonging. Since going remote, this has been a great way to network with my colleagues who are outside my immediate team and learn from them, bounce ideas off them, and make more connections.

Q: What are you most proud of as it relates to your Latin heritage?
There’s a lot of things to be proud of when it comes to Latin heritage (food, music, family, etc.), but one thing that I am most proud of is how we always put family first. We spend all our special moments together laughing, crying and enjoying each other’s company. Being Hispanic is more than just the parties and food, it’s about our history, our struggles and our hard work. My parents sacrificed a lot to give my siblings and me a better opportunity and that is something I will always cherish.

Miguel Peralvo — Solutions Architect

Members of the Databricks Latinx Network with family
Miguel (on the left) with his wife

Q: Why do you think it is important to have a Latinx Network at Databricks?
My wife is Puerto Rican and I’m Spanish. So having a Latinx Network at Databricks makes me feel a stronger sense of community. An open community, because at Databricks our Employee Resource Groups welcome and educate all people, regardless of their background. Sientes que perteneces. (You feel like you belong.)

Q: What advice do you have for other people from the Latinx community who want to pursue a career in tech?
At the beginning of your career, try to work in places that allow you to fail quickly and embrace who you are, including your Latinx background. If there are things you don’t like, try to change them. If you can’t, that’s a signal. You should feel like you are in charge of your career and situation. You should feel like what you are doing and how you are treated will help you blossom, eventually. Intenta cambiar la habitación, y si no puedes, cambia de habitación. (Try to change the room, and if you can’t, change to another room.)

Diane Romualdez — Sr. Event Marketing Manager

Members of the Databricks Latinx Network with family
Diane (on the far left) with her sisters

Q: How do you and your family like to celebrate this month?
Hispanic Heritage Month is a great time to reflect on the paths paved by our ancestors through song, dance, food and more. It is a time to revisit recipes passed down from generation to generation and come together over a reminiscent meal that is meaningful for our family and culture.

Q: How have you and your family been involved in the events that our Latinx Network ERG group has hosted for Hispanic Heritage Month?
I have had the pleasure to support the Latinx Network ERG with pre-planning and working on co-hosted events with other ERG groups such as the Intro to Stocks event with our Black Employee Network. There is a lot to look forward to beyond this month, and it is special that we are able to celebrate it with all Databricks employees.

Daniel Alvarez — Recruiting Coordinator

Members of the Databricks Latinx Network with family
Daniel (on the left) with his dad

Q: How do you and your family like to celebrate this month?
In addition to enjoying the most delicious Salvadorian foods, we also celebrate this month by taking some time to look back at how far we have truly come as a family. My mom immigrated from El Salvador, and my dad is from Mexico. They both sacrificed an opportunity for a great life in order to give me and my brother an even better one. Now, I am a recruiting coordinator and my brother is a software engineer, and they couldn’t be prouder! We are truly grateful to have our careers in the tech industry and to never forget our family’s sacrifice that allowed us to reach this point.

Q: What are you most proud of as it relates to your Latin heritage?
I am most proud of the grit and resilience of our Latinx community. Our history is full of stories of parents who work countless hours to provide a roof and an education for their children. I am proud and inspired by the hardworking members of the Latinx community who dream of opportunities and turn them into reality because the will to succeed is always welcomed here.

Creating a space to blend both our communities at home and within Databricks allowed us to connect with one another and celebrate Hispanic Heritage Month. To learn more about how you can join our community at Databricks, check out our Careers page.

Try Databricks for free. Get started today.

The post Building the Latinx Network and Celebrating Our Heritage appeared first on Databricks.

↧

Announcing Summit Keynotes: Malcolm Gladwell, Dr. Kira Radinsky, Jeremy Singer-Vine

October 18, 2020, 7:59 pm

≫ Next: Announcing Single-Node Clusters on Databricks

≪ Previous: Building the Latinx Network and Celebrating Our Heritage

In less than a month, we kick off the inaugural Data + AI Summit Europe, with free general admission and expanding upon the Spark + AI Summit to cover all things data — analytics, science, engineering and AI. We have an amazing lineup of speakers and training courses, covering the latest in SQL on data lakes, designing lakehouse models, deep learning and, of course, plenty on Spark best practices and internals. While these breakout sessions and training classes are delivered by the smartest minds in the industry, we also have a slate of keynote speakers on the (virtual) main stage.

Malcolm Gladell, award-winning author of Outliers, The Tipping Point, Blink and more

Malcolm Gladwell is the author of five New York Times bestsellers — The Tipping Point, Blink, Outliers, What the Dog Saw, and David and Goliath: Underdogs, Misfits and the Art of Battling Giants. He has been named one of the 100 most influential people by TIME magazine and one of the Foreign Policy’s Top Global Thinkers. Gladwell’s new book, Talking to Strangers: What We Should Know About the People We Don’t Know, offers a powerful examination of our interactions with strangers and why they often go wrong. Through a series of encounters and misunderstandings – from history, psychology and infamous legal cases – Malcolm Gladwell takes us on an intellectual adventure and challenges our assumptions on human nature and strategies we use to make sense of strangers, who are never simple. He explains why we act the way we do, and how we all might know a little more about those we don’t. He has explored how ideas spread in The Tipping Point, decision making in Blink, and the roots of success in Outliers. With his latest book, David and Goliath, he examines our understanding of advantages of disadvantages, arguing that we have underestimated the value of adversity and over-estimated the value of privilege.

Dr. Kira Radinsky, researcher on AI and Predictive Analytics in Healthcare

Dr. Kira Radinsky is the chairperson and CTO of Diagnostic Robotics, where the most advanced technologies in the field of artificial intelligence are harnessed to make healthcare better, cheaper, and more widely available. Dr. Radinsky has founded SalesPredict, acquired by eBay in 2016 and served as eBay Chief Scientist (IL). She gained international recognition for her work at the Technion and Microsoft Research for developing predictive algorithms that recognized the early warning signs of globally impactful events, as disease epidemics and political unrests. In 2013, she was named one of MIT Technology Review’s 35 Young Innovators Under 35 and in 2015 Forbes included her as “30 Under 30 Rising Stars in Enterprise Tech. Radinsky also serves as a board member in Israel Securities Authority and technology board of HSBC bank. She also holds a visiting professor position at the Technion focusing on the application of predictive data mining in medicine.

Jeremy Singer-Vine, Data Editor at BuzzFeed and Investigative Journalist on FinCEN files

Jeremy Singer-Vine is the data editor at BuzzFeed News, where he undertakes projects that combine data analysis and traditional reporting. He most recently was one of the leads on the FinCEN files investigation, working with the International Consortium of Investigative Journalism (ICIJ) and hundreds of reporters around the world to uncover the origins of $2 trillion in financial transactions. He also publishes Data Is Plural, a weekly newsletter that highlights useful and interesting datasets.

During Jeremy’s career in journalism, he’s been a co-finalist for a Pulitzer Prize, a co-winner of a National Magazine Award, and a co-winner of two Scripps Howard Awards. In addition to his position at BuzzFeed News, he’s also worked at Slate Magazine and The Wall Street Journal.

More Keynotes from Databricks and Customers

We have other keynotes lined up that we’ll be announcing soon. Some of the speakers are already featured on the Data + AI Summit Europe site.

Come and join us

Join the European data community online and enjoy the camaraderie at Data + AI Summit Europe 2020. Register to save your free spot!

Save Your Spot

The post Announcing Summit Keynotes: Malcolm Gladwell, Dr. Kira Radinsky, Jeremy Singer-Vine appeared first on Databricks.

↧

Announcing Single-Node Clusters on Databricks

October 19, 2020, 9:00 am

≫ Next: Detecting At-risk Patients with Real World Data

≪ Previous: Announcing Summit Keynotes: Malcolm Gladwell, Dr. Kira Radinsky, Jeremy Singer-Vine

Databricks is used by data teams to solve the world’s toughest problems. This can involve running large-scale data processing jobs to extract, transform, and analyze data. However, it often also involves data analysis, data science, and machine learning at the scale of a single machine, for instance using libraries like scikit-learn. To streamline these single machine workloads, we are happy to announce native support for creating single-node clusters on Databricks.

Background and motivation

Standard Databricks Spark clusters consist of a driver node and one or more worker nodes. These clusters require a minimum of two nodes — a driver and a worker — in order to run Spark SQL queries, read from a Delta table, or perform other Spark operations. However, for many machine learning model training or lightweight data workloads, multi-node clusters are unnecessary.

Single-node clusters are a cost-efficient option for single machine workloads. Single-node clusters support Spark and Spark data sources including Delta, as well as libraries including scikit-learn and tensorflow included in the Runtime for Machine Learning.

For instance, suppose one wanted to train a scikit-learn machine learning model on a Delta table containing the UCI adult census dataset. This relatively small dataset (< 50k tabular rows) can be easily processed, converted to a Pandas dataframe, and used for training a scikit-learn model on a single machine. Spark SQL queries also scale down well to a single-node cluster, as seen in a previous blog post benchmarking Spark on a single machine.

Creating Single-Node Clusters

Single node clusters are now available in Public Preview as a new cluster mode in the interactive cluster creation UI. Selecting this mode will configure the cluster to launch only a driver node, while still supporting spark jobs in local mode on the driver.

To further simplify the cluster creation process, administrators can also create cluster policies for single node cluster creation. Using these policies, users can then launch single node clusters with zero additional configuration and subject to budget controls. For more details including example single node cluster policies see the user guide. In the video below we illustrate how if a cluster administrator has set up single node policies, users can create pre-configured single node clusters by directly selecting the policy.

Learn more about single-node clusters and start using them today

Try Databricks for free. Get started today.

The post Announcing Single-Node Clusters on Databricks appeared first on Databricks.

↧

Detecting At-risk Patients with Real World Data

October 20, 2020, 9:00 am

≫ Next: Faster SQL: Adaptive Query Execution in Databricks

≪ Previous: Announcing Single-Node Clusters on Databricks

With the rise of low cost genome sequencing and AI-enabled medical imaging, there has been substantial interest in precision medicine. In precision medicine, we aim to use data and AI to come up with the best treatment for a disease. While precision medicine has improved outcomes for patients diagnosed with rare diseases and cancers, precision medicine is reactive: the patient has to be sick for precision medicine to be deployed.

When we look at healthcare spending and outcomes, there is a tremendous opportunity to improve cost-of-care and quality of living by preventing chronic conditions such as diabetes, heart disease, or substance use disorders. In the United States, 7 out of 10 deaths and 85% of healthcare spending is driven by chronic conditions, and similar trends are found in Europe and Southeast Asia. Noncommunicable diseases are generally preventable through patient education and by addressing underlying issues that drive the chronic condition. These issues can include underlying biological risk factors such as known genetic risks that drive neurological conditions, socioeconomic factors like environmental pollution or lack of access to healthy food/preventative care, and behavioral risks such as smoking status, alcohol consumption, or having a sedentary lifestyle.

Precision prevention is focused on using data to identify patient populations at risk of developing a disease, and then providing interventions that reduce disease risk. An intervention might include a digital app that remotely monitors at-risk patients and provides lifestyle and treatment recommendations, increased monitoring of disease status, or offering supplemental preventative care. However, deploying these interventions first depends on identifying the patients at risk.

One of the most powerful tools for identifying patients at risk is the use of real world data (RWD), a term that collectively refers to data generated by the healthcare ecosystem, such as electronic medical records (EMR) and health records (EHR) from hospitalizations, clinical practices, pharmacies, healthcare providers, and increasingly data collected from other sources such as genomics, social media, and wearables. In our last blog we demonstrated how to build a clinical data lake from EHR data. In this blog, we build on that by using the Databricks Unified Data Analytics Platform to track a patient’s journey and create a machine learning model. Using this model, given a patient’s encounter history and demographics information, we can assess the risk of a patient for a given condition within a given window of time. In this example, we will look at drug overuse, an important topic given the broad range of poor health outcomes driven by substance use disorders. By tracking our models using MLflow, we make it easy to track how models have changed over time, adding confidence to the process of deploying a model into patient care.

Disease prediction using machine learning on Databricks

Data preparation

To train a model to predict risk at a given time, we need a dataset that captures relevant demographic information about the patient (such as age at time of encounter, ethnicity etc) as well as time series data about the patient’s diagnostic history. We can then use this data to train a model that learns the diagnoses and demographic risks that influence the patient’s likelihood of being diagnosed with a disease in the upcoming time period.

Figure 1: Data schemas and relationships between tables extracted from the EHR

To train this model, we can leverage the patient’s encounter records and demographic information, as would be available in an electronic health record (EHR). Figure 1 depicts the tables we will use in our workflow. These tables were prepared using the notebooks from our previous blog. We will proceed to load encounters, organizations and patient data (with obfuscated PII information) from Delta Lake and create a dataframe of all patient encounters along with patient demographic information.

patient_encounters = (
    encounters
    .join(patients, ['PATIENT'])
    .join(organizations, ['ORGANIZATION'])
)
display(patient_encounters.filter('REASONDESCRIPTION IS NOT NULL').limit(10))

Based on the target condition, we also select a set of patients that qualify to be included in the training data. Namely, we include cases, patients that have been diagnosed with the disease at least once through their encounter history, and an equal number of controls, patients without any history of the disease.

positive_patients = (
    patient_encounters
    .select('PATIENT')
    .where(lower("REASONDESCRIPTION").like("%{}%".format(condition)))
    .dropDuplicates()
    .withColumn('is_positive',lit(True))
)
negative_patients = (
    all_patients
    .join(positive_patients,on=['PATIENT'],how='left_anti')
    .limit(positive_patients.count())
    .withColumn('is_positive',lit(False))
)
patients_to_study = positive_patients.union(negative_patients)

Now we limit our set of encounters to patients included in the study.

qualified_patient_encounters_df = (
    patient_encounters
    .join(patients_to_study,on=['PATIENT'])
    .filter("DESCRIPTION is not NUll")
)

Now that we have the records of interest, our next step is to add features. For this forecasting task, in addition to demographic information, we choose the total number of times having been diagnosed with the condition or any known coexisting conditions (comorbidities) and the number of previous encounters as historical context for a given encounter.

Although for most diseases there is extensive literature on comorbid conditions, we can also leverage the data in our real world dataset to identify comorbidities associated with the target condition.

comorbid_conditions = (
    positive_patients.join(patient_encounters, ['PATIENT'])
    .where(col('REASONDESCRIPTION').isNotNull())
    .dropDuplicates(['PATIENT', 'REASONDESCRIPTION'])
    .groupBy('REASONDESCRIPTION').count()
    .orderBy('count', ascending=False)
    .limit(num_conditions)
    )

In our code, we use notebook widgets to specify the number of comorbidities to include, as well as the length of time (in days) to look across encounters. These parameters are logged using MLflow’s tracking API.

Now we need to add comorbidity features to each encounter. Corresponding to each comorbidity we add a column that indicates how many times the condition of interest has been observed in the past, i.e.

where

We add these features in two steps. First, we define a function that adds comorbidity indicator functions (i.e. xi,c):

def add_comorbidities(qualified_patient_encounters_df,comorbidity_list):
output_df = qualified_patient_encounters_df
idx = 0
for comorbidity in comorbidity_list:
    output_df = (
        output_df
        .withColumn("comorbidity_%d" % idx, (output_df['REASONDESCRIPTION'].like('%' + comorbidity['REASONDESCRIPTION'] + '%')).cast('int'))
        .withColumn("comorbidity_%d"  % idx,coalesce(col("comorbidity_%d" % idx),lit(0))) # replacing null values with 0
        .cache()
    )
    idx += 1
return(output_df)

And then we sum these indicator functions over a contiguous range of days using Spark SQL’s powerful support for window functions:

def add_recent_encounters(encounter_features):
  lowest_date = (
    encounter_features
    .select('START_TIME')
    .orderBy('START_TIME')
    .limit(1)
    .withColumnRenamed('START_TIME', 'EARLIEST_TIME')
    )
  output_df = (
    encounter_features
    .crossJoin(lowest_date)
    .withColumn("day", datediff(col('START_TIME'), col('EARLIEST_TIME')))
    .withColumn("patient_age", datediff(col('START_TIME'), col('BIRTHDATE')))
    )
  w = (
    Window.orderBy(output_df['day'])
    .partitionBy(output_df['PATIENT'])
    .rangeBetween(-int(num_days), -1)
  )
  for comorbidity_idx in range(num_conditions):
    col_name = "recent_%d" % comorbidity_idx
    
    output_df = (
        output_df
        .withColumn(col_name, sum(col("comorbidity_%d" % comorbidity_idx)).over(w))
        .withColumn(col_name,coalesce(col(col_name),lit(0)))
    )
  return(output_df)

After adding comorbidity features, we need to add the target variable, which indicates whether the patient is diagnosed with the target condition in a given window of time in the future (for example a month after the current encounter). The logic of this operation is very similar to the previous step, with the difference that the window of time covers future events. We only use a binary label, indicating whether the diagnosis we are interested in will occur in the future or not.

def add_label(encounter_features,num_days_future):
  w = (
    Window.orderBy(encounter_features['day'])
    .partitionBy(encounter_features['PATIENT'])
    .rangeBetween(0,num_days_future)
  )
  output_df = (
    encounter_features
    .withColumn('label', max(col("comorbidity_0")).over(w))
    .withColumn('label',coalesce(col('label'),lit(0)))
  )
    return(output_df)

Now we write these features into a feature store within Delta Lake. To ensure reproducibility, we add the mlflow experiment ID and the run ID as a column to the feature store. The advantage of this approach is that we receive more data, we can add new features to the featurestore that can be re-used to refer to in the future.

Controlling for quality issues in our data

Before we move ahead with the training task, we take a look at the data to see how different labels are distributed among classes. In many applications of binary classification, one class can be rare, for example in disease prediction. This class imbalance will have a negative impact on the learning process. During the estimation process, the model tends to focus on the majority class at the expense of rare events. Moreover, the evaluation process is also compromised. For example, in an imbalance dataset with 0/1 labels distributed as 95% and %5 respectively, a model that always predicts 0, would have an accuracy of 95%. If the labels are imbalanced, then we need to apply one of the common techniques for correcting for imbalanced data.

Looking at our training data, we see (Figure 2) that this is a very imbalanced dataset: over 95% of the observed time windows do not show evidence of a diagnosis. To adjust for imbalance, we can either downsample the control class or generate synthetic samples. This choice depends on the dataset size and the number of features. In this example, we downsample the majority class to obtain a balanced dataset. Note that in practice, you can choose a combination of methods, for example downsample the majority class and also assign class weights in your training algorithm.

df1 = dataset_df.filter('label==1')
n_df1=df1.count()
df2 = dataset_df.filter('label==0').sample(False,0.9).limit(n_df1)
training_dataset_df = df1.union(df2).sample(False,1.0)
display(training_dataset_df.groupBy('label').count())

Model training

To train the model, we augment our conditions with a subset of demographic and comorbidity features, apply labels to each observation, and pass this data to a model for training downstream. For example, here we augment our recent diagnosed comorbidities with Encounter Class (e.g., was this appointment for preventative care or was it an ER visit?, ), and the cost of the visits, and for demographic information, we choose Race, Gender, Zip and the patient’s age at the time of the encounter.

Most often, although the original clinical data can add up to terabytes, after performing filtering and limiting records based on inclusion/exclusion criteria, we end up with a dataset that can be trained on a single machine. We can easily transform spark dataframes to pandas dataframes and train a model based on any algorithm of choice. When using the Databricks ML runtime, we have access to a wide range of open ML libraries readily available.

Any machine learning algorithm takes a set of parameters (hyper parameters), and depending on the input parameters the score can change. In addition, in some cases wrong parameters or algorithms can result in overfitting. To ensure that the model performs well, we use hyperparameter tuning to choose the best model architecture and then we will train the final model by specifying the parameters that were obtained from this step.

To perform model tuning, we first need to pre-process the data. In this dataset, in addition to numeric features (counts of recent comorbidities for example), we also have the categorical demographic data that we would like to use. For categorical data, the best approach is to use one-hot-encoding. There are two main reasons for this: first, most classifiers (logistic regression in this case), operate on numeric features. Second, if we simply convert categorical variables to numeric indices, it would introduce ordinality in our data which can mislead the classifier: for example, if we convert states names to indices, e.g. California to 5 and New York to 23, then New York becomes “bigger” than California. While this reflects the index of each state name in an alphabetized list, in the context of our model, this ordering does not mean anything. One-hot-encoding eliminates this effect.

The pre-processing step in this case does not take any input parameters and hyperparameters only affect the classifier and not the preprocessing part. Hence, we separately perform pre-processing and then use the resulting dataset for model tuning:

from sklearn.preprocessing import OneHotEncoder
import numpy as np
def pre_process(training_dataset_pdf):
    X_pdf=training_dataset_pdf.drop('label',axis=1)
    y_pdf=training_dataset_pdf['label']
    onehotencoder = OneHotEncoder(handle_unknown='ignore')
    one_hot_model = onehotencoder.fit(X_pdf.values)
    X=one_hot_model.transform(X_pdf)
    y=y_pdf.values
    return(X,y)

Next, we would like to choose the best parameters to the model. For this classification, we use LogisticRegression with elastic net penalization. Note that after applying one-hot-encoding, depending on the cardinality of the categorical variable in question, we can end up with many features which can surpass the number of samples. To avoid overfitting for such problems, a penalty is applied to the objective function. The advanaget of elastic net regularization is that it combines two penalization techniques (LASSO and Ridge Regression) and the degree of the mixture can be controlled by a single variable, during hyperparameter tuning.

To improve on the model, we search a grid of hyperparameters using hyperopt to find the best parameters. In addition, we use the SparkTrials mode of hyperopt to perform the hyperparameter search in parallel.This process leverages Databricks’ managed MLflow to automatically log parameters and metrics corresponding to each hyperparameter run. To validate each set of parameters, we use a k-fold cross validation skim using F1 score as the metric to assess the model. Note that since k-fold cross validation generates multiple values, we choose the minimum of the scores (the worst case scenario) and try to maximize that when we use hyperopt.

from math import exp
def params_to_lr(params):
    return {
    'penalty':          'elasticnet',
    'multi_class':      'ovr',
    'random_state':     43,
    'n_jobs':           -1,
    'solver':           'saga',
    'tol':              exp(params['tol']), # exp() here because hyperparams are in log space
    'C':                exp(params['C']),
    'l1_ratio':         exp(params['l1_ratio'])
    }
def tune_model(params):
    with mlflow.start_run(run_name='tunning-logistic-regression',nested=True) as run:
    clf = LogisticRegression(**params_to_lr(params)).fit(X, y)
    loss = - cross_val_score(clf, X, y,n_jobs=-1, scoring='f1').min()
    return {'status': STATUS_OK, 'loss': loss}

To improve our search over the space, we choose the grid of parameters in logspace and define a transformation function to convert the suggested parameters by hyperopt. For a great overview of the approach and why we chose to define the hyperparameter space like this, look at this talk that covers how you can manage the end-to-end ML life cycle on Databricks.

from hyperopt import fmin, hp, tpe, SparkTrials, STATUS_OK
search_space = {
    # use uniform over loguniform here simply to make metrics show up better in mlflow comparison, in logspace
    'tol':                  hp.uniform('tol', -3, 0),
    'C':                    hp.uniform('C', -2, 0),
    'l1_ratio':             hp.uniform('l1_ratio', -3, -1),
}
spark_trials = SparkTrials(parallelism=2)
best_params = fmin(fn=tune_model, space=search_space, algo=tpe.suggest, max_evals=32, rstate=np.random.RandomState(43), trials=spark_trials)

The outcome of this run is the best parameters, assessed based on the F1-score from our cross validation.

params_to_lr(best_params)
Out[46]: {'penalty': 'elasticnet',
    'multi_class': 'ovr',
    'random_state': 43,
    'n_jobs': -1,
    'solver': 'saga',
    'tol': 0.06555920596441883,
    'C': 0.17868321158011416,
    'l1_ratio': 0.27598949120226646}

Now let’s take a look at the MLflow dashboard. MLflow automatically groups all runs of the hyperopt together and we can use a variety of plots to inspect the impact of each hyperparameter on the loss function, as shown in Figure 3. This is particularly important for getting a better understanding of the behavior of our model and the effect of the hyperparameters. For example, we noted that lower values for C, the inverse of regularization strength, result in higher values for F1.

Fig 3. Parallel coordinates plots for our models in MLflow.

After finding the optimal parameter combinations, we train a binary classifier with the optimal hyperparameters and log the model using MLflow. MLflow’s model api makes it easy to store a model, regardless of the underlying library that was used for training, as a python function that can later be called during model scoring. To help with model discoverability, we log the model with a name associated with the target condition (for example in this case, “drug-overdose”).

import mlflow.sklearn
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from mlflow.models.signature import infer_signature
## since we want the model to output probabilities (risk) rather than predicted labels, we overwrite
## mlflow.pyfun's predict method:
class SklearnModelWrapper(mlflow.pyfunc.PythonModel):
    def __init__(self, model):
    self.model = model
    
    def predict(self, context, model_input):
    return self.model.predict_proba(model_input)[:,1]
def train(params):
    with mlflow.start_run(run_name='training-logistic-regression',nested=True) as run:
    mlflow.log_params(params_to_lr(params))
    
    X_arr=training_dataset_pdf.drop('label',axis=1).values
    y_arr=training_dataset_pdf['label'].values
    
    ohe = OneHotEncoder(handle_unknown='ignore')
    clf = LogisticRegression(**params_to_lr(params)).fit(X, y)
    
    pipe = Pipeline([('one-hot', ohe), ('clf', clf)])
    
    lr_model = pipe.fit(X_arr, y_arr)
    
    score=cross_val_score(clf, ohe.transform(X_arr), y_arr,n_jobs=-1, scoring='accuracy').mean()
    wrapped_lr_model = SklearnModelWrapper(lr_model)
    
    model_name= '-'.join(condition.split())   
    mlflow.log_metric('accuracy',score)
    mlflow.pyfunc.log_model(model_name, python_model=wrapped_lr_model)
    displayHTML('The model accuracy is:  %s '%(score))
    return(mlflow.active_run().info)

Now, we can train the model by passing the best params obtained from the previous step.

Note that for model training, we have included preprocessing (one hot encoding) as part of the sklearn pipeline and log the encoder and classifier as one model. In the next step, we can simply call the model on patient data and assess their risk.

Model deployment and productionalization

After training the model and logging it to MLflow, the next step is to use the model for scoring new data. One of the features of MLflow is that you can search through experiments based on different tags. For example, in this case we use the run name that was specified during model training to retrieve the artifact URI of the trained models. We can then order the retrieved experiments based on key metrics.

import mlflow
best_run=mlflow.search_runs(filter_string="tags.mlflow.runName = 'training-logistic-regression'",order_by=['metrics.accuracy DESC']).iloc[0]
model_name='drug-overdose'
clf=mlflow.pyfunc.load_model(model_uri="%s/%s"%(best_run.artifact_uri,model_name))
clf_udf=mlflow.pyfunc.spark_udf(spark, model_uri="%s/%s"%(best_run.artifact_uri,model_name))

Once we have chosen a specific model, we can then load the model by specifying the model URI and name:

We can also use Databricks’s model registry to manage model versions, production lifecycle and also easy model serving.

Translating disease prediction into precision prevention

In this blog, we walked through the need for a precision prevention system that identifies clinical and demographic covariates that drive the onset of chronic conditions. We then looked at an end-to-end machine learning workflow that used simulated data from an EHR to identify patients who were at risk of drug overdose. At the end of this workflow, we were able to export the ML model we trained from MLflow, and we applied it to a new stream of patient data.

While this model is informative, it doesn’t have impact until translated into practice. In real world practice, we have worked with a number of customers to deploy these and similar systems into production. For instance, at the Medical University of South Carolina, they were able to deploy live-streaming pipelines that processed EHR data to identify patients at risk of sepsis. This led to detection of sepsis-related patient decline 8 hours in advance. In a similar system at INTEGRIS Health, EHR data was monitored for emerging signs of pressure ulcer development. In both settings, whenever a patient was identified, a care team was alerted to their condition. In the health insurance setting, we have worked with Optum to deploy a similar model. They were able to develop a disease prediction engine that used recurrent neural networks in a long-term short-term architecture to identify disease progression with good generalization across nine different disease areas. This model was used to align patients with preventative care pathways, leading to improved outcomes and cost-of-care for chronic disease patients.

While most of our blog has focused on the use of disease prediction algorithms in healthcare settings, there is also a strong opportunity to build and deploy these models in a pharmaceutical setting. Disease prediction models can provide insights into how drugs are being used in a postmarket setting, and even detect previously undetected protective effects that can inform label expansion efforts. Additionally, disease prediction models can be useful when looking at clinical trial enrollment for rare—or otherwise underdiagnosed—diseases. By creating a model that looks at patients who were misdiagnosed prior to receiving a rare disease diagnosis, we can create educational material that educates clinicians about common misdiagnosis patterns and hopefully create trial inclusion criteria that leads to increased trial enrollment and higher efficacy.

Get started With precision prevention on a health Delta Lake

In this blog, we demonstrated how to use machine learning on real-world data to identify patients at risk of developing a chronic disease. To learn more about using Delta Lake to store and process clinical datasets, download our free eBook on working with real world clinical datasets. You can also start a free trial today using the patient risk scoring notebooks from this blog.

Try Databricks for free. Get started today.

The post Detecting At-risk Patients with Real World Data appeared first on Databricks.

↧

Faster SQL: Adaptive Query Execution in Databricks

October 21, 2020, 9:59 am

≫ Next: Reputation Risk: Improving Business Competency and Nurturing Happy Customers by Building a Risk Analysis Engine

≪ Previous: Detecting At-risk Patients with Real World Data

Earlier this year, Databricks wrote a blog on the whole new Adaptive Query Execution framework in Spark 3.0 and Databricks Runtime 7.0. The blog has sparked a great amount of interest and discussions from tech enthusiasts. Today, we are happy to announce that Adaptive Query Execution (AQE) has been enabled by default in our latest release of Databricks Runtime, DBR 7.3.

AQE is an execution-time SQL optimization framework that aims to counter the inefficiency and the lack of flexibility in query execution plans caused by insufficient, inaccurate, or obsolete optimizer statistics. As we continue our effort to expand AQE functionalities, below are the specific use cases you can find AQE most effective in its current status.

Optimizing Shuffles

While Spark shuffles are a crucial part of the query performance, finding the right shuffle partition number has always been a big struggle for Spark users. That is because the amount of data varies from query to query, or even from stage to stage within the same query, and using the same shuffle partition number can lead to either small tasks that make inefficient use of the Spark scheduler, or otherwise big tasks that may end up with excessive garbage collection (GC) overhead and disk spilling.

Now, AQE adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output. So as data size grows or shrinks over different stages, the task size will remain roughly the same, neither too big nor too small.

However, it is important to note that AQE does not set the map-side partition number automatically. This means in order for this AQE feature to work perfectly, it is recommended that the user set a relatively high number of initial shuffle partition number through the SQL config spark.sql.shuffle.partitions. Or, as an alternative, they can enable Databricks edge feature “Auto-Optimized Shuffle” by setting config spark.databricks.adaptive.autoOptimizeShuffle.enabled to true.

Choosing Join Strategies

One of the most important cost-based decisions made in the Spark optimizer is the selection of join strategies, which is based on the size estimation of the join relations. But since this estimation can go wrong in both directions, it can either result in a less efficient join strategy because of overestimation, or even worse, out-of-memory errors because of underestimation.

AQE offers a trouble-free solution here by switching to the faster broadcast hash join during execution time.

Handling Skew Joins

Data skew is a common problem in which data is unevenly distributed, causing bottlenecks and significant performance downgrade, especially with sort merge joins. Those individual long running tasks will become stragglers, slowing down the entire stage. And on top of that, spilling data out of memory onto disk usually happens in those skew partitions, worsening the effect of the slowdown.

The unpredictable nature of the data skew often makes it hard for the static optimizer to handle skew automatically, or even with the help of querying hints. By collecting runtime statistics, AQE can now detect skew joins at runtime and split skew partitions into smaller sub-partitions, thus eliminating the negative impact of skew on query performance.

Understand AQE Query Plans

One major difference for the AQE query plan is that it often evolves as execution progresses. Several AQE specific plan nodes are introduced to provide more details about the execution. Furthermore, AQE uses a new query plan string format that can show both the initial and the final query execution plans. This section will help users get familiar with the new AQE query plan, and show users how to identify the effects of AQE on the query.

The AdaptiveSparkPlan Node

AQE-applied queries usually have one or more AdaptiveSparkPlan nodes as the root node of each query or subquery. Before or during the execution, the `isFinalPlan` flag will show as `false`. Once the query is completed, this flag will turn to `true` and the plan under the AdaptiveSparkPlan node will no longer change.

The CustomShuffleReader Node

The CustomShuffleReader node is the key to AQE optimizations. It can dynamically adjust the post shuffle partition number based on the statistics collected during the shuffle map stage. In the Spark UI, users can hover over the node to see the optimizations it applied to the shuffled partitions.

When the flag of CustomShuffleReader is `coalesced`, it means AQE has detected and coalesced small partitions after the shuffle based on the target partition size. Details of this node shows the number of shuffle partitions and partition sizes after the coalesce.

When the flag of CustomShuffleReader is `skewed`, it means AQE has detected data skew in one or more partitions before a sort-merge join operation. Details of this node shows the number of skewed partitions as well as the total number of new partitions splitted from the skewed partitions.

Both effects can also take place at the same time:

Detecting Join Strategy Change

A join strategy change can be identified by comparing changes in query plan join nodes before and after the AQE optimizations. In DBR 7.3, AQE query plan string will include both the initial plan (the plan before applying any AQE optimizations) and the current or the final plan. This provides better visibility into the optimizations AQE applied to the query. Here is an example of the new query plan string that shows a broadcast-hash join being changed to a sort-merge join:

The Spark UI will only display the current plan. In order to see the effects using the Spark UI, users can compare the plan diagrams before the query execution and after execution completes:

Detecting Skew Join

The effect of skew join optimization can be identified via the join node name.

In the Spark UI:

In the query plan string:

Adaptive query execution incorporates runtime statistics to make query execution more efficient. Unlike other optimization techniques, it can automatically pick an optimal post shuffle partition size and number, switch join strategies, and handle skew joins. Learn more about AQE in the Spark + AI Summit 2020 talk: Adaptive Query Execution: Speeding Up Spark SQL at Runtime and the AQE user guide. Get started today and try out the new AQE features in Databricks Runtime 7.3.

Try Databricks for free. Get started today.

The post Faster SQL: Adaptive Query Execution in Databricks appeared first on Databricks.

↧

Reputation Risk: Improving Business Competency and Nurturing Happy Customers by Building a Risk Analysis Engine

October 26, 2020, 3:28 pm

≫ Next: Ten Simple Databricks Notebook Tips & Tricks for Data Scientists

≪ Previous: Faster SQL: Adaptive Query Execution in Databricks

Why reputation risk matters?

When it comes to the term “risk management”, Financial Service Institutions (FSI) have seen guidance and frameworks around capital requirements from Basel standards. But, none of these guidelines mention reputation risk and for years organizations have lacked a clear way to manage and measure non-financial risks such as reputation risk. Given how the conversation has shifted recently towards the importance of Environmental, Social and Governance (ESG), companies must bridge the reputation-reality gap and ensure processes are in place to adapt to changing beliefs and expectations from stakeholders and customers.

For a FSI, reputation is arguably its most important asset.

For financial institutions, reputation is arguably its most important asset. For example, Goldman Sachs’ renowned business principles states that “Our assets are our people, capital and reputation. If any of these are ever diminished, the last is the most difficult to restore”. In commercial banking, for example, brands that act on consumer complaints and feedback are able to manage the legal, commercial, and reputation risks better than their competitors. American Banker published this article which re-iterates that non-financial risks, such as reputation risk, are critical factors for FSIs to address in a rapidly changing landscape.

The process of winning a customer’s trust typically involves harnessing vast amounts of data through multiple disparate channels to mine for insights related to issues that may adversely impact a brand’s reputation. Despite the importance of data in nurturing happier customers, most organizations struggle to architect a platform that solves fundamental challenges related to data privacy, scale, and model governance as typically seen in the financial services industry.

In this blog post, we will demonstrate how to leverage the power of Databricks’ Unified Data Analytics Platform to solve those challenges, unlock insights, and initiate remediation actions. We will look at Delta Lake which is an open source storage layer that brings reliability and performance to data lakes and easily allows compliance around GDPR and CCPA regulations whether it is structured data or unstructured data. Machine Learning Runtime and Managed MLflow are also part of Databricks’ Unified Analytics platform which we cover in this blog post that enables Data Scientists and Business Analysts to leverage popular open source machine learning and governance frameworks to build and deploy state-of-the-art machine learning models. This approach to reputation risk enables FSIs to measure brand perception and brings together multiple stakeholders to work collaboratively to drive higher levels of customer satisfaction and trust.

Databricks Unified Risk Architecture for assessing reputational risk.

This blog post references notebooks which cover the multiple data engineering and data science challenges that must be addressed to effectively modernize reputation risk management practices:

Using Delta Lake to ingest anonymized customer complaints in real time
Explore customer feedback at scale using Koalas
Leverage AI and open source to enable proactive risk management
Democratizing AI to risk and advocacy teams using SQL and Business Intelligence (BI) / Machine Learning (ML) reports

Harnessing cloud storage

Object storage has been a boon to organizations looking to park massive amounts of data at a cheaper cost when compared to traditional data warehouses. But, this comes with operational overhead. When data arrives in rapid volumes, managing this data becomes a huge challenge as often corrupt and unreliable data points lead to inconsistencies that are hard to correct at later points in time.

This has been a major pain-point for many FSIs who have started on an AI journey to develop solutions that enable faster insights and get more from the data that is being collected. Managing reputation risk requires major effort by organizations to measure customer satisfaction and brand perception. Taking a data + AI approach to preserving customer trust requires infrastructure that can support storing massive amounts of customer data in a secure manner, ensuring no personally identifiable information (PII) is exploited, and full compliance with PCI-DSS regulation. While securing and storing the data is only the beginning, performing exploration at scale on millions of complaints and building models that provide prescriptive insights are key to a successful implementation.

As a unified data analytics platform, Databricks not only allows the ingestion and processing of large amounts of data but also enables users to apply AI – at scale – to uncover insights about reputation and customer perceptions. Throughout this blog post, we will ingest data from the Consumer Finance Protection Bureau (CFPB) and build data pipelines to better explore product feedback from consumers using Delta Lake and Koalas API. Open-source libraries will be used to build and deploy ML models in order to classify and measure customer complaint severity across various products and services. By unifying batch and streaming, complaints can be categorized and re-routed to the appropriate advocacy teams in real-time, leading to better management of incoming complaints and greater customer satisfaction.

Establishing gold data standards

As Databricks already leverages all the security tools provided by the cloud vendors, Apache SparkTM and Delta Lake offer additional enhancements such as data quarantine and schema enforcement to maintain and protect the quality of data in a timely manner. We will be using Spark to read in the complaints data by using a schema and persist it to Delta Lake. In this process, we also provide a path to bad records which may be caused due to schema mismatch, data corruption or syntax errors into a separate location which could then be investigated later for consistency.

df = spark.read
  .option("header", "true")
  .option("delimiter", ",")
  .option("quote", "\"")
  .option("escape", "\"")
  .option("badRecordsPath", "/tmp/complaints_invalid")
  .schema(schema)
  .csv("/tmp/complaints.csv")

It is well known that sensitive data like PII is a major threat and increases the attack surface for any enterprise. Pseudonymization, along with ACID transactional capabilities and data retention enforcement based on time, help us maintain data compliance when using Delta Lake for specific column based operations. However, this becomes a real challenge with unstructured data where each complaint could be a transcript from an audio call, web chat, e-mail and contain personal information such as customer first and last names, not to mention the right for consumers to be forgotten (such as GDPR compliance). In the example below, we demonstrate how organizations can leverage natural language processing (NLP) techniques to anonymize highly unstructured records whilst preserving their semantic value (i.e. replacing a mention of name should preserve the underlying meaning of a consumer complaint).

Using open-source libraries like spaCy, organizations can extract specific entities such as customer and agent names, but also Social Security Numbers (SSN), Account Number, and other PII (such as names in the example below).

Example of how Databricks’ reputational risk framework uses Spacy to highlight entities.

In the code below, we show how a simple anonymization strategy based on natural language processing technique can be enabled as a user-defined function (UDF).

def anonymize_record(original, nlp):
  doc = nlp(original)
  for X in doc.ents:
    if(X.label_ == 'PERSON'):
      original = original.replace(X.text, "John Doe")
  return original
    
@pandas_udf('string')
def anonymize(csi: Iterator[pd.Series]) -> Iterator[pd.Series]:

  # load spacy model only once
  spacy.cli.download("en_core_web_sm")
  nlp = spacy.load("en_core_web_sm")
  
  # Extract organisations for a batch of content 
  for cs in csi:
    yield cs.map(lambda x: anonymize_record(x, nlp))

By understanding the semantic value of each word (e.g. a name) through NLP, organizations can easily obfuscate sensitive information from unstructured data as per the example below.

With Databricks’ approach to reputational risk assessment, more advanced entity recognition models can be applied to obfuscate sensitive information from an unstructured dataset.

This method can scale really well to handle multiple streams of data in real-time as well as batch processing to continuously update and maintain the state of the latest information in the target Delta table to be consumed by data scientists and business analysts for further analysis.

Databricks increases data controls and quality in real time, enabling data engineers, data scientists, and business analysts to collaborate on a unified data analytics platform.

Such a practical approach to data science demonstrates the need for organizations to break the silos that exist between traditional data science activities and day to day data operations, bringing all personas within the same data and analytics platform.

Measuring brand perception and customer sentiment

With better reputation management systems, FSIs can build superior customer experience by tracking and isolating customer feedback to certain products and services offered by the institution. This not only helps discover problem areas but also helps internal teams be more proactive and reach out to customers in distress. In order to better understand data, data scientists traditionally sample large data sets to produce smaller sets that they can explore in greater depth (sometimes on their laptops) using tools they are familiar with, such as Pandas dataframe and Matplotlib visualizations. In order to minimize data movement across platforms (therefore minimizing the risk associated with moving data) and maximize the efficiency and effectiveness of exploratory data analysis at scale, Koalas can be used to explore all of your data with a syntax data scientists are most familiar with (similar to Pandas).

In the below example, we explore all of J.P Morgan Chase’s complaints using simple Pandas-like syntax while still utilizing the distributed Spark engine under the hood.

import databricks.koalas as ks
kdf = spark.read.table("complaints.complaints_anonymized").to_koalas()

jp_kdf = kdf[kdf['company'] == 'JPMORGAN CHASE & CO.']
jp_kdf['product'].value_counts().plot('bar')

Sample chart visualizing number of complaints across multiple products using Koalas API

To take the analysis further, we can run a term frequency analysis on customer complaints to identify the top issues that were reported by customers across all the products for a particular FSI. At a glance, we can easily identify issues related to victim identity theft and unfair debt collection.

Sample term frequency analysis chart visualizing the most descriptive n-gram mentioned in consumer complaints, produced via the Databricks approach to reputational risk analysis.

We can dig in further into individual products such as consumer loans and credit cards using a word cloud to better understand what the customers are complaining about.

Understanding consumer complaints through word cloud visualization, produced via the Databricks approach to reputational risk analysis.

While exploratory data analysis is great for business intelligence (BI) and reactive analytics, it is important to understand, predict, and to categorize direct customer feedback, public reviews, and other social media interactions in real time to build trust and enable effective customer service and measure individual product performance. While many solutions enable us to collect and store data, the ability to seamlessly analyze and act on that data to enable key insights within a unified platform is a must when building reputation management systems.

In order to validate the predictive potential of our consumer data and therefore confirm our dataset is a great fit for ML, we can identify similarity between complaints by using t-Distributed Stochastic Neighbor Embedding (t-SNE) as per below example. Although some consumer complaints may overlap in terms of possible categories (both secure and unsecured lending exhibit similar keywords), we can observe distinct clusters, indicative of patterns that could easily be learned by a machine.

Validating the predictive potential of consumer complaints through t-SNE visualization.

The above plot re-confirms a pattern that would enable us to classify complaints. The potential overlap also indicates that some complaints could easily be misclassified by end-users or agents, resulting in a suboptimal complaint management system and poor customer experience.

ML and augmented intelligence

Databricks’ ML runtime packages provide access to reliable and performant open-source frameworks including scikit-learn, XGboost, Tensorflow, Jon Snow Labs NLP among others, helping data scientists better focus on delivering value through data rather than spending time and efforts managing infrastructure, packages, and dependencies.

In this example, we build a simple scikit-learn pipeline to classify complaints into four major categories of products we see in t-SNE plot and predict the severity of complaints by training on previously disputed claims. Whilst Delta Lake provides reliability and performance in your data, MLFlow provides efficiency and transparency to your insights. Every ML experiment will be tracked and hyperparameters automatically logged in a common place, resulting in artifacts of high-quality one can trust and act upon.

import mlflow
import mlflow.sklearn

with mlflow.start_run(run_name='complaint_classifier'):

  # Train pipeline, automatically logging all parameters
  pipeline.fit(X_train, y_train)  
  y_pred = pipeline.predict(X_test)
  accuracy = accuracy_score(y_pred, y_test)
  
  # Log pipeline and metrics to mlflow
  mlflow.sklearn.log_model(pipeline, "pipeline")
  mlflow.log_metric("accuracy", accuracy

With all experiments logged in one place, data scientists can easily find the best model fit, enabling operation teams to retrieve the approved model (as part of their model risk management process) and surface those insights to end-users or downstream processes, shortening model lifecycle processes from months to weeks.

# load our model as a spark UDF
model_udf = mlflow.pyfunc.spark_udf(spark, "models:/complaints/production")

# load our model as a SQL function
spark.udf.register("classify", model_udf)

# classify complaints in real time
spark
  .readStream
  .table("complaints_fsi.complaints_anonymized")
  .withColumn("product", model_udf("complaint")

While we can now apply ML to automatically classify and re-reroute new complaints in real-time, as they unfold, the possibility to utilize UDF in SQL code gives business analysts the ability to directly interact with our models while querying data for visualization.

SELECT 
  received_date, 
  classify(complaint) AS product,
  count(1) as total
FROM 
  complaints.complaints_anonymized
GROUP BY
  received_date

Databricks approach to reputational risk assessment augmenting BI with artificial intelligence for a more descriptive approach to analyze complaints and disputes for reputational risk management

Databricks approach to reputational risk assessment augmenting BI with artificial intelligence for a more descriptive approach to analyze complaints and disputes for reputational risk management.

This can enable us to produce further actionable insights using Databricks’ notebook visualizations or Redash which is an easy to use web-based visualization and dashboarding tool within databricks that enables users to explore, query, visualize and share data. Using simple SQL syntax, we can easily look at complaints attributed to different products over a period of time in a given location. If implemented on a stream this can provide rapid insights for advocacy teams to act and respond to customers. For example, typical complaints we see from customers include identity theft and data security which can have huge implications on brand reputation and carry large fines from regulators. These types of incidents can be easily managed by building pipelines outlined in this blog post which helps enterprises manage reputation risk as part of a corporate strategy for happy customers and changing digital landscape.

Building reputation risk into corporate governance strategy

Throughout this blog, we showed how enterprises can harness Databricks’ Unified Analytics Platform to build a risk engine that can analyze customer feedback, both securely and in real time, in order to allow early assessment of reputational risks. While the blog highlights data sourced from CFPB, this approach can be applied to other sources of data such as social media, direct customer feedback, and other unstructured sources. This enables data teams to collaborate and iterate quickly on building reputation risk platforms that can scale as the data volume grows while utilizing the best of breed open-source AI tools in the market.

Try the below notebooks on Databricks to harness the power of AI to mitigate reputation risk and contact us to learn more about how we assist FSIs with similar use cases.

Try Databricks for free. Get started today.

The post Reputation Risk: Improving Business Competency and Nurturing Happy Customers by Building a Risk Analysis Engine appeared first on Databricks.

↧

Ten Simple Databricks Notebook Tips & Tricks for Data Scientists

October 29, 2020, 10:23 am

≫ Next: Announcing Databricks on AWS Quick Starts to Deploy in Under 15 minutes

≪ Previous: Reputation Risk: Improving Business Competency and Nurturing Happy Customers by Building a Risk Analysis Engine

Often, small things make a huge difference, hence the adage that “some of the best ideas are simple!” Over the course of a few releases this year, and in our efforts to make Databricks simple, we have added several small features in our notebooks that make a huge difference.

In this blog and the accompanying notebook, we illustrate simple magic commands and explore small user-interface additions to the notebook that shave time from development for data scientists and enhance developer experience.

Collectively, these enriched features include the following:

%pip install
%conda env export and update
%matplotlib inline
%load_ext tensorboard and %tensorboard
%run auxiliary notebooks to modularize code
Upload data
MLflow: Dynamic Experiment counter and Reproduce run button
Simple UI nuggets and nudges
Format SQL code
Web terminal to log into the cluster

For brevity, we summarize each feature usage below. However, we encourage you to download the notebook. If you don’t have Databricks Unified Analytics Platform yet, try it out here. Import the notebook in your Databricks Unified Data Analytics Platform and have a go at it.

1. Magic command %pip: Install Python packages and manage Python Environment

Databricks Runtime (DBR) or Databricks Runtime for Machine Learning (MLR) installs a set of Python and common machine learning (ML) libraries. But the runtime may not have a specific library or version pre-installed for your task at hand. To that end, you can just as easily customize and manage your Python packages on your cluster as on laptop using %pip and %conda.

Before the release of this feature, data scientists had to develop elaborate init scripts, building a wheel file locally, uploading it to a dbfs location, and using init scripts to install packages. This is brittle. Now, you can use %pip install <package> from your private or public repo.

%pip install vaderSentiment

Alternatively, if you have several packages to install, you can use %pip install -r <path>/requirements.txt.

To further understand how to manage a notebook-scoped Python environment, using both pip and conda, read this blog.

2. Magic command %conda and %pip: Share your Notebook Environments

Once your environment is set up for your cluster, you can do a couple of things: a) preserve the file to reinstall for subsequent sessions and b) share it with others.

Since clusters are ephemeral, any packages installed will disappear once the cluster is shut down. A good practice is to preserve the list of packages installed. This helps with reproducibility and helps members of your data team to recreate your environment for developing or testing. With %conda magic command support as part of a new feature released this year, this task becomes simpler: export and save your list of Python packages installed.

%conda env export -f /jsd_conda_env.yml or %pip freeze > /jsd_pip_env.txt

From a common shared or public dbfs location, another data scientist can easily use %conda env update -f <yaml_file_path> to reproduce your cluster’s Python packages’ environment.

3. Magic command %matplotlib inline: Display figures inline

As part of an Exploratory Data Analysis (EDA) process, data visualization is a paramount step. After initial data cleansing of data, but before feature engineering and model training, you may want to visually examine to discover any patterns and relationships.

Among many data visualization Python libraries, matplotlib is commonly used to visualize data. Although DBR or MLR includes some of these Python libraries, only matplotlib inline functionality is currently supported in notebook cells.

With this magic command built-in in the DBR 6.5+, you can display plots within a notebook cell rather than making explicit method calls to display(figure) or display(figure.show()) or setting spark.databricks.workspace.matplotlibInline.enabled = true.

4. Magic command %tensorboard with PyTorch or TensorFlow

Recently announced in a blog as part of the Databricks Runtime (DBR), this magic command displays your training metrics from TensorBoard within the same notebook. This new functionality deprecates the dbutils.tensorboard.start(), which requires you to view TensorBoard metrics in a separate tab, forcing you to leave the Databricks notebook and breaking your flow.

No longer must you leave your notebook and launch TensorBoard from another tab. The inplace visualization is a major improvement toward simplicity and developer experience.

While you can use either TensorFlow or PyTorch libraries installed on a DBR or MLR for your machine learning models, we use PyTorch (see the notebook for code and display), for this illustration.

%load_ext tensorboard

%tensorboard --logdir=./runs

5. Magic command %run to instantiate auxiliary notebooks

Borrowing common software design patterns and practices from software engineering, data scientists can define classes, variables, and utility methods in auxiliary notebooks. That is, they can “import”—not literally, though—these classes as they would from Python modules in an IDE, except in a notebook’s case, these defined classes come into the current notebook’s scope via a %run auxiliary_notebook command.

Though not a new feature as some of the above ones, this usage makes the driver (or main) notebook easier to read, and a lot less clustered. Some developers use these auxiliary notebooks to split up the data processing into distinct notebooks, each for data preprocessing, exploration or analysis, bringing the results into the scope of the calling notebook.

Another candidate for these auxiliary notebooks are reusable classes, variables, and utility functions. For example, Utils and RFRModel, along with other classes, are defined in auxiliary notebooks, cls/import_classes. After the %run ./cls/import_classes, all classes come into the scope of the calling notebook. With this simple trick, you don’t have to clutter your driver notebook. Just define your classes elsewhere, modularize your code, and reuse them!

6. Fast Upload new data

Sometimes you may have access to data that is available locally, on your laptop, that you wish to analyze using Databricks. A new feature Upload Data, with a notebook File menu, uploads local data into your workspace. The target directory defaults to /shared_uploads/your-email-address; however, you can select the destination and use the code from the Upload File dialog to read your files. In our case, we select the pandas code to read the CSV files.

Once uploaded, you can access the data files for processing or machine learning training.

7.1 MLflow Experiment Dynamic Counter

The MLflow UI is tightly integrated within a Databricks notebook. As you train your model using MLflow APIs, the Experiment label counter dynamically increments as runs are logged and finished, giving data scientists a visual indication of experiments in progress.

By clicking on the Experiment, a side panel displays a tabular summary of each run’s key parameters and metrics, with ability to view detailed MLflow entities: runs, parameters, metrics, artifacts, models, etc.

7.2 MLflow Reproducible Run button

Another feature improvement is the ability to recreate a notebook run to reproduce your experiment. From any of the MLflow run pages, a Reproduce Run button allows you to recreate a notebook and attach it to the current or shared cluster.

Another feature improvement is the ability to recreate a notebook run to reproduce your experiment

8. Simple UI nuggets and task nudges

To offer data scientists a quick peek at data, undo deleted cells, view split screens, or a faster way to carry out a task, the notebook improvements include:

Light bulb hint for better usage or faster execution: Whenever a block of code in a notebook cell is executed, the Databricks runtime may nudge or provide a hint to explore either an efficient way to execute the code or indicate additional features to augment the current cell’s task. For example, if you are training a model, it may suggest to track your training metrics and parameters using MLflow.

Or if you are persisting a DataFrame in a Parquet format as a SQL table, it may recommend to use Delta Lake table for efficient and reliable future transactional operations on your data source. Also, if the underlying engine detects that you are performing a complex Spark operation that can be optimized or joining two uneven Spark DataFrames—one very large and one small—it may suggest that you enable Apache Spark 3.0 Adaptive Query Execution for better performance.

These little nudges can help data scientists or data engineers capitalize on the underlying Spark’s optimized features or utilize additional tools, such as MLflow, making your model training manageable.

Undo deleted cells: How many times you have developed vital code in a cell and then inadvertently deleted that cell, only to realize that it’s gone, irretrievable. Now you can undo deleted cells, as the notebook keeps tracks of deleted cells.

Run All Above: In some scenarios, you may have fixed a bug in a notebook’s previous cells above the current cell and you wish to run them again from the current notebook cell. This old trick can do that for you.

Tab for code completion and function signature: Both for general Python 3 functions and Spark 3.0 methods, using a method_name.tab key shows a drop down list of methods and properties you can select for code completion.

Tab for code completion and function signature

Use Side-by-Side view:

As in a Python IDE, such as PyCharm, you can compose your markdown files and view their rendering in a side-by-side panel, so in a notebook. Elect the View->Side-by-Side to compose and view a notebook cell.

9. Format SQL code

Though not a new feature, this trick affords you to quickly and easily type in a free-formatted SQL code and then use the cell menu to format the SQL code.

10. Web terminal to log into the cluster

Any member of a data team, including data scientists, can directly log into the driver node from the notebook. No need to use %sh ssh magic commands, which require tedious setup of ssh and authentication tokens. Moreover, system administrators and security teams loath opening the SSH port to their virtual private networks. As a user, you do not need to setup SSH keys to get an interactive terminal to a the driver node on your cluster. If your Databricks administrator has granted you “Can Attach To” permissions to a cluster, you are set to go.

Announced in the blog, this feature offers a full interactive shell and controlled access to the driver node of a cluster. To use the web terminal, simply select Terminal from the drop down menu.

Collectively, these features—little nudges and nuggets—can reduce friction, make your code flow easier, to experimentation, presentation, or data exploration. Give one or more of these simple ideas a go next time in your Databricks notebook.

Download the notebook today and import it to Databricks Unified Data Analytics Platform (with DBR 7.2+ or MLR 7.2+) and have a go at it.

To discover how data teams solve the world’s tough data problems, come and join us at the Data + AI Summit Europe.

Try Databricks for free. Get started today.

The post Ten Simple Databricks Notebook Tips & Tricks for Data Scientists appeared first on Databricks.

↧

Announcing Databricks on AWS Quick Starts to Deploy in Under 15 minutes

October 29, 2020, 1:49 pm

≫ Next: Announcing Azure Databricks Power BI Connector (Public Preview)

≪ Previous: Ten Simple Databricks Notebook Tips & Tricks for Data Scientists

We are pleased to announce the availability of Databricks on the AWS Quick Starts program. With this release, our customers can easily deploy the Databricks Unified Data Analytics Platform on Amazon Web Services (AWS) along with the rest of their infrastructure using a flexible and powerful tool.

The AWS Quick Starts include an AWS CloudFormation template that rapidly automates the deployment of a Databricks workspace, a deployment guide that outlines the architecture as well as step-by-step deployment instructions.

Given that security privileges may vary from customer to customer, we have created an AWS Quick Starts solution that allows customers to automate:

The deployment of a Databricks workspace and creation of a new cross-account IAM role
The deployment of a Databricks workspace and use an existing cross-account IAM role

AWS CloudFormation templates, custom resource and AWS Lambda

AWS CloudFormation is a service that enables you to describe and provision all the infrastructure resources in your cloud environment. The Databricks CloudFormation templates are written in YAML and extended by an AWS Lambda-backed custom resource written in Python.

You also have direct access to the templates. You can download them, customize them, and extract interesting elements for use in your projects. Please see the deployment guide for customizing the templates.

How to get started?

The Quick Starts solution launches a CloudFormation template that creates and configures the necessary AWS resources needed to deploy and configure the Databricks workspace by invoking the Databricks API calls for a given Databricks Account, AWS Account, and region.

The Databricks Quick Starts solution is available under the Analytics, Data Lake, Machine learning & AI categories or by simply filtering using the search bar.

You can then review the full deployment guide from the Databricks reference deployment page which contains the architecture overview, deployment options and steps.

Simple and easy deployment process

The deployment process is simple and easy and will complete in less than 15 minutes. First, you’ll need to be signed into your account prior to launching the deployment. If you don’t already have an AWS account, sign up at https://aws.amazon.com.
Select the template of your choice and then select the region where to deploy your Databricks workspace from the CloudFormation UI.

In 15 minutes, with just a few parameters …

The template is easy to follow and requires only a few mandatory parameters to launch the workspace deployment as follow:

CloudFormation stack name
E2 account ID
Username
password
Unique workspace deployment name
Unique IAM role
Unique S3 root bucket

Ability to innovate while enforcing security controls and governance

With the general availability of E2, our AWS customers can unleash their data teams potential by enabling all your data teams to solve your toughest data problems while using a highly secure, scalable, simple to manage, data analytics and machine learning platform.

Innovation is 15 minutes away!

What’s Next?

The new features for the Unified Data Analytics Platform on AWS are now available in the following AWS Regions (ap-south-1, ap-southeast-2, us-west-1, us-west-2, us-east-1, us-east-2, ca-central-1, eu-west-1, and eu-central-1 ). Learn more about how we are enabling you with comprehensive platform security, elastic scalability, and 360° administration for all your data analytics and machine learning needs.

Or by watching the on-demand webinar discussing how to Simplify, Secure, and Scale your Enterprise Cloud Data Platform on AWS in an automated way.

Try Databricks for free. Get started today.

The post Announcing Databricks on AWS Quick Starts to Deploy in Under 15 minutes appeared first on Databricks.

↧

Announcing Azure Databricks Power BI Connector (Public Preview)

October 30, 2020, 12:27 pm

≫ Next: Quickly Deploy, Test, and Manage ML Models as REST Endpoints with MLflow Model Serving on Databricks

≪ Previous: Announcing Databricks on AWS Quick Starts to Deploy in Under 15 minutes

Databricks and Microsoft Power BI customers will be delighted to know that an enhanced Azure Databricks PowerBI connector is now natively integrated into Power BI Desktop (2.85.681.0 and above ) and Power BI Service!

The native connector lets users connect to Databricks from PowerBI Desktop with a couple of clicks, using Azure Active Directory (Azure AD) credentials and SSO for PowerBI Service users. With support for DirectQuery, users directly access data in Databricks, querying fresh data, enforcing data lake security controls – no need to duplicate security controls in PowerBI. The Databricks ODBC driver has been further optimised to speed up transfer of results.

Support for Azure AD and SSO for PowerBI Service
Users can use their Azure AD credentials to connect to Databricks. Power BI services users can access shared reports using SSO, using their own AAD credentials when accessing Databricks in DirectQuery mode. Administrators no longer need to generate PAT tokens for authentication.

Simple connection configuration
The new Databricks connector is natively integrated into PowerBI. Connections to Databricks are configured with a couple of clicks. In Power BI Desktop, users select Databricks as a data source (1), authenticate once using AAD (2) and enter the Databricks-specific connection details (3). Just like that, you are ready to query the data!

(1)	(2)
	(3)

Direct access to Data Lake via DirectQuery
When using Power BI DirectQuery, data is directly accessed in Databricks, allowing users to query and visualise large datasets, without the size limitations imposed by import queries. Power query results are always fresh and Delta Lake data security controls are enforced. For PowerBi Service users, SSO ensures that users access Databricks with their own credentials. There is no need to duplicate security controls in PowerBI.

Faster results via Databricks ODBC
The Databricks ODBC driver has been optimised with reduced query latency, increased result transfer speed based on Apache Arrow serialization, and improved metadata retrieval performance.

The enhanced Azure Databricks connector is the result of an on-going collaboration between Databricks and Microsoft. Take this enhanced connector for a test drive to improve your Databricks connectivity experience, and let us know what you think. We would love to hear from you!

References
Azure Databricks Power BI Documentation
Databricks ODBC driver release notes.

Try Databricks for free. Get started today.

The post Announcing Azure Databricks Power BI Connector (Public Preview) appeared first on Databricks.

↧

Quickly Deploy, Test, and Manage ML Models as REST Endpoints with MLflow Model Serving on Databricks

November 2, 2020, 9:00 am

≫ Next: Why Cloud Centric Data Lake is the future of EDW

≪ Previous: Announcing Azure Databricks Power BI Connector (Public Preview)

MLflow Model Registry now provides turnkey model serving for dashboarding and real-time inference, including code snippets for tests, controls, and automation.

MLflow Model Serving on Databricks provides a turnkey solution to host machine learning (ML) models as REST endpoints that are updated automatically, enabling data teams to own the end-to-end lifecycle of a real-time machine learning model from training to production.

Since its launch, Model Serving has enabled many Databricks customers to seamlessly deliver their ML models as REST endpoints without having to manage additional infrastructure or configure integrations. To simplify Model Serving even more, the MLflow Model Registry now shows the serving status of each model and deep links into the Model Serving page.

To simplify the consumption of MLflow Models even more, the Model Serving page now provides curl and Python snippets to make requests to the model. Requests can be made either to the latest version at a deployment stage, e.g. model/clemens-windfarm-signature/Production or to a specific version number, e.g. model/clemens-windfarm-signature/2

To simplify the consumption of MLflow Models even more, the Model Serving page now provides curl and Python snippets to make requests to the model.

Databricks customers have utilized Model Serving for several use cases including making model predictions in Dashboards or serving forecasts for finance teams. Freeport McMoRan is serving TensorFlow models to simulate operations for their plants:

“We simulate different scenarios for our plants and operators need to review recommendations in real-time to make decisions, optimizing plant operations and saving cost. Databricks MLflow Model Serving enables us to seamlessly deliver low latency machine learning insights to our operators while maintaining a consolidated view of end to end model lifecycle.”

Model Serving on Databricks is now in public preview and provides cost-effective, one-click deployment of models for real-time inference, tightly integrated with the MLflow Model Registry for ease of management. See our documentation for how to get started [AWS, Azure]. While this service is in preview, we recommend its use for low throughput and non-critical applications.

Try Databricks for free. Get started today.

The post Quickly Deploy, Test, and Manage ML Models as REST Endpoints with MLflow Model Serving on Databricks appeared first on Databricks.

↧

Diving Into Delta Lake: DML Internals (Update, Delete, Merge)

Delta Lake: Basic Mechanics

Delta Lake DML: UPDATE

UPDATE: Under the hood

UPDATE + Delta Lake time travel = Easy debugging

UPDATE: Performance tuning tips

Delta Lake DML: DELETE

DELETE: Under the hood

DELETE + VACUUM: Cleaning up old data files

DELETE: Performance tuning tips

Delta Lake DML: MERGE

MERGE: Under the hood

MERGE: Performance tuning tips

Summary

Related

AI, data, SQL and BI analytics, and other sessions by theme

Come and join us

Detecting AgentTeslaRAT with Databricks

Introducing the Sales Forecasting & Advertising Attribution Dashboard Solution Accelerator

Based on best-practices from our work with the leading brands, we’ve developed solution accelerators for common analytics and machine learning use cases to save weeks or months of development time for your data engineers and data scientists.

plotly chart with zoom in out panel

plotly chart with zoom in out panel

Node Telemetry Data

Analytics Workflow

Blockchain Network Analysis Process

Step 1: Pull data from Elasticsearch

Step 2: Parse and flatten JSON data

Step 3a: Create Edge information and save in S3 as CSV file

Step 3b: Create Node information and add geocoding

Step 4a: Analyze node telemetry data using SQL and charting

Step 4b: Analyze node telemetry data using Graph Libraries.

Step 5: Visualization of the geographic distribution of nodes and relays

Step 5a: Pyvis visualization of the star topology

Step 5b: Geospatial Visualization of Node and Edge data in Delta using Tableau

Step 5c: Visualization of Node and Edge data in D3

Ingesting GFF3 Annotation Data

Schema

Example: Gene Transcripts and Transcript Exons

Example Continued: Integration with Variant Data

Try Glow!

An end-to-end model governance process

Steps 1, 2 and 3: Train the model and deploy it in the Model Registry

Steps 4 through 9: Setup the pipeline and run the ML deployment into QA

Steps 10 through 13: Promote ML model to production

Discussion

Helping customers and partners scale with global availability

Learn more about the Azure China region and Azure Databricks

Malcolm Gladell, award-winning author of Outliers, The Tipping Point, Blink and more

Dr. Kira Radinsky, researcher on AI and Predictive Analytics in Healthcare

Jeremy Singer-Vine, Data Editor at BuzzFeed and Investigative Journalist on FinCEN files

More Keynotes from Databricks and Customers

Come and join us

Background and motivation

Creating Single-Node Clusters

Disease prediction using machine learning on Databricks

Data preparation

Controlling for quality issues in our data

Model training

Model deployment and productionalization

Translating disease prediction into precision prevention

Get started With precision prevention on a health Delta Lake

Optimizing Shuffles

Choosing Join Strategies

Handling Skew Joins

Understand AQE Query Plans

The AdaptiveSparkPlan Node

The CustomShuffleReader Node

Detecting Join Strategy Change

Detecting Skew Join

Why reputation risk matters?

Harnessing cloud storage

Establishing gold data standards

Measuring brand perception and customer sentiment

ML and augmented intelligence

Building reputation risk into corporate governance strategy

1. Magic command %pip: Install Python packages and manage Python Environment

2. Magic command %conda and %pip: Share your Notebook Environments

3. Magic command %matplotlib inline: Display figures inline

4. Magic command %tensorboard with PyTorch or TensorFlow

5. Magic command %run to instantiate auxiliary notebooks