Quantcast
Channel: Databricks
Viewing all 1872 articles
Browse latest View live

Profit-Driven Retention Management with Machine Learning

$
0
0

Companies with the highest loyalty ratings and retention rates grew revenues 250% faster than their industry peers and delivered two to five times the shareholder returns over a 10 year period. Earning loyalty and getting the largest number of customers to stick around is something that is in the best interest of both a company and its customer base.

So why do companies struggle with retention? Other than some subscription-based businesses such as telecom that report Average Revenue Per User (ARPU), most companies aren’t required or compelled to disclose this in public filings. Many companies focus on functional priorities instead of the customer, believing customer loyalty will naturally emerge through these efforts. In fact, a recent survey by Nielsen highlights that addressing “customer churn is the last priority when it comes to companies’ marketing objectives.”

This is particularly problematic in light of increasing evidence that customers are rethinking how and where they spend their money. And while most studies identify these shifts as part of consumers’ response to COVID, the reality is that this growing disinterest in brand loyalty predates the current crisis.

Value Is Delivered Over a Customer’s Lifetime

Customer retention must become a priority for any company seeking long-term growth. In a series of recent posts on customer lifetime value in both subscription and non-subscription models, we examined how retention plays a critical role in building profitable customer relationships. At a minimum, customers need to remain engaged long enough for a firm to offset their acquisition costs, but an ideal relationship continues to deliver profits well beyond this.

Churn at differing stages of the customer lifetime journey
Figure 1. Churn at different stages of the customer lifetime journey

The key to effectively managing retention, and reducing your churn rate, is developing an understanding of how a customer lifetime should progress (Figure 1) and examining where in that lifetime journey customers are likely to churn. In early stages, customers are still learning about the products and services they are consuming and how best to derive benefits from them. Proactive engagement to encourage the adoption of behaviors that maximize these benefits may help transition customers into later stages of sustained consumption. In those later stages, connecting with customers through brand identity can not only encourage continued loyalty but help customers become brand ambassadors, helping to organically bring new customers to the business and reducing on-boarding challenges.

You Can’t Save Them All…

When a customer abandons the relationship, it is important that we understand why. Some churn may represent the natural conclusion of a long-standing relationship which has finally run its course. In such scenarios, we may continue to derive value by transitioning the customer to other products and services within our portfolio or provided by a partner organization.  Or we may simply allow the customer to leave, content in knowing a satisfied customer is likely to continue to serve as a net promoter.

It’s when a customer leaves prematurely that we need to take corrective action. Churn in early stages of the lifetime journey may indicate difficulty in using products or services or recognizing value through them.  Churn in later stages may indicate diminished value, real or perceived, due to changes in the product, its delivery or the competitive landscape.  And at any stage, business process issues such as the failure to identify expiring credit cards may inadvertently push customers out.  The specific reasons for churn are highly varied and each requires a different kind of response both at the individual and the organizational levels.

Nor Should You Try 

When addressing the individual, it’s important to consider the costs and benefits of any corrective action. Every customer has a potential value to the firm, derived over the lifetime of a relationship. The cost of avoiding churn whether through promotions, discounts or other incentives should never exceed the residual value we might hope to preserve. Our goal should always be to retain profitability.

This requires not only a careful consideration of an individual’s CLV but the cost of implementing an active retention campaign on the whole. The planning and administration as well as the labor costs associated with consistent, sustained engagement must be averaged over the (ideally large) fraction of at-risk customers retained.

This is in no way intended to discourage organizations from pursuing a retention management strategy.  Indeed, numerous studies have shown that it costs 5-times (or more) to acquire a new customer than retain an existing one, and that firms may see as much as a 95% increase in profits with each 5% reduction in churn. Still, we must be careful to recognize and address the macro-level patterns behind churn to keep retention pressure down while also selectively engaging the at-risk customers with whom there is the most at stake.  And this is where machine learning and predictive analytics  can help.

Use Machine Learning to Quantify Likelihood of Churn 

The signals customers emit ahead of departure are often buried in the noise of overall customer activity.  Preventing a customer from leaving requires us to have some amount of advanced notice which is obtained through the careful examination of large volumes of historical data, something for which machine learning models are ideally suited.

The classic techniques, such as the use of logistic regression or decision trees, for proactive churn detection may be insensitive to events such as churn which (ideally) occur with low frequency (Figure 2). More modern techniques such as neural networks and gradient boosted trees are much more capable of picking up on the subtle shifts in patterns that denote churn, but require careful configuration and evaluation to do so.

The imbalance between churning and not-churning classes in a real-world dataset
Figure 2. The imbalance between churning and not-churning classes in a real-world dataset

The key to success with these models is to move away from a will-they or won’t they mindset and instead to embrace the uncertainty inherent in any churn prediction. When we begin to examine all vulnerable customers as having a quantifiable risk of churning, we can focus on removing uncertainty from our calculations.  Armed with more reliable predictions of churn risk, we can more carefully examine the residual CLV associated with individual customers and make more targeted decisions regarding when and how to intervene.

Use Databricks to Focus on Business Outcomes

Machine learning and data science in general is not easy. But bringing together the data with specialized software, managing the infrastructure to enable model processing and frequent reprocessing, and delivering outputs to downstream business systems shouldn’t be what consumes your organization’s time.

Leveraging elastic, cloud-based infrastructure under a platform with the most popular machine learning libraries pre-integrated, your data scientists have immediate access to the capabilities needed to get in motion. Using pre-integrated frameworks like hyperopt and mlflow, the previously laborious and time-consuming chore of optimizing model performance and configurations can be automated (Figure 3). And backed by a powerful, dynamically-scalable data processing engine, the mountains of data within which customer signals reside can be quickly and efficiently examined.

Model precision relative to various hyperparameter values
Figure 3. Model precision relative to various hyperparameter values

To see how these capabilities come together to tackle customer churn prediction, check out our Solution Accelerator assets which demonstrate how to go from raw data to prediction leveraging real-world data:

--

Try Databricks for free. Get started today.

The post Profit-Driven Retention Management with Machine Learning appeared first on Databricks.


A Guide to the Databricks + AWS Cloud Data Lake Dev Day Workshop

$
0
0

The Databricks team has been working hard to recreate content and enhance the experience as we transition all our events to a virtual experience. And we’ve learned a thing or two about what you want to learn: integrating Databricks with AWS services. So we’re excited to present our Cloud Data Lake Dev Day Workshop in partnership with AWS and Onica.

Organizations want to leverage the wealth of data accumulated in their data lake for deep analytics insights. However, most organizations struggle with preparing data for analytics and automating data pipelines to leverage new data as data lakes are constantly updated. Making the shift to automated data pipelines can be challenging, but it’s become more urgent as the COVID-19 pandemic accelerates the move to a completely virtual workforce and collaborative problem solving.

Learn how to move from manual management of data pipelines to seamless automation in this collaborative workshop with experienced partners and customers to pave the way. Join us Thursday, August 27, at 9:00 AM PDT to experience a deep dive into the technology that makes up a modern cloud-based data and analytics platform. The sessions will include live chat interactions with our system architects to answer all your questions.

Meet our speakers

Arsalan Tavakoli-Shiraji, Co-Founder and SVP of Field Engineering, Databricks
Prior to Databricks, Arsalan was an associate principal at McKinsey & Company, where he advised enterprises, vendors and the public sector on a broad spectrum of strategic topics, including next-generation IT, cloud computing and big initiatives as well as general IT and corporate strategy. Arsalan received a Ph.D. in computer science from UC Berkeley in the area of Networking and Distributed Systems and a B.Eng. from the University of Virginia.

Kevin Miller, General Manager, Amazon S3, AWS
With more than 8 years of experience on the Amazon Web Services team, Kevin has a deep understanding of the storage technologies and options available to a wide range of industries. Kevin can also speak to customer experiences with S3 and Databricks. Prior to Amazon, Kevin was an assistant director at Duke University and member of the Technology Architecture Group where he was charged with establishing technical strategies and developing organizational technical maturity.

Sally Hoppe, Big Data System Architect, HP
Sally is a big data system architect at HP. With a background in math and computer science, she is a versatile software engineering professional with experience developing enterprise software solutions and managing cross-functional teams. While working for a large corporation, she has sought out opportunities in new businesses to learn new technologies and work with passionate coworkers. Because she likes to make order out of chaos, she frequently finds herself in positions that require deep technical knowledge and management skills.

Daniel Ferrante, Director of Platform Engineering, Digital Turbine
For the past 4+ years, Daniel has been leading the Platform and Data Engineering teams at Digital Turbine. He is well skilled in data engineering techniques using Apache Spark™, Scala, Python, Java and Spring Boot. His interests and focus lie in helping businesses succeed in the automation of business analytics and data mastery.

Traey Hatch, Practice Manager, Onica
For the past year, Traey has been at Onica, a cloud consulting and managed services company, helping businesses enable, operate and innovate on the cloud. From migration strategy to operational excellence and immersive transformation, Onica is a full spectrum AWS integrator, helping hundreds of companies realize the value, efficiency and productivity of the cloud. Traey has experience in creating greenfield data lakes for multiple customers ranging from POC projects to full production deployments.

Denis Dubeau, AWS Partner Solution Architect Manager, Databricks
Denis is a partner solution architect providing guidance and enablement on modernizing data lake strategies using Databricks on AWS. Denis is a seasoned professional with significant industry experience in data engineering and data warehousing with previous stops at Greenplum, Hortonworks, IBM and AtScale.

An overview of what you’ll learn:

  • How to build highly scalable and reliable data pipelines for analytics
  • How you can make your existing S3 data lake analytics-ready with open-source Delta Lake technology
  • Evaluate options to migrate current on-premises data lakes (Hadoop, etc.) to AWS with Databricks Delta
  • How to integrate that data with services such as Amazon SageMaker, Amazon Redshift, AWS Glue and Amazon Athena as well as how to leverage your AWS security and roles without moving your data out of your account
  • Understand open-source technologies like Delta Lake and Apache Spark that are portable and powerful at any organization and for any analytics use case

Get ready

  1. Register: If you have not registered for the event, you can do so here.
  2. Training: If you are new to Databricks and want to learn more, check out our free online training course here.
  3. Learn more about Databricks on AWS at www.databricks.com/aws

--

Try Databricks for free. Get started today.

The post A Guide to the Databricks + AWS Cloud Data Lake Dev Day Workshop appeared first on Databricks.

%tensorboard – a new way to use TensorBoard on Databricks

$
0
0

Introduction

With the Databricks Runtime 7.2 release, we are introducing a new magic command %tensorboard. This brings the interactive TensorBoard experience Jupyter notebook users expect to their Databricks notebooks. The %tensorboard command starts a TensorBoard server and embeds the TensorBoard user interface inside the Databricks notebook for data scientists and machine learning engineers to visualize and debug their machine learning projects. We’ve made it much easier to use TensorBoard in Databricks.

Motivation

In 2017, we released the  dbutils.tensorboard.start() API to manage and use TensorBoard inside Databricks python notebooks. This API only permits one active TensorBoard process on a cluster at any given time – which hinders multi-tenant use-cases. Early last year, TensorBoard released its own API for notebooks via the %tensorboard python magic command. This API not only starts TensorBoard processes but also exposes the TensorBoard’s command line arguments in the notebook environment. In addition, it embeds the TensorBoard UI inside notebooks, whereas the dbutils.tensorboard.start API prints a link to open TensorBoard in a new tab.

Welcoming %tensorboard

Upgrading to the %tensorboard magic command in Databricks has allowed us to take advantage of TensorBoard’s new API features. It is now possible to have multiple concurrent TensorBoard processes on a cluster as well as to interact with a TensorBoard UI inline in a notebook.

We’ve built upon the TensorBoard experience to better integrate it into the Databricks workflow:

  • A link on top of the embedded TensorBoard UI to open TensorBoard in a new browser tab.
  • Notebook-scoped process re-use to improve performance.
  • The ability to stop a notebook’s TensorBoard servers and free up cluster resources by detaching the notebook or clearing its state.

With the introduction of %tensorboard magic command we are deprecating dbutils.tensorboard.start and plan to remove it in a future major Databricks Runtime release.

How to get started

Here’s how you can quickly start using %tensorboard in your machine learning project. Inside your Databricks notebook:

  1. Run %load_ext tensorboard to enable the %tensorboard magic command
  2. Start and view your TensorBoard by running %tensorboard --logdir $experiment_log_dir, where experiment_log_dir is the path to a directory in DBFS dedicated to TensorBoard logs.
  3. Use TensorBoard callbacks or TensorFlow or PyTorch file writers to generate logs during your training process. To make sure your logs are separated by runs, set the function’s log directory to a run specific subdirectory in DBFS. For TensorFlow, this is as simple as:
import datetime
from tensorflow.keras import callbacks
log_dir = experiment_log_dir + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
# in your model.fit call
model.fit(
    …
    callback=[
        …
        tensorboard_callback
    ]
)
  1. Refresh your TensorBoard user interface to visualize your training process using the data you just generated

For an end-to-end example, check out this notebook using TensorBoard in a TensorFlow project.

For more details on using %tensorboard in Databricks, you can read our official documentation.

--

Try Databricks for free. Get started today.

The post %tensorboard – a new way to use TensorBoard on Databricks appeared first on Databricks.

Women’s Equality Day: Cultivating Leadership, Communication and Negotiation Skills

$
0
0
Databricks celebrated Women’s Equality Day with a talk by Coco Brown

Databricks celebrated Women’s Equality Day with a talk by Coco Brown

Databricks is committed to fostering an environment that promotes equality and inclusion. For Women’s Equality Day on August 26, we invited Coco Brown, founder and CEO of Athena Alliance — a digital platform dedicated to revolutionizing leadership — to present a talk about women in leadership and the path to the boardroom. This talk highlighted the importance of creating leadership opportunities for women.

At Databricks, we have a broad interconnected network of Employee Resource Groups (ERGs) that create and curate leadership and professional development opportunities for our employees. These include our Latinx, Black Employee, Queeries, and Women’s Networks.

The Women in Customer Success (WiCS) group, which is a partner to our Women’s Network ERG, is one example of an employee-led group focused on supporting women at Databricks. The mission of WiCS is to build a community that strengthens our culture of inclusion while supporting professional growth and leadership development for women. Chris Gilbert and Carrie Anderson, co-executive sponsors of WiCS, also invite other leaders in the company to participate in the monthly meetings to provide women with more executive face time.

This past month, the Women in Customer Success (WiCS) book club focused on learning more about the art of negotiation by reading and discussing “Never Split the Difference” by Chris Voss and Tahl Raz. If you don’t have time to read the book, here’s a fantastic 12-minute summary. Below a few WiCS members spotlight their favorite quotes from the book that sparked a lively discussion about negotiation techniques:


Chengyin E.
Data Scientist

“‘No’ is not a failure. We have learned that ‘No’ is the anti-’Yes’ and, therefore, a word to be avoided at all costs. But it really often just means ‘Wait’ or ‘I’m not comfortable with that.’ Learn how to hear it calmly. It is not the end of the negotiation but the beginning.”


Lucille N.
Executive Assistant

“There are three types of ‘Yes’: counterfeit (which is utilized as an escape route), confirmation (an affirmation with no promise of movement), and commitment (an authentic agreement that points to action). Aim for a ‘commitment’ Yes.”


Nicole L.
Solutions Consultant

“Slow it down, make your counterpart feel in control, and show a sincere interest in his/her experience. Repeat back the last few key words, uncover the information you need with calibrated ‘what’ and ‘how’ questions, and lead the conversation toward a compassionate and solution-based outcome.”

 

To learn more about how you can join us, check out our Careers page

--

Try Databricks for free. Get started today.

The post Women’s Equality Day: Cultivating Leadership, Communication and Negotiation Skills appeared first on Databricks.

Enabling Spark SQL DDL and DML in Delta Lake on Apache Spark 3.0

$
0
0

Last week, we had a fun Delta Lake 0.7.0 + Apache Spark 3.0 AMA where Burak Yavuz, Tathagata Das, and Denny Lee provided a recap of Delta Lake 0.7.0 and answered your Delta Lake questions.  The theme for this AMA was the release of Delta Lake 0.7.0 coincided with the release of Apache Spark 3.0 thus enabling a new set of features that were simplified using Delta Lake from SQL.

Delta Lake 0.7.0 + Apache Spark 3.0 AMA where Burak Yavuz, Tathagata Das, and Denny Lee provided a recap of Delta Lake 0.7.0 and answered Delta Lake questions.

Recap of Delta Lake 0.7.0

Here are some of key highlights of Delta Lake 0.7.0 as recapped in the AMA; refer to the release notes for more information.

Support for SQL DDL commands to define tables in the Hive metastore

You can now define Delta tables in the Hive metastore and use the table name in all SQL operations when creating (or replacing) tables.

-- Create table in the metastore
CREATE TABLE events (
    date DATE,
    eventId STRING,
    eventType STRING,
    data STRING)
USING DELTA
PARTITIONED BY (date)
LOCATION '/delta/events'

-- If a table with the same name already exists, the table is replaced with 
the new configuration, else it is created
CREATE OR REPLACE TABLE events (
    date DATE,
    eventId STRING,
    eventType STRING,
    data STRING)
USING DELTA
PARTITIONED BY (date)
LOCATION '/delta/events'
-- Alter table and schema
ALTER TABLE table_name ADD COLUMNS (
    col_name data_type 
        [COMMENT col_comment]
        [FIRST|AFTER colA_name],
    ...)      

You can also use the Scala/Java/Python APIs:

  • DataFrame.saveAsTable(tableName) and DataFrameWriterV2 APIs (#307).
  • DeltaTable.forName(tableName) API to create instances of io.delta.tables.DeltaTable which is useful for executing Update/Delete/Merge operations in Scala/Java/Python.

Support for SQL Insert, Delete, Update and Merge

One of most frequent questions through our Delta Lake Tech Talks was when would DML operations such as delete, update, and merge be available in Spark SQL?  Wait no more, these operations are now available in SQL!  Below are example of how you can write delete, update, and merge (insert, update, delete, and deduplication operations using Spark SQL

-- Using append mode, you can atomically add new data to an existing Delta table
INSERT INTO events SELECT * FROM newEvents

-- To atomically replace all of the data in a table, you can use overwrite mode
INSERT OVERWRITE events SELECT * FROM newEvents


-- Delete events
DELETE FROM events WHERE date < '2017-01-01'


-- Update events
UPDATE events SET eventType = 'click' WHERE eventType = 'click'

-- Upsert data to a target Delta 
-- table using merge
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN UPDATE
SET events.data = updates.data
WHEN NOT MATCHED THEN INSERT 
(date, eventId, data)
VALUES (date, eventId, data)      

It is worth noting that the merge operation in Delta Lake supports more advanced syntax than standard ANSI SQL syntax. For example, merge supports

  • Delete actions - Delete a target when matched with a source row. For example,  "... WHEN MATCHED THEN DELETE ..."
  • Multiple matched actions with clause conditions - Greater flexibility when target and source rows match. For example,
... 
WHEN MATCHED AND events.shouldDelete THEN DELETE 
WHEN MATCHED THEN UPDATE SET events.data = updates.data
  • Star syntax - Short-hand for setting target column value with the similarly-named sources column. For example,
WHEN MATCHED THEN SET *
WHEN NOT MATCHED THEN INSERT *
-- equivalent to updating/inserting with event.date = updates.date, 
    events.eventId = updates.eventId, event.data = updates.data

Refer to the Delta Lake documentation for more information.

Automatic and incremental Presto/Athena manifest generation

As noted in Query Delta Lake Tables from Presto and Athena, Improved Operations Concurrency, and Merge performance, Delta Lake supports other processing engines to read Delta Lake by using manifest files; the manifest files contain the list of the most current version of files as of manifest generation.  As described in the preceding blog, you will need to:

  • Generate Delta Lake Manifest File
  • Configure Presto or Athena to read the generated manifests
  • Manually re-generate (update) the manifest file

New for Delta Lake 0.7.0 is the capability to update the manifest file automatically with following command.

ALTER TABLE delta.`pathToDeltaTable` 
SET TBLPROPERTIES(
    delta.compatibility.symlinkFormatManifest.enabled=true
)   

For more information, please refer to the Delta Lake documentation.

Configuring your table through Table Properties

With the ability to set table properties on your table by using ALTER TABLE SET TBLPROPERTIES, you can enable, disable or configure many features of Delta such as automated manifest generation. For example, with table properties, you can block deletes and updates in a Delta table using delta.appendOnly=true.

You can also easily control the history of your Delta Lake table retention by the following properties:

  • delta.logRetentionDuration: Controls how long the history for a table (i.e. transaction log history) is kept. By default, thirty (30) days of history is kept but you may want to alter this value based on your requirements (e.g. GDPR historical context)
  • delta.deletedFileRetentionDuration: Controls how long ago a file must have been deleted before being a candidate for VACUUM.  By default, data files older than seven (7) days are deleted.

As of Delta Lake 0.7.0, you can use ALTER TABLE SET TBLPROPERTIES to configure these properties.

ALTER TABLE delta.`pathToDeltaTable` 
SET TBLPROPERTIES(
    delta.logRetentionDuration = "interval "
    delta.deletedFileRetentionDuration = "interval "
)    

For more information, refer to Table Properties in the Delta Lake documentation.

Support for Adding User-Defined Metadata in Delta Table Commits

You can specify user-defined strings as metadata in commits made by Delta table operations, either using the DataFrameWriter option userMetadata or the SparkSession configuration spark.databricks.delta.commitInfo.userMetadata (documentation).

In the following example, we are deleting a user (1xsdf1) from our data lake per user request.  To ensure we associate the user’s request with the deletion, we have also added the DELETE request ID into the userMetadata.

SET spark.databricks.delta.commitInfo.userMetadata={ 
    "GDPR":"DELETE Request 1x891jb23" 
};
DELETE FROM user_table WHERE user_id = '1xsdf1'    

When reviewing the history operations of the user table (user_table), you can easily identify the associated deletion request within the transaction log.
AWD Data Lake implementation using the Databricks Unified Analytics Platform.

Other Highlights

Other highlights for the Delta Lake 0.7.0 release include:

  • Support Azure Data Lake Storage Gen2 - Spark 3.0 has support for Hadoop 3.2 libraries which enables support for Azure Data Lake Storage Gen2 (documentation).
  • Improved support for streaming one-time triggers - With Spark 3.0, we now ensure that one-time trigger (Trigger.Once) processes all outstanding data in a Delta table in a single micro-batch even if rate limits are set with the DataStreamReader option maxFilesPerTrigger.

There were a lot of great questions during the AMA concerning structured streaming and using  trigger.once.  For more information, some good resources explaining this concept include:

Now to the Questions!

We had a lot of great questions during our AMA; below is a quick synopsis of some of those questions.

Can Delta tables be created on AWS Glue catalog service?

Yes, you can integrate your Delta Lake tables with the AWS Glue Data Catalog service.  The blog Transform Your AWS Data Lake using Databricks Delta and the AWS Glue Data Catalog Service provides a great how-to.

AWD Data Lake implementation using the Databricks Unified Analytics Platform.

It is important to note that not all of the Delta Lake metadata information is stored in Glue so for more details, you will still want to read the Delta Lake transaction log directly.

Can we query the Delta Lake metadata?  Does the cluster have to have live access to the metastore?

As noted in the previous question, there is a slight difference between the Delta Lake metadata vs. the Hive or Glue metastores.  The latter are metastores that act as catalogs to let any compatible framework determine what tables are available to query.

While the Delta Lake metadata contains this information, it also contains a lot of other information that may not be important for a metastore to catalog including the current schema of the table, what files are associated with which transaction, operation metrics, etc.  To query the metadata, you can use Spark SQL or DataFrame APIs to query the Delta Lake transaction log. For more information, refer to the Delta Lake Internals Online Tech Talks which dive deeper into these internals as well as provide example notebooks so you can query the metadata yourself.

Do we still need to define tables in Athena/Presto using symlinks or can we use the new SQL method of defining delta tables via glue catalog?

As noted earlier, one of the first steps to defining an Athena/Presto table is to generate manifests of a Delta table using Apache Spark. This task will generate a set of files - i.e. the manifest - that contains which files Athena or Presto will read when looking at the most current catalog of data.  The second step is to configure Athena/Presto to read those generated manifests.    Thus, at this time, you will still need to create the synlinks so that Athena/Presto will be able to identify which files it will need to read.

Note, the SQL method for defining the Delta table defines the existence of the table and schema but does not specify which files Athena/Presto should be reading (i.e. read this snapshot of the  latest Parquet files that make up the current version of the table).  As Delta Lake table versions can often change (e.g. structured streams appending data, multiple batches running ForeachBatch statements to update the table, etc.), it would very likely overload any metastore with continuous updates to the metadata of the latest files.

Does the update, delete, merge immediately write out new Parquet files or use other tricks on the storage layer to minimize I/O?

As noted in Delta Lake Internals Online Tech Talks, any changes to the underlying file system whether they be update, delete, or merge results in the addition of new files.  As Delta Lake is writing new files every time, this process is not as storage I/O intensive as (for example) a traditional delete that would require I/O to read the file, remove the deleted rows, and overwrite the original file.   In addition, because Delta Lake uses a transaction log to identify which files are associated with the data version, the reads are not nearly as storage I/O intensive.  Instead of listing out all of the files from distributed storage which can be I/O intensive, time consuming, or both, through its transaction log Delta Lake can automatically obtain the necessary files. In addition, deletes at partition boundaries are performed as pure metadata operations, therefore are super fast.

In our environment, the update happens as a separate process at regular intervals and the ETL happens on the bronze tables. Is it possible to leverage caching to improve performance for these processes.

For those who may be unfamiliar with bronze tables, this question is in reference to the Delta Medallion Architecture framework for data quality.  We start with a fire hose of events that are written to storage as fast as possible as part of the data ingestion process where the data lands in these ingestion or bronze tables.  As you refind the data (joins, lookups, filtering, etc.) you create silver tables.  Finally, you have the features for your ML and/or aggregate table(s) - also known as Gold tables - to perform your analysis.  For more information on the Delta Architecture, please refer to Beyond Lambda: Introducing Delta Architecture and Productionizing Machine Learning with Delta Lake.

Delta Lake Medallion Architecture framework for data quality.

With this type of architecture, as part of the Extract Transform Load (ETL) process, the extracted data is stored in your Bronze tables as part of data ingestion.  The transformations of your data (including updates) will occur as you go from Bronze to Silver resulting in your refined tables.

In terms of caching, there are two types of caching that may be coming into play. There is the Apache Spark caching as well as the Delta Engine caching which is specific to Databricks. Using Apache Spark cache via .cache and/or .persist allows you to keep data in-memory thus minimizing storage I/O.   This can be especially useful when creating intermediary tables for multi-hop pipelines where multiple downstream tables are created based on a set of intermediate tables.    You can also leverage the Delta Engine cache (which can be used in tandem with the Apache Spark cache) as it contains local copies of remote data thus can be read and operated on faster than data solely using the Apache Spark cache.  In this scenario, you may benefit from materializing the DataFrames not only to take advantage of the Delta Engine cache but to handle fault recovery and simplify troubleshooting for your multi-hop data pipelines.

Benefits of using “Intermediate Hops” with Delta tables, especially where large numbers of transformations are involved.

For more information on intermediate hops, please refer to Beyond Lambda: Introducing Delta Architecture.  For more information on Delta Engine and Apache Spark caching, please refer to Optimize performance with caching.

Can we use Delta Lake in scenarios where the tables are updated very frequently, say every five minutes. Basically we have tables stored in an online system and want to create an offline system using delta lake and update the delta tables every five minutes. What are the perf and cost implications and is this something we can consider using delta for?

Delta Lake can be both a source and a sink for both your batch and streaming processes.  In the case of updating tables frequently, you can either regularly run batch queries every 5min or another approach would be to use Trigger.once (as noted in the previous section).  In terms of performance and cost implications, below is a great slide that encompasses the cost vs. latency trade off for these approaches.

Cost and performance considerations of using Delta Tables in scenarios where the tables are updated every few minutes.

To dive deeper into this, please refer to the tech talk Beyond Lambda: Introducing Delta Architecture.  A quick call out for frequent batches (whether they be batch or streaming):

  • When adding data frequently, this may result in many small files.  A best practice would be to periodically compact your files.  If you’re using Databricks, you can also use auto-optimize to automate this task.
  • Using ForeachBatch to modify existing data may result in a lot of transactions and versions of data.  May want to be more aggressive in cleaning out log entries and/or vacuuming to reduce size.

Another good reference is the VLDB 2020 paper: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores.

What's the performance impact on live queries, when VACUUM is in progress?

There should be minimal to no impact to live queries as vacuum is typically running on data that is on a different set of files than your queries. Where there is a potential impact is if you’re doing a time travel query on the same data that you’re about to vacuum (e.g. running vacuum of default of 7 days while attempting to query data that is older than 7 days).

Concerning time travel, if a parquet metadata file is created after 10 commits, does it mean that I can go back only 10 commits back? Or time travel queries just ignore parquet metadata files?

You can go as far back in the transaction log as defined by the delta.logRetentionDuration which is by default 30 days of history.  That is, by default you can see 30 days of history within the transaction log.  Note, while there is 30 days of log history, when running vacuum (which needs to be initiated manually, it does not run automatically) by default any data files that are older than 7 days are removed.

In the case of creating parquet metadata files, every Delta Lake transaction will first record the JSON file that is the transaction log.  Every 10th transaction, a parquet metadata file is generated that stores the previous transaction log entries to improve performance.  Thus, if a new cluster needs to read all the transaction log entries, it needs only to the parquet file and most recent (up to 9) JSON files

Handling massive metadata. How you can use Apache Spark to scale your Delta Tables with millions of files in them.

For more information, please refer to Diving into Delta Lake: Unpacking the Transaction Log.  

Get Started with Delta Lake 0.7.0

Try out Delta Lake with the preceding code snippets on your Apache Spark 3.0.0 (or greater) instance. Delta Lake makes your data lakes more reliable (whether you create a new one or migrate an existing data lake).  To learn more, refer to https://delta.io/, and join the Delta Lake community via Slack and Google Group.  You can track all the upcoming releases and planned features in GitHub milestones. You can also try out Managed Delta Lake on Databricks with a free account.

Credits

We want to thank the following contributors for updates, doc changes, and contributions in Delta Lake 0.7.0: Alan Jin, Alex Ott, Burak Yavuz, Jose Torres, Pranav Anand, QP Hou, Rahul Mahadev, Rob Kelly, Shixiong Zhu, Subhash Burramsetty, Tathagata Das, Wesley Hoffman, Yin Huai, Youngbin Kim, Zach Schuermann, Eric Chang, Herman van Hovell, Mahmoud Mahdi.

O’Reilly Learning Spark Book

Free 2nd Edition includes updates on Spark 3.0 and chapters on Spark SQL and Data Lakes.

The post Enabling Spark SQL DDL and DML in Delta Lake on Apache Spark 3.0 appeared first on Databricks.

Improving Public Health Surveillance During COVID-19 with Data Analytics and AI

$
0
0

As the leader of the State and Local Government business at Databricks, I get to see what governments all over the U.S. are doing to address the Novel Coronavirus and COVID-19 crisis. I am continually inspired by the work of public servants as they go about their business to save lives and address this crisis.

In the midst of all of the bad news, there are good news reports of the important work done by public health officials on COVID-19.  The good work that public health departments beyond the Center for Disease Control and Prevention (CDC) perform are not usually the dramatic headlines, but they are making an amazing impact.

Like many of us, local and state governments are figuring things out as they go along, one step at a time. By observing successful COVID-19 response programs in countries where infections happened early, public health agencies first recognized the need for contact tracing as an important data source, and have scrambled to implement contact tracing programs. Once contact tracing is in place, vast amounts of data become available.

Across the globe, it has been proven, in countries like South Korea, that the COVID-19 case data from contact tracing can inform the management of outbreaks in powerful ways. How does all that data get used to inform government policy makers, to guide public health practices and define public policy, sometimes in spite of a less-than-enthusiastic public? The epidemiological study of this data informs research not just on individuals, but on populations, geographies, and risk factors that contribute to outbreaks, hospitalizations, and fatalities.

What is the right shelter-in-place or reopening policy for Los Angeles County vs. Humboldt County California?   What are the right group size limitations? The right policies for high-risk environments like skilled nursing facilities? Data can inform all of these policy recommendations. It must.

Unfortunately, it’s not that easy. Local departments of health and other public health agencies at the forefront of this pandemic are struggling with fundamental data challenges that are impeding their ability to drive meaningful insights. Challenges like:

  • How do we bring together clinical and case investigation datasets that reside in siloed, legacy data warehouses, EHR and operational systems managed by thousands of healthcare providers and agencies?
  • How do we provide the necessary compute power to process these population-scale datasets?
  • How do we blend structured data (e.g. medical records) with unstructured data (e.g. patient chatbot logs, medical images) to power novel insights and predictive models?
  • How do we reliably ingest streaming data for real-time insights on the spread of COVID-19, hospital usage trends, and more?

For many health organizations, building this analytics muscle has been a slow burn. The good news: powerful cloud-based software solutions, like Databricks Unified Data Analytics Platform, are accelerating this transformation with the tooling and scale needed to analyze large volumes of health data in minutes. With these fundamental data problems solved, health organizations can refocus their efforts on building analytics and ML products instead of wrangling their data. One example is the COVID-19 surveillance solution developed on top of Databricks, which is being deployed in a number of state and local government health departments, as well as by a number of hospitals and care facilities across the U.S.

Included above is a brief demo of our public health surveillance solution. In the demo, we show how to take a data-driven approach to adaptive response, or in other words, apply predictive analytics to COVID-19 datasets to help drive more effective shelter-in-place policies.

With this solution on Databricks we’re able to yield important insights in a short amount of time and, as a cloud native offering, it can be deployed quickly and cost effectively at scale. We recently launched this program in one of the largest state government health departments in the country, and we had it running and delivering insights in less than two hours.

This solution includes COVID-19 data sets we have previously published, as well as workbooks used by public health departments to deliver data-driven insight to guide COVID-19 public policy. This is one of many solutions that can be built on Databricks using this dataset. Other use cases for COVID-19 data include Hotspot Analysis, Epidemiological Modeling, and Supply Chain Optimization. You can learn more on our COVID-19 hub.

Databricks is committed to fighting the COVID-19 epidemic and other infectious diseases by implementing powerful analytical tools for government agencies across the country. We invite you to inquire about how we might be able to help your agency.

Next Steps

--

Try Databricks for free. Get started today.

The post Improving Public Health Surveillance During COVID-19 with Data Analytics and AI appeared first on Databricks.

Operationalize 100 Machine Learning Models in as Little as 12 Weeks with Azure Databricks

$
0
0

In rapidly changing environments, Azure Databricks enables organizations to spot new trends, respond to unexpected challenges and predict new opportunities. Organizations are leveraging machine learning and artificial intelligence (AI) to derive insight and value from their data and to improve the accuracy of forecasts and predictions. Data teams are using Delta Lake to accelerate ETL pipelines and MLflow to establish a consistent ML lifecycle.

Solving the complexity of ML frameworks, libraries and packages

Customers frequently struggle to manage all of the libraries and frameworks for machine learning on a single laptop or workstation. There are so many libraries and frameworks to keep in sync (H2O, PyTorch, scikit-learn, MLlib). In addition, you often need to bring in other Python packages, such as Pandas, Matplotlib, numpy and many others. Mixing and matching versions and dependencies between these libraries can be incredibly challenging.

Databricks Runtime for ML
Diagram: Databricks Runtime for ML enables ready-to-use clusters with built-in ML Frameworks

With Azure Databricks, these frameworks and libraries are packaged so that you can select the versions you need as a single dropdown. We call this the Databricks Runtime. Within this runtime, we also have a specialized runtime for machine learning which we call the Databricks Runtime for Machine Learning (ML Runtime). All these packages are pre-configured and installed so you don’t have to worry about how to combine them all together. Azure Databricks updates these every 6-8 weeks, so you can simply choose a version and get started right away.

Establishing a consistent ML lifecycle with MLflow

The goal of machine learning is to optimize a metric such as forecast accuracy. Machine learning algorithms are run on training data to produce models. These models can be used to make predictions as new data arrive. The quality of each model depends on the input data and tuning parameters. Creating an accurate model is an iterative process of experiments with various libraries, algorithms, data sets and models. The MLflow open source project started about two years ago to manage each phase of the model management lifecycle, from input through hyperparameter tuning. MLflow recently joined the Linux Foundation. Community support has been tremendous, with over 200 contributors, including large companies. In June, MLflow surpassed 2.5 million monthly downloads.

MLflow unifies data scientists and data engineers
Diagram: MLflow unifies data scientists and data engineers

Ease of infrastructure management

Data scientists want to focus on their models, not infrastructure. You don’t have to manage dependencies and versions. It scales to meet your needs. As your data science team begins to process bigger data sets, you don’t have to do capacity planning or requisition/acquire more hardware. With Databricks, it’s easy to onboard new team members and grant them access to the data, tools, frameworks, libraries and clusters they need.

Building your first machine learning model with Azure Databricks

To help you get a feel for Azure Databricks, let’s build a simple model using sample data in Azure Databricks. Often a data scientist will see a blog post about an algorithm, or have some data they want to use for exploratory ML. It can be very hard to take a code snippet found online, shape a dataset to fit the algorithm, then find the correct infrastructure and libraries to pull it all together. With Azure Databricks, all that hassle is removed. This blog post talks about time-series analysis with a library called Prophet. It would be interesting to take this idea, of scaling single-node machine learning to distributed training with Spark and Pandas UDFs, and apply it to a COVID-19 dataset available on Azure Databricks.

Installing the library is as simple as typing fbprophet in a PyPi prompt, then clicking Install. From there, once the data has been read into a pandas DataFrame and transformed into the format expected by Prophet, trying out the algorithm was quick and simple.

from fbprophet import Prophet

model = Prophet(
    interval_width=0.9,
    growth='linear'
)

model.fit(summed_case_pd)

#set periods to a large number to see window of uncertainty grow
future_pd = model.make_future_dataframe(
    periods=200,
    include_history=True
)

# predict over the dataset
forecast_pd = model.predict(future_pd) 

With a DataFrame containing the predictions, plotting the results within the same notebook just takes a call to display().

predict_fig = model.plot(forecast_pd, xlabel='date', ylabel='new_cases')
display(predict_fig)    

Creating a machine learning model in Azure Databricks
Diagram: Creating a machine learning model in Azure Databricks

The referenced blog then used a Pandas UDF to scale up this model to much larger amounts of data. We can do the same, and train models in parallel on several different World Health Organization (WHO) regions at once. To do this we wrap the single-node code in a Pandas UDF:

@pandas_udf(prophet_schema, PandasUDFType.GROUPED_MAP )  
def forecast_per_region(keys, grouped_pd):  
    
    region = keys[0]  
    days_to_forecast = keys[1]  
    
    model = Prophet(  
    interval_width=0.9,  
    growth='linear'  
    )  
    
    model.fit(grouped_pd[['ds', 'y']])  
    
    future_pd = model.make_future_dataframe(  
        periods=days_to_forecast,  
        include_history=True  
    )  
    
    forecast_pd = model.predict(future_pd)  
    forecast_pd['WHO_region']=region  
    
    return forecast_pd[[c.name for c in prophet_schema]]  

We can then apply the function to each WHO region and view the results:

results = (covid_spark
    .groupBy("WHO_region", lit(100).alias('days_to_forecast'))
    .apply(forecast_per_region))

Finally, we can use the Azure Databricks notebooks’ SQL functionality to quickly visualize some of our predictions:

%sql
SELECT WHO_region, ds, yhat
FROM results
WHERE WHO_region = "Eastern Mediterranean Region" 
    or WHO_region = "South-East Asia Region" 
    or WHO_region = "European Region" 
    or WHO_region = "Region of the Americas" 
ORDER BY ds, WHO_region 

From our results we can see that this dataset is not ideal for time-series forecasting. However, we were able to quickly experiment and scale up our model, without having to set up any infrastructure or manage libraries. We could then share these results with other team members just by sending the link of the collaborative notebook, quickly making the code and results available to the organization.

Alignment Healthcare

Alignment Healthcare, a rapidly growing Medicare insurance provider, serves one of the most at-risk groups of the COVID-19 crisis—seniors. While many health plans rely on outdated information and siloed data systems, Alignment processes a wide variety and large volume of near real-time data into a unified architecture to build a revolutionary digital patient ID and comprehensive patient profile by leveraging Azure Databricks. This architecture powers more than 100 AI models designed to effectively manage the health of large populations, engage consumers, and identify vulnerable individuals needing personalized attention—with a goal of improving members’ well-being and saving lives.

Start building your machine learning models on Azure Databricks

Try out the notebook hosted here and learn more about building ML models on Azure Databricks by attending this webinar, From Data to AI with Microsoft Azure Databricks, this Azure Databricks ML training module on MS Learn and our next Azure Databricks Office Hours. If you are ready to grow your business with machine learning in Azure Databricks, schedule a demo.

--

Try Databricks for free. Get started today.

The post Operationalize 100 Machine Learning Models in as Little as 12 Weeks with Azure Databricks appeared first on Databricks.

Introducing the Databricks Web Terminal

$
0
0

Introduction

We’re excited to introduce the public preview of the Databricks Web Terminal in the 3.25 platform release. Any user with “Can Attach To” cluster permissions can now use the Web Terminal to interactively run Bash commands on the driver node of their cluster.

The new Databricks web terminal provides a fully interactive shell that supports virtually all command-line programs. The terminal is not intended for running Apache Spark jobs; however, it is a convenient environment for installing native libraries, debugging package management issues, or simply editing a system file inside the container.

New Databricks Web Terminal for running line commands via an interactive shell.

Motivation

Running shell commands has been possible through %sh magic commands in Databricks Notebooks. In addition, in some environments, cluster creators can set up SSH keys at cluster launch time and SSH into the driver container of their cluster. Both these features had limitations for power users. The new web terminal feature is more convenient and powerful than both these existing methods and is now our recommended way of running shell commands on the driver.

We heard from our users that they want a highly interactive shell environment which supports any command-line tools, including popular editors such as Vim or Emacs. They asked for interactive terminal sessions to install arbitrary Linux packages or download files. These were not convenient or possible with %sh magic commands.

SSH would offer an interactive shell, but it is limited to a single user, whose keys are registered on the cluster. Many users share clusters and most of them want the convenience of the interactive shell. In addition, system administrators and security teams are not comfortable with opening the SSH port to their virtual private networks. The web terminal addresses all these limitations. As a user, you do not need to setup SSH keys to get an interactive terminal on a cluster.

How to get started

Web terminal is in Public Preview (AWS|Azure) and disabled by default. Workspace admins can enable the feature (AWS|Azure) through the Advanced tab. After this step, users can launch web terminal sessions on any clusters running Databricks Runtime 7.0 or above if they have “Can Attach To” permission.

There are two ways to open a web terminal on a cluster. You can go to the Apps tab under a cluster’s details page and click on the web terminal button.

There are two ways to open a web terminal on a cluster. You can go to the Apps tab under a cluster’s details page and click on the web terminal button.

Or when inside a notebook, you can click on the Cluster dropdown menu and click the “Terminal” shortcut.

The new Databricks Web Terminal can be accessed via a “Terminal” shortcut inside a notebook’s Cluster dropdown menu.

If you are launching a cluster and you wish to restrict web terminal access on your cluster, you can do so by setting DISABLE_WEB_TERMINAL=true environment variable. Also note that high concurrency clusters (AWS|Azure) with table ACLs (AWS|Azure) or credential passthrough (AWS|Azure) do not allow web terminal access.

Please see our user guide (AWS|Azure) for more details about using the web terminal on Databricks.

Limitations & Future Plans

When a web terminal session is not actively in use for several minutes, it will time out which leads to a new Bash process being created. This can result in losing your active shell session. To avoid this, we recommend managing your sessions with tools like tmux.

High concurrency clusters that have either table access control or credential passthrough enabled, do not support web terminal.

When a user logs out of Databricks or his/her permission is removed from a cluster, their active web terminal sessions are not terminated. Please refer to these security considerations (AWS|Azure) for more details. We are working on addressing these issues before the feature is generally available.

--

Try Databricks for free. Get started today.

The post Introducing the Databricks Web Terminal appeared first on Databricks.


Databricks Unified Data Analytics Platform for AWS Gets a Major Upgrade

$
0
0

Now generally available an upgraded platform architecture that adds customer-managed IP access lists, customer managed-VPC, Account API, multiple workspaces per account, cluster level policies, IAM credential passthrough, and more.

We are excited to announce the general availability of a major upgrade to the Databricks Unified Data Analytics Platform on Amazon Web Services  (AWS) that adds more security, scalability, and simpler management with features like IP Access List, Customer Managed VPC, Cluster level policies, and much more.

Data leaders are tasked to create data-driven business value by ensuring secure and manageable access of all their data for all of their users. It’s extremely difficult to do this without building a data analytics and machine learning (ML) platform that provides strong data security and governance, with simplified management of users and data initiatives, and high reliability and performance, so data teams can trust it to run business critical workloads  at scale.

In this blog, you will learn about the exciting new features that are unleashing data teams to innovate faster by doing more experimentation and get business critical workloads into production at scale while enforcing the right security and governance. You’ll also learn how customers like Wejo and Expedia are starting to benefit from this major platform upgrade.

Comprehensive platform security

Enterprises need to balance data democratization with enterprise data security and governance. As the data grows, enterprises default to a defensive lock down of all data. This limits innovation and the ability to use the data to create new insights, new data products, and improve operations. To help make data accessible while enforcing the right security controls and governance, Databricks enables you with capabilities to help you securely generate value out of your data using:

  • IP Access List – Databricks workspaces can be configured so that employees connect to the service only through existing corporate networks with a secure perimeter. Databricks customers can use the IP access lists feature to define a set of approved IP addresses. All incoming access to the Web application and REST APIs requires the user connect from an authorized IP address or VPN.
  • Customer-managed VPC – Deploy Databricks data plane in your own enterprise-managed VPC, in order to do necessary customizations as required by your cloud engineering & security teams.
  • Secure Cluster Connectivity – Databricks establishes secure connectivity between the scalable control plane and the clusters in your private VPC data plane. We don’t need a single Public IP in your cluster infrastructure to interact with the control plane.
  • Customer-managed Keys for Notebooks – You can now choose to use your own AWS KMS key to encrypt those notebooks in your data plane. Databricks stores customer notebooks in the scalable control plane so as to provide a slick and fast user experience via the web interface.
  • IAM Credential Passthrough – Access S3 buckets and other IAM-enabled AWS data services using the identity that you use to login into Databricks, either with SAML 2.0 Federation or SCIM.

The new capabilities are already enabling Wejo, the global leader in connected car data, to build a new connected car data platform-as-a-service. “Having the ability to effectively and efficiently digest, process, and extract value from over 15M active connected cars delivering over 2 trillion data points, is critical to our success at Wejo,” said Daniel Tibble, Head of Analytics at Wejo. “With Databricks, we are building a rich connected car data platform-as-a-service to enable our customers and partners, from global data providers to city traffic planning commissions, that simplifies analyzing and running machine learning workloads on all of our connected car data. Databricks platform is enabling our customers’ data teams to work in a more collaborative, more secure, and highly scalable solution without the need to invest in their own infrastructure.”

360° administration

With the proliferation of siloed tools to do data analytics or machine learning, IT Administrators are bogged down with managing an ever growing complex infrastructure. Databricks is helping IT teams easily manage users, costs, and a single unified platform for analytics and ML with full control. With a consistent experience across clouds, you can now deliver cloud-native data environments with:

  • Create on-demand data analytics workspaces in minutesI – Setting up a workspace and the infrastructure for a new project can take months in some cases – the multi-workspace feature brings this down to minutes. Get a new project and team up and running with a few API calls while implementing existing policies and configuration. If you use Terraform, you could also utilize the Databricks Terraform Resource Provider to bootstrap and operate a workspace.
  • Trust But Verify with Databricks – Get visibility into relevant cloud platform activity in terms of who’s doing what and when, by configuring Databricks Audit Logs and other related audit logs in AWS. See how you could process the Databricks Audit Logs for continuous monitoring.
  • Cluster Policies – Implement cluster policies across multiple workspaces to make cluster creation interface relevant for different data personas, and to enforce different security and cost controls.

Elastic scalability

Your data teams can now use fully-configured data environments and API’s to quickly take initiatives from development to production, reducing the complexity and inefficiencies of manual processes that can add months to data initiatives.  Once in production, they can use on-demand autoscaling to optimize performance and reduce down time of data pipelines and ML models by efficiently matching resources to demand. Exciting new features enabling this include:

  • Productionize and Automate Your Data Platform at Scale – Create fully configured data environments and bootstrap them with users / groups, cluster policies, clusters, notebooks, object permissions etc. all through APIs.
  • CI/CD for your Data Workloads – Streamline your application development and deployment process with integration to DevOps tools like Jenkins, Azure DevOps, CircleCI etc. Use REST API 2.0 under the hood to deploy your application artifacts and provision workspace-level objects.
  • Databricks Pools – Enable clusters to start and scale faster by creating a managed cache of virtual machine instances that can be acquired for use when needed.

Customers like Expedia.com are using Databricks to engage with their customers in a whole new way. “At Expedia we are future proofing the way we think about and engage with our customers to provide more personalized, seamless, and stellar experiences across our platforms as they plan their next big adventure,” says Ashin Moodithaya, Director of Technical Product Management at Expedia.  “We are expanding the way we use Databricks to now include Expedia.com. The ease of use, simple configuration, and collaborative environment of Databricks Unified Data Analytics Platform, will drastically improve our marketing data science teams productivity by using relevant data from across the enterprise in a secure and compliant manner to reimagine the customer experience across Expedia.com and our partner platforms.”

Unleash your data teams potential

Databricks Unified Data Analytics Platform is the highly secure, scalable, simple to manage, data analytics and machine learning platform enabling all your data teams to solve your toughest data problems.  Securely democratize all your data to enable your data teams to extract insights, build new data products, and introduce new data-driven operational efficiencies. Get your data teams creating new value within minutes while maintaining control across workspaces, clusters, and users. Do more data analytics and machine learning, faster, securely, and at scale.

The new features for the Unified Data Analytics Platform on AWS  are now available in the following AWS Regions (US West 1, US West 2, US East 1, and US East 2). Learn more about how we are enabling you with comprehensive platform security, elastic scalability, and 360° administration for all your data analytics and machine learning needs.

--

Try Databricks for free. Get started today.

The post Databricks Unified Data Analytics Platform for AWS Gets a Major Upgrade appeared first on Databricks.

Spark + AI Summit Europe is Expanding and Getting a New Name: Data + AI Summit Europe

$
0
0

Back in 2013, we held the first Spark Summit — a gathering of the Apache Spark™ community with leading contributors and production users sharing their wisdom. Since the first event, Spark’s success has accelerated the evolution of data science, data engineering and analytics. As the data community has expanded, we’ve evolved the content and the name — including speakers such as Jeff Dean (of Google fame) and Adam Pazske (PyTorch). Most recently, we held the Spark + AI Summit in June as a global event that brought together 40,000 attendees from data teams around the world to shape the future of big data, analytics and artificial intelligence (AI). We usually follow this up later in the year with another Spark + AI Summit timed for the European audience, but this year we’ve decided to expand the event and to evolve the name once more.

In November 2020, we’re excited to launch the inaugural Data + AI Summit Europe, officially expanding Spark + AI Summit content and community to include all things data, with a focus on the best open source technologies for building enterprise data applications!

What about Spark?

Spark represents the cornerstone of the Data + AI Summit community. Over the last 10 years, Apache Spark has quickly become the open standard for large scale data processing. It is widely adopted, and is supported by an active global community that we remain deeply committed to. Data + AI Summit Europe will continue to be the home for the Spark community to gather, share ideas and accelerate the development and adoption of Spark. We’ll have a new Apache Spark track, and will also continue to include Spark content, speakers and activities throughout the event, just as we always have.

Over the last 10 years, data analysts, data scientists and others have joined the Spark community and are working on teams solving complex data challenges. Born out of this community are key open source technologies such as Delta Lake, MLflow, Redash and Koalas – all of which are growing rapidly. We’ve widened the conference program to cover all of these technologies and many others – including Spark – in more depth, and have adapted the name to be more inclusive of the communities starting to form around them.

We invite data scientists, data engineers, analysts, developers, chief data officers, industry experts, researchers and ML practitioners to attend the Summit and learn from the world’s leading experts on topics such as:

  • AI Uses Cases and New Opportunities
  • Best Practices and Use Cases for Apache Spark, Delta Lake, MLflow
  • Data Engineering, including streaming architectures
  • SQL Analytics and business intelligence (BI) using data warehouses and data lakes
  • Data Science, including the Python ecosystem
  • Machine Learning and Deep Learning Applications
  • Productionizing Machine Learning (MLOps)
  • Research on large-scale data analytics and ML
  • Industry Use Cases

First Data + AI Summit Europe: 17-19 November

Formerly known as Spark + AI Summit, the Data + AI Summit Europe will take place virtually from 17-19 November. We’ll explore how data teams are increasingly working as one — performing BI, ML and AI concurrently by leveraging advanced technologies like Apache Spark 3.0, Delta Lake and MLflow. While the program will be optimized for European timezones, data scientists, data analysts, data engineers and data leaders across the globe are expected to participate because of this virtual format and free general admission. Register here to be one of the first to hear when it’s open for registration.

Call for Presentations Now Open

With the evolution of the conference name and broadened program, we’re delighted to open the call for presentations. The call for presentations is open now and until 13 September 11:59PM PDT.

We’re especially interested in more ideas for sessions on Delta Lake and SQL Analytics,MLflow and MLOps, Machine Learning and Deep Learning with PyTorch and TensorFlow, etc.

Interested in speaking? Submit your talk now
SUBMIT YOUR TOPIC!

We really hope you can join us for the inaugural Data + AI Summit Europe in November!

--

Try Databricks for free. Get started today.

The post Spark + AI Summit Europe is Expanding and Getting a New Name: Data + AI Summit Europe appeared first on Databricks.

An Update on Project Zen: Improving Apache Spark for Python Users

$
0
0

Apache Spark™ has reached its 10th anniversary with Apache Spark 3.0 which has many significant improvements and new features including but not limited to type hint support in pandas UDF, better error handling in UDFs, and Spark SQL adaptive query execution. It has grown to be one of the most successful open-source projects as the de facto unified engine for data science.  In fact, Apache Spark has now reached the plateau phase of the Gartner Hype cycle in data science and machine learning pointing to its enduring strength.

As Apache Spark grows, the number of PySpark users has grown rapidly. 68% of notebook commands on Databricks are in Python. The number of PySpark users has almost jumped up three times for the last year. The Python programming language itself became one of the most commonly used languages in data science.

With this momentum, the Spark community started to focus more on Python and PySpark, and in an initiative we named Project Zen, named after The Zen of Python that defines the principles of Python itself.

This blog post introduces Project Zen and our future plans for PySpark development focusing on the following upcoming features:

  • Redesigning PySpark documentation
  • PySpark type hints
  • Visualization
  • Standardized warnings and exceptions
  • JDK, Hive and Hadoop distribution option for PyPI users

Redesigning PySpark documentation

The structure and layout of the PySpark documentation have not been updated for more than five years, since the days when RDD was the only user-facing API It had focused more on being a development reference rather than readability. For example, it listed all classes and methods in a single page. It does not have a user guide or quickstart, and it is difficult to navigate.

As part of Project Zen, redesigning PySpark documentation is now under heavy development to provide users not only structured API references as well as meaningful examples, scenarios, and quickstart guides – but also dedicated migration guides and advanced use cases.The demonstration of the new PySpark documentation was introduced in the SAIS 2020 keynote.

Demonstration of new PySpark documentation

In addition, the docstrings will follow numpydoc style (ref: SPARK-32085) as the current PySpark docstrings and the generated HTML pages are less readable.

Switching it to numpydoc enables us to have a better docstring; for example, this hint:

"""Specifies some hint on the current :class:`DataFrame`.

:param name: A name of the hint.
:param parameters: Optional parameters.
:return: :class:`DataFrame`

becomes a more readable docstring in the numpydoc style as below:

"""Specifies some hint on the current :class:`DataFrame`.

Parameters
----------
name : str
    A name of the hint.
parameters : dict, optional
    Optional parameters

Returns
-------
DataFrame

Moreover, it generates more readable and structured API references in HTML as noted in this example.

PySpark type hints

An important roadmap item is Python type hint support in PySpark APIs. Python type hints were officially introduced in PEP 484 with Python 3.5 to statically indicate the type of a value in Python, and leveraging it has multiple benefits such as auto-completion, IDE support, automated documentation, etc.

This type hint support in PySpark APIs was implemented as a third party, and we’re currently working to officially port it into PySpark to improve usability in PySpark. This is now in the roadmap for the upcoming Apache Spark 3.1.

With the type hint support users will be able to do static error detection and autocompletion as shown below:

Static error detection

 

Improved autocompletion

Visualization

Visualizing data is a critical component of data science as it helps people understand trends at a glance and make informed decisions quickly. Currently, there is no native visualization support in PySpark, so developers generally just call DataFrame.summary() to get a table of numbers and downsampling a subset of their data to visualize it with libraries such as matplotlib and Koalas, or rely on other third-party business intelligence and big data analytics tools.

Visualization support is in the roadmap of Project Zen for PySpark to directly support APIs to plot users’ DataFrames. Koalas already implements plotting from a Spark DataFrame (see example below), and this can be easily leveraged to bring the visualization support into PySpark.

Example visualization plotting Spark DataFrame by using Koalas that will be supported by Databricks’ Project Zen.

Plotting Spark DataFrame by using Koalas

Standardized warnings and exceptions

PySpark error and warning types are obscure and vaguely classified. When users face an exception or a warning, often it is a plain Exception or UserWarning. For example,

>>> spark.range(10).explain(1, 2)
Traceback (most recent call last):
    ...
Exception: extended and mode should not be set together.

This makes it difficult for users to programmatically take an action on the exception, for example by try-except syntax.

It is now in the roadmap to define the classification of exceptions and fined-grained warnings so that users can expect which exceptions and warnings are issued for which case, and take the corresponding action appropriately.

JDK, Hive and Hadoop distribution option for PyPI users

pip is a very easy way to install PySpark with more than 5 million downloads every month from PyPI. Users just type pip install pyspark, and run PySpark shells or submit an application to the cluster. At the same time, Apache Spark introduced many profiles to consider when distributing, for example, JDK 11, Hadoop 3, and Hive 2.3 support.

Unfortunately, PySpark only supports one combination by default when it is downloaded from PyPI: JDK 8, Hive 1.2, and Hadoop 2.7 as of Apache Spark 3.0.

As part of Project Zen, the distribution option will be provided to users so users can select the profiles they want. Users will be able to simply install from PyPI and use any existing Spark cluster. This is being tracked at SPARK-32017.

What’s next?

Project Zen is in progress thanks to the tremendous efforts from the community. PySpark documentation, PySpark type hints, and optional profiles in the PyPI distribution are targeted to be introduced for the upcoming Apache Spark 3.1. Other items that are under heavy development will be introduced in a later Spark release.

Apache Spark places more importance on PySpark and Python than ever and will keep improving its usability as well as its performance to help Python developers and data scientists become even more successful when working with Spark.

--

Try Databricks for free. Get started today.

The post An Update on Project Zen: Improving Apache Spark for Python Users appeared first on Databricks.

It’s an ESG World and We’re Just Living in it

$
0
0

The future of finance goes hand in hand with socially responsible investing, environmental stewardship, and corporate ethics. In order to stay competitive, Financial Services Institutions (FSI) are increasingly disclosing more information about their environmental, social, and corporate governance (ESG) performance. Hence the increasing importance of ESG ratings and ESG scores to investment managers and institutional investors. In fact, the value of data-driven ESG global assets has increased to $40.5 trillion in 2020.

With nearly ⅓ of global managed assets focused on ESG, it’s clear that the broad benefits of incorporating ESG goals and responsible investing are well understood by companies, investment professionals, and regulators. But how do you harness the insights buried deep in non-financial data sources from CSR reporting and social media to impact investment and carbon offsetting strategies?

Financial Services ESG Virtual Workshop

In the recent technical workshop “Data + AI in the World of ESG”, that attracted some of the biggest financial services institutions in the world, we aimed to answer that question by educating the community on how data and AI can help organizations better understand and quantify the sustainability and social impact of any investment in a company. Turns out, although many companies are prioritizing ESG factors as a strategic initiative, many of them don’t have the resources and data strategy to really take it to the next level.

In fact, when asked about whether they have technical resources focused on ESG, only 31% of respondents said they had dedicated data engineers and data scientists for ESG. And when asked whether their ESG strategy leverages data and AI, less than a third of respondents said yes.

To help the attendees address some of these problems, Junta Nakai, the industry leader for Databricks’ financial services business provided an insightful overview of the connections between companies and how understanding the positive or negative ESG consequences of these connections may have to one’s business and investment process and strategy.

Joining Junta was Antoine Amend, technical director, who dove into how Databricks can enable asset managers to assess the sustainability of their investments and demonstrated ways to use machine learning to extract the key ESG initiatives, as climate change mitigation, as communicated in yearly PDF reports and compare these with the actual media coverage from news analytics data.

Through this novel approach to sustainable investing and asset management, companies can combine natural language processing (NLP) techniques and graph analytics to extract key strategic ESG initiatives and learn companies’ relationships in a global market and their impact on market risk calculations.

If you missed the workshop or would like to share the contents with your colleagues, you can access the replay below.

Try the below notebooks on Databricks to accelerate your ESG development strategy today and contact us to learn more about how we assist customers with similar use cases.

  1. Using NLP to extract key ESG initiatives from PDF reports
  2. Introducing a novel approach to ESG scoring using graph analytics
  3. Applying ESG framework to market risk calculation

WATCH RECORDING!

The post It’s an ESG World and We’re Just Living in it appeared first on Databricks.

Diving Deep Into the Inner Workings of the Lakehouse and Delta Lake

$
0
0

Earlier this year, Databricks wrote a blog that outlined how more and more enterprises are adopting the lakehouse pattern. The blog created a massive amount of interest from technology enthusiasts. While lots of people praised it as the next-generation data architecture, some people thought the Lakehouse is the same thing as the data lake. Recently, several of our engineers and founders wrote a research paper that describes some of the core technological challenges and solutions that set the Lakehouse paradigm apart from the data lake, and it was accepted and published at the International Conference on Very Large Databases (VLDB) 2020. You can read the paper, “Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores”, here.

Henry Ford is often credited with having said, “If I had asked people what they wanted, they would have said faster horses.” The crux of this statement is that people often envision a better solution to a problem as an evolution of what they already know rather than rethinking the approach to the problem altogether. In the world of data storage, this pattern has been playing out for years. Vendors continue to try to reinvent the old horses of data warehouses and data lakes rather than seek a new solution.

More than a decade ago, the cloud opened a new frontier for data storage. Cloud object stores like Amazon S3 have become some of the largest and most cost-effective storage systems in the world, which makes them an attractive platform to store data warehouses and data lakes. However, their nature as a key-value store makes it difficult to achieve ACID transactions that many organizations require. Also, performance is hampered by expensive metadata operations (e.g. listing objects) and limited consistency guarantees.

Based on the characteristics of object stores, three approaches have emerged.

Data Lakes

The first is directories of files (i.e. data lakes) that store the table as a collection of objects, typically in columnar format such as Apache Parquet. It’s an attractive approach, because the table is just a group of objects that can be accessed from a wide variety of tools without a lot of additional data stores or systems. However, both performance and consistency problems are common. Hidden data corruption is common due to transaction fails, eventual consistency leads to inconsistent queries, latency is high, and basic management capabilities like table versioning and audit logs are unavailable.

Custom storage engines

The second approach is custom storage engines, such as proprietary systems built for the cloud like the Snowflake data warehouse. These systems can bypass the consistency challenges of data lakes by managing the metadata in a separate, strongly consistent service that’s able to provide a single source of truth. However, all I/O operations need to connect to this metadata service, which can increase resource costs and reduce performance and availability. Additionally, it takes a lot of engineering work to implement connectors to existing computing engines like Apache Spark, TensorFlow, and PyTorch, which can be challenging for data teams that use a variety of computing engines on their data. Engineering challenges can be exacerbated by unstructured data, because these systems are generally optimized for traditional structured data types. Finally, and most egregious, the proprietary metadata service locks customers into a specific service provider, leaving customers to contend with consistently high prices and expensive, time-consuming migrations if they decide to adopt a new approach later.

Lakehouse

With Delta Lake, an open source ACID table storage layer atop cloud object stores, we sought to build a car instead of a faster horse with not just a better data store, but a fundamental change in how data is stored and used via the lakehouse. A lakehouse is a new paradigm that combines the best elements of data lakes and data warehouses. Lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse, directly on the kind of low cost storage used for data lakes. They are what you would get if you had to redesign storage engines in the modern world, now that cheap and highly reliable storage (in the form of object stores) are available.

Delta Lake maintains information about which objects are part of a Delta table in an ACID manner, using a write-ahead log, compacted into Parquet, that is also stored in the cloud object store. This design allows clients to update multiple objects at once, replace a subset of the objects with another, etc., in a serializable manner that still achieves high parallel read/write performance from the objects. The log also provides significantly faster metadata operations for large tabular datasets. Additionally, Delta Lake offers advanced capabilities like time travel (i.e. query point-in-time snapshots or roll back erroneous updates), automatic data layout optimization, upserts, caching, and audit logs. Together, these features improve both the manageability and performance of working with data in cloud object stores, ultimately opening the door to the lakehouse paradigm that combines the key features of data warehouses and data lakes to create a better, simpler data architecture.

Today, Delta Lake is used across thousands of Databricks customers, processing exabytes of structured and unstructured data each day, as well as many organizations in the open source community. These use cases span a variety of data sources and applications. The data types stored include Change Data Capture (CDC) logs from enterprise OLTP systems, application logs, time-series data, graphs, aggregate tables for reporting, and image or feature data for machine learning. The applications include SQL workloads (most commonly), business intelligence, streaming, data science, machine learning, and graph analytics. Overall, Delta Lake has proven itself to be a good fit for most data lake applications that would have used structured storage formats like Parquet or ORC, and many traditional data warehousing workloads.

Across these use cases, we found that customers often use Delta Lake to significantly simplify their data architecture by running more workloads directly against cloud object stores, and increasingly, by creating a lakehouse with both data lake and transactional features to replace some or all of the functionality provided by message queues (e.g. Apache Kafka), data lakes, or cloud data warehouses (e.g. Snowflake, Amazon Redshift.)

In the research paper, the authors explain:

  • The characteristics and challenges of object stores
  • The Delta Lake storage format and access protocols
  • The current features, benefits, and limitations of Delta Lake
  • Both the core and specialized use cases commonly employed today
  • Performance experiments, including TPC-DS performance

Through the paper, you’ll gain a better understanding of Delta Lake and how it enables a wide range of DBMS-like performance and management features for data held in low-cost cloud storage. As well as how the Delta Lake storage format and access protocols make it simple to operate, highly available, and able to deliver high-bandwidth access to the object store.

Download the research paper

--

Try Databricks for free. Get started today.

The post Diving Deep Into the Inner Workings of the Lakehouse and Delta Lake appeared first on Databricks.

Announcing Databricks Labs Terraform integration on AWS and Azure

$
0
0

Architecture for managing Databricks workspaces on Azure and AWS via Terraform.

We are pleased to announce integration for deploying and managing Databricks environments on Microsoft Azure and Amazon Web Services (AWS) with HashiCorp Terraform. It is a popular open source tool for creating safe and predictable cloud infrastructure across several cloud providers. With this release, our customers can manage their entire Databricks workspaces along with the rest of their infrastructure using a flexible, powerful tool. Previously on the company blog you may have read how we use the tool internally or how to share common building blocks of it as modules.

Increasing adoption of Databricks Labs Terraform Provider

Growing adoption from initial customer base

Few months ago a customer obsessed crew from the Databricks Labs teamed up and started making a Databricks Terraform Provider. Since the very start we’ve been seeing a steady increase in usage of this integration by a number of different customers.

Overall resource usage of Databricks Terraform Provider across AWS and Azure clouds.

Overall resource usage from all clouds

The aim of this provider is to support all Databricks APIs on Azure and AWS. This allows the Cloud Infrastructure Engineers to automate the most complicated things about their Data & AI platforms. Vast majority of the initial user group is using this provider to set up their clusters and jobs. Customers are also using it to provision workspaces on AWS and configure data access. Workspace setup resources are usually used only in the beginning of deployment setup along with virtual network setup.

Controlling compute resources and monetary spend

From a compute perspective, the provider makes it simple to create a cluster for interactive analysis or a job to run production workload with guaranteed installation of libraries. It’s also quite simple to create and modify an instance pool with potentially reserved instances, so that your clusters can start up x-times quicker and cost you less $$$.

Managing the cost of compute resources in Databricks data science workspaces is a top concern for Platform admins. And for large organisations, managing all of these compute resources across multiple workspaces comes with a bit of overhead. To address those, the provider makes it easier to create scalable cluster management using cluster policies and the Hashicorp Configuration Language (HCL).

resource "databricks_cluster" "shared_autoscaling" {
    cluster_name            = "Shared Autoscaling"
    instance_pool_id        = databricks_instance_pool.this.id
    spark_version           = "6.6.x-scala2.11"
    autotermination_minutes = 10
    
    autoscale {
        min_workers = 1
        max_workers = 1000
    }
    
    library {
        maven {
        coordinates = "com.amazon.deequ:deequ:1.0.4"
        }
    }
    
    init_scripts {
        dbfs {
        destination = databricks_dbfs_file.show_variables.path
        }
    }
    
    custom_tags = {
        Department = "Marketing"
    }
}       

Controlling data access

From a workspace security perspective, administrators can configure different groups of users  with different access rights and even add users. General recommendation is to let Terraform manage groups including their workspace and data access rights, leaving group membership management to Identity Provider with SSO or SCIM provisioning.

For the sensitive data sources, one should create secret scopes to store the external API credentials in a secure manner. The secrets are redacted by default in the notebooks, and one could also manage access to those using access control lists. If you already use Hashicorp Vault, AWS Secrets Manager or Azure Key Vault, you can populate Databricks secrets from there and have them be usable for your AI and Advanced Analytics use cases. If you have workspace security enabled, permissions can be the single source of truth for managing user or group access to clusters (and their policies), jobs, instance pools, notebooks and other Databricks objects.

resource "databricks_permissions" "grant policy usage" {
    cluster_policy_id = databricks_cluster_policy.something_simple.id
    access_control {
        group_name = databricks_scim_group.datascience.display_name
        permission_level = "CAN_USE"
    }
}     

From a data security perspective, one could manage AWS EC2 instance profiles in a workspace and assign those to only relevant groups of users. The key thing to note here is that you can define all of these cross platform components (AWS & Databricks) in the same language and code base where Terraform manages the intricate dependencies.

// now you can do `%fs ls /mnt/experiments` in notebooks
resource "databricks_s3_mount" "this" {
    instance_profile = databricks_instance_profile.ds.id
    s3_bucket_name = aws_s3_bucket.this.bucket
    mount_name = "experiments"
}

The integration also facilitates mounting of object storage within workspace into “normal” file system for the following storage types:

Managing workspaces

It is possible to create Azure Databricks workspaces using azurerm_databricks_workspace (this resource is part of the Azure provider that’s officially supported by Hashicorp). Customers interested in provisioning a setup conforming to their enterprise governance policy could follow this working example with Azure Databricks VNet injection.

Reference Architecture for AWS with data and network security across multiple Databricks workspaces

With general availability of the E2 capability, our AWS customers can now leverage enhanced security features and create workspaces within their own fully managed VPCs. Customers can configure a network resource which defines the subnets and security groups within the existing VPC. Then could then create a cross-account role and register it as a credentials resource to grant Databricks relevant permissions to provision compute resources within the provided VPC. A storage configuration resource could be used to configure the root bucket.

// create workspace in given VPC with DBFS on root bucket
resource "databricks_mws_workspaces" "this" {
    provider        = databricks.mws
    account_id      = var.account_id
    workspace_name  = var.prefix
    deployment_name = var.prefix
    aws_region      = var.region
    
    credentials_id = databricks_mws_credentials.this.credentials_id
    storage_configuration_id = databricks_mws_storage_configurations.this.storage_configuration_id
    network_id = databricks_mws_networks.this.network_id
    verify_workspace_runnning = true
}

Please follow this complete example with a new VPC and new workspace setup. Please pay special attention to the fact that there are two different instances of the Databricks provider – one for deploying workspaces (with host=https://accounts.cloud.databricks.com/) and another for managing Databricks objects within the provisioned workspace. If you would like to manage provisioning of workspaces as well as clusters within that workspace in the same terraform module (essentially same directory), you should use the provider aliasing feature of Terraform. We strongly recommend having separate terraform modules for provisioning of the workspace including generating the initial PAT token, and managing resources within the workspace. This is due to the fact that Databricks APIs are nearly the same across all cloud providers but workspace creation may be cloud specific. Once the PAT token has been created after the workspace provisioning, that could be used in other modules to provision relevant objects within the workspace.

Provider quality and support

Provider has been developed as part of the Databricks Labs initiative and has an established issue tracking through Github. Pull requests are always welcome. Code is undergoing heavy integration testing each release and has got significant unit test code coverage. The goal is also to make sure every possible Databricks resource and data source definition is documented.

We extensively test all of the resources for all of the supported cloud providers through a set of integration tests before every release. We mainly test with Terraform 0.12, though soon we’ll switch to testing with 0.13 as well.

What’s Next?

Stay tuned for related blog posts in future. You can also watch on demand our webinar discussing how to Simplify, Secure, and Scale your Enterprise Cloud Data Platform  on AWS & Azure Databricks in an automated way.

Read more

--

Try Databricks for free. Get started today.

The post Announcing Databricks Labs Terraform integration on AWS and Azure appeared first on Databricks.

Easily Clone your Delta Lake for Testing, Sharing, and ML Reproducibility

$
0
0

Introducing Clones

An efficient way to make copies of large datasets for testing, sharing and reproducing ML experiments 

We are excited to introduce a new capability in Databricks Delta Lake – table cloning. Creating copies of tables in a data lake or data warehouse has several practical uses. However, given the volume of data in tables in a data lake and the rate of its growth, making physical copies of tables is an expensive operation. Databricks Delta Lake now makes the process simpler and cost-effective with the help of table clones.

Databricks Delta Lake table cloning abstracts the complexity and cost from cloning for testing, sharing and ML reproducibility.

What are clones anyway?

Clones are replicas of a source table at a given point in time. They have the same metadata as the source table: same schema, constraints, column descriptions, statistics, and partitioning. However, they behave as a separate table with a separate lineage or history. Any changes made to clones only affect the clone and not the source. Any changes that happen to the source during or after the cloning process also do not get reflected in the clone due to Snapshot Isolation. In Databricks Delta Lake we have two types of clones: shallow or deep.

Shallow Clones

A shallow (also known as Zero-Copy) clone only duplicates the metadata of the table being cloned; the data files of the table itself are not copied. This type of cloning does not create another physical copy of the data resulting in minimal storage costs. Shallow clones are inexpensive and can be extremely fast to create. These clones are not self-contained and depend on the source from which they were cloned as the source of data. If the files in the source that the clone depends on are removed, for example with VACUUM, a shallow clone may become unusable. Therefore, shallow clones are typically used for short-lived use cases such as testing and experimentation.

Deep Clones

Shallow clones are great for short-lived use cases, but some scenarios require a separate and independent copy of the table’s data. A deep clone makes a full copy of the  metadata and data files of the table being cloned. In that sense it is similar in functionality to copying with a CTAS command (CREATE TABLE.. AS… SELECT…). But it is simpler to specify since it makes a faithful copy of the original table at the specified version and you don’t need to re-specify partitioning, constraints and other information as you have to do with CTAS. In addition, it is much faster, robust, and can work in an incremental manner against failures!

With deep clones, we copy additional  metadata, such as your streaming application transactions and COPY INTO transactions, so you can continue your ETL applications exactly where it left off on a deep clone!

Where do clones help?

Sometimes I wish I had a clone to help with my chores or magic tricks. However, we’re not talking about human clones here. There are many scenarios where you need a copy of your datasets – for exploring, sharing, or testing ML models or analytical queries. Below are some example customer use cases.

Testing and experimentation with a production table

When users need to test a new version of their data pipeline they often have to rely on sample test datasets which are not representative of all the data in their production environment. Data teams may also want to experiment with various indexing techniques to improve performance of queries against massive tables. These experiments and tests cannot be carried out in a production environment without risking production data processes and affecting users.

It can take many hours or even days, to spin up copies of your production tables for a test or a development environment. Add to that, the extra storage costs for your development environment to hold all the duplicated data – there is a large overhead in setting a test environment reflective of the production data. With a shallow clone, this is trivial:

-- SQL
CREATE TABLE delta.`/some/test/location` SHALLOW CLONE prod.events

# Python
DeltaTable.forName("spark", "prod.events").clone("/some/test/location", isShallow=True)

// Scala
DeltaTable.forName("spark", "prod.events").clone("/some/test/location", isShallow=true)    

After creating a shallow clone of your table in a matter of seconds, you can start running a copy of your pipeline to test out your new code, or try optimizing your table in different dimensions to see how you can improve your query performance, and much much more. These changes will only affect your shallow clone, not your original table.

Staging major changes to a production table

Sometimes, you may need to perform some major changes to your production table. These changes may consist of many steps, and you don’t want other users to see the changes which you’re making until you’re done with all of your work. A shallow clone can help you out here:

-- SQL
CREATE TABLE temp.staged_changes SHALLOW CLONE prod.events;
DELETE FROM temp.staged_changes WHERE event_id is null;
UPDATE temp.staged_changes SET change_date = current_date() WHERE change_date is null;
...
-- Perform your verifications

Once you’re happy with the results, you have two options. If no other change has been made to your source table, you can replace your source table with the clone. If changes have been made to your source table, you can merge the changes into your source table.

-- If no changes have been made to the source
REPLACE TABLE prod.events CLONE temp.staged_changes;
-- If the source table has changed
MERGE INTO prod.events USING temp.staged_changes
ON events.event_id <=> staged_changes.event_id 
WHEN MATCHED THEN UPDATE SET *;
-- Drop the staged table
DROP TABLE temp.staged_changes;

Machine Learning result reproducibility

Coming up with an effective ML model is an iterative process. Throughout this process of tweaking the different parts of the model data scientists need to assess the accuracy of the model against a fixed dataset. This is hard to do in a system where the data is constantly being loaded or updated. A snapshot of the data used to train and test the model is required. This snapshot allows the results of the ML model to be reproducible for testing or model governance purposes. We recommend leveraging Time Travel to run multiple experiments across a snapshot; an example of this in action can be seen in Machine Learning Data Lineage with MLflow and Delta Lake. Once you’re happy with the results and would like to archive the data for later retrieval, for example next Black Friday, you can use deep clones to simplify the archiving process. MLflow integrates really well with Delta Lake, and the auto logging feature (mlflow.spark.autolog() ) will tell you, which version of the table was used to run a set of experiments.

# Run your ML workloads using Python and then
DeltaTable.forName(spark, "feature_store").cloneAtVersion(128, "feature_store_bf2020")

Data Migration

A massive table may need to be moved to a new, dedicated bucket or storage system for performance or governance reasons. The original table will not receive new updates going forward and will be deactivated and removed at a future point in time. Deep clones make the copying of massive tables more robust and scalable.

-- SQL
CREATE TABLE delta.`zz://my-new-bucket/events` CLONE prod.events;
ALTER TABLE prod.events SET LOCATION 'zz://my-new-bucket/events';

With deep clones, since we copy your streaming application transactions and COPY INTO transactions, you can continue your ETL applications from exactly where it left off after this migration!

Data Sharing

In an organization it is often the case that users from different departments are looking for data sets that they can use to enrich their analysis or models. You may want to share your data with other users across the organization. But rather than setting up elaborate pipelines to move the data to yet another store it is often easier and economical to create  a copy of the relevant data set for users to explore and test the data to see if it is a fit for their needs without affecting your own production systems. Here deep clones again come to the rescue.

-- The following code can be scheduled to run at your convenience
CREATE OR REPLACE TABLE data_science.events CLONE prod.events;

Data Archiving

For regulatory or archiving purposes all data in a table needs to be preserved for a certain number of years, while the active table retains data for a few months. If you want your data to be updated as soon as possible, but however you have a requirement to keep data for several years, storing this data in a single table and performing time travel may become prohibitively expensive. In this case, archiving your data in a daily, weekly or monthly manner is a better solution. The incremental cloning capability of deep clones will really help you here.

-- The following code can be scheduled to run at your convenience
CREATE OR REPLACE TABLE archive.events CLONE prod.events;

Note that this table will have an independent history compared to the source table, therefore time travel queries on the source table and the clone may return different results based on your frequency of archiving.

Looks awesome! Any gotchas?

Just to reiterate some of the gotchas mentioned above as a single list, here’s what you should be wary of:

  • Clones are executed on a snapshot of your data. Any changes that are made to the source table after the cloning process starts will not be reflected in the clone.
  • Shallow clones are not self-contained tables like deep clones. If the data is deleted in the source table (for example through VACUUM), your shallow clone may not be usable.
  • Clones have a separate, independent history from the source table. Time travel queries on your source table and clone may not return the same result.
  • Shallow clones do not copy stream transactions or COPY INTO metadata. Use deep clones to migrate your tables and continue your ETL processes from where it left off.

How can I use it?

Shallow and Deep clones support new advances in how data teams test and manage their modern cloud data lakes and warehouses. Table clones can help your team now implement production-level testing of their pipelines, fine tune their indexing for optimal query performance, create table copies for sharing – all with minimal overhead and expense. If this is a need in your organization we hope you will take table cloning for a spin and give us your feedback – we look forward to hearing about new use cases and extensions you would like to see in the future.

The feature is available in Databricks 7.2 as a public preview for all customers. Learn more about the feature. To see it in action, sign up for a free trial of Databricks.

--

Try Databricks for free. Get started today.

The post Easily Clone your Delta Lake for Testing, Sharing, and ML Reproducibility appeared first on Databricks.


Building a Modern Risk Management Platform in Financial Services

$
0
0

This blog was collaboratively written with Databricks partner Avanade. A special thanks to Dael Williamson, Avanade CTO, for his contributions.

Financial Institutions today are still struggling to keep up with the emerging risks and threats facing their business. Managing risk, especially within the banking sector, has increased in complexity over the past several years.

First, new frameworks (such as FRTB) are being introduced that potentially require tremendous computing power and an ability to analyze years of historical data. Second, regulators are demanding more transparency and explainability from the banks they oversee. Finally, the introduction of new technologies and business models means that the need for sound risk governance is at an all-time high. However, the ability for the banking industry to effectively meet these demands has not been an easy undertaking.

Agile approach to risk management

Traditional banks relying on on-premises infrastructure can no longer effectively manage risk. Banks must abandon the computational inefficiencies of legacy technologies and build an agile Modern Risk Management practice capable of rapidly responding to market and economic volatility using data and advanced analytics.

Our work with clients shows that as new threats, such as the last decade’s financial crisis, emerge, historical data and aggregated risk models lose their predictive values quickly. Luckily, modernization is made possible today based on open-source technologies powered by cloud-native big data infrastructure that bring an agile and forward-looking approach to financial risk analysis and management.

Traditional datasets limit transparency and reliability

Risk analysts must augment traditional data with alternative datasets to explore new ways of identifying and quantifying the risk factors facing their business, both at scale and in real time. Risk management teams must be able to efficiently scale their simulations from tens of thousands up to millions by leveraging both the flexibility of cloud compute and the robustness of open-source computing frameworks like Apache SparkTM.

They must accelerate model development lifecycle by bringing together both the transparency of their experiment and the reliability in their data, bridging the gap between science and engineering and enabling banks to have a more robust approach to risk management.

Data organization is critical to understanding and mitigating risk

How data is organized and collected is critical to creating highly reliable, flexible and accurate data models. This is particularly important when it comes to creating financial risk models for areas such as wealth management and investment banking.

In the financial world, risk management is the process of identification, analysis and acceptance or mitigation of uncertainty in investment decisions.

When data is organized and designed to flow within an independent pipeline, separate from massive dependencies and sequential tools, the time to run financial risk models is significantly reduced. Data is more flexible, easier to slice and dice, so institutions can apply their risk portfolio at a global and regional level as well as firmwide.

Plagued by the limitations of on-premises infrastructure and legacy technologies, banks particularly have not had the tools until recently to effectively build a modern risk management practice. A modern risk management framework enables intraday views, aggregations on demand and an ability to future proof/scale risk assessment and management.

Replace historical returns with highly accurate predictive models

Financial risk modeling should include multiple data sources to create more predictive financial and credit risk models. A modern risk and portfolio management practice should not be solely based on historical returns but also must embrace the variety of information available today.

For example, a white paper from Atkins et al describes how financial news can be used to predict stock market volatility better than close price. As indicated in the white paper, the use of alternative data can dramatically augment the intelligence for risk analysts to have a more descriptive lens of modern economy, enabling them to better understand and react to exogenous shocks in real time.

A modern risk management model in the cloud

Avanade and Databricks have demonstrated how Apache Spark, Delta Lake and MLflow can be used in the real world to organize and rapidly deploy data into a value-at-risk (VAR) data model. This enables financial institutions to modernize their risk management practices into the cloud and adopt a unified approach to data analytics with Databricks.

Using the flexibility and scale of cloud compute and the level of interactivity in an organization’s data, clients can better understand the risks facing their business and quickly develop accurate financial market risk calculations. With Avanade and Databricks, businesses can identify how much risk can be decreased and then accurately pinpoint where and how they can quickly apply risk measures to reduce their exposure.

Join us at the Modern Data Engineering with Azure Databricks Virtual Event on October 8th to hear Avanade present on how Avanade and Databricks can help you manage risk through our Financial Services Risk Management model. Sign up here today.

--

Try Databricks for free. Get started today.

The post Building a Modern Risk Management Platform in Financial Services appeared first on Databricks.

Automate Azure Databricks Platform Provisioning and Configuration

$
0
0

Introduction

In our previous blog, we discussed the practical challenges related to scaling out a data platform across multiple teams and how lack of automation adversely affects innovation and slows down go-to-market. Enterprises need consistent and scalable solutions that could utilize repeatable templates to seamlessly comply with enterprise governance policies, with a goal to bootstrap unified data analytics  environments across data teams. With Microsoft Azure Databricks, we’ve taken a API-first approach for all objects that enables quick provisioning & bootstrapping of cloud computing data environments, by integrating into existing Enterprise DevOps tooling without requiring customers to reinvent the wheel. In this article, we will walk through such a cloud deployment automation process using different Azure Databricks APIs.

The process for configuring an Azure Databricks data environment looks like the following:

  1. Deploy Azure Databricks Workspace
  2. Provision users and groups
  3. Create clusters policies and clusters
  4. Add permissions for users and groups
  5. Secure access to workspace within corporate network (IP Access List)
  6. Platform access token management

To accomplish the above, we will be using APIs for the following IaaS features or capabilities available as part of Azure Databricks:

  1. Token Management API allows admins to manage their users’ cloud service provider personal access tokens (PAT), including:
    1. Monitor and revoke users’ personal access tokens.
    2. Control the lifetime of future tokens in your public cloud workspace.
    3. Control which users can create and use PATs.
  2. AAD Token Support allows the use of AAD tokens to invoke the Azure Databricks APIs. One could also use Service Principals as first-class identities.
  3. IP Access Lists ensure that users can only connect to Azure Databricks through privileged networks thus forming a secure perimeter.
  4. Cluster policies is a construct that allows simplification of cluster management across workspace users, where admins could also enforce different security & cost control measures.
  5. Permissions API allows automation to set access control on different Azure Databricks objects like Clusters, Jobs, Pools, Notebooks, Models etc.

Automation options

There are a few options available to use the Azure Databricks APIs:

  • Databricks Terraform Resource Provider could be combined with Azure provider to create an end-to-end architecture, utilizing Terraform’s dependency and state management features.
  • Python (or any other programming language) could be used to invoke the APIs (sample solution) providing a way to integrate with third-party or homegrown DevOps tooling.
  • A readymade API client like Postman could be used to invoke the API directly.

To keep things simple, we’ll use the Postman approach below.

Common workflow

  1. Use a Azure AD Service Principal to create a Azure Databricks workspace.
  2. Use the service principal identity to set up IP Access Lists to ensure that the workspace can only be accessed from privileged networks.
  3. Use the service principal identity to set up cluster policies to simplify the cluster creation workflow. Admins can define a set of policies that could be assigned to specific users or groups.
  4. Use the service principal identity to provision users and groups using SCIM API (alternative to SCIM provisioning from AAD)
  5. Use the service principal identity to limit user personal access token (PAT) permissions using token management API
  6. All users (non-service principal identities) will use Azure AD tokens to connect to workspace APIs. This ensures conditional access (and MFA) is always enforced.

Pre-Requisites

Create Azure Resource Group and Virtual Network

Please go ahead and pre-create an Azure resource group. We will be deploying Azure Databricks workspace in a customer managed virtual network (VNET). VNET pre-creation is optional. Please refer to this guide to understand VNET requirements.

Provision Azure Application / Service Principal

We will be using an Azure Service Principal to automate the deployment process, using this guide please create a service principal. Please generate a new client secret and make sure to note down the following details:

  • Client Id
  • Client Secret (secret generated for the service principal)
  • Azure Subscription Id
  • Azure Tenant Id

Assign Role to Service Principal

Navigate to Azure Resource Group where you plan to deploy Azure Databricks workspace and add the “Contributor” role to your service principal.

Configure Postman Environment

We will be using the Azure Databricks ARM REST API option to provision a workspace. This is not to be confused with the REST API for different objects within a workspace.

Download postman collection from here. In order to run it please click Run in Postman button

Using the automation accelerator to automate the end-to-end set up of Azure Databricks in Postman.

The collection consists of several sections

Using the automation accelerator to automate the end-to-end set up of Azure Databricks in Postman.

Environment config file is already imported into postman, please go ahead and edit it by clicking on the “gear” button.

Example environment configuration using Postman user manual within Azure Databricks.

Configure environment as per your settings

Environment configuration settings available to Azure Databricks users.

Variable Name Value Description
Azure subscription details
tenantId Azure Tenant ID Locate it here
subscriptionId Azure Subscription ID Locate it here
clientCredential Service Principal Secret
clientId Service Principal ID
resourceGroup Resource group name User defined resource group
Constant’s used
managementResource https://management.core.windows.net/ Constant, more details here
databricksResourceId 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d Constant, unique applicationId that identifies Azure Databricks workspace resource inside azure
Azure Databricks deployment via ARM template specific variables
workspaceName Ex: adb-dev-workspace unique name given to the Azure Databricks workspace
VNETCidr Ex: 11.139.13.0/24 More details here
VNETName Ex: adb-VNET unique name given to the VNET where ADB is deployed, if a VNET exists we will use it, otherwise it will create a new one.
publicSubnetName Ex: adb-dev-pub-sub unique name given to the subnet within the VNET where Azure Databricks is deployed. We highly recommend that you let ARM template create this subnet rather than you pre creating it.
publicSubnetCidr Ex: 11.139.13.64/26 More details here
privateSubnetName Ex: adb-dev-pvt-sub unique name given to the subnet within the VNET where ADB is deployed. We highly recommend that you let ARM template create this subnet rather than you pre creating it.
privateSubnetCidr Ex: 11.139.13.128/26 More details here
nsgName Ex: adb-dev-workspace-nsg Network Security Group attached to Azure Databricks subnets.
pricingTier premium Options available premium or standard , more details here, IP-Access-List feature requires premium tier
workspace tags
tag1 Ex: dept101  Demonstrating how to set tags on Azure Databricks workspace

Provision Azure Databricks Workspace

Generate AAD Access Token

We will be using Azure AD access token to deploy the workspace, utilizing the OAuth Client Credential workflow, which is also referred to as two-legged OAuth to access web-hosted resources by using the identity of an application. This type of grant is commonly used for server-to-server interactions that must run in the background, without immediate interaction with a user.

Cloud provisioning the Azure Databricks workspace using the OAuth Client Credential workflow.

Executing aad token for management resource API returns AAD access token which will be used to deploy the Azure Databricks workspace, and to retrieve the deployment status. Access token is valid for 599 seconds by default, if you run into token expiry issues then please go ahead and rerun this API call to regenerate access token.

Deploy Workspace using the ARM template

ARM templates are utilized in order to deploy Azure Databricks workspace. ARM template is used as a request body payload in step provision databricks workspace inside Provisioning Workspace section as highlighted  above.

Deployment of the Azure Databricks workspace using the ARM template.

If subnets specified in the ARM template exist then we will use those otherwise those will be created for you. Azure Databricks workspace will be deployed within your VNET, and a default Network Security Group will be created and attached to subnets used by the workspace.

Get workspace URL

Workspace deployment takes approximately 5-8 minutes. Executing “get deployment status and workspace url” call returns workspace URL which we’ll use in subsequent calls.

Using “Get workspace URL” to return the Azure Databricks workspace URL to use in subsequent calls.

We set a global variable called “workspaceUrl” inside the test step to extract value from the response. We use this global variable in subsequent API calls.

A note on using Azure Service Principal as an identity in Azure Databricks

Please note that Azure Service Principal is considered a first class identity in Azure Databricks and as such can invoke all of the API’s. One thing that sets them apart from user identities is that service principals do not have access to the web application UI i.e. they cannot log into the workspace web application and perform UI functions they way a typical user like you and me would perform. Service principals are primarily used to invoke API in a headless fashion.

Generate Access Token for Auth

To authenticate and access Azure Databricks REST APIs, we can use of the following:

  • AAD access token generated for the service principal
    • Access token is managed by Azure AD
    • Default expiry is 599 seconds
  • Azure Databricks Personal Access Token generated for the service principal
    • Platform access token is managed by Azure Databricks
    • Default expiry is set by the user, usually in days or months

In this section we demonstrate usage of both of these tokens
Access to and authentication for Azure Databricks APIs are provided by the AAD access and Azure Databricks Personal Access tokens.

Generate AAD Access Token For Azure Databricks API Interaction

To generate AAD token for the service principal we’ll use the client credentials flow for the AzureDatabricks login application resource which is uniquely identified using the object resource id 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d.

Generating AAD Access token for Azure Databricks API interaction.

Response contains an AAD access token. We’ll set up a global variable “access_token” by extracting  this value.

The response with the AAD access token will allow Azure Databricks users to set up a global variable “access_token” with the extracted value.

Please note that the AAD access token generated is a bit different from the one that we have generated earlier to create the workspace, AAD token for workspace deployment is generated for the Azure management resource where as AAD access token to interact with API is for Azure Databricks Workspace resource.

Generate Azure Databricks Platform Token

To generate Azure Databricks platform access token for the service principal we’ll use access_token generated in the last step for authentication.

With the AAD access token value, users can generate the Azure Databricks platform access token for the service principal.

Executing generate databricks platform token for service principal returns platform access token, we then set a global environment variable called sp_pat based on this value. To keep things simple we will be using sp_pat for authentication for the rest of the API calls.

With the Databricks platform access token,<b><i> the Azure Databricks user can then </i></b>set a global environment variable called <b><i>sp_pat</i></b> based on that value.” height=”630″ class=”aligncenter size-full wp-image-97884″ scale=”0″></a></p>
<h2 id=Users and Groups Management

The SCIM API allows you to manage

  • Users (individual identities)
  • Azure Service Principals
  • Groups of users and/or service principal

Provision users and groups using SCIM API

Azure Databricks supports SCIM or System for Cross-domain Identity Management, an open standard that allows you to automate user provisioning using a REST API and JSON. The Azure Databricks SCIM API follows version 2.0 of the SCIM protocol.

  • An Azure Databricks administrator can invoke all `SCIM API` endpoints.
  • Non-admin users can invoke the Me Get endpoint, the `Users Get` endpoint to read user display names and IDs, and the Group Get endpoint to read group display names and IDs.

Please note that Azure Service Principal is considered a first class identity in Azure Databricks and as such can invoke all of the API’s. One thing that sets them apart from user identities is that service principals do not have access to the web application UI i.e. they cannot log into the workspace web application and perform UI functions they way a typical user like you and me would perform. Service principals are primarily used to invoke API in a headless fashion.

Manage PAT using Token Management API

Token Management provides Azure Databricks administrators with more insight and control over Personal Access Tokens in their workspaces. Please note that this does not apply to AAD tokens as they are managed within Azure AD.

Azure Databricks Token Management provides administrators with insight and control over Personal Access Tokens in their workspaces.

By monitoring and controlling token creation, you reduce the risk of lost tokens or long-lasting tokens that could lead to data exfiltration from the workspace.

The control and management of token creation made possible by Azure Databricks reduces the risk of lost tokens or long-lasting tokens that could lead to data exfiltration from the workspace.

Cluster Policies

A cluster policy limits the ability to create clusters based on a set of rules. A policy defines those rules as limitations on the attributes used for the cluster creation. Cluster policies define ACLs to limit their use to specific users and and groups. For more details please refer to our blog on cluster policies.

Azure Databricks cluster policies limit the ability to create clusters based on a set of rules.

Only admin users can create, edit, and delete policies. Admin users also have access to all policies.

In Azure Databricks, only admin users can create, edit, and delete policies.

Cluster Permissions

Clusters Permission API allows permissions for users and groups on clusters (both interactive and job clusters). The same process could be used for Jobs, Pools, Notebooks, Folders, Model Registry and Tokens.

Common use cases

  • Clusters are created based on the policies and admins would like to give a user or a group permission to view cluster logs or job output.
  • Assigning “Can Attach” permissions for users to jobs submitted through a centralized orchestration mechanism, so they could view the Job’s Spark UI and Logs. This can be achieved today for jobs created through jobs/create endpoints and run via run/now or scheduled runs. The centralized automation service can retrieve the cluster_id when the job is run and set permission on it
  • Permission Levels have been augmented to include permissions for all the supported objects i.e. Jobs, Pools, Notebooks, Folders, Model Registry and Tokens.

IP Access List

You may have a security policy which mandates that all access to Azure Databricks workspaces goes through your network and web application proxy. Configuring IP Access Lists ensure that employees have to connect via corporate VPN before accessing a workspace.

Azure Databricks allows for the configuring of IP Access Lists,  ensuring that employees have to connect via corporate VPN before accessing a workspace.

This feature provides Azure Databricks admins a way to set a `allowlist` and `blocklist` for `CIDR / IPs` that could access a workspace.

The Azure Databricks IP Access List feature provides admins a way to set `allowlist` and `blocklist` for `CIDR / IPs` that could access a workspace.

Azure Databricks platform APIs not only enable data teams to provision and secure enterprise grade data platforms but also help automate some of the most mundane but crucial tasks from user onboarding to setting up secure perimeter around these platforms.

As the unified data analytics platform is scaled across data teams, challenges in terms of workspace provisioning, resource configuration, overall management and compliance with enterprise governance multiply for the admins. End-to-End automation is a highly recommended best practice to address any such concerns and have better repeatability & reproducibility across the board.

We want to make workspace administration super simple, so that you get to do more and focus on solving some of the world’s toughest data challenges.

Troubleshooting

Expired token

<pre>    Error while parsing token: io.jsonwebtoken.ExpiredJwtException: JWT expired at 2019-08-08T13:28:46Z. Current time: 2019-08-08T16:19:10Z, a difference of 10224117 milliseconds.  Allowed clock skew: 0 milliseconds.</pre></p>

Please rerun step  generate aad token for management resource to regenerate management access token. Token has a time to live of 599 seconds.

Rate Limits

The Azure Databricks REST API supports a maximum of 30 requests/second per workspace. Requests that exceed the rate limit will receive a 429 response status code.

Common Token Issues are listed over here along with mitigation

--

Try Databricks for free. Get started today.

The post Automate Azure Databricks Platform Provisioning and Configuration appeared first on Databricks.

Registration Open for Inaugural Data + AI Summit Europe

$
0
0

Data + AI Summit Europe will take place 17-19 November and is now open for registration!  Formerly known as the Spark + AI Summit, this free and virtual event is expanded to bring together the world’s leading experts on data analysis, data engineering, data science, machine learning and artificial intelligence (AI) to explore the convergence of big data, analytics, and AI.  Whether you’re in Europe or want to realign your body clock to European time zones, we really hope you will join us for three days of keynotes, sessions, demos, AMAs and meetups!

Scribd data architecture, before Databricks

Expanded Topics and Community

Over the last 10 years, data analysts, data scientists and others have joined the Spark community and are working in teams solving complex data challenges. Born out of this community are key open source technologies such as Delta Lake, MLflow, Redash and Koalas – all of which are growing rapidly. We’ve widened the conference programme to cover all these technologies and many others – including Spark – in more depth, and have adapted the name to be more inclusive of the communities starting to form around them.

As data scientists, data engineers, data analysts, developers, chief data officers, industry experts, researchers and ML practitioners, you are invited to attend the Summit and learn from the world’s leading experts on topics such as:

  • AI use cases and opportunities being created by this tech in leading industries
  • Best practices and use cases for Apache Spark™, Delta Lake, MLflow
  • Data engineering, including simplifying and scaling streaming architectures
  • Lakehouse architectural pattern and empowering data lakes with data warehousing concepts
  • SQL analytics and business intelligence (BI)
  • Data science, including the growing influence of the Python ecosystem
  • Machine learning and deep learning applications
  • Productionising machine learning (MLOps)
  • Research on large-scale data analytics and ML

What about Spark?

Spark represents the cornerstone of the Data + AI Summit community. Over the last 10 years, Apache Spark™ has quickly become the open standard for large-scale big data processing. It is widely adopted, and is supported by an active global community to which we remain deeply committed. Data + AI Summit Europe will continue to be the home for the Spark community to gather, share ideas and accelerate the development and adoption of Spark. We invite you to learn from the leading Spark experts in our new Apache Spark track, and will also continue to include Spark content, speakers and activities throughout the event, just as we always have.

Virtual Can Be Awesome!

During the June event, we were excited to see the community push the boundaries of virtual events — with many opportunities for attendees to interact with each other, as well as the speakers and other experts.  While many of you came for the keynotes and sessions, we also held very popular AMAs, MLflow Yoga taught by a Databricks founder, DJ performances and networking through “birds of a feather” networking rooms and attendee direct messaging.

Because the event is virtual, you have the opportunity to watch all content available on demand as soon as it airs.  Whether you’re in Berlin, Amsterdam, London, New York, Tokyo or Mumbai, we encourage you to register and watch both the live sessions and on-demand content.

REGISTER NOW

--

Try Databricks for free. Get started today.

The post Registration Open for Inaugural Data + AI Summit Europe appeared first on Databricks.

It’s a Good Time to be a Brickster!

$
0
0

Despite what has been an incredibly challenging year for so many, both personally and professionally, we are proud to have a more passionate and engaged team of Bricksters than ever. And we’re fortunate that our momentum hasn’t slowed down – if anything, it’s accelerated. Databricks now has nearly 1,500 employees worldwide, and nearly 40% of the Fortune 500 depend on us to simplify data and AI so their data teams can innovate faster. We’re especially proud of the work our customers have done since the onset of the pandemic. Healthcare providers are using Databricks to deliver better care with real time patient tracking spread prediction modeling. Research institutions and pharmaceutical companies are using our platform to accelerate clinical trials and drug discovery. And the entire data community is mobilizing to help combat the crisis by leveraging free datasets and tools that we’ve made available in Databricks community edition. Our mission is to help data teams solve the world’s toughest problems, and it has never felt more important or relevant than it does today.

We’re extremely proud and humbled by the many ways in which Databricks has been recognized over the past few months. More than anything, the acknowledgements listed below reflect our team’s passion and dedication towards our mission to help data teams solve the world’s toughest problems. They demonstrate how we’ve maintained our collaborative and inclusive culture even while remote. And they validate our focus and investment in innovation. There has never been a better time to be a Brickster!

Forbes Cloud 100 (#5)

We ranked #5 on the annual Forbes Cloud 100 list this year — up from #13 in 2019. This list ranks the top 100 private cloud companies in the world and is produced in collaboration with Salesforce Ventures and Bessemer Venture Partners. Our high ranking on this list, based on sales, growth, high valuations, and strong culture, is evidence that Databrickscontinues to establish the future of data and AI in the cloud.

LinkedIn Top Startups 

For the third consecutive year in a row, Databricks was named to the LinkedIn Top Startups List, honoring the hottest companies attracting and retaining the best talent. This year we are ranked #5! The list is derived from a blended score of factors including employee engagement and growth, job interest, and the ability to attract top talent, and is informed by the billions of actions taken by LinkedIn members each year. It shows that talented candidates are continuously seeking out Databricks job postings and that Databricks employees have stayed highly engaged in sharing their experiences at the company across their networks.

Forbes Best Startup Employers 

Databricks recently made the Forbes Best Startup Employers list, a list compiled by Forbes and partner Statistica that identifies 500 top employers based on reputation, employee satisfaction, and growth. The Forbes team considered and evaluated 2,500 American businesses with at least 50 employees. Making this list illustrates Databricks’ commitment to building a team that people want to be a part of and help grow.

WayUp Top 100 Internship Programs 

This year, Databricks made the WayUp Top 100 Internship Programs list, selected by a panel of industry expert judges and thousands of public votes. Databricks is called out on the WayUp website for its University Recruiting team’s dedication to meeting with interns weekly to discuss feedback on how to improve the virtual program and make it more impactful while remote.

What all of these awards have in common: we could not have won them without the help of every individual at Databricks contributing to our overarching mission, making Databricks a great place to work, and spreading the word about our positive team-first culture. We’ve come far in achieving rapid growth and keeping our culture strong along the way, but we also know there is so much more potential to make a big impact and help data teams do even more.

Ready to join us? Check out of our open positions here.

--

Try Databricks for free. Get started today.

The post It’s a Good Time to be a Brickster! appeared first on Databricks.

How Relogix reduced infrastructure costs by 80% with Delta Lake and Azure Databricks

$
0
0

Introduction

Relogix is a leading workspace analytics and sensor-as-a-service provider for workplace monitoring, management, and performance. Founded on a decade of Corporate Real Estate (CRE) data and analytics experience, Relogix specializes in blending data used by CRE professionals with data collected from workspace occupancy and comfort sensors. Blending data enables CRE strategy leaders to tell compelling stories about the culture of the organization which includes the workspace, the workplace, and the workforce. Relogix customers include the most recognizable brands in the world today. They have grown to over 30 corporate and public sector customers and deployed over 100,000 sensors covering over 20 million square feet of office space globally.

Relogix Workspace Occupancy Sensors provide real-time monitoring of workspace occupancy, utilization, densities and dwell times, while Workplace Comfort Sensors monitor ambient temperature, light, noise and humidity in the workplace. Relogix analytics and insights are designed to help CRE professionals understand how employees use the workplace, support organizations that are interested in achieving WELL certification, and prepare for the return to office in a post-COVID world.

Challenges with Cost in a Traditional Data Architecture

The maturity of data and technology has come a long way from descriptive data analysis where simple relational databases, historical data, and spreadsheet expertise delivered all the insight necessary to the business. The question of “what is my portfolio occupancy” or even “is the workspace comfortable?” are no longer sufficient to meet and deliver on competitive business opportunities. Especially relevant today, it was important for Relogix to move forward to answer questions for their customers like “Can I reduce my office footprint while maintaining or improving the employee experience?” or “How can I ensure my workplace comfort levels fall within acceptable WELL ranges?”. In order to answer these questions, the cost associated with development in their current architecture made it nearly impossible to achieve them. Traditional data architecture costs go beyond the high cost of storage and compute. The even larger opportunity lost is from not delivering on innovative and competitive business opportunities. Let’s break these data challenges out even further:

  • Data Accessibility – Traditional data servicing architecture makes it too difficult and costly to store all the data at high volumes and reliably retrieve it. This ranges from a wide data variety, including semi-structured and unstructured data and at high velocity, including batch and real-time intervals.
  • Inflexibility – Traditional data servicing and BI architecture usually only has limited support with certain proprietary tools and languages. This results in additional costs with introducing new tools and data silos with the lack of a central environment supporting all the necessary capabilities.
  • Performance – Ingesting complex, high volume data can add costly processing times that may take hours before incoming data is available for queries. Even then, typically only a static, partial view is available. As a result, analysts and data science teams do not benefit from exploring the latest data.
  • Reliability – Complex data pipelines are error-prone and complicated, requiring significant amounts of time and resources. In addition, data schemas evolve as business needs change. As new business opportunities arise every change to the schema may cause unforeseen data errors and failures in downstream dashboards, applications, and analysis.
  • Complexity – Combining streaming and batch analytics requires complex and low-level code. Updates to batch and streaming data often interfere with queries or cause inconsistent results. This also results in needing more expensive, skilled resources to develop and maintain, neglecting new development opportunities. The data silos from this lambda architecture resulting from inflexibility make it difficult for resources to maximize each other’s skill sets to speed up project development.

Given all of these challenges, Relogix’s data is driving demand for greater business insight, the foundation to deliver it, and the human resources to execute it.

How Azure Databricks and Delta Lake helped Relogix power Conexus

Azure Databricks is the jointly-developed data and AI service from Databricks and Microsoft that enables organizations to unlock value from their data by dramatically accelerating time to insight and by maximizing the productivity of their operations. Delta Lake is an open-source storage layer that brings reliability at scale to data lakes. Together Azure Databricks and Delta Lake are designed to pave the way for cost-effective, fast, and flexible analysis of complex data. This resulted in Relogix being able to reallocate IT resources to higher-value projects and reduce operational costs by up to 10x.

Azure Databricks is scalable, reliable and fast, providing a fully-integrated security model and data integration with other Azure services to enable a collaborative workspace for data teams across the full lifecycle.

Introducing Azure Databricks: fast, easy and collaborative

Higher value with the Relogix launch of Conexus

Relogix recently launched Conexus, a workspace intelligence platform powered by Azure Databricks and Delta Lake. Conexus enables organizations to access and leverage insights for the return to office in the post-COVID world and plan for the new workplace. Conexus surfaces insights for executives and decision makers, and provides the supporting details required by those who execute on those strategies, such as space planners and interior designers. Whether the objective is to optimize, rationalize or revolutionize the workplace, the information provided through the Relogix Conexus platform helps organizations transform the experience of work.

Conexus Data Stories provides narrative-based, easy to follow, actionable insights.

Conexus Data Stories provides narrative-based, easy to follow, actionable insights.

I.E: Occupancy rates, Actual Desks Required, Desk Surplus, and Suggested Sharing Ratio for optimizing safe, shared workspace portfolios.

Relogix customers can visualize insights and KPIs via data ‘storyboards’ and quickly get answers to questions, not possible before. The Conexus platform enables CRE professionals to:

  • Automate the blending and integration of multiple data sources in any format ( i.e. API, xlsx, doc, txt, csv, IOT, PDF) at scale to improve reporting . Examples of additional data sources include, but are not limited to meeting room and desk reservation systems, IWMS, HRIS, security badging etc.
  • Monitor and report on Key Performance Indicators (KPIs) that directly impact cost savings, cost reduction and/or cost avoidance
  • Monitor COVID-19 workplace related health and safety protocols and identify potential areas that may be of concern such as high occupancy or high densities (clustering) and inform the availability of safe seats i.e. seats that have not been occupied for more than a predetermined number of days
  • Access always up-to-date insights related to their workplace and workspace performance on-demand

With Conexus, CRE professionals are able to extract factual assessments related to the workspace, workplace, and workforce.This eliminates guesswork from reporting and ensures real estate related requirements are always aligned to business objectives and effectively support how people work.

Conexus Safe Seats provides a visual representation of your workspace, showing the seats that need to be cleaned.

Safe Seats allows the customer to visualize the number of seats in a building that have been used to understand what’s needed to be clean, proactively reducing the  spread of germs.

Conexus Workplace Analytics allows customers to visualize the occupancy rate per floor plan to better plan portfolio strategy and discover reduction opportunities.

Workplace Analytics allows customers to visualize the occupancy rate per floor plan to better plan portfolio strategy and discover reduction opportunities. This is meant to give a holistic view of the key KPIs and is commonly used to negotiate space planning with business leaders…

Reduced TCO with a Modern Data Architecture

Prior to their Delta Lake migration, Relogix worked off a single SQL Server instance. Although SQL Server supported their core use case, scaling it to meet the requirements of a modern data company became too expensive. Relogix took advantage of Delta Lake and Azure Databricks to provide a secure and intuitive environment for analytics while limiting compute spend to what is required. Data preparation and ad hoc analysis were moved to Azure Databricks where data could be centrally accessed and clusters could be turned on and off, as needed. Additionally, Delta empowered Relogix to experiment with ML and AI use cases at scale. The implementation of the modern data architecture allowed Relogix to scale back costs on wasted compute resources by 80% while further empowering their data team.

Azure Databricks and Delta Lake integrate with other Azure Services such as Azure Data Factory, Azure Event Hubs, Azure Data Lake Storage Gen 2, Azure Synapse Analytics, Azure machine Learning and Power BI

Simple, modern data architecture with Azure Databricks and Delta Lake.

Decreased Time to Market with Delta Lake and Databricks Connect

The medallion architecture (as noted in the following diagram) allows for flexible access and extendable data processing. The Bronze tables are for data ingestion and enable quick access (without the need for data modeling) to a single source of truth for incoming IoT and transactional events. As data flows to Silver tables, it becomes more refined and optimized for business intelligence and data science use cases through data transformations and feature engineering. The Bronze and Silver tables also act as Operational Data Store (ODS) style tables allowing for agile modifications and reproducibility of downstream tables. Deeper analysis is done on Gold tables where analysts are empowered to use their method of choice (PySpark, Koalas, SQL, BI, and Excel all enable business analytics at Relogix) to derive new insights and formulate queries. Productized through Databricks Connect, Relogix follows SaaS best practices such as Agile methodologies, DevOps standards, and functional programming principles to create highly resilient tables that their analysts, scientists, and customers can leverage with confidence. By connecting the desired business outcomes with implementation, Relogix takes tables from concept to production in a timely manner, ensuring quality each step of the way.

Process batch and streaming data with Delta Lake on your existing Azure data lake, including Azure Data Lake Storage, Hadoop File System (HDFS) and Azure Blob Storage

Architecting your Delta Lake with the medallion data quality data flow.

 

Increased Productivity with Delta Lake and Collaborative Notebooks

Azure Databricks has become the de-facto platform at Relogix for turning data into insights. From the moment data lands in the Bronze stage of their Delta Lake, Relogix connects the analytical with the technical, providing a secure sandbox-style environment for team members to collaborate in real-time. Subject matter experts propose and experiment with new ideas, analysts quickly answer customer requests, debug issues, or dream up new insights, and data scientists combine data sources to bring advanced analytics to life. Underlying the analysis are data engineers leveraging Delta Lake to securely support data access internally while rapidly serving insights to customers externally. By unifying the disciplines around a common platform, Relogix ensures business relevance and actionable outcomes are at the core of data initiatives.

  • Real-Time Delta Pipeline with SQL and Python Collaboration 

A relevant case example is customers relying on real-time social media trends to gain a holistic view of their target market and anticipate their needs and desires. Reacting in an untimely manner or showing poor judgment can result in millions of lost revenue. The video tutorials below will show you real-time ingestion of tweets from Twitter, specifically for COVID-19, through the stages of Bronze to Gold with Databricks and Delta Lake.


Bronze tables will contain raw data.


Silver tables will contain filtered, consistent information after data cleansing


Gold tables will contain business-level aggregations used by our customers for data visualization, business analytics and critical business decision-making

What’s Coming up Next

Stay tuned for the next blog post on Databricks and Relogix giving better context to IoT data by leveraging alternative and big data sources.

Gain an even deeper understanding of your workspace with the Conexus Platform.

Get Started

The Delta Lake quickstart provides an overview of the basics of working with Delta Lake. The quickstart shows how to build a pipeline that reads JSON data into a Delta table, modify the table, read the table, display table history and optimize the table. To see these Delta Lake features in action, see these introductory notebooks.

To learn more about Delta Lake, join our Modern Data Engineering virtual event or our next Azure Databricks Office Hours. To try Delta Lake, sign up for a free trial of Azure Databricks.

--

Try Databricks for free. Get started today.

The post How Relogix reduced infrastructure costs by 80% with Delta Lake and Azure Databricks appeared first on Databricks.

Viewing all 1872 articles
Browse latest View live