Allow Simple Cluster Creation with Full Admin Control Using Cluster Policies

July 2, 2020, 10:00 am

≫ Next: The data community raised a total of $101,626 to help organizations fight racial injustice. Thank you to all who donated!

≪ Previous: Databricks Announces 2020 North America Partner Awards

What is a Databricks cluster policy?

A Databricks cluster policy is a template that restricts the way users interact with cluster configuration. Today, any user with cluster creation permissions is able to launch an Apache Spark^™ cluster with any configuration. This leads to a few issues:

Administrators are forced to choose between control and flexibility. We often see clusters being managed centrally, with all non-admins being stripped of cluster creation privileges; this provides acceptable control over the environment, but creates bottlenecks for user productivity. The alternative—allowing free reign to all users— can lead to problems like runaway costs and an explosion of cluster types across the enterprise.
Users are forced to choose their own configuration, even when they may not need or want to. For many users, the number of options when creating a new cluster can be overwhelming—many users just want to create a small, basic cluster for prototyping, or to recreate a cluster that someone else has already configured. In these cases, more options is not better.
Standardization of configurations—for the purpose of things such as tagging, chargeback, user onboarding, and replicability across environments—is often manual. These can mostly be accomplished via API workarounds, but they are not well-integrated.

To help solve these problems, we are introducing cluster policies to allow the creation of reusable, admin-defined cluster templates. These will control what a user can see and select when creating a cluster, and can be centrally managed and controlled via group or user privileges. We see two broad areas of benefit: increasing the ability of admins to balance control and flexibility, and simplifying the user experience for non-admins.

How do cluster policies help admins balance admin control and user flexibility?

While admins have historically had to choose between control and flexibility in designing usage patterns within Databricks, cluster policies will allow coexistence of both. By defining a set of templates that can be assigned to specific users or groups, admins can meet organizational guidelines on usage and governance without hampering the agility of an ad-hoc cluster model. To this end, policies will allow some of the most common patterns to be replicated and enforced automatically:

Maximum DBU burndown per cluster per hour can be enforced to prevent users from spinning up overly large or expensive clusters
Tagging of clusters can be mandated to enable chargeback/showback based on AWS resource tags
Instance types and number of instances can be controlled via whitelisting, range specification, and even regular expressions, providing fine-grained controls over the type and size of clusters created
Cluster type can be restricted to ensure users run jobs only on job clusters instead of all-purpose clusters

More complex templates- such as enforcing passthrough, enabling an external metastore, etc., can also provide a reusable framework; instead of struggling with a complex configuration every time a cluster is created, it can be done once and then applied to new clusters repeatedly. All of this combines to provide better visibility, control, and governance to Databricks administrators and Cloud Ops teams, without taking away from the flexibility and agility that makes Databricks valuable to so many of our customers.

How do cluster policies help simplify the experience of non-admin users?

As a user of Databricks today, I need to make several choices when creating a cluster, such as what instance type and size to use for both my driver and worker nodes, how many instances to include, the version of Databricks Runtime, autoscaling parameters, etc. While some users may find these options helpful and necessary, a majority of users need to make only basic choices when creating a cluster, such as choosing small, medium or large. The advanced options might be unnecessarily overwhelming to non-sophisticated users. Cluster policies will let such users pick a basic policy (such as “small”), provide a cluster name, and get directly to their notebook. For example, instead of the full create cluster screen all users see today, a minimal policy might look like this:

This is especially useful for users who may be new to the cloud computing world, or who are unfamiliar with Apache Spark^TM; they can now rely on templates provided to them, instead of guessing. More advanced users might require additional options, for which policies could be created and assigned to specific users or groups. Policies are flexible enough to allow many levels of granularity, so that data science or data engineering teams can see the exact level of detail that they need to, without additional complexity that causes confusion and slows productivity.

What are some examples of cluster policies?

Although cluster policies will continue to evolve as we add more endpoints and interfaces, we have already taken some of the best practices from the field and formed them into a starting point to build upon. Some examples of these templates include:

Small/Medium/Large “t-shirt size” clusters: minimal clusters that require little to no configuration by the user; we use a standard i3.2xlarge node type with auto-scaling and auto-termination enforced. Users only need to provide a cluster name.
Max DBU count: allow all of the parameters of the cluster to be modified, but provide a limit (i.e., 50 DBUs per hour) to prevent users from creating an excessively large or expensive cluster
Single-Node machine learning (ML) Clusters: limit runtime to Databricks ML Runtimes, enforce 1 driver and 0 workers, and provide options for either GPU or CPU machines acceptable for ML workloads
Jobs-only clusters: users can only create a jobs cluster and run Databricks jobs using this policy, and cannot create shared, all-purpose clusters

These are a small sample of the many different types of templates that are possible with cluster policies.

General Cluster Policy

DESCRIPTION: this is a general purpose cluster policy meant to guide users and restrict some functionality, while requiring tags, restricting maximum number of instances, and enforcing timeout.

{
	"spark_conf.spark.databricks.cluster.profile": {
		"type": "fixed",
		"value": "serverless",
		"hidden": true
	},
	"instance_pool_id": {
		"type": "forbidden",
		"hidden": true
	},
	"spark_version": {
		"type": "regex",
		"pattern": "6.[0-9].x-scala.*"
	},
	"node_type_id": {
		"type": "whitelist",
		"values": [
		"i3.xlarge",
		"i3.2xlarge",
		"i3.4xlarge"
		],
		"defaultValue": "i3.2xlarge"
	},
	"driver_node_type_id": {
		"type": "fixed",
		"value": "i3.2xlarge",
		"hidden": true
	},
	"autoscale.min_workers": {
		"type": "fixed",
		"value": 1,
		"hidden": true
	},
	"autoscale.max_workers": {
		"type": "range",
		"maxValue": 25,
		"defaultValue": 5
	},
	"autotermination_minutes": {
		"type": "fixed",
		"value": 30,
		"hidden": true
	},
	"custom_tags.team": {
		"type": "fixed",
		"value": "product"
	}
}

Note: For Azure users, “node_type_id” and “driver_node_type_id” need to be Azure supported VMs instead.

Simple Medium-Sized Policy

DESCRIPTION: this policy allows users to create a medium Databricks cluster with minimal configuration. The only required field at creation time is cluster name; the rest is fixed and hidden.

{
	"instance_pool_id": {
		"type": "forbidden",
		"hidden": "true"
	},
	"spark_conf.spark.databricks.cluster.profile": {
		"type": "forbidden",
		"hidden": "true"
	},
	"autoscale.min_workers": {
		"type": "fixed",
		"value": 1,
		"hidden": "true"
	},
	"autoscale.max_workers": {
		"type": "fixed",
		"value": 10,
		"hidden": "true"
	},
	"autotermination_minutes": {
		"type": "fixed",
		"value": 60,
		"hidden": "true"
	},
	"node_type_id": {
		"type": "fixed",
		"value": "i3.xlarge",
		"hidden": "true"
	},
	"driver_node_type_id": {
		"type": "fixed",
		"value": "i3.xlarge",
		"hidden": "true"
	},
	"spark_version": {
		"type": "fixed",
		"value": "7.x-snapshot-scala2.11",
		"hidden": "true"
	},
		"custom_tags.team": {
		"type": "fixed",
		"value": "product"
	}
}

Note: For Azure users, “node_type_id” and “driver_node_type_id” need to be Azure supported VMs instead.

Job-Only Policy

DESCRIPTION: this policy only allows users to create Databricks job (automated) clusters and run jobs using the cluster. Users cannot create an all-purpose (interactive) cluster using this policy.

{
	"cluster_type": {
		"type": "fixed",
		"value": "job"
	},
	".dbus_per_hour": {
		"type": "range",
		"maxValue": 100
	},
	"instance_pool_id": {
		"type": "forbidden",
		"hidden": "true"
	},
	"num_workers": {
		"type": "range",
		"minValue": 1
	},
		"node_type_id": {
		"type": "regex",
		"pattern": "[rmci][3-5][rnad]*.[0-8]{0,1}xlarge"
	},
	"driver_node_type_id": {
		"type": "regex",
		"pattern": "[rmci][3-5][rnad]*.[0-8]{0,1}xlarge"
	},
	"spark_version": {
		"type": "regex",
		"pattern": "6.[0-9].x-scala.*"
	},
	"custom_tags.team": {
		"type": "fixed",
		"value": "product"
	}
}

Note: For Azure users, “node_type_id” and “driver_node_type_id” need to be Azure supported VMs instead.

High Concurrency Passthrough policy

DESCRIPTION: this policy allows users to create clusters that have passthrough enabled by default, in high concurrency mode. This simplifies setup for the admin, since users would need to set the appropriate Spark parameters manually.

{
	"spark_conf.spark.databricks.passthrough.enabled": {
		"type": "fixed",
		"value": "true"
	},
	"spark_conf.spark.databricks.repl.allowedLanguages": {
		"type": "fixed",
		"value": "python,sql"
	},
	"spark_conf.spark.databricks.cluster.profile": {
		"type": "fixed",
		"value": "serverless"
	},
	"spark_conf.spark.databricks.pyspark.enableProcessIsolation": {
		"type": "fixed",
		"value": "true"
	},
	"custom_tags.ResourceClass": {
		"type": "fixed",
		"value": "Serverless"
	}
}

External Metastore policy

DESCRIPTION: this policy allows users to create a Databricks cluster with an admin-defined metastore already attached. This is useful to allow users to create their own clusters without requiring additional configuration.

{
	"spark_conf.spark.hadoop.javax.jdo.option.ConnectionURL": {
		"type": "fixed",
		"value": "jdbc:sqlserver://"
	},
	"spark_conf.spark.hadoop.javax.jdo.option.ConnectionDriverName": {
		"type": "fixed",
		"value": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
	},
	"spark_conf.spark.databricks.delta.preview.enabled": {
		"type": "fixed",
		"value": "true"
	},
	"spark_conf.spark.hadoop.javax.jdo.option.ConnectionUserName": {
		"type": "fixed",
		"value": "{{secrets/metastore/databricks-poc-metastore-user}}"
	},
	"spark_conf.spark.hadoop.javax.jdo.option.ConnectionPassword": {
		"type": "fixed",
		"value": "{{secrets/metastore/databricks-poc-metastore-password}}"
	}
}

You can see more policies in our Databricks Labs repo at https://github.com/databrickslabs/policy-templates.

How can I get started?

You need to be on Databricks Premium Tier (Azure Databricks or AWS) and plus (see pricing details) to use cluster policies.

As a Databricks admin, you can go to the “Clusters” page, the “Cluster Policies” tab to create your policies in the policy JSON editor. Alternatively, you can create policies via API. See details in Databricks documentation – Cluster Policies (AWS, Azure).

Try Databricks for free. Get started today.

The post Allow Simple Cluster Creation with Full Admin Control Using Cluster Policies appeared first on Databricks.

↧

The data community raised a total of $101,626 to help organizations fight racial injustice. Thank you to all who donated!

July 9, 2020, 6:00 am

≫ Next: A data-driven approach to Environmental, Social and Governance

≪ Previous: Allow Simple Cluster Creation with Full Admin Control Using Cluster Policies

Our commitment to diversity and inclusion is inherent in our company values at Databricks, but recent events and protests around the world have reminded us that there’s much more we can do to bring awareness to racial injustice and drive meaningful change.

This year’s theme at Spark + AI Summit was “DATA TEAMS UNITE!” and while we created the theme a while back, it took on a new meaning with broader visibility of the Black Lives Matter movement and protests we’ve seen across the globe. As we organized our first-ever virtual Summit this year, we knew we had a responsibility to unite the data community in an effort to make the positive impact we want and need to see. With over 60,000 people registered for the event, we saw an opportunity to use Spark + AI Summit as a platform to do just that. We partnered with two important organizations committed to driving change — the NAACP Legal Defense and Educational Fund and Center for Policing Equity — and organized a fundraising program during Spark + AI Summit. We encouraged the data community to join the cause and Databricks matched all donations.

I’m proud to share that together, we took a significant step in the right direction: with over $50,000 in donations plus the matching program, we raised $101,626 for these important causes to continue the fight against social and racial injustice. We’re grateful to everyone who donated and inspired by what the data community can accomplish when we unite to help solve the world’s toughest problems.

Looking ahead, we can’t discount the importance of continuing to do our part to address racial injustice and empowering our employees and community to do more. We will always remain committed to making a difference — at Databricks, on behalf of our customers and partners, and in collaboration with the data community.

About the organizations

The NAACP Legal Defense and Educational Fund, Inc. (LDF) is America’s premier legal organization fighting for racial justice. Through litigation, advocacy, and public education, LDF seeks structural changes to expand democracy, eliminate disparities, and achieve racial justice in a society that fulfills the promise of equality for all Americans. LDF also defends the gains and protections won over the past 75 years of civil rights struggle and works to improve the quality and diversity of judicial and executive appointments.

As a research and action think tank, Center for Policing Equity (CPE) produces analyses identifying and reducing the causes of racial disparities in law enforcement. Using evidence-based approaches to social justice, it uses data to create levers for social, cultural and policy change. CPE’s work continues to simultaneously aid police departments to realize their own equity goals as well as advance the scientific understanding of issues of equity within organizations and policing.

I encourage you to watch Co-founder and CEO, Dr. Phillip Atiba Goff’s keynote presentation at Spark + AI Summit where he discusses racism and the role data can play in changing policing.

“The role of nerds – data nerds and justice nerds – is not all that sexy, it is not in front of the camera and not at the front of the movement, but without it failure is absolutely guaranteed.” – Dr. Goff.

Try Databricks for free. Get started today.

The post The data community raised a total of $101,626 to help organizations fight racial injustice. Thank you to all who donated! appeared first on Databricks.

↧

A data-driven approach to Environmental, Social and Governance

July 10, 2020, 8:00 am

≫ Next: Azure Databricks Now Available in Azure Government (Public Preview)

≪ Previous: The data community raised a total of $101,626 to help organizations fight racial injustice. Thank you to all who donated!

The future of finance goes hand in hand with social responsibility, environmental stewardship and corporate ethics. In order to stay competitive, Financial Services Institutions (FSI) are increasingly disclosing more information about their environmental, social and governance (ESG) performance. By better understanding and quantifying the sustainability and societal impact of any investment in a company or business, FSIs can mitigate reputation risk and maintain the trust with both their clients and shareholders. At Databricks, we increasingly hear from our customers that ESG has become a C-suite priority. This is not solely driven by altruism but also by economics: Higher ESG ratings are generally positively correlated with valuation and profitability while negatively correlated with volatility. In this blog post, we offer a novel approach to sustainable investing by combining natural language processing (NLP) techniques and graph analytics to extract key strategic ESG initiatives and learn companies’ relationships in a global market and their impact to market risk calculations.

Using the Databricks Unified Data Analytics Platform, we will demonstrate how Apache Spark^TM, Delta Lake and MLflow can enable asset managers to assess the sustainability of their investments and empower their business with a holistic and data-driven view to their environmental, social and corporate governance strategies. Specifically, we will extract the key ESG initiatives as communicated in yearly PDF reports and compare these with the actual media coverage from news analytics data.

In the second part of this blog, we will learn the connections between companies and understand the positive or negative ESG consequences these connections may have to your business. While this blog will focus on asset managers to illustrate the modern approach to ESG and socially responsible investing, this framework is broadly applicable across all sectors in the economy from Consumer Staples and Energy to Media and Healthcare.

Extracting key ESG initiatives

Financial services organisations are now facing more and more pressure from their shareholders to disclose more information about their environmental, social and governance strategies. Typically released on their websites on a yearly basis as a form of a PDF document, companies communicate their key ESG initiatives across multiple themes such as how they value their employees, clients or customers, how they positively contribute back to society or even how they mitigate climate change by, for example, reducing (or committing to reduce) their carbon emissions. Consumed by third-party agencies (such as msci or csrhub), these reports are usually consolidated and benchmarked across industries to create ESG metrics.

Extracting statements from ESG reports

In this example, we would like to programmatically access 40+ ESG reports from top tier financial services institutions (some are reported in the below table) and learn key initiatives across different topics. However, with no standard schema nor regulatory guidelines, communication in these PDF documents can be varied, making this approach a perfect candidate for the use of machine learning (ML).

Barclays	https://home.barclays/content/dam/home-barclays/documents/citizenship/ESG/Barclays-PLC-ESG-Report-2019.pdf
JP Morgan Chase	https://impact.jpmorganchase.com/content/dam/jpmc/jpmorgan-chase-and-co/documents/jpmc-cr-esg-report-2019.pdf
Morgan Stanley	https://www.morganstanley.com/pub/content/dam/msdotcom/sustainability/Morgan-Stanley_2019-Sustainability-Report_Final.pdf
Goldman Sachs	https://www.goldmansachs.com/what-we-do/sustainable-finance/documents/reports/2019-sustainability-report.pdf

Although our data set is relatively small, we show how one could distribute the scraping process using a user defined function (UDF), assuming the third-party library `PyPDF2` is available across your Spark environment.

import requests
import PyPDF2
import io

@udf('string')
def extract_content(url):

    # retrieve PDF binary stream
    response = requests.get(url)
    open_pdf_file = io.BytesIO(response.content)
    pdf = PyPDF2.PdfFileReader(open_pdf_file)

    # return concatenated content
    text = [pdf.getPage(i).extractText() for i in range(0, pdf.getNumPages())]
    return "\n".join(text)

Beyond regular expressions and fairly complex data cleansing (reported in the attached notebooks), we also want to leverage more advanced NLP capabilities to tokenise content into grammatically valid sentences. Given the time it takes to load trained NLP pipelines in memory (such as the `spacy` library below), we ensure our model is loaded only once per Spark executor using a PandasUDF strategy as follows.

import gensim
import spacy
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('array', PandasUDFType.SCALAR_ITER)
def extract_statements(content_series_iter):
    
    # load spacy model for english only once
    spacy.cli.download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")
    
    # provide process_text function with our loaded NLP model
    # clean and tokenize a batch of PDF content
    for content_series in content_series_iter:

    yield content_series.map(lambda x: process_text(nlp, x))

With this approach, we were able to convert raw PDF documents into well defined sentences (some are reported in the table below) for each of our 40+ ESG reports. As part of this process, we also lemmatised our content – that is, to transform a word into its simpler grammatical form, such as past tenses transformed to present form or plural form converted to singular. This extra process will pay off in the modeling phase by reducing the number of words to learn topics from.

Goldman Sachs	we established a new policy to only take public those companies in the us and europe with at least one diverse board director (starting next year, we will increase our target to two)
Barclays	it is important to us that all of our stakeholders can clearly understand how we manage our business for good.
Morgan Stanley	in 2019, two of our financings helped create almost 80 affordable apartment units for low-and moderate-income families in sonoma county, at a time of extreme shortage.
Riverstone	in the last four years, the fund has conserved over 15,000 acres of bottomland hardwood forests, on track to meeting the 35,000-acre goal established at the start of the fund

Although it is relatively easy for the human eye to infer the themes around each of these statements (in this case diversity, transparency, social, environmental), doing so programmatically and at scale is of a different complexity and requires advanced use of data science.

Classifying ESG statements

In this section, we want to automatically classify each of our 8,000 sentences we extracted from 40+ ESG reports. Together with non matrix factorisation, Latent Dirichlet Allocation (LDA) is one of the core models in the topic modeling arsenal, using either its distributed version on Spark ML or its in-memory sklearn equivalent as follows. We compute our term frequencies and capture our LDA model and hyperparameters using MLflow experiments tracking.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
import mlflow

# compute word frequencies
# stop words are common english words + banking related buzzwords
word_tf_vectorizer = CountVectorizer(stop_words=stop_words, ngram_range=(1,1))
word_tf = word_tf_vectorizer.fit_transform(esg['lemma'])
    
# track experiment on ml-flow
with mlflow.start_run(run_name='topic_modeling'):
    
    # Train a LDA model with 9 topics
    lda = LDA(random_state = 42, n_components = 9, learning_decay = .3)
    lda.fit(word_tf)
    
    # Log model 
    mlflow.sklearn.log_model(lda, "model")
    mlflow.log_param('n_components', '9')
    mlflow.log_param('learning_decay', '.3')
    mlflow.log_metric('perplexity', lda.perplexity(word_tf))

Following multiple experiments, we found that 9 topics would summarise our corpus best. By looking deeper at the importance of each keyword learned from our model, we try to describe our 9 topics into 9 specific categories, as reported in the table below.

Suggested name	LDA descriptive keywords
company strategy	board, company, corporate, governance, management, executive, director, shareholder, global, engagement, vote, term, responsibility, business, team
green energy	energy, emission, million, renewable, use, project, reduce, carbon, water, billion, power, green, total, gas, source
customer focus	customer, provide, business, improve, financial, support, investment, service, year, sustainability, nancial, global, include, help, initiative
support community	community, people, business, support, new, small, income, real, woman, launch, estate, access, customer, uk, include
ethical investments	investment, climate, company, change, portfolio, risk, responsible, sector, transition, equity, investor, sustainable, business, opportunity, market
sustainable finance	sustainable, impact, sustainability, asset, management, environmental, social, investing, company, billion, waste, client, datum, investment, provide
code of conduct	include, policy, information, risk, review, management, investment, company, portfolio, process, environmental, governance, scope, conduct, datum
strong governance	risk, business, management, environmental, customer, manage, human, social, climate, approach, conduct, page, client, impact, strategic
value employees	employee, work, people, support, value, client, company, help, include, provide, community, program, diverse, customer, service

With our 9 machine learned topics, we can easily compare each of our FSI’s ESG reports side by side to better understand the key priority focus for each of them.

Using seaborn visualisation, we can easily flag key differences across our companies (organisations’ names redacted). When some organisations would put more focus on valuing employees and promoting diversity and inclusion (such as ORG-21), some seem to be more focused towards ethical investments (ORG-14). As the output of LDA is a probability distribution across our 9 topics instead of one specific theme, we easily unveil the most descriptive ESG initiative for any given organisation using a simple SQL statement and a partitioning function that captures the highest probability for each theme.

WITH ranked (
    SELECT
        e.topic,
        e.statement,
        e.company,
        dense_rank() OVER (
            PARTITION BY e.company, e.topic ORDER BY e.probability DESC
        ) as rank
    FROM esg_reports e
)
    
SELECT 
    t.topic,
    t.statement
FROM ranked t
WHERE t.company = 'goldman sachs'
AND t.rank = 1

This SQL statement provides us with a NLP generated executive summary for Goldman Sachs (see original report), summarising a complex 70+ pages long document into 9 ESG initiatives / actions.

Topic	Statement
support community	Called the Women Entrepreneurs Opportunity Facility (WEOF), the program aims to address unmet financing needs of women-owned businesses in developing countries, recognizing the significant obstacles that women entrepreneurs face in accessing the capital needed to grow their businesses.
strong governance	The ERM framework employs a comprehensive, integrated approach to risk management, and it is designed to enable robust risk management processes through which we identify, assess, monitor and manage the risks we assume in conducting our business activities.
sustainable finance	In addition to the Swedish primary facility, Northvolt also formed a joint venture with the Volkswagen Group to establish a 16 GWh battery cell gigafactory in Germany, which will bring Volkswagens total investment in Northvolt to around $1 billion.
green energy	Besides reducing JFKs greenhouse gas emissions by approximately 7,000 tons annually (equivalent to taking about 1,400 cars off the road), the project is expected to lower the Port Authority’s greenhouse gas emissions at the airport by around 10 percent The GSAM Renewable Power Group will hold the power purchase agreement for the project, while SunPower will develop and construct the infrastructure at JFK.
customer focus	Program alumni can also join the 10KW Ambassadors Program, an advanced course launched in 2019 that enables the entrepreneurs to further scale their businesses.10,000 Women Measures Impacts in China In Beijing, 10,000 Women held a 10-year alumni summit at Tsinghua University School of Economics and Management.
ethical investments	We were one of the first US companies to commit to the White House American Business Act on Climate Pledge in 2015; we signed an open letter alongside 29 other CEOs in 2017 to support the US staying in the Paris Agreement; and more recently, we were part of a group of 80+ CEOs and labour leaders reiterating our support that staying in the Paris Agreement will strengthen US competitiveness in global markets.
value employee	Other key initiatives that enhance our diversity of perspectives include: Returnship Initiative, which helps professionals restart their careers after an extended absence from the workforce The strength of our culture, our ability to execute our strategy, and our relevance to clients all depend on a diverse workforce and an inclusive environment that encourages a wide range of perspectives.
company strategy	Underscoring our conviction that diverse perspectives can have a strong impact on company performance, we have prioritized board diversity in our stewardship efforts.
code of conduct	13%Please see page 96 of our 2019 Form 10-K for further of approach to incorporation of environmental, social and governance (ESG) factors in credit analysisDiscussion and AnalysisFN-CB-410a.2Environmental Policy Framework

Although we may observe some misclassification (mainly related to how we have named each topic) and may have to tune our model more, we have demonstrated how NLP techniques can be used to efficiently extract well defined initiatives from complex PDF documents. These, however, may not always reflect companies’ core priorities nor does it capture every initiative for each theme. This can be further addressed using techniques borrowed from anomaly detection, grouping corpus into broader clusters and extracting sentences that deviate the most from the norm (i.e. sentences specific to an organisation and not mainstream). This approach, using K-Means, is discussed in our notebooks attached.

Create a data-driven ESG score

As covered in the previous section, we were able to compare businesses side by side across 9 different ESG initiatives. Although we could attempt to derive an ESG score (the approach many third-party organisations would use), we want our score not to be subjective but truly data-driven. In other terms, we do not want to solely base our assumptions on companies’ official disclosures but rather on how companies’ reputations are perceived in the media, across all 3 environmental, social and governance variables. For that purpose, we use GDELT, the global database of event location and tones.

Data acquisition

Given the volume of data available in GDELT (100 million records for the last 18 months only), we leverage the lakehouse paradigm by moving data from raw, to filtered and enriched, respectively from Bronze, to Silver and Gold layers, and extend our process to operate in near real time (GDELT files are published every 15mn). For that purpose, we use a Structured Streaming approach that we `trigger` in batch mode with each batch operating on data increment only. By unifying Streaming and Batch, Spark is the de-facto standard for data manipulation and ETL processes in modern data lake infrastructures.

gdelt_stream_df = spark \
    .readStream \                              
    .format("delta") \
    .table("esg_gdelt_bronze") \                                        
    .withColumn("themes", filter_themes(F.col("themes"))) \
    .withColumn("organisation", F.explode(F.col("organisations"))) \
    .select(
        F.col("publishDate"),
        F.col("organisation"),
        F.col("documentIdentifier").alias("url"),
        F.col("themes"),
        F.col("tone.tone")
    )

gdelt_stream_df \
    .writeStream \
    .trigger(Trigger.Once) \
    .option("checkpointLocation", "/tmp/gdelt_esg") \
    .format("delta") \
    .table("esg_gdelt_silver")

From the variety of dimensions available in GDELT, we only focus on sentiment analysis (using the tone variable) for financial news related articles only. We assume financial news articles to be well captured by the GDELT taxonomy starting with ECON_*. Furthermore, we assume all environmental articles to be captured as ENV_* and social articles to be captured by UNGP_* taxonomies (UN guiding principles on human rights).

Sentiment analysis as proxy for ESG

Without any industry standard nor existing models to define environmental, social and governance metrics, and without any ground truth available to us at the time of this study, we assume that the overall tone captured from financial news articles is a good proxy for companies’ ESG scores. For instance, a series of bad press articles related to maritime disasters and oil spills would strongly affect a company’s environmental performance. On the opposite, news articles about […] financing needs of women-owned businesses in developing countries [source] with a more positive tone would positively contribute to a better ESG score. Our approach is to look at the difference between a company sentiment and its industry average; how much more “positive” or “negative” a company is perceived across all its financial services news articles, over time.

In the example below, we show that difference in sentiment (using a 15 days moving average) between one of our key FSIs and its industry average. Apart from a specific time window around COVID-19 virus outbreak in March 2020, this company has been constantly performing better than the industry average, indicating a good environmental score overall.

Generalising this approach to every entity mentioned in our GDELT dataset, we are no longer limited to the few FSIs we have an official ESG report for and are able to create an internal score for each and every single company across their environmental, social and governance dimensions. In other words, we have started to shift our ESG lense from being subjective to being data-driven.

Introducing a propagated weighted ESG metrics

In a global market, companies and businesses are inter-connected, and the ESG performance of one (e.g. seller) may affect the reputation of another (e.g. buyer). As an example, if a firm keeps investing in companies directly or indirectly related to environmental issues, this risk should be quantified and must be reflected back on companies’ reports as part of their ethical investment strategy. We could cite the example of Barclays’ reputation being impacted in late 2018 because of its indirect connections to tar sand projects (source).

Identifying influencing factors

Democratised by Google for web indexing, Page Rank is a common technique used to identify nodes’ influence in large networks. In our approach, we use a variant of Page Rank, Personalised Page Rank, to identify influential organisations relative to our key financial services institutions. The more influential these connections are, the more likely they will contribute (positively or negatively) to our ESG score. An illustration of this approach is reported below where indirect connections to tar sand industry may negatively contribute to a company ESG score proportional to its personalised page rank influence.

Using Graphframes, we can easily create a network of companies sharing a common media coverage. Our assumption is that the more companies are mentioned together in news articles, the stronger their link will be (edge weight). Although this assumption may also infer wrong connections because of random co-occurrence in news articles (see later), this undirected weighted graph will help us find companies’ importance relative to our core FSIs we would like to assess.

val buildTuples = udf((organisations: Seq[String]) => {
    // as undirected, we create both IN and OUT connections
    organisations.flatMap(x1 => {
        organisations.map(x2 => {
        (x1, x2)
        })
    }).toSeq.filter({ case (x1, x2) =>
        x1 != x2 // remove self edges
    })
})
    
val edges = spark.read.table("esg_gdelt")
    .groupBy("url")
    .agg(collect_list(col("organisation")).as("organisations"))
    .withColumn("tuples", buildTuples(col("organisations")))
    .withColumn("tuple", explode(col("tuples")))
    .withColumn("src", col("tuple._1"))
    .withColumn("dst", col("tuple._2"))
    .groupBy("src", "dst")
    .count()
    
import org.graphframes.GraphFrame 
val esgGraph = GraphFrame(nodes, edges)

By studying this graph further, we observe a power of law distribution of its edge weights: 90% of the connected businesses share a very few connections. We drastically reduce the graph size from 51,679,930 down to 61,143 connections by filtering edges for a weight of 200 or above (empirically led threshold). Prior to running Page Rank, we also optimise our graph by further reducing the number of connections through a Shortest Path algorithm and compute the maximum number of hops a node needs to follow to reach any of our core FSIs vertices (captured in `landmarks` array). The depth of a graph is the maximum of every shortest path possible, or the number of connections it takes for any random node to reach any others (the smaller the depth is, denser is our network).

val shortestPaths = esgGraph.shortestPaths.landmarks(landmarks).run()
val filterDepth = udf((distances: Map[String, Int]) => {
    distances.values.exists(_ < 5)
})

We filter our graph to have a maximum depth of 4. This process reduces our graph further down to 2,300 businesses and 54,000 connections, allowing us to run Page Rank algorithm more extensively with more iterations in order to better capture industry influence.

val prNodes = esgDenseGraph
    .parallelPersonalizedPageRank
    .maxIter(100)
    .sourceIds(landmarks)
    .run()

We can directly visualise the top 100 influential nodes to a specific business (in this case Barclays PLC) as per below graph. Without any surprise, Barclays is well connected with most of our core FSIs (such as the institutional investors JP Morgan Chase, Goldman Sachs or Credit Suisse), but also to the Security Exchange Commission, Federal Reserve and International Monetary Fund.

Further down this distribution, we find public and private companies such as Huawei, Chevron, Starbucks or Johnson and Johnson. Strongly or loosely related, directly or indirectly connected, all these businesses (or entities from an NLP standpoint) could theoretically affect Barclays ESG performance, either positively or negatively, and as such impact Barclays’ reputation.

ESG as a propagated metric

By combining our ESG score captured earlier with the importance of each of these entities, it becomes easy to apply a weighted average on the “Barclays network” where each business contributes to Barclays’ ESG score proportionally to its relative importance. We call this approach a propagated weighted ESG score (PW-ESG).

We observe the negative or positive influence of any company’s network using a word cloud visualisation. In the picture below, we show the negative influence (entities contributing negatively to ESG) for a specific organisation (name redacted).

Due to the nature of news analytics, it is not surprising to observe news publishing companies (such as Thomson Reuters or Bloomberg) or social networks (Facebook, Twitter) as strongly connected organisations. Not reflecting the true connections of a given business but rather explained by a simple co-occurrence in news articles, we should consider filtering them out prior to our page rank process by removing nodes with a high degree of connections. However, this additional noise seems constant across our FSIs and as such does not seem to disadvantage one organisation over another. An alternative approach would be to build our graph using established connections as extracted from advanced uses of NLP on raw text content. This, however, would drastically increase the complexity of this project and the costs associated with news scraping processes.

Finally, we represent the original ESG score as computed in the previous section, and how much of these scores were reduced (or increased) using our PW-ESG approach across its environmental, social and governance dimensions. In the example below, for a given company, the initial scores of 69, 62 and 67 have been reduced to 57, 53 and 60, with the most negative influence of PW-ESG being on its environmental coverage (-20%).

Using the agility of Redash coupled with the efficiency of Databricks’ runtime, this series of insights can be rapidly packaged up as a BI/MI report, bringing ESG as-a-service to your organisation for asset managers to better invest in sustainable and responsible finance.

It is worth mentioning that this new framework is generic enough to accommodate multiple use cases. Whilst core FSIs may consider their own company as a landmark to Page Rank in order to better evaluate reputational risks, asset managers could consider all their positions as landmarks to better assess the sustainability relative to each of their investment decisions.

ESG applied to market risk

In order to validate our initial assumption that [...] higher ESG ratings are generally positively correlated with valuation and profitability while negatively correlated with volatility, we create a synthetic portfolio made of random equities that we run through our PW-ESG framework and combine with actual stock information retrieved from Yahoo Finance. As reported in the graph below, despite an evident lack of data to draw scientific conclusions, it would appear that our highest and lowest ESG rated companies (we report the sentiment analysis as a proxy of ESG in the top graph) are respectively the best or worst profitable instruments in our portfolio over the last 18 months.

Interestingly, CSRHub reports the exact opposite, Pearson (media) being 10 points above Prologis (property leasing) in terms of ESG scores, highlighting the subjectivity of ESG scoring and its inconsistency between what is communicated and what is actually observed.

Following up on our recent blog post about modernizing risk management, we can use this new information available to us to drive better risk calculations. Splitting our portfolio into 2 distinct books, composed of the best and worst 10% of our ESG rated instruments, we report in the graph below the historical returns and its corresponding 95% value-at-risk (historical VaR).

Without any prior knowledge of our instruments beyond the metrics we extracted through our framework, we can observe a risk exposure to be 2 times higher for a portfolio made of poor ESG rated companies, supporting the assumptions found in the literature that ”poor ESG [...] correlates with higher market volatility”, hence to a greater value-at-risk.

As covered in our previous blog, the future of risk management lies with agility and interactivity. Risk analysts must augment traditional data with alternative data and alternative insights in order to explore new ways of identifying and quantifying the risks facing their business. Using the flexibility and scale of cloud compute and the level of interactivity in your data enabled through our Databricks runtime, risk analysts can better understand the risks facing their business by slicing and dicing market risk calculations at different industries, countries, segments, and now at different ESG ratings. This data-driven ESG framework enables businesses to ask new questions such as: how much of your risk would be decreased by bringing the environmental rating of this company up 10 points? How much more exposure would you face by investing in these instruments given their low PW-ESG scores?

Transforming your ESG strategy

In this blog, we have demonstrated how complex documents can be quickly summarised into key ESG initiatives to better understand the sustainability aspect of each of your investments. Using graph analytics, we introduced a novel approach to ESG by better identifying the influence a global market has to both your organisation strategy and reputational risk. Finally, we showed the economic impact of ESG factors on market risk calculation. As a starting point to a data-driven ESG journey, this approach can be further improved by bringing the internal data you hold about your various investments and the additional metrics you could bring from third-party data, propagating the risks through our PW-ESG framework to keep driving more sustainable finance and profitable investments.

Try the below notebooks on Databricks to accelerate your ESG development strategy today and contact us to learn more about how we assist customers with similar use cases.

Try Databricks for free. Get started today.

The post A data-driven approach to Environmental, Social and Governance appeared first on Databricks.

↧

Azure Databricks Now Available in Azure Government (Public Preview)

July 14, 2020, 8:00 am

≫ Next: How to Extract Market Drivers at Scale Using Alternative Data

≪ Previous: A data-driven approach to Environmental, Social and Governance

We are excited to announce that Azure Databricks is now in Microsoft’s Azure Government region, enabling new data and AI use cases for federal agencies, state and local governments, public universities, and government contractors to enable faster decisions, more accurate predictions, and unified and collaborative data analytics. More than a dozen federal agencies are building cloud data lakes and are looking to use Delta Lake for reliability.

Azure Databricks is trusted by organizations such as Credit Suisse, Starbucks, AstraZeneca, McKesson, Deutsche Telekom, ExxonMobil, H&R Block, and Dell for business-critical data and AI use cases. Databricks maintains the highest level of data security by incorporating industry leading best practices into our cloud computing security program. Azure Government public preview provides customers the assurance that Azure Databricks is designed to meet United States Government security and compliance requirements to support sensitive analytics and data science use cases. Azure Government is a gold standard among public sector organizations and their partners who are modernizing their approach to information security and privacy.

Enabling government agencies and partners to accelerate mission-critical workloads on Azure Government

High Impact data are frequently stored and processed in emergency services systems, financial systems, department of defense and healthcare systems. For example, Azure Databricks enables government agencies and their contractors to analyze public records data such as tax history, financial records, welfare and healthcare claims to improve processing times, reduce operating costs, and reduce claims fraud. In addition, government agencies and contractors need to analyze large geospatial datasets from GPS satellites, cell towers, ships and autonomous platforms for marine mammal and fish population assessments, highway construction, disaster relief, and population health. State and local governments who utilize federal data also depend on Azure Government to ensure they meet these same high standards of security and compliance.

Learn more about Azure Government and Azure Databricks

You can learn more about Azure Databricks and Azure Government by visiting the Azure Government website, see the full list of Azure services available in Azure Government, compare Azure Government and global Azure and read Microsoft’s documentation here.

Get started with Azure Databricks by attending this free, 3-part training series. Learn more about Azure Databricks security best practices by attending this webinar and reading this blog post.

As always, we welcome your feedback and questions and commit to helping customers achieve and maintain the highest standard of security and compliance. Please feel free to reach out to the team through Azure Support.

Follow us on Twitter, LinkedIn, and Facebook for more Azure Databricks security and compliance news, customer highlights, and new feature announcements.

Try Databricks for free. Get started today.

The post Azure Databricks Now Available in Azure Government (Public Preview) appeared first on Databricks.

↧

How to Extract Market Drivers at Scale Using Alternative Data

July 15, 2020, 8:54 am

≫ Next: Spark + AI Summit Reflections

≪ Previous: Azure Databricks Now Available in Azure Government (Public Preview)

Watch the on-demand webinar Alternative Data Analytics with Python for a demonstration of the solution discussed in this blog and/or download the following notebooks to try it yourself.

Introduction
Why Alternative data Is critical
Main Section – Solution + Descriptive Notebook
Architecture
Ingest alternative data sources
Data analysis using Python:
Important term exploration with TF-IDF
Named entity recognition

Why alternative data is critical

Alternative data is helping banking institutions, insurance companies and asset managers alike make better decisions by revealing valuable information about consumer behavior (e.g. utility payment history, transaction information) and extends across a variety of use cases including trade analyses, credit risk, and ESG risk. Traditional data sources, such as FICO scores or quarterly 10Q reports, have been mined and analyzed to the point that they no longer provide a competitive advantage. To gain a real competitive advantage, financial services institutions (FSIs) need to leverage alternative data to obtain a better understanding of their customers, markets, and businesses. Some of the most common alternative datasets used in the industry include news articles, web/mobile app exhaust data, social media data, credit/debit card data, and satellite imagery data.

According to a survey completed by Dow Jones newswire, 66% of respondents believe alternative data is critical to the success of their FSI. However, only 7% believe they are leveraging alternative data to the greatest extent possible. Transunion reports confirm that 90% of loan applicants would be no-hit (failure to provide credit score) without alternative data, which highlights the immense value of these data sources. Two main reasons FSIs fail to extract value are a) challenges in integrating alternative data sources (e.g. transactions) and traditional data sources (e.g. earnings reports) and b) iterating on experiments with unstructured data. On top of this, historical datasets are large, frequently updated, and require thorough cleansing to unlock value.

Fortunately, Databricks Unified Data Analytics Platform and Delta Lake, an open-source storage layer that brings ACID transactions to big data workloads, helps organizations overcome these challenges with a scalable platform for data analytics and AI in the cloud. More specifically, Databricks’ Delta Lake autoloader gives capabilities for ‘set and forget’ ETL into a format ready for analytics, improving productivity for data science teams. Additionally, Databricks’ Apache Spark^TM runtime on Delta Lake provides performance benefits for parsing and analyzing unstructured and semi-structured sources, such as news and images, thanks to optimized data lake parsers and simple library (e.g. NLP packages) management.

In this blog post, we explore an architecture based on Databricks and Delta Lake which combines analyses on unstructured text and structured time series foot traffic data, mocked in the SafeGraph data format. The specific business challenge is to extract insights from these alternative data sources and uncover a network of partners, competitors, and a proxy for sales at QSRs (quick-serve restaurants) for one plant-based meat company Beyond Meat. We delivered this content in a joint webinar with SafeGraph, which specializes in providing curated datasets for points of interest, geometries, and patterns which is available here.

Discovering key drivers for stock prices using alternative data

The biggest shifts in stock prices occur when new information is released regarding the past performance or future viability of the underlying companies. In this example, we examine alternative data, namely news articles and foot traffic data, to see if positive news, such as celebrity endorsements and free press garnered by innovative products such as plant-based meats, can attract foot traffic to fast-food restaurants, increase sales and ultimately move their stock prices.

To begin the task of getting insights from news articles, we will set up the architecture shown in the diagram below. We are using three data sources for our analysis:

articles from an open-source online database called GDELT project
restaurant foot traffic (mocked up using the SafeGraph schema)
market data from Yahoo finance for initial exploratory analysis

In the attached notebooks, our analysis begins with a chart to examine a simple moving average (SMA). As shown below, since Beyond Meat is a young company, we cannot extract the simplest of technical indicators, and this is our primary driver for understanding this company from other perspectives, namely news and foot traffic data.

Alternative data sources often exist in raw format in external cloud storage, so a good first step is to ingest this into Delta Lake, an open format based on parquet and optimal for scalable and reliable analytics on a cloud data lake. We establish three layers of data storage using Delta Lake, each with an increasing level of data refinement from raw ingested data on the left (bronze) to parsed and enriched data in the middle layer (silver) to aggregated and annotated data ready for business intelligence (BI) and machine learning on the right.

Incrementally ingesting alternative data sources

The first step towards analyzing alternative data is to set up ingestion pipelines from all the required data sources. In this example, we are going to show you how to ingest two data sources:

Front Page News Articles: We are going to pull news articles related to plant-based meat. We are going to leverage an open database called GDELT which scans 50,000 news outlets daily and provides a snapshot of all the news headlines and links every hour.
Geospatial Foot Traffic: We will import (mock) foot traffic data for popular restaurants such as Dunkin Donuts in the NY Metro area for the past 12 months using Safegraph’s data format, which is an alternative data vendor specializing in curated points of interest and patterns datasets.

Autoload data files into Delta Lake

Many alternative data sources, such as news or web search data, arrive in real-time. In our example, GDELT articles are refreshed every 15 minutes. Other sources, such as transaction data, arrive on a daily basis, but it is cumbersome for data science teams to have to keep track of the latest dates in order to append new data to their data lake. To automate the process of ingesting data files from the above-mentioned data sources, we will leverage Databricks Delta Lake’s autoloader functionality to pick up these files continuously as they arrive. We take the data as-is from the raw files and store it into staging tables in Delta format. These tables form the “bronze” layer of the data platform. The code to auto ingest files is shown below. Note that the Spark stream writer has a ‘trigger once’ option: this is particularly useful to avoid always-on streaming queries and instead allows the writer to schedule it on a daily cadence, for ‘set and forget’ ingestion. Also, note that the checkpoint gives us built-in fault tolerance and ease-of-use at no cost; when we start this query at any time, it will only pick up new files from the source and we can safely restart in case of failure.

    new_data = spark.readStream.format("cloudFiles") \
        .option("cloudFiles.format", "csv/json/parquet/etc") \
        .schema(schema) \
        .load()
    
    new_data.writeStream \
        .trigger(once = True) \ 
        .partitionBy("date") \
        .format("delta") \
        .option("checkpointLocation", "/tmp/checkpoint/gdelt_delta_checkpoint") \
        .start()

Data analysis using Python

Python is a popular and versatile language used by many data scientists around the world. For tasks involving text cleansing and modeling, there are hundreds of libraries and packages, making it our go-to language in the analysis that follows. We first read the bronze layer and load it into an enriched dataset containing the article text, timestamp, and language of the GDELT source. This enriched format will constitute the silver layer and details of this extraction are in the attached notebooks. Once the data is in the silver layer, we can clean and summarize the text in three steps:

Summarize terms from corpus – discover important and unique terms among the corpus of articles using TF IDF
Summarize article topics from corpus – Topic modeling for articles using distributed LDA may give us information about the following:
1. Understand the current TAM by uncovering popular forms of plant-based meat (e.g. pork, chicken, beef)
2. Which competitors or QSRs are showing up as major topics?
Find interesting named entities in articles to understand the influences of entities on plant-based meat products and affected companies

Important term exploration with TF-IDF

Given we are examining many articles about plant-based meat, there are commonalities among these articles. Term Frequency-Inverse Document Frequency (TF-IDF for short) gives us the important terms normalized by their presence in the broader corpus of articles. This type of analysis is easy to distribute with Apache Spark and runs in seconds on thousands of articles. Below is a summary using the TF-IDF computed value, ranked in descending order.

    select textSWRemoved, linktext, tf_idf 
    from alt_data.plant_based_articles_gold  
    where ((lower(linktext) like '%beyond%meat%'))
    order by tf_idf desc
    limit 5

What is notable here are some of the names and ticker information. Digging into the article text, we found that Lana Weidgenant created a petition for bringing a new plant-based meat product to Dunkin’s menu. Another interesting observation is the stock ticker TSX: QSR shows up here. This is the ticker for Canadian stock Restaurant Brands International, which pulled Beyond Meat products at Tim Hortons locations due to failed adoption by customers. This simple term summary allows us to get a quick picture of a few quick-serve restaurants (QSRs for short) and Beyond Meat’s impact.

Applying topic modeling at scale

Now, let’s apply a topic model to understand articles as distributions of topics and topics as distributions of terms. Since we are using unsupervised learning techniques to summarize article content, generative model LDA (Latent Dirichlet Allocation) is a natural choice for discovering latent topics among articles. Fortunately, this algorithm is distributable and built into Spark ML. The succinct code below shows a short and simple configuration needed to run this on our text. We’ve also included a bar chart showing a visual representation of topics extracted from the LDA model.

First, before running any type of modeling exercise, it is essential to clean our data further. To do this, let’s use the robust NLP library nltk to filter short words, stop words, remove punctuation, and stem our tokens. This will greatly help in improving model results. Below is the function we will apply to each row of our data frame before wrapped inside a pandas UDF. Note that we’ve customized the nltk stop word list since we have extra information about our article base. In particular, ‘veganism’, ‘plant-based’, and meat are not necessarily interesting topics or terms – instead, we want information on QSRs serving these products or actual partners themselves.

    from nltk.tokenize import word_tokenize
    tokens = word_tokenize(text)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # filter out stop words
    from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    stop_words.add('edt')
    stop_words.add('beyond')
    stop_words.add('meat')
    ...
    words = [w for w in words if not w in stop_words]

Now that we’ve cleaned our text, we’ll leverage Spark for two steps: i) Run a count vectorizer with a vocabulary size of 1000 to featurize our data into vectors and ii) fit LDA to our vectorized data. Note that we’ve included a parameter on document concentration (otherwise known as alpha in the world of LDA), which indicates the document topic density – higher values correspond to the assumption that there are many topics per article as opposed to a smaller number. The default is 1/# of topics, so here we choose a lower value to assume a lower number of topics.

   from pyspark.ml.clustering import LDA
    
    # Set LDA params to 6 topics, repeatable seed and documentation concentration to lower value based on assumptions for fewer topics
    lda = LDA() \
        .setOptimizer("online") \
        .setMaxIter(40) \
        .setK(6) \
        .setSeed(1) \
        .setDocConcentration([0.1,...,0.1])

Representation of terms with probabilities within each topic

Our LDA model has produced a set of six topics that we summarize below using a bar chart to show distribution over topics and the top 3 terms per topic. Some notable observations involve Beyond Meat protein (chicken), partners (McDonald’s), and competitors such as Impossible Foods featured prominently.

Named entity recognition

Our topic model gave us great information about more QSRs involved in selling plant-based meat. Now, we want to understand which entities are collocated in the same article to give us more information on any special partnerships or promotions. If we extract information on how Beyond Meat is penetrating the market, we can potentially use other alternative data sources to uncover transaction information or restaurant visits, for example.

Let’s use the Python NLP library SpaCy’s built-in named entity recognition to help solve this problem. Also, instead of running the entity recognition on the article text, let’s run it on the article title text itself to see if we can find new information on partnerships or promotions. We use a Spark Pandas UDF to parallelize the entity recognition annotation (as below).

    # Parallelize SpaCy entity recognition using Pandas UDF
    def addEntities(text):
        nlp_br = SpacyMagic.get(lang)
        sentence = str(text)
        sentence_nlp = nlp_br(sentence)
        return("#".join([str((word, word.ent_type_)) for word in sentence_nlp if word.ent_type_]))
     
     @pandas_udf('string', PandasUDFType.SCALAR)
     def entitiesUDF(v: pd.Series) -> pd.Series:
         return v.map(lambda x: addEntities(x))

Now, we produce a summary of the entity information we received from SpaCy’s logic:

Dunkin’ was one of the top 10 entities
SQL exploration of all entities related to Dunkin’ (Snoop Dogg was a standout)
Visualization of the article highlighting a promotion by Dunkin’ and Snoop Dogg for a Beyond Meat sandwich, targeted for January 2020.

Now that we have a specific example of Beyond Meat’s sales strategy, let’s formulate a hypothesis about the effectiveness of the promotion using another popular alternative data type – geolocation data, in this case, provided by Safegraph’s POI offering.

Below is the visualized representation of entities that SpaCy produces, which is easily viewable in a Databricks notebook. To produce this, we took the text from the article above we observed through SQL exploration.

Validating signals using Safegraph’s POI data

Up to this point, we have done many different kinds of news analyses. However, this has only leveraged our GDELT dataset. While results from NLP summaries above have led us to the need to understand Beyond Meat promotions via partnerships, we want to analyze some proxy for sales to see how effectively Beyond Meat is penetrating the fast-food market. We will do this using a mocked dataset based on SafeGraph’s geo-location data schema.SafeGraph’s data is coalesced from over 50M mobile devices and offers highly structured formats based on POI (point-of-interest) data for many different brands. The POI data is especially interesting for our use case since we are focused on QSRs (quick-serve restaurants) which tend to have high conversion rates and serve as a proxy for sales. In particular, given a promotion of a Beyond Meat-based product (in this case Snoop Dogg’s breakfast sandwich), we want to understand whether foot traffic was ‘significant’ for this short time period given historical time series data on foot traffic at NYC Dunkin’ locations. We’ll use an additive time series forecasting model, Prophet, to forecast the foot traffic for a week in January based on the prior year’s history. As a crude metric, a significant uptick in foot traffic will equate to actual Safegraph traffic amounts being higher than the 80% prediction interval forecasted traffic for the majority of the length of the promotional period (1 week).

Note: The foot traffic data used within the notebooks and shown in this blog is simulated based on the SafeGraph schema. In order to validate with true data, visit SafeGraph to get these samples.

While SafeGraph’s foot traffic is cleansed well, we still need to explode each record to create a time series for Prophet. The code in the attached notebook creates a row per date of the number of visits, i.e. the num_visits array has been converted to multiple records, one record per date as below.

    %sql 
    
    select location_name, street_address, city, region, year, month, day, num_visits 
    from ft_raw
    where location_name = 'Dunkin\''

Using the data above, we can aggregate by the date to produce a time series we can use for our forecasting model, which we’ve named all_nyc_dunkin. Once this is done, we’ll produce a pandas data frame, shown below.

    # Python
    m = Prophet()
    m.fit(pdf)
    
    m_nyc_dunkin = Prophet()
    m_nyc_dunkin.fit(all_nyc_dunkin)
    
    # Python
    future = m.make_future_dataframe(periods=15)
    nyc_future = m_nyc_dunkin.make_future_dataframe(periods = 15)
    
    nyc_forecast = m_nyc_dunkin.predict(nyc_future)
    forecast = m.predict(future)
    
    forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail(15)

Overlaying the forecast above with actual foot traffic data, we see that the additional traffic doesn’t pass our test for ‘significant’ (all dates stayed within the prediction interval). Our POI data has allowed us to achieve this interesting result quickly with a simple library import and aggregation in SQL. To conclude this analysis, this is not to say that the average receipt didn’t have a higher bill amount as compared to the average transaction price – however, we would need transaction data (another alternative dataset) to confirm this.

Overall, our goal was to enhance our methodology for discovering market drivers. The analyses presented here greatly improve on the limited technical market data we first inspected. While we would likely conclude there is not a strong buy or sell signal for Beyond Meat based on the article summaries or foot traffic, all of the analyses above can be automated and extended to other use cases to provide hourly dashboards to aid financial decisions.

Leveraging alternative data

In summary, this blog post has provided a variety of ideas on how you can leverage alternative data to inform and improve your financial decision-making process. The pipeline code we have provided in this blog post is only for the purpose of getting you started in this journey of exploring non-traditional data. In real-life scenarios, you may deal with more messy data which will require more cleaning steps before the data is ready for analysis. But irrespective of how many steps you need to execute or how big the datasets are, Databricks provides an easy-to-use, collaborative, and scalable platform to uncover insights from unstructured and siloed data in a matter of minutes and hours instead of weeks. Databricks is currently helping some of the largest financial institutions which are leveraging alternative data for investment decisions and we can do the same for your organization as well.

Getting Started

Download the Notebooks:

More Resources

Market Data Notebook
Data Ingestion Notebook

Try Databricks for free. Get started today.

The post How to Extract Market Drivers at Scale Using Alternative Data appeared first on Databricks.

↧

Spark + AI Summit Reflections

July 15, 2020, 10:09 am

≫ Next: Analyzing Customer Attrition in Subscription Models

≪ Previous: How to Extract Market Drivers at Scale Using Alternative Data

Developers attending a conference have high expectations: what knowledge gaps they’ll fill; what innovative ideas or inspirational thoughts they’ll take away; who to contact for technical questions, during and after the conference; what technical trends are emerging in their domain of expertise; what serendipitous connections they’ll make and foster; and what theme will resonate across the conference content.

As developer advocates, we offer a developer’s perspective by reflecting on Spark + AI Summit 2020, held June 22–26 with nearly 70K registrants from 125+ countries.

As part of a data team of data engineers, scientists, architects and analysts, you’ll want to get to the heart of the matter. So let’s consider keynotes first.

Technical keynotes

Setting the overarching narrative of the conference, Ali Ghodsi, CEO and co-founder of Databricks, asserted why, more than ever, data teams must come together. By citing the social and health crises facing the world today, he elaborated on how data teams in organizations have embraced the idea of Data + AI as a team sport that heeds a clarion call: unlock the power of data and machine learning. This unifying theme resounded across products and open-source project announcements, training courses and many sessions.

We developers are enthralled to technical details: We expect to see architectural diagrams, an under-the-hood glimpse, code, notebooks and demos. We ensure that we’ve got all the technical details on what, why and how Lakehouse, a new data paradigm built atop Delta Lake and compatible with Apache Spark 3.0, allows data engineers and data scientists to amass oceans of structured, semi-structured and unstructured data for a myriad of use cases. All the historical issues attributed to data lakes, noted Ghodsi, are now addressed by this “opinionated” standard for building reliable Delta Lakes. At length, we got an under-the-hood view of how Delta Lake provides a transactional layer for processing your data.

Adding to the conversation about how the Lakehouse stack of technologies helps data teams tackle tough data problems, Reynold Xin, chief architect and co-founder of Databricks, delivered a deep dive of a new component called Delta Engine. Built on top of Apache Spark 3.0 and compatible with its APIs, the Delta Engine affords developers on the Databricks platform “massive performance” when using DataFrame APIs and SQL workloads in three primary ways:

Extends cost-based query optimizer and adaptive query execution with advanced statistics
Adds a native vectorized execution engine written in C++ called Photon
Implements caching mechanisms for high IO throughput for the Delta Lake storage layer

Though no developer APIs exist for this engine, it offers under-the-hood acceleration of Spark workloads running on Databricks. However, for developers using SQL, Spark DataFrames or Koalas on Databricks, it’s good news. Photon is a native-execution engine purpose-built for performance; it capitalizes on data-level and CPU instructions-level parallelism, taking advantage of modern hardware. Written in C++, it optimizes modern workloads that comprise mainly of string processing and regular expressions.

In a similar technical format and through the lens of one of the original creators of Apache Spark, Matei Zaharia, we journeyed through 10 years of Apache Spark. Zaharia explained how Spark improved with each release — incorporating feedback from its early users and developers; adopting new use cases and workloads, such as accelerating R and SQL interactive queries and incremental data set updates for streaming; expanding access to Spark with programming languages, machine learning libraries, and high-level, structured APIs — and always kept developers’ needs front and center for Spark’s ease of use in its APIs.

Zaharia’s key takeaways for developers as it relates to Apache Spark 3.0 include:

Significant performance improvements under the hood for Spark SQL (Adaptive Query Execution and Dynamic Partitioning Pruning)
ANSI SQL compliance
Notable improvements in usability for Python and PySpark and a new Python Project Zen
An impressive 3,400 Jira issues resolved (with 46% toward Spark SQL)

One salient observation that Zaharia noted: 68% of commands in the Databricks notebooks developers issue are in Python and 18% in SQL, with Scala trailing at 11%. This observation is fitting with Spark 3.0’s emphasis on Spark SQL and Python enhancements.

All these were not just a litany of slides; we saw code, pull requests, stacktraces, notebooks, tangible code with performance improvements, demos, etc.

Even better, you can download Spark 3.0, get a complimentary copy of “Learning Spark, 2nd Edition,” and start using the latest features today!

Data visualization narrates a story. Redash joining Databricks brings a new set of data visualization capabilities. For data analysts more fluent in SQL than Python, it affords them the ability to explore, query, visualize dashboards, and share data from many data sources, including Delta Lakes. You can see how you can augment data visualization with Redash.

As an open-source project, with over 300 contributors and 7,000 deployments, the Redash acquisition reaffirms Databricks’ commitment to the open-source developer community.

Finally, two keynotes completed the narrative with the use of popular software development tools for data teams. First, Clemens Mewald and Lauren Richie walked us through how data scientists, using the newly introduced Databricks next-generation Data Science Workspace, could work collaboratively through a new concept — Projects — using their favorite Git repository.

Four easy steps allow you to create a Project as part of your Databricks workspace:

Create a new Project
Clone your master Git repository and create a branch for collaboration
Open a Jupyter notebook, if one exists in the repository, into the Databricks notebook editor
Start coding and collaborating with developers in the data team
Pull, commit and push your code as desired

Not much different from your Git workflow on your laptop, this collaborative process is now available in private preview in Databricks workspace, on a scalable cluster, giving you the ability to access data in the Lakehouse and configure a project-scoped Conda environment (coming soon) if required for your project.

Second, this last bit finished the narrative of how data teams come together with new capabilities in MLflow, a platform for a complete machine learning lifecycle. Presented by Matei Zaharia and Sue Ann Hong, the key takeaways for developers were:

The state of the MLflow project and community, new features and what’s the future
Databricks contributes MLflow to the Linux Foundation Project, expanding its scope and expecting contributions from the larger machine learning developer community
MLflow on Databricks announces easy Model Serving
The MLflow 1.9 release introduces stronger model governance, experiment autologging capabilities in some of its ML libraries, and pluggable deployment APIs

All in all, these announcements from a developer’s lens afford us a sense of new trends and paradigms in the big data and AI space; they offer new tools, APIs, and insights into how data teams can come together to solve tough data problems.

Meetups and Ask Me Anything

Let’s shift focus to communal developer activities. Since the earliest days of this conference, there’s always been a kick-off meetup with cold brew and tech banter. By the end of the meetup, excitement is in the air and the eagerness in the venue is palpable — all for the next few days of keynotes and break-away sessions. This time, we hosted the meetup virtually. And the eagerness was just as palpable — in attendance and through interactivity via the virtual Q&A panels.

Just concluded our Brew Banter & AI for Social Good @SparkAISummit with @dennylee Rob Reed & @lmoroney. Huge virtual turn up; loads of engaging questions; and attendees from across the globe, in different timezones. Thank you all! #DataTeams #sparkaisummit. pic.twitter.com/Fe213Nueed

— { Jules Damji } 📝 (@2twitme) June 22, 2020

Like meetups, Ask Me Anything (AMA) sessions are prevalent at conferences. As an open forum, developers from around the globe asked unfiltered questions to committers, contributors, practitioners and subject matter experts on the panel. We hosted over a half dozen AMAs, including the following:

Four Delta Lake AMAs
MLflow AMA
Apache Spark 3.0 and Koalas AMA
AMA with Matei Zaharia
AMA with Nate Silver

Training, tutorials and tracks

Next, the jewels in the oasis of knowledge. Catering to a range of developer roles — data scientist, data engineer, SQL analyst, platform administrator, and business leader — Databricks Academy conducted paid training for two days, with half-day repeated sessions. Equally rich in technical depth were the diverse break-away sessions, deep dives and tutorials across tracks and topics aimed at all skill levels: beginner, intermediate and advanced.

These sessions are the pulsing heartbeat of knowledge that we relish to hear about when we attend conferences. They are what propel us to travel lands and cross oceans. Except this time, you could do it from your living room, without the panic of missing sessions or running across hallways and buildings to get to the next session in time. One participant noted that this virtual conference set a high bar for technical content and engagement as she attends future virtual conferences.

Swag, parties and camaraderie

Well, you can’t have delicious curry without masala, can you? Naturally, the key conference ingredients of swag, parties and camaraderie are as important as technical sessions. Got to have the T-shirt and goodies! But this time, you didn’t have to carry your backpack to hold the goodies. Instead, you could shop at the online Summit shop, redeem your Summit points (accumulated by attending numerous events, including the two nights of virtual parties with games and activities), and have your swag shipped to you — some restrictions applied for some global destinations.

Throughout the five days, you could pose for selfies and share on social media, interact and make new connections with a global community on the virtual platform, and schedule an appointment at the DevHub to ask technical questions — and follow up later.

As a developer advocate, judging by our initial expectations, they were met beyond a doubt. Had it not been for the inimitable Spark + AI Summit organizing committee uniting as a #datateam, this virtual event would not have been possible. So huge kudos and hats off to them and the global community!

What’s next

Well, if you missed some sessions, all are online now for perusal. A few favorite picks include:

Try Databricks for free. Get started today.

The post Spark + AI Summit Reflections appeared first on Databricks.

↧

Analyzing Customer Attrition in Subscription Models

July 15, 2020, 11:49 am

≫ Next: Bucket Brigade — Securing Public S3 Buckets

≪ Previous: Spark + AI Summit Reflections

Download the notebooks to demo the solution covered below

The subscription model is experiencing a renaissance. Gone are the days of the penny music CD clubs, replaced by an ever-increasing assortment of digital streaming services delivering music, videos and more directly to consumers’ devices in exchange for a modest recurring fee. Today, 70% of US households subscribe to at least one subscription streaming service with an average of 3.4 such subscriptions per subscriber household.

The success of these services combined with increasing consumer demand for convenience has pushed more and more retailers and consumer goods companies in on the act. Between 2014 and 2017, the subscription box market grew 890% with a disproportionate share of consumer interest focused on Food, Beauty and Apparel. By the end of this period, approximately 15% of online shoppers had signed up for such services. By late 2019, that rate had grown to over 50% with registrations extending well beyond the millennial consumers that fed the rapid growth in this space. And as consumers rethink past spending patterns in light of ongoing health and safety concerns, the offer of reliable, doorstep-delivery of essential goods is promoting even further growth in the subscription market.

For both retailers, this new model represents an opportunity to reach new customers and secure recurring revenue streams. For consumer goods manufacturers, this model provides the additional benefit of connecting directly with consumers – something consumers are increasingly coming to expect – that would otherwise be hidden behind retailers. It also opens up additional avenues for promoting brands and delivering product to the customer via routes fully in control of the manufacturer. At the same time, these models provide retailers the ability to introduce their own private labels in a manner that overcomes some of the past barriers to consumer adoption. The potential of the direct-to-consumer subscription market is huge, but it’s yet to be determined exactly who the winners and the losers in this space will ultimately be.

There are no guarantees of success

Success in the subscription space does not come easy as “consumers do not have an inherent love of subscriptions.” Services often have to drive awareness through increasingly expensive advertising buys and entice subscribers with free or discounted trials that very frequently fail to convert to full-priced subscriptions. Should a subscriber convert, keeping them engaged is an on-going challenge as fatigue sets in or product simply stacks up. Quick exit policies intended to ease subscribers’ concerns surrounding long-term commitments to new service providers, make it simple for customers to leave a service with relatively short notice, putting the promises of steady revenues at risk.

One recent analysis of consumer-oriented subscription services estimated a segment average 7.2% monthly rate of churn. When narrowed to services focused on consumer goods, that rate jumped to 10.0%. This figure translates to a lifetime 10 months for the average subscription box service, leaving businesses little time to recover acquisition costs and bring subscribers to net profitability.

Balancing customer acquisition & retention is critical

And this is the central challenge to the long-term success of any subscription service. High profile services such as Blue Apron have provided very public case studies on the consequences of high customer acquisition costs coupled with low customer lifetime value, but every subscription service must struggle with their own balance between customer acquisition and customer retention.

This is particularly challenging in that successful customer acquisition strategies needed to get services to scale tend to be followed by service disruptions or declines in quality and customer experience, accelerating subscription abandonment. To replenish lost subscribers, the acquisition engine continues to grind and expenses mount. As services reach for customers beyond the core segments they may have initially targeted, the service offerings may not resonate with new subscribers over the same durations of time or may overwhelm the ability of these subscribers to consume, reinforcing the overall problem of subscriber churn.

At some point, it becomes critical for organizations to take a cold, hard look at the cost of acquisition relative to the subscriber lifetime value (LTV) earned. These figures need to be brought into a healthy balance, and retention needs to be actively managed, not as a point-in-time problem to be solved, but as a “chronic condition” which needs to be managed for the ongoing health of the business.

Headroom for continued acquisition-driven growth can be created by carefully examining why some customers leave and some customers stay. When centered on factors known at the time of acquisition, businesses may have the opportunity to rethink key aspects of their acquisition strategy that promote higher average retention rates and profitability.

Examining retention based on acquisition variables

Public data for subscription services is extremely hard to come by, but one service, KKBox, a Taiwan-based music streaming service, recently made 2+ years of anonymized subscription data available for the examination of customer churn. While not a retail or CPG subscription service, the customer dynamics found in the data should resonate with any subscription provider.

The vast majority of subscribers join the KKBox service under an initial 30-day trial offering. Customers then appear to enlist in 1-year subscriptions which provide the service with a steady flow of revenue. Within the 30-day trial and at regular one-year intervals, subscribers have the opportunity to churn as shown in Figure 1 where Survival Rate reflects the proportion of the initial (Day 1) subscriber population that is retained over time, first at the roll-to-pay milestone, and then at the renewal milestone.

Figure 1. Customer attrition by subscription day on the KKBox streaming service

This pattern of high initial drop-off, followed by a period of lock-in and lessening drop-off with each renewal cycle narratively makes intuitive sense. What’s striking is that if we consider the registration channel (Figure 2), initial payment method and initial payment terms/days (Figure 3) for these subscriptions, we find vastly different patterns of customer churn, not just in the first 30-day subscription window but over the two-year duration for which customer data was made available.

Figure 2. Customer attrition by subscription day on the KKBox streaming service for customers registering via different channels

Figure 3. Customer attrition by subscription day on the KKBox streaming service for customers selecting different initial payment methods and terms/days

These patterns seem to indicate that KKBox could actually differentiate between customers based on their lifetime potential using information known at the time of acquisition. This information might help inform or steer specific discounts or promotions to customers as they register for a trial. This information might also inform KKBox of which offerings or capabilities to discontinue as some, e.g. Initial Payment Method 35 or the 7-day payment plan as shown in Figure 3, align with exceptionally high churn rates in the first 30-days with little long-term survivorship.

Of course, there are relationships between these factors so that we should be careful in viewing them in isolation. By deriving a baseline risk (hazard) of customer churn (Figure 4), we can calculate the influence of different factors on the baseline in such a manner that each factor may be considered an independent hazard multiplier. When combined (through simple multiplication) against the baseline, we can plot the a specific customer’s chances of abandoning a subscription by a given point in time (Table 1).

Figure 4. The baseline risk of customer attrition over the first 900 days of a subscription lifespan

Category	Feature	First 30 Days	1 Year Renewal	2 Year Renewal
Registration Channel	channel_3	1.06	1.08	1.08
	channel_4	1.21	0.69	0.69
	channel_7	1.00	1.00	1.00
	channel_9	1.04	1.00	1.00
	channel_unknown	1.00	1.00	1.00
Initial Payment Plan Days/Term	plan_7	1.23	0.81	0.81
	plan_10	1.06	3.17	3.17
	plan_30	1.00	1.00	1.00
	plan_31	2.00	1.08	1.08
	plan_90	0.19	0.59	0.59
	plan_100	0.44	1.28	1.28
	plan_120	0.53	0.04	0.04
	plan_180	0.57	0.74	0.74
	plan_195	0.55	0.80	0.80
	plan_395	0.07	0.34	0.34
	plan_410	0.09	0.43	0.43
Initial Payment Method	method_20	2.50	1.74	1.74
	method_22	1.57	2.51	2.51
	method_28	2.38	1.59	1.59
	method_29	1.00	0.05	0.05
	method_30	1.00	0.62	0.62
	method_31	0.30	0.56	0.56
	method_32	1.45	2.35	2.35
	method_33	0.50	1.27	1.27
	method_34	0.16	0.54	0.54
	method_35	4.19	1.74	1.74
	method_36	1.11	1.25	1.25
	method_37	0.38	0.80	0.80
	method_38	1.76	2.38	2.38
	method_39	1.20	0.08	0.08
	method_40	0.67	1.49	1.49
	method_41	1.00	1.00	1.00

Table 1. The channel, payment method and payment plan multipliers that combine to explain a customer’s risk of attrition at various points in time. The higher the alue, the higher the proportional risk of churn in the associated period.

Applying churn analytics to your data

The exciting part of this analysis is that not only does it help to quantify the risk of customer churn but it paints a quantitative picture of exactly which factors explain that risk. It’s important that we not draw too rash of a conclusion with regards to the causal linkage between a particular attribute and its associated hazard, but it’s an excellent starting point for identifying where an organization needs to focus its attention for further investigation.

The hard part in this analysis is not the analytic techniques. The Kaplan-Meier curves and Cox Proportional Hazard models used to perform the analysis above are well established and widely supported across analytics platforms. The principal challenge is organizing the input data.

The vast majority of subscription services are fairly new as businesses. As such, the data required to examine customer attrition may be scattered across multiple systems, making an integrated analysis more difficult. Data Lakes are a starting point for solving this problem, but complex transformations required to cleanse and restructure data that has evolved as the business itself has (often rapidly) evolved requires considerable processing power. This is certainly the case with the KKBox information assets and is a point noted by the data provider in their public challenge.

The key to successfully completing this work is the establishment of transparent, maintainable data processing pipelines executed on an elastically-scalable (and therefore cost-efficient) infrastructure, a key driver behind the Delta Lake pattern. While most organizations may not be overly cost-conscious in their initial approach, it’s important to remember the point made above that churn is a chronic condition to be managed. As such, this is an analysis that should be periodically revisited to ensure acquisition and retention practices are aligned.

To support this, we are making the code behind our analysis available for download and review. If you have any questions about how this solution can be deployed in your environment, please don’t hesitate to reach out to us.

Download the notebooks

Try Databricks for free. Get started today.

The post Analyzing Customer Attrition in Subscription Models appeared first on Databricks.

↧

Bucket Brigade — Securing Public S3 Buckets

July 16, 2020, 8:00 am

≫ Next: Optimizing User Defined Functions with Apache Spark™ and R in the Real World: Scaling Pitch Scenario Analysis with the Minnesota Twins Part 2

≪ Previous: Analyzing Customer Attrition in Subscription Models

Are your Amazon S3 buckets secure? Do you know which ones are public or private? Do you even know which ones are supposed to be?

Data breaches are expensive. Facebook notoriously exposed 540 million records of its users recently, which was a contributing factor of the $5 billion fine they were issued by the FTC. Unfortunately, these kinds of breaches feel increasingly common. By one account, 7% of all Amazon Web Services (AWS) S3 buckets are publicly accessible. While some of these buckets are intentionally public, it’s all too common for non-public sensitive data to be exposed accidentally in public-facing buckets.

The Databricks security team recently encountered this situation ourselves. We were alerted to a potentially non-public object contained in one of our public buckets. While the result was benign, we set out to improve our overall bucket security posture and be able to affirmatively answer the questions above at all times. We set out to provide not only tighter controls around what was and was not public, but also to monitor our intentionally public resources for unintentional exposures.

Our S3 Bucket Security Solution

As a response to our initial alert, we took action to identify all of our S3 buckets and the public / non-public status. Since Databricks is a cloud-native company, we had already deployed JupiterOne, a commercial cloud asset management solution that allowed us to quickly query and determine which assets were public. Open-source tools are available, such as Cartography from Lyft, which allow for similar query capabilities. With the outputs of our queries, we were able to quickly identify and triage our public buckets and determine whether they should remain public.

Having verified whether our buckets should be public, we wanted to address a few questions:

Did we have any existing non-public files in our intentionally public buckets?
How could we prevent buckets, and their objects, from becoming public unintentionally?
How could we continuously monitor for secrets in our public buckets and get real-time alerts?

Tools

To address each of these questions, we found, implemented and created tools. Each tool below addresses one of the above questions.

YAR: (Y)et (A)nother (R)obber

To address our first question whether we had any existing non-public files in our intentionally public buckets, we repurposed Níels Ingi’s YAR tool – typically used for scanning Github repositories for secrets – to scan our existing public buckets. We spun up EC2 instances, synchronized the bucket contents, customized the YAR configuration with additional patterns specific to our secrets, and ran YAR. In order to speed this up, we wrote a script that wraps YAR and allows for parallel execution and applies pretty formatting to the results.

Sample YAR Output with Secret Context

As shown above, YAR provides context for lines surrounding the secret, which allowed us to more quickly triage identified potential secrets.

Cloud Custodian Access Management

To address our second question of how we could prevent buckets, and their objects, from becoming public unintentionally, we employed Cloud Custodian. We had already deployed this self-identified “opensource cloud security, governance and management” tool for other purposes, and it became clear it could help us in this effort too. Cloud Custoidan’s real-time compliance uses AWS Lambda functions and CloudWatch events to detect changes to configuration.

We added a Cloud Custodian policy to automatically enable AWS public access blocks for buckets and their objects exposed publicly through any access control lists (ACLs). We created an internal policy and process to get exceptions for intentionally public buckets that required this functionality to remain disabled. This meant that between Cloud Custodian’s enforcement and JupiterOne’s alerting for buckets configured with public access, we could ensure that none of our buckets were unintentionally public or exposing objects.

S3 Secrets Scanner

To address our third question, how we could continuously monitor for secrets in our public buckets and get real-time alerts, we wanted additional assurance that objects in our intentionally public buckets did not get updated with non-public objects. Importantly we wanted to get this information as quickly as possible to limit our exposure in the event of accidental exposure.

We developed a simple solution. We employed S3 events to trigger on any modification to objects in our S3 buckets. Whenever bucket objects are created or updated, the bucket’s event creates an SQS queue entry for the change. Lambda functions then process these events inspecting each file for secrets using pattern matching (similar to the YAR method). If any matches are found, the output is sent to an alert queue. The security team receives alerts from the queue, allowing for a near-real-time response to the potential exposure.

S3 Secrets Scanner Architecture

With all of the above questions addressed, we were feeling pretty good until …

Complications

Shortly after we deployed the above tooling, the Databricks security team was alerted to an outage that appeared to be caused by our new tooling and processes. In order to resolve the immediate issue we had to roll-back our Cloud Custodian changes. Through Databricks’s postmortem process, it became clear that despite having sent notifications about the tooling and process changes, it hadn’t been sufficient. Our solution was not universally understood and was causing confusion for people who suddenly had buckets enforcing access limits. Additionally, in one case, a bucket was nearly deleted when the security team followed up on an alert that it was public. The owner indicated that the bucket was no longer needed, but after removing access – a precaution prior to deleting the bucket – we discovered that an unknown team was using the bucket unknown to the apparent owner.

It became apparent that clear bucket ownership was key to our success for securing our buckets. Clear, authoritative ownership meant that we would be able to get quick answers to alerts and would provide an accountable party for the exception process for intentionally public buckets.

While we’d adopted some best practices for ownership identification using tags, we had some inconsistencies in the implementation. This meant that even when people were “doing the right thing”, the result was inconsistent and made it difficult to clearly identify owners.

Our Solution Revisited

We stepped back from enforcing our initial solution while we could get clearer ownership information and provide better communication.

We turned to JupiterOne again to query our S3 ownership information. Armed with a literal bucket list, we went to each AWS account owner for help in identifying clear owners for each bucket. We published policies around required tagging standards for ownership to ensure that there was consistency in the method used. We also published policies clarifying the Cloud Custodian public access blocks and exception process. Our CISO sent notifications to both leadership and the broader company creating awareness and support aligning behind our new policies and process.

We added ownership tagging enforcement via Cloud Custodian to our overall technical strategy. Importantly, our communication process helped identify some modifications to our enforcement approach. Originally, we identified that we would remove buckets missing compliant ownership tags. While this was appropriate for non-production resources, it was too aggressive for production resources. Instead, for production resources we updated Cloud Custodian to create tickets within our ticketing system with defined service level agreements (SLAs) to ensure that ownership would not be missing for long and without putting critical resources at risk.

Implementing a clear ownership process and providing clear communication to leadership, AWS account owners and users took more time than our original technical solution implementation, but it meant that this time we could enable it without any concern of an outage or accidentally removing important contents.

Databricks Loves to Open Source

Databricks is a big participant in the open source community – we are the original creators of Apache Spark™, Delta Lake™, Koalas and more – and secure data is extremely important to us. So, naturally having solved this problem for ourselves, we wanted to share the result with the community.

However, as security professionals, we considered what should be part of our offering. We have commonly seen tools open sourced, but in this case we identified that equally important to our tooling was the communication, policies and process surrounding those tools. For this reason, we are open sourcing Bucket Brigade – a complete solution for securing your public buckets.

The open source repository includes a detailed write-up of each of the solution phases with step-by-step implementation guides, including the code for the tools implemented. Perhaps equally importantly, it also includes policies for maintaining clear bucket ownership, enforcing public access blocks and getting exceptions. It extends the policies with email templates for leadership and user communication to ensure that the policies and process are well-understood prior to implementation.

Let’s all be able to answer the questions about our public S3 data and avoid being part of the problem.

Check out our open source Bucket Brigade solution.

Try Databricks for free. Get started today.

The post Bucket Brigade — Securing Public S3 Buckets appeared first on Databricks.

↧

Optimizing User Defined Functions with Apache Spark™ and R in the Real World: Scaling Pitch Scenario Analysis with the Minnesota Twins Part 2

July 21, 2020, 9:00 am

≫ Next: A Comprehensive Look at Dates and Timestamps in Apache Spark™ 3.0

≪ Previous: Bucket Brigade — Securing Public S3 Buckets

Introduction

In part 1 we talked about how Baseball Operations for the Minnesota Twins wanted to run up to 20k simulations on 15 million historical pitches – 300 billion total simulations – to more accurately evaluate player performance. The idea is simple: if 15 million historical pitches sketched an image of player performance, 300 billion simulated pitches from each players’ distribution would sharpen that image and provide more reliable valuations. This data would impact coaching and personnel decisions with the goal of generating more wins, and by extension, revenue for the club.

All of the scripts and machine learning models to generate and score data were written in R. Even when running these scripts with multi-threading packages in R they estimated it would take 3.8 years to process all of the simulations. With user defined functions (UDFs) in Apache Spark and Databricks we were able to reduce the data processing time to 2-3 days for 300 billion simulations on historical data sets, and near real time for in-game data. By enabling near real-time scoring of in-game pitches, the Twins are looking to eventually optimize lineup and strategy decisions based on in-game conditions, for example, choosing the best pitcher and pitch given the batter, weather, inning, and speed + rotation readings from the pitcher’s last throws.

By combining the vast ecosystem of R packages with the scalability and parallelism of Spark, UDFs can be extraordinarily powerful not just in sports but across industry use cases. In addition to our model inference use case for the Twins, consider the following applications:

Generating sales forecasts for thousands of consumer products using time series packages like prophet
Simulating the performance of hundreds of financial portfolios
Simulating transportation schedules for fleets of vehicles
Finding the best model fit by searching thousands of hyperparameters in parallel

As exciting and tantalizing as these applications are, their power comes at a cost. Ask anyone who has tried and they will tell you that implementing a UDF that can scale gracefully can be quite challenging. This is due to the need to efficiently manage cores and memory on the cluster, and the tension between them. The key is to structure the job in such a way that Spark can linearly scale it.

In this post we embark on a journey into what it takes to craft UDFs that scale. Success hinges on an understanding of storage, Spark, R, and the interactions between them.

Understanding UDFs with Spark and R

Generally speaking, when you use SparkR or sparklyr your R code is translated into Scala, Spark’s native language. In these cases the R process is limited to the driver node of the Spark cluster, while the rest of the cluster completes tasks in Scala. User defined functions, however, provide access to an R process on each worker, letting you apply an R function to each group or partition of a Spark DataFrame in parallel before returning the results.

How does Spark orchestrate all of this? You can see the control flow clearly in the diagram below.

As part of each task Spark will create a temporary R session on each worker, serialize the R closure, then distribute the UDF across the cluster. While the R session on each worker is active, the full power of the R ecosystem can be leveraged inside the UDF. When the R code is finished executing the session is terminated, and results sent back to the Spark context. You can learn more about this by watching the talk here and in this blog post.

Getting UDFs Right

“…you will be fastest if you avoid doing the work in the first place.” [1]

Now that we understand the basics of how UDFs are executed, let’s take a closer look at potential bottlenecks in the system and how to eliminate them. There are essentially four key areas to understand when writing these functions:

Data Sources
Data Transfer in Spark
Data Transfer Between Spark and R
R Process

1. Data Sources: Minimizing Storage I/O

The first step is to plan how data is organized in storage. Many R users may be used to working with flat files, but a key principle is to only ingest what is necessary for the UDF to execute properly. A significant portion of your job will be I/O to and from storage, and if your data is currently in an unoptimized file format (like CSV) Spark may need to read the entire data set into memory. This can be painfully slow and inefficient, particularly if you don’t need all of the contents of that file.

For this reason we recommend saving your data in a scalable format like Delta Lake. Delta speeds up ingestion into Spark by partitioning data in storage, optimizing the size of these partitions, and creating a secondary index with Z-Ordering. Taken together, these features help limit the volume of data that needs to be accessed in a UDF.

How so? Imagine we partitioned our baseball data in storage by pitch type and directed Spark to read rows where pitch type equals ‘curveball’. Using Delta we could skip the ingestion of all rows with other pitch types. This reduction in the scan of data can speed up reads tremendously – if only 10% of your data contains curveball pitches, you can effectively skip reading 90% of your dataset into memory!

By using a storage layer like Delta Lake and a partitioning strategy that corresponds to what data will be processed by the UDF, you will have laid a solid foundation and eliminated a potential bottleneck to your job.

2a. Data Transfer in Spark: Optimizing Partition Size in Memory

The size of partitions in memory can affect performance of feature engineering/ETL pipelines leading up to and including the UDF itself. In general, whenever Spark has to perform a wide transformation like a join or group by, data must be shuffled across the cluster.The default setting for the number of shuffle partitions is arbitrarily set to 200, meaning at the time of a shuffle operation the data in your Spark DataFrame is distributed across 200 partitions.

This can create inefficiencies depending on the size of your data. If your dataset is smaller, 200 partitions may be over-parallelizing the work, causing unnecessary scheduling overhead and tasks with very little data in them. If the dataset is large you may be under-parallelizing and not effectively using the resources on your cluster.

As a general rule of thumb, keeping the size of shuffle partitions between 128-200MB will maximize parallelism while avoiding spilling data to disk. To identify how many shuffle partitions there should be, use the Spark UI for your longest job to sort the shuffle read sizes. Divide the size of the largest shuffle read stage by 128MB to arrive at the optimal number of partitions for your job. Then you can set the spark.sql.shuffle.partitions config in SparkR like this:

sparkR.session(sparkConfig = list(spark.sql.shuffle.partitions = “400”))

Actively repartitioning a Spark DataFrame is also impacted by this setting as it requires a shuffle. As we’ll see, this behavior can be used to manage memory pressure in other parts of the system like garbage collection and data transfer between Spark and R.

2b. Data Transfer in Spark: Garbage Collection and Cluster Sizing

When you have a big data problem it can be tempting to adopt a brute force approach and reach for the largest worker type, but the solution may not be so simple. Garbage collection in the Java Virtual Machine (JVM) tends to get out of control when there are large objects in memory that are no longer being used. Using very large workers can exacerbate this problem because there’s more room to create large objects in the first place. Managing the size of objects in memory is thus a key consideration of the solution architecture.

For this particular job we found that a few large workers or many small workers did not perform as well as many medium sized workers. Large workers would generate excessive garbage collection causing the job to hang indefinitely, while small workers would simply run out of memory.

To address this we gradually increased the size of workers until we wound up in the middle range of RAM and CPU that JVM garbage collection can gracefully handle. We also repartitioned the input Spark DataFrame to our UDF and increased its partitions. Both measures were effective at managing the size of objects in the JVM and helped keep garbage collection to less than 10% of the total task time for each Spark executor. If we wanted to score more records we could simply add more medium-sized workers to the cluster and increase the partitions of the input DataFrame in a linear fashion.

3. Data Transfer between Spark and R

The next step to consider was how data is passed between Spark and R.Here we identified two potential bottlenecks – overall I/O and the corresponding (de)serialization that occurs between processes.

First, only input what is necessary for the UDF to execute properly. Similar to how we optimize I/O reads from storage, filter the input Spark DataFrame to contain only those columns necessary for the UDF. If our Spark DataFrame has 30 columns and we only need 4 of them for the UDF, subset your data accordingly and use that as input instead. This will speed up execution by reducing I/O and related (de)serialization.

If you’ve subset the input data appropriately and still have out-of-memory issues, repartitioning can help control how much data is transferred between Spark and R. For example, applying a UDF to 200GB of data across 100 partitions will result in 2GB of data sent to R in each task. If we increase the number of partitions to 200 using the `repartition()` function from SparkR, then 1 GB will be sent to R in each task. The tradeoff of more partitions is more (de)serialization tasks between the JVM and R, but less data and subsequent memory pressure in each task.

You might think a typical 14GB RAM Spark worker would be able to handle a 2GB partition of data with room to spare, but in practice you will require at least 30GB RAM if you want to avoid out-of-memory errors! This can be a rude awakening for many developers trying to get started with UDFs in Spark and R, and can cause costs to skyrocket. Why do the workers need so much memory?

The fact is that Spark and R represent data in memory quite differently. To transfer data from Spark to R, a copy must be created and then converted to an in-memory format that R can use. Recall that in the UDF architecture diagram above, objects need to be serialized and deserialized every time they move between the two contexts. This is slow and creates enormous memory overhead, with workers typically requiring an order of magnitude greater memory than expected.

We can mitigate this bottleneck by replacing the two distinct in-memory formats with a single one using Apache Arrow. Arrow is designed to quickly and efficiently transfer data between different systems – like Spark and R – by using a columnar format similar to Parquet. This eliminates the time spent in serialization/deserialization as well as the increased memory overhead. It is not uncommon to see speed-ups of 10-100x when comparing workloads with vs. without Arrow. It goes without saying that using one of these optimizations is critical when working with UDFs. Spark 3.0 will include support for Arrow in SparkR, and you can install Arrow for R on Databricks by following the instructions here. It’s also worth noting that there is a similar out-of-the-box optimizer available for SparkR on Databricks Runtime.

4. R Process: Managing Idiosyncrasies of R

Each language has its own quirks, and so now we turn our attention to R itself. Consider that the function in your UDF may be applied hundreds or thousands of times on the workers, so it pays to be mindful of how R is using resources.

In tuning the pitch scoring function we identified loading the model object and commands that trigger R’s copy-on-modify behavior as potential bottlenecks in the job. How’s that? Let’s examine these two more carefully, beginning with loading the model object.

If you’ve worked with R long enough, you probably know that many R packages include training data as part of the model object. This is fine when the data is small, but when the data grows it can become a significant problem – some of the pitch models were nearly 2GB in size! In our case the overhead associated with loading and dropping these models from memory in each execution of the UDF limited the scale to 300 million rows, 3 orders of magnitude away from 300 billion.

Furthermore, we noticed that the model had been saved as part of a broader training function:

trainmodel <- function(x) {
        # prepare data
        # train model
        # return model
}
my_model <- trainmodel(data)
saveRDS(my_model)

Inadvertently saving a model this way can drag along a significant amount of other objects in your R environment, inflating its size. As it turns out there are several options to trim these models in a way that would make loading them in the UDF efficient while leaving their ability to predict intact.

Depending on what R package you used for training, you can use the aptly named butcher package to directly trim existing models of attributes that are not required for prediction. You can also do this yourself - and we tried - but saw mixed results and found ourselves digging into deeply nested R objects while looking at model package source code to figure out where the bloat was coming from. Not recommended! Better to use a package designed to handle this when possible.

Another way to reduce the model size is to retrain on a subset of the initial features. This can be worth it if you are trading a tiny bit of accuracy for massive scale. Given that we had to save the model object separately from the training function anyway, it made sense to retrain. Saving the model outside of a function and retraining on a subset of features reduced the pitch outcome model size from 2GB to 93MB, a 94% reduction.

A word on serialization from R to disk: while saveRDS() and readRDS() are the standard for serializing R objects to and from storage, we found speed enhancements with qsave() and qload() from the qs (quick serialization) package. Furthermore, reads from local storage are generally faster than from cloud storage. To take advantage of reads from local storage, add the following logic to your UDF:

results <- dapply(features,
        # x is a partition of data
        function(x) {
        # if file doesn’t exist on the cluster, copy it from cloud storage
        if(!file.exists(“/tmp/model.qs”){
                system(“cp /dbfs/model.qs /tmp/model.qs”)
        } 
        # load model
        model <- qs::qread(“/tmp/model.rds”)
        # infer pitch outcomes 
        x$predictions <- predict(model, data = x)
        # return result
        x
        },  schema = schema)

Lastly, to eliminate copy-on-modify behavior, we used the data.table package for model inference inside the UDF. Looping and assigning predictions to a column of our output dataframe was creating many copies of that data.frame and the model object itself, consuming additional memory and causing the R process to choke. Thankfully with data.table we were able to modify the column in place:

data_table[, pitch_outcome := predict(model, data_table, type = 'response')]

By efficiently loading model objects into memory and eliminating copy-on-modify behavior we streamlined R’s execution within the UDF.

Testing, Debugging and Monitoring

Understanding how to begin testing, debugging and monitoring your UDF is just as important as knowing how the system works. Here are some tips to get started.

For testing, it is best to start by simply counting rows in each group or partition of data, making sure every row is flowing through the UDF. Then take a subset of input data and start to slowly introduce additional logic and the results that you return until you get a working prototype. Finally, slowly scale up the execution by adding more rows until you’ve submitted all of them.

At some point you will almost certainly hit errors and need to debug. If the error is with Spark you will see the stack trace in the console of the driver. If the error is with your R code inside the UDF, this will not be shown in the driver output. You’ll need to open up the Spark UI and check stderr from the worker logs to see the stack trace from the R process.

Once you have an error free UDF, you can monitor the execution using Ganglia metrics. Ganglia contains detailed information on how and where the cluster is being utilized and can provide valuable clues to where a bottleneck may lie. For the pitch scoring pipeline, we used Ganglia to help diagnose multiple issues - we saw CPU idle time explode when model objects were too large, and eliminated swap memory utilization when we increased worker size.

Putting it All Together

At this point we have discussed I/O and resource management when reading into Spark, across workers, between Spark and R, and in R itself. Let’s take what we’ve learned and compare what an unoptimized architecture looks like vs. an optimized one:

Conclusion

At the beginning of this post, we stated that the key to crafting a UDF with Spark and R is to structure the job in such a way that Spark can linearly scale it. In this post we learned about how to scale a model scoring UDF by optimizing each major component of the architecture - from storing data in Delta Lake, to data transfer in Spark, to data transfer between R and Spark, and finally to the R process itself. While all UDFs are somewhat different, R developers should now feel confident in their ability to navigate this pattern and avoid major stumbling blocks along the way.

Try Databricks for free. Get started today.

The post Optimizing User Defined Functions with Apache Spark™ and R in the Real World: Scaling Pitch Scenario Analysis with the Minnesota Twins Part 2 appeared first on Databricks.

↧

A Comprehensive Look at Dates and Timestamps in Apache Spark™ 3.0

July 22, 2020, 11:00 am

≫ Next: Modern Industrial IoT Analytics on Azure – Part 1

≪ Previous: Optimizing User Defined Functions with Apache Spark™ and R in the Real World: Scaling Pitch Scenario Analysis with the Minnesota Twins Part 2

Apache Spark is a very popular tool for processing structured and unstructured data. When it comes to processing structured data, it supports many basic data types, like integer, long, double, string, etc. Spark also supports more complex data types, like the Date and Timestamp, which are often difficult for developers to understand. In this blog post, we take a deep dive into the Date and Timestamp types to help you fully understand their behavior and how to avoid some common issues. In summary, this blog covers four parts:

The definition of the Date type and the associated calendar. It also covers the calendar switch in Spark 3.0.
The definition of the Timestamp type and how it relates to time zones. It also explains the detail of time zone offset resolution, and the subtle behavior changes in the new time API in Java 8, which is used by Spark 3.0.
The common APIs to construct date and timestamp values in Spark.
The common pitfalls and best practices to collect date and timestamp objects on the Spark driver.

Date and calendar

The definition of a Date is very simple: It’s a combination of the year, month and day fields, like (year=2012, month=12, day=31). However, the values of the year, month and day fields have constraints, so that the date value is a valid day in the real world. For example, the value of month must be from 1 to 12, the value of day must be from 1 to 28/29/30/31 (depending on the year and month), and so on.

These constraints are defined by one of many possible calendars. Some of them are only used in specific regions, like the Lunar calendar. Some of them are only used in history, like the Julian calendar. At this point, the Gregorian calendar is the de facto international standard and is used almost everywhere in the world for civil purposes. It was introduced in 1582 and is extended to support dates before 1582 as well. This extended calendar is called the Proleptic Gregorian calendar.

Starting from version 3.0, Spark uses the Proleptic Gregorian calendar, which is already being used by other data systems like pandas, R and Apache Arrow. Before Spark 3.0, it used a combination of the Julian and Gregorian calendar: For dates before 1582, the Julian calendar was used, for dates after 1582 the Gregorian calendar was used. This is inherited from the legacy java.sql.Date API, which was superseded in Java 8 by java.time.LocalDate, which uses the Proleptic Gregorian calendar as well.

Notably, the Date type does not consider time zones.

Timestamp and time zone

The Timestamp type extends the Date type with new fields: hour, minute, second (which can have a fractional part) and together with a global (session scoped) time zone. It defines a concrete time instant on Earth. For example, (year=2012, month=12, day=31, hour=23, minute=59, second=59.123456) with session timezone UTC+01:00. When writing timestamp values out to non-text data sources like Parquet, the values are just instants (like timestamp in UTC) that have no time zone information. If you write and read a timestamp value with different session timezone, you may see different values of the hour/minute/second fields, but they are actually the same concrete time instant.

The hour, minute and second fields have standard ranges: 0–23 for hours and 0–59 for minutes and seconds. Spark supports fractional seconds with up to microsecond precision. The valid range for fractions is from 0 to 999,999 microseconds.

At any concrete instant, we can observe many different values of wall clocks, depending on time zone.

And conversely, any value on wall clocks can represent many different time instants. The time zone offset allows us to unambiguously bind a local timestamp to a time instant. Usually, time zone offsets are defined as offsets in hours from Greenwich Mean Time (GMT) or UTC+0 (Coordinated Universal Time). Such a representation of time zone information eliminates ambiguity, but it is inconvenient for end users. Users prefer to point out a location around the globe such as America/Los_Angeles or Europe/Paris.

This additional level of abstraction from zone offsets makes life easier but brings its own problems. For example, we now have to maintain a special time zone database to map time zone names to offsets. Since Spark runs on the JVM, it delegates the mapping to the Java standard library, which loads data from the Internet Assigned Numbers Authority Time Zone Database (IANA TZDB). Furthermore, the mapping mechanism in Java’s standard library has some nuances that influence Spark’s behavior. We focus on some of these nuances below.

Since Java 8, the JDK has exposed a new API for date-time manipulation and time zone offset resolution, and Spark migrated to this new API in version 3.0. Although the mapping of time zone names to offsets has the same source, IANA TZDB, it is implemented differently in Java 8 and higher versus Java 7.

As an example, let’s take a look at a timestamp before the year 1883 in the America/Los_Angeles time zone: 1883-11-10 00:00:00. This year stands out from others because on November 18, 1883, all North American railroads switched to a new standard time system that henceforth governed their timetables.
Using the Java 7 time API, we can obtain time zone offset at the local timestamp as -08:00:

scala> java.time.ZoneId.systemDefault
res0: java.time.ZoneId = America/Los_Angeles
scala> java.sql.Timestamp.valueOf("1883-11-10 00:00:00").getTimezoneOffset / 60.0
res1: Double = 8.0

Java 8 API functions return a different result:

scala> java.time.ZoneId.of("America/Los_Angeles")
.getRules.getOffset(java.time.LocalDateTime.parse("1883-11-10T00:00:00"))
res2: java.time.ZoneOffset = -07:52:58

Prior to November 18, 1883, time of day was a local matter, and most cities and towns used some form of local solar time, maintained by a well-known clock (on a church steeple, for example, or in a jeweler’s window). That’s why we see such a strange time zone offset.

The example demonstrates that Java 8 functions are more precise and take into account historical data from IANA TZDB. After switching to the Java 8 time API, Spark 3.0 benefited from the improvement automatically and became more precise in how it resolves time zone offsets.

As we mentioned earlier, Spark 3.0 also switched to the Proleptic Gregorian calendar for the date type. The same is true for the timestamp type. The ISO SQL:2016 standard declares the valid range for timestamps is from 0001-01-01 00:00:00 to 9999-12-31 23:59:59.999999. Spark 3.0 fully conforms to the standard and supports all timestamps in this range. Comparing to Spark 2.4 and earlier, we should highlight the following sub-ranges:

0001-01-01 00:00:00..1582-10-03 23:59:59.999999. Spark 2.4 uses the Julian calendar and doesn’t conform to the standard. Spark 3.0 fixes the issue and applies the Proleptic Gregorian calendar in internal operations on timestamps such as getting year, month, day, etc. Due to different calendars, some dates that exist in Spark 2.4 don’t exist in Spark 3.0. For example, 1000-02-29 is not a valid date because 1000 isn’t a leap year in the Gregorian calendar. Also, Spark 2.4 resolves time zone name to zone offsets incorrectly for this timestamp range.
1582-10-04 00:00:00..1582-10-14 23:59:59.999999. This is a valid range of local timestamps in Spark 3.0, in contrast to Spark 2.4 where such timestamps didn’t exist.
1582-10-15 00:00:00..1899-12-31 23:59:59.999999. Spark 3.0 resolves time zone offsets correctly using historical data from IANA TZDB. Compared to Spark 3.0, Spark 2.4 might resolve zone offsets from time zone names incorrectly in some cases, as we showed above in the example.
1900-01-01 00:00:00..2036-12-31 23:59:59.999999. Both Spark 3.0 and Spark 2.4 conform to the ANSI SQL standard and use Gregorian calendar in date-time operations such as getting the day of the month.
2037-01-01 00:00:00..9999-12-31 23:59:59.999999. Spark 2.4 can resolve time zone offsets and in particular daylight saving time offsets incorrectly because of the JDK bug #8073446. Spark 3.0 does not suffer from this defect.

One more aspect of mapping time zone names to offsets is overlapping of local timestamps that can happen due to daylight saving time (DST) or switching to another standard time zone offset. For instance, on 3 November 2019, 02:00:00 clocks were turned backward 1 hour to 01:00:00. The local timestamp When mapping time zone names to offsets in Spark 3.0, the switch to daylight saving time can result in an overlap of local timestamps. When possible, specifying exact time zone offsets when making timestamps is recommended. 2019-11-03 01:30:00 America/Los_Angeles can be mapped either to 2019-11-03 01:30:00 UTC-08:00 or 2019-11-03 01:30:00 UTC-07:00. If you don’t specify the offset and just set the time zone name (e.g., '2019-11-03 01:30:00 America/Los_Angeles'), Spark 3.0 will take the earlier offset, typically corresponding to “summer.” The behavior diverges from Spark 2.4 which takes the “winter” offset. In the case of a gap, where clocks jump forward, there is no valid offset. For a typical one-hour daylight saving time change, Spark will move such timestamps to the next valid timestamp corresponding to “summer” time.

As we can see from the examples above, the mapping of time zone names to offsets is ambiguous, and it is not one to one. In the cases when it is possible, we would recommend specifying exact time zone offsets when making timestamps, for example timestamp '2019-11-03 01:30:00 UTC-07:00'.

Let’s move away from zone name to offset mapping, and look at the ANSI SQL standard. It defines two types of timestamps:

TIMESTAMP WITHOUT TIME ZONE or TIMESTAMP – Local timestamp as (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND). These kinds of timestamps are not bound to any time zone, and actually are wall clock timestamps.
TIMESTAMP WITH TIME ZONE – Zoned timestamp as (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, TIMEZONE_HOUR, TIMEZONE_MINUTE). The timestamps represent an instant in the UTC time zone + a time zone offset (in hours and minutes) associated with each value.

The time zone offset of a TIMESTAMP WITH TIME ZONE does not affect the physical point in time that the timestamp represents, as that is fully represented by the UTC time instant given by the other timestamp components. Instead, the time zone offset only affects the default behavior of a timestamp value for display, date/time component extraction (e.g. EXTRACT), and other operations that require knowing a time zone, such as adding months to a timestamp.

Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME ZONE, which is a combination of the fields (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, SESSION TZ) where the YEAR through SECOND field identify a time instant in the UTC time zone, and where SESSION TZ is taken from the SQL config spark.sql.session.timeZone. The session time zone can be set as:

Zone offset '(+|-)HH:mm'. This form allows us to define a physical point in time unambiguously.
Time zone name in the form of region ID 'area/city', such as 'America/Los_Angeles'. This form of time zone info suffers from some of the problems that we described above like overlapping of local timestamps. However, each UTC time instant is unambiguously associated with one time zone offset for any region ID, and as a result, each timestamp with a region ID based time zone can be unambiguously converted to a timestamp with a zone offset.

By default, the session time zone is set to the default time zone of the Java virtual machine.

Spark’s TIMESTAMP WITH SESSION TIME ZONE is different from:

TIMESTAMP WITHOUT TIME ZONE, because a value of this type can map to multiple physical time instants, but any value of TIMESTAMP WITH SESSION TIME ZONE is a concrete physical time instant. The SQL type can be emulated by using one fixed time zone offset across all sessions, for instance UTC+0. In that case, we could consider timestamps at UTC as local timestamps.
TIMESTAMP WITH TIME ZONE, because according to the SQL standard column values of the type can have different time zone offsets. That is not supported by Spark SQL.

We should notice that timestamps that are associated with a global (session scoped) time zone are not something newly invented by Spark SQL. RDBMSs such as Oracle provide a similar type for timestamps too: TIMESTAMP WITH LOCAL TIME ZONE.

Constructing dates and timestamps

Spark SQL provides a few methods for constructing date and timestamp values:

Default constructors without parameters: CURRENT_TIMESTAMP() and CURRENT_DATE().
From other primitive Spark SQL types, such as INT, LONG, and STRING
From external types like Python datetime or Java classes java.time.LocalDate/Instant.
Deserialization from data sources CSV, JSON, Avro, Parquet, ORC or others.

The function MAKE_DATE introduced in Spark 3.0 takes three parameters: YEAR, MONTH of the year, and DAY in the month and makes a DATE value. All input parameters are implicitly converted to the INT type whenever possible. The function checks that the resulting dates are valid dates in the Proleptic Gregorian calendar, otherwise it returns NULL. For example in PySpark:

>>> spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],
... ['Y', 'M', 'D']).createTempView('YMD')
>>> df = sql('select make_date(Y, M, D) as date from YMD')
>>> df.printSchema()
root
        |-- date: date (nullable = true)

To print DataFrame content, let’s call the show() action, which converts dates to strings on executors and transfers the strings to the driver to output them on the console:

>>> df.show()
+-----------+
|       date|
+-----------+
| 2020-06-26|
|       null|
|-0044-01-01|
+-----------+

Similarly, we can make timestamp values via the MAKE_TIMESTAMP functions. Like MAKE_DATE, it performs the same validation for date fields, and additionally accepts time fields HOUR (0-23), MINUTE (0-59) and SECOND (0-60). SECOND has the type Decimal(precision = 8, scale = 6) because seconds can be passed with the fractional part up to microsecond precision. For example in PySpark:

>>> df = spark.createDataFrame([(2020, 6, 28, 10, 31, 30.123456),
... (1582, 10, 10, 0, 1, 2.0001), (2019, 2, 29, 9, 29, 1.0)],
... ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE', 'SECOND'])
>>> df.show()
+----+-----+---+----+------+---------+
|YEAR|MONTH|DAY|HOUR|MINUTE|   SECOND|
+----+-----+---+----+------+---------+
|2020|    6| 28|  10|    31|30.123456|
|1582|   10| 10|   0|     1|   2.0001|
|2019|    2| 29|   9|    29|      1.0|
+----+-----+---+----+------+---------+

>>> ts = df.selectExpr("make_timestamp(YEAR, MONTH, DAY, HOUR, MINUTE, SECOND) as MAKE_TIMESTAMP")
>>> ts.printSchema()
root
        |-- MAKE_TIMESTAMP: timestamp (nullable = true)

As we did for dates, let’s print the content of the ts DataFrame using the show() action. In a similar way, show() converts timestamps to strings but now it takes into account the session time zone defined by the SQL config spark.sql.session.timeZone. We will see that in the following examples.

>>> ts.show(truncate=False)
+--------------------------+
|MAKE_TIMESTAMP            |
+--------------------------+
|2020-06-28 10:31:30.123456|
|1582-10-10 00:01:02.0001  |
|null                      |
+--------------------------+

Spark cannot create the last timestamp because this date is not valid: 2019 is not a leap year.

You might notice that we didn’t provide any time zone information in the example above. In that case, Spark takes a time zone from the SQL configuration spark.sql.session.timeZone and applies it to function invocations. You can also pick a different time zone by passing it as the last parameter of MAKE_TIMESTAMP. Here is an example in PySpark:

>>> df = spark.createDataFrame([(2020, 6, 28, 10, 31, 30, 'UTC'),
...     (1582, 10, 10, 0, 1, 2, 'America/Los_Angeles'),
...     (2019, 2, 28, 9, 29, 1, 'Europe/Moscow')],
...     ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'TZ'])
>>> df = df.selectExpr('make_timestamp(YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, TZ) as MAKE_TIMESTAMP')
>>> df = df.selectExpr("date_format(MAKE_TIMESTAMP, 'yyyy-MM-dd HH:mm:SS VV') AS TIMESTAMP_STRING")
>>> df.show(truncate=False)
+---------------------------------+
|TIMESTAMP_STRING                 |
+---------------------------------+
|2020-06-28 13:31:00 Europe/Moscow|
|1582-10-10 10:24:00 Europe/Moscow|
|2019-02-28 09:29:00 Europe/Moscow|
+---------------------------------+

As the example demonstrates, Spark takes into account the specified time zones but adjusts all local timestamps to the session time zone. The original time zones passed to the MAKE_TIMESTAMP function will be lost because the TIMESTAMP WITH SESSION TIME ZONE type assumes that all values belong to one time zone, and it doesn’t even store a time zone per every value. According to the definition of the TIMESTAMP WITH SESSION TIME ZONE, Spark stores local timestamps in the UTC time zone, and uses the session time zone while extracting date-time fields or converting the timestamps to strings.

Also, timestamps can be constructed from the LONG type via casting. If a LONG column contains the number of seconds since the epoch 1970-01-01 00:00:00Z, it can be cast to Spark SQL’s TIMESTAMP:

spark-sql> select CAST(-123456789 AS TIMESTAMP);
1966-02-02 05:26:51

Unfortunately, this approach doesn’t allow us to specify the fractional part of seconds. In the future, Spark SQL will provide special functions to make timestamps from seconds, milliseconds and microseconds since the epoch: timestamp_seconds(), timestamp_millis() and timestamp_micros().

Another way is to construct dates and timestamps from values of the STRING type. We can make literals using special keywords:

spark-sql> select timestamp '2020-06-28 22:17:33.123456 Europe/Amsterdam', date '2020-07-01';
2020-06-28 23:17:33.123456	2020-07-01

or via casting that we can apply for all values in a column:

spark-sql> select cast('2020-06-28 22:17:33.123456 Europe/Amsterdam' as timestamp), cast('2020-07-01' as date);
2020-06-28 23:17:33.123456	2020-07-01

The input timestamp strings are interpreted as local timestamps in the specified time zone or in the session time zone if a time zone is omitted in the input string. Strings with unusual patterns can be converted to timestamp using the to_timestamp() function. The supported patterns are described in Datetime Patterns for Formatting and Parsing:

spark-sql> select to_timestamp('28/6/2020 22.17.33', 'dd/M/yyyy HH.mm.ss');
2020-06-28 22:17:33

The function behaves similarly to CAST if you don’t specify any pattern.

For usability, Spark SQL recognizes special string values in all methods above that accept a string and return a timestamp and date:

epoch is an alias for date ‘1970-01-01’ or timestamp ‘1970-01-01 00:00:00Z’
now is the current timestamp or date at the session time zone. Within a single query it always produces the same result.
today is the beginning of the current date for the TIMESTAMP type or just current date for the DATE type.
tomorrow is the beginning of the next day for timestamps or just the next day for the DATE type.
yesterday is the day before current one or its beginning for the TIMESTAMP type.

For example:

spark-sql> select timestamp 'yesterday', timestamp 'today', timestamp 'now', timestamp 'tomorrow';
2020-06-27 00:00:00	2020-06-28 00:00:00	2020-06-28 23:07:07.18	2020-06-29 00:00:00
spark-sql> select date 'yesterday', date 'today', date 'now', date 'tomorrow';
2020-06-27	2020-06-28	2020-06-28	2020-06-29

One of Spark’s great features is creating Datasets from existing collections of external objects at the driver side, and creating columns of corresponding types. Spark converts instances of external types to semantically equivalent internal representations. PySpark allows to create a Dataset with DATE and TIMESTAMP columns from Python collections, for instance:

>>> import datetime
>>> df = spark.createDataFrame([(datetime.datetime(2020, 7, 1, 0, 0, 0),
...     datetime.date(2020, 7, 1))], ['timestamp', 'date'])
>>> df.show()
+-------------------+----------+
|          timestamp|      date|
+-------------------+----------+
|2020-07-01 00:00:00|2020-07-01|
+-------------------+----------+

PySpark converts Python’s datetime objects to internal Spark SQL representations at the driver side using the system time zone, which can be different from Spark’s session time zone settings spark.sql.session.timeZone. The internal values don’t contain information about the original time zone. Future operations over the parallelized dates and timestamps value will take into account only Spark SQL sessions time zone according to the TIMESTAMP WITH SESSION TIME ZONE type definition.

In a similar way as we demonstrated above for Python collections, Spark recognizes the following types as external date-time types in Java/Scala APIs:

java.sql.Date and java.time.LocalDate as external types for Spark SQL’s DATE type
java.sql.Timestamp and java.time.Instant for the TIMESTAMP type.

There is a difference between java.sql.* and java.time.* types. The java.time.LocalDate and java.time.Instant were added in Java 8, and the types are based on the Proleptic Gregorian calendar — the same calendar that is used by Spark from version 3.0. The java.sql.Date and java.sql.Timestamp have another calendar underneath — the hybrid calendar (Julian + Gregorian since 1582-10-15), which is the same as the legacy calendar used by Spark versions before 3.0. Due to different calendar systems, Spark has to perform additional operations during conversions to internal Spark SQL representations, and rebase input dates/timestamp from one calendar to another. The rebase operation has a little overhead for modern timestamps after the year 1900, and it can be more significant for old timestamps.

The example below shows making timestamps from Scala collections. In the first example, we construct a java.sql.Timestamp object from a string. The valueOf method interprets the input strings as a local timestamp in the default JVM time zone which can be different from Spark’s session time zone. If you need to construct instances of java.sql.Timestamp or java.sql.Date in specific time zone, we recommend to have a look at java.text.SimpleDateFormat (and its method setTimeZone) or java.util.Calendar.

scala> Seq(java.sql.Timestamp.valueOf("2020-06-29 22:41:30"), new java.sql.Timestamp(0)).toDF("ts").show(false)
+-------------------+
|ts                 |
+-------------------+
|2020-06-29 22:41:30|
|1970-01-01 03:00:00|
+-------------------+
scala> Seq(java.time.Instant.ofEpochSecond(-12219261484L), java.time.Instant.EPOCH).toDF("ts").show
+-------------------+
|                 ts|
+-------------------+
|1582-10-15 11:12:13|
|1970-01-01 03:00:00|
+-------------------+

Similarly, we can make a DATE column from collections of java.sql.Date or java.LocalDate. Parallelization of java.LocalDate instances is fully independent of either Spark’s session time zone or JVM default time zone, but we cannot say the same about parallelization of java.sql.Date instances. There are nuances:

java.sql.Date instances represent local dates at the default JVM time zone on the driver
For correct conversions to Spark SQL values, the default JVM time zone on the driver and executors must be the same.

scala> Seq(java.time.LocalDate.of(2020, 2, 29), java.time.LocalDate.now).toDF("date").show
+----------+
|      date|
+----------+
|2020-02-29|
|2020-06-29|
+----------+

To avoid any calendar and time zone related issues, we recommend Java 8 types java.LocalDate/Instant as external types in parallelization of Java/Scala collections of timestamps or dates.

Collecting dates and timestamps

The reverse operation for parallelization is collecting dates and timestamps from executors back to the driver and returning a collection of external types. For example above, we can pull the DataFrame back to the driver via the collect() action:

>>> df.collect()
[Row(timestamp=datetime.datetime(2020, 7, 1, 0, 0), date=datetime.date(2020, 7, 1))]

Spark transfers internal values of dates and timestamps columns as time instants in the UTC time zone from executors to the driver, and performs conversions to Python datetime objects in the system time zone at the driver, not using Spark SQL session time zone. collect() is different from the show() action described in the previous section. show() uses the session time zone while converting timestamps to strings, and collects the resulted strings on the driver.

In Java and Scala APIs, Spark performs the following conversions by default:

Spark SQL’s DATE values are converted to instances of java.sql.Date.
Timestamps are converted to instances of java.sql.Timestamp.

Both conversions are performed in the default JVM time zone on the driver. In this way, to have the same date-time fields that we can get via Date.getDay(), getHour(), etc. and via Spark SQL functions DAY, HOUR, the default JVM time zone on the driver and the session time zone on executors should be the same.

Similarly to making dates/timestamps from java.sql.Date/Timestamp, Spark 3.0 performs rebasing from the Proleptic Gregorian calendar to the hybrid calendar (Julian + Gregorian). This operation is almost free for modern dates (after the year 1582) and timestamps (after the year 1900), but it could bring some overhead for ancient dates and timestamps.

We can avoid such calendar-related issues, and ask Spark to return java.time types, which were added since Java 8. If we set the SQL config spark.sql.datetime.java8API.enabled to true, the Dataset.collect() action will return:

java.time.LocalDate for Spark SQL’s DATE type
java.time.Instant for Spark SQL’s TIMESTAMP type

Now the conversions don’t suffer from the calendar-related issues because Java 8 types and Spark SQL 3.0 are both based on the Proleptic Gregorian calendar. The collect() action doesn’t depend on the default JVM time zone any more. The timestamp conversions don’t depend on time zone at all. Regarding to date conversion, it uses the session time zone from the SQL config spark.sql.session.timeZone. For example, let’s look at a Dataset with DATE and TIMESTAMP columns, set the default JVM time zone to Europe/Moscow, but the session time zone to America/Los_Angeles.

scala> java.util.TimeZone.getDefault
res1: java.util.TimeZone = sun.util.calendar.ZoneInfo[id="Europe/Moscow",...]

scala> spark.conf.get("spark.sql.session.timeZone")
res2: String = America/Los_Angeles

scala> df.show
+-------------------+----------+
|          timestamp|      date|
+-------------------+----------+
|2020-07-01 00:00:00|2020-07-01|
+-------------------+----------+

The show() action prints the timestamp at the session time America/Los_Angeles, but if we collect the Dataset, it will be converted to java.sql.Timestamp and printed at Europe/Moscow by the toString method:

scala> df.collect()
res16: Array[org.apache.spark.sql.Row] = Array([2020-07-01 10:00:00.0,2020-07-01])

scala> df.collect()(0).getAs[java.sql.Timestamp](0).toString
res18: java.sql.Timestamp = 2020-07-01 10:00:00.0

Actually, the local timestamp 2020-07-01 00:00:00 is 2020-07-01T07:00:00Z at UTC. We can observe that if we enable Java 8 API and collect the Dataset:

scala> df.collect()
res27: Array[org.apache.spark.sql.Row] = Array([2020-07-01T07:00:00Z,2020-07-01])

The java.time.Instant object can be converted to any local timestamp later independently from the global JVM time zone. This is one of the advantages of java.time.Instant over java.sql.Timestamp. The former one requires changing the global JVM setting, which influences other timestamps on the same JVM. Therefore, if your applications process dates or timestamps in different time zones, and the applications should not clash with each other while collecting data to the driver via Java/Scala Dataset.collect() API, we recommend switching to Java 8 API using the SQL config spark.sql.datetime.java8API.enabled.

Conclusion

In this blog post, we described Spark SQL DATE and TIMESTAMP types. We showed how to construct date and timestamp columns from other primitive Spark SQL types and external Java types, and how to collect date and timestamp columns back to the driver as external Java types. Since version 3.0, Spark switched from the hybrid calendar, which combines Julian and Gregorian calendars, to the Proleptic Gregorian calendar (see SPARK-26651 for more details). This allowed Spark to eliminate many issues such as we demonstrated earlier. For backward compatibility with previous versions, Spark still returns timestamps and dates in the hybrid calendar (java.sql.Date and java.sql.Timestamp) from the collect like actions. To avoid calendar and time zone resolution issues when using the Java/Scala’s collect actions, Java 8 API can be enabled via the SQL config spark.sql.datetime.java8API.enabled. Try it out today free on Databricks as part of our Databricks Runtime 7.0.

Try Databricks for free. Get started today.

The post A Comprehensive Look at Dates and Timestamps in Apache Spark™ 3.0 appeared first on Databricks.

↧

Modern Industrial IoT Analytics on Azure – Part 1

August 3, 2020, 9:00 am

≫ Next: Data with impact: A look back at the first Hackathon for Social Good

≪ Previous: A Comprehensive Look at Dates and Timestamps in Apache Spark™ 3.0

This post and the three-part series about Industrial IoT analytics were jointly authored by Databricks and members of the Microsoft Cloud Solution Architecture team. We would like to thank Databricks Solutions Architect Samir Gupta and Microsoft Cloud Solution Architects Lana Koprivica and Hubert Dua for their contributions to this and the two forthcoming posts.

The Industrial Internet of Things (IIoT) has grown over the last few years as a grassroots technology stack being piloted predominantly in the oil & gas industry to wide scale adoption and production use across manufacturing, chemical, utilities, transportation and energy sectors. Traditional IoT systems like Scada, Historians and even Hadoop do not provide the big data analytics capabilities needed by most organizations to predictively optimize their industrial assets due to the following factors.

Challenge	Required Capability
Data volumes are significantly larger & more frequent	The ability to capture and store sub-second granular readings reliably and cost effectively from IoT devices streaming terabytes of data per day
Data processing needs are more complex	ACID-compliant data processing – time-based windows, aggregations, pivots, backfilling, shifting with the ability to easily reprocess old data
More user personas want access to the data	Data is an open format and easily shareable with operational engineers, data analysts, data engineers, and data scientists without creating silos
Scalable ML is needed for decision making	The ability to quickly and collaboratively train predictive models on granular, historic data to make intelligent asset optimization decisions
Cost reduction demands are higher than ever	Low-cost on-demand managed platform that scales with the data and workloads independently without requiring significant upfront capital

Organizations are turning to cloud computing platforms like Microsoft Azure to take advantage of the scalable, IIoT-enabling technologies they have to offer that make ingesting, processing, analyzing and serving time-series data sources like Historians and SCADA systems easy.

In part 1, we discuss the end-to-end technology stack and the role Azure Databricks plays in the architecture and design for the industrial application of modern IoT analytics.

In part 2, we will take a deeper dive into deploying modern IIoT analytics, ingest real-time IIoT machine-to-machine data from field devices into Azure Data Lake Storage and perform complex time-series processing on Data Lake directly.

In part 3, we will look at machine learning and analytics with industrial IoT data.

The Use Case – Wind Turbine Optimization

Most IIoT Analytics projects are designed to maximize the short-term utilization of an industrial asset while minimizing its long-term maintenance costs. In this article, we focus on a hypothetical energy provider trying to optimize its wind turbines. The ultimate goal is to identify the set of optimal turbine operating parameters that maximizes each turbine’s power output while minimizing its time to failure.

The final artifacts of this project are:

An automated data ingestion and processing pipeline that streams data to all end users
A predictive model that estimates the power output of each turbine given current weather and operating conditions
A predictive model that estimates the remaining life of each turbine given current weather and operating conditions
An optimization model that determines the optimal operating conditions to maximize power output and minimize maintenance costs thereby maximizing total profit
A real-time analytics dashboard for executives to visualize the current and future state of their wind farms, as shown below:

The Architecture – Ingest, Store, Prep, Train, Serve, Visualize

The architecture below illustrates a modern, best-of-breed platform used by many organizations that leverages all that Azure has to offer for IIoT analytics.

A key component of this architecture is the Azure Data Lake Store (ADLS), which enables the write-once, access-often analytics pattern in Azure. However, Data Lakes alone do not solve the real-world challenges that come with time-series streaming data. The Delta storage format provides a layer of resiliency and performance on all data sources stored in ADLS. Specifically for time-series data, Delta provides the following advantages over other storage formats on ADLS:

Required Capability	Other formats on ADLS Gen 2	Delta Format on ADLS Gen 2
Unified batch & streaming	Data Lakes are often used in conjunction with a streaming store like CosmosDB, resulting in a complex architecture	ACID-compliant transactions enable data engineers to perform streaming ingest and historically batch loads into the same locations on ADLS
Schema enforcement and evolution	Data Lakes do not enforce schema, requiring all data to be pushed into a relational database for reliability	Schema is enforced by default. As new IoT devices are added to the data stream, schemas can be evolved safely so downstream applications don’t fail
Efficient Upserts	Data Lakes do not support in-line updates and merges, requiring deletion and insertions of entire partitions to perform updates	MERGE commands are effective for situations handling delayed IoT readings, modified dimension tables used for real-time enrichment, or if data needs to be reprocessed.
File Compaction	Streaming time-series data into Data Lakes generates hundreds or even thousands of tiny files.	Auto-compaction in Delta optimizes the file sizes to increase throughput and parallelism.
Multi-dimensional clustering	Data Lakes provide push-down filtering on partitions only	ZORDERing time-series on fields like timestamp or sensor ID allows Databricks to filter and join on those columns up to 100x faster than simple partitioning techniques.

Summary

In this post we reviewed a number of different challenges facing traditional IIoT systems. We walked through the use case and the goals for modern IIoT analytics, shared a repeatable architecture that organizations are already deploying at scale and explored the benefits of Delta format for each of the required capabilities.

In the next post we will ingest real-time IIoT data from field devices into Azure and perform complex time-series processing on Data Lake directly.

They key technology that ties everything together is Delta Lake. Delta on ADLS provides reliable streaming data pipelines and highly performant data science and analytics queries on massive volumes of time-series data. Lastly, it enables organizations to truly adopt a Lakehouse pattern by bringing best of breed Azure tools to a write-once, access-often data store.

What’s Next?

Learn more about Azure Databricks with this 3-part training series and see how to create modern data architectures by attending this webinar.

Try Databricks for free. Get started today.

The post Modern Industrial IoT Analytics on Azure – Part 1 appeared first on Databricks.

↧

Data with impact: A look back at the first Hackathon for Social Good

August 7, 2020, 8:44 am

≫ Next: On Demand Virtual Workshop: Predicting Churn to Improve Customer Retention

≪ Previous: Modern Industrial IoT Analytics on Azure – Part 1

As global citizens, more and more businesses are investing in corporate social responsibility (CSR) programs to help solve the issues of system social injustice and economic inequity highlighted by COVID-19 and the Black Lives Matter movement. We’ve seen first-hand the incredible power of data to solve seemingly intractable problems and decided to launch our own scalable initiative: to see what the world’s data teams could do when unleashed on both global and local community’ challenges.

Spark + AI Summit 2020 Hackathon for Social Good
The result was the first large-scale, month-long virtual hackathon on social good projects as part of the lead up to this year’s Spark + AI Summit, our annual gathering of data scientists, data engineers and analysts in San Francisco. The data teams who won the hackathon would be invited to fly out to SF to present their projects. Well, as y’all know, we turned the event into a global virtual conference due to the COVID-19 pandemic. But that didn’t keep many talented community members from making a positive impact by submitting a great project.

In fact, I believe that the pandemic and the Black Lives Matter movement only increased the importance of corporate citizenship and our motivation to ask “how can we help” our communities and our world promote human rights and environmental sustainability and bring about social change. This is in addition to other efforts at the Spark + AI Summit – including leading the community to raise $101,626 towards the NAACP LDF and Center for Policing Equity (CPE). We even had an amazing keynote from Dr. Phillip Atiba Goff of CPE where he talked about how data nerds can become justice nerds, and several great sessions on COVID-19.

For the hackathon, we had 44 teams submit projects by hundreds of engineers, data scientists, doctors, climate scientists, designers and concerned citizens, competing for $35k in donations to charities of the winners’ choice. We announced the winners during the Summit keynote, but wanted to share them here in case you missed it.

First Place:
Taking it to the Streets – used data science to determine the economic effects of street closures during the COVID-19 crisis, using Python, SQL, R and Delta Lake.

By the data team from Revgen in Denver, Colorado
Yulia Quintela, Brian Liberatore, Steve Idowu and Meghan Villard

$20,000 donation to The Gathering Place was made by Databricks on their behalf.

Learn more about their project from their Summit presentation:

Second Place:
Wildfire Real-Time Detection System using Satellite Imagery – trained a TensorFlow U-Net model using Google satellite imagery and deployed the model for use within a web application.

By the data teams from Shell and Enbridge in Houston, Texas
Disha An, Boran Han, Yanxiang Yu, Zhijuan Zhang

$10,000 donation to the Amazon Conservation Association was made by Databricks on their behalf.

Learn more about their project from their Summit presentation:

Third Place:
Climate Resiliency for the Chesapeake Bay – looked at nitrogen in the water using data from NOAA, USGS and the Chesapeake Monitoring Cooperative. They first did a variety of exploratory data analysis (EDA) and then used AutoML to train a machine learning model which predicts nitrogen levels based on environmental activities.

By the data team from Booz Allen Hamilton
Moe Steller, Grace Kim, Sarah Olson, Lucy Han

$5,000 donation to the Alliance for the Chesapeake Bay was made by Databricks on their behalf.

Learn more about their project from their Summit presentation:

Congratulations to the winners and thanks so much to all the data teams who participated in the hackathon. Without each of you, we wouldn’t have been able to gain these insights into problems facing our society today and wouldn’t be making these donations. You can find the complete gallery of submissions on the hackathon site. And, to learn more about the technologies used by the winning teams, all the sessions from Spark + AI Summit 2020 are available online and free for your binge watching pleasure.

Try Databricks for free. Get started today.

The post Data with impact: A look back at the first Hackathon for Social Good appeared first on Databricks.

↧

On Demand Virtual Workshop: Predicting Churn to Improve Customer Retention

August 7, 2020, 9:16 am

≫ Next: Modern Industrial IoT Analytics on Azure – Part 2

≪ Previous: Data with impact: A look back at the first Hackathon for Social Good

The proliferation of subscription models has increased across industries: from direct-to-consumer brands for shaving supplies and prepared meals to streaming media services, at-home fitness, auto insurance and even automobiles themselves. Consumers are flocking to these new offerings while moving away from long-term contracts, which for subscription-based businesses means they have to prove their value to their customers every month. From source of acquisition to devices used to detect frequency and type of interaction, which customer events signal an increased likelihood to churn versus renew in the near future? How do you decide the proper investment to save an at-risk subscriber from the risk of churning?

In short, how do you use customer behavior and interactions to predict churn and save your most valuable customers?

In this on-demand virtual workshop, learn how unified data analytics can bring data science, business analytics and engineering together to increase the precision in customer lifetime value and churn prediction models across industries like retail, media, telco, insurance, retail financial services, and others. Hear from innovative meat delivery subscription box ButcherBox around how this rapidly growing, digital-native brand is using customer data, such as user interactions and other data points, to better predict existing customer lifetime value and feed downstream supply chain analyses.

This virtual workshop will give you the opportunity to learn about:

Using Survival Analysis to understand when and possibly why customers abandon subscription services
Predicting customer churn at key stages in the subscription lifecycle

Relevant blog posts

Analyzing Customer Attrition in Subscription Models

Q&A

Q: How frequently do you update the models to evaluate or improve the models?
A. There is no predefined standard period for updating or rerunning the churn prediction models. The best advice we can offer is to run it as frequently as your business makes decisions about churn. Are you dealing with churn each week because you have weekly changes in payment or offer options? In that scenario, you would run this weekly.

Q: How many of these survival models would you include multiple predictors?
A. All of these models include multiple predictors. Consider all the potential predictors that might offer value but be sure that predictors adhere to the Cox PH assumption of not being time-variant. Also, carefully consider removing variables with strong collinearity as these will likely interfere with model calculations. Finally, use the statistical tests included in the notebooks to identify and remove any predictors that contribute no statistical value above the baseline.
Q: Could you elaborate on data transformation challenges for survival models. If you have large datasets, you will find a challenge in creating the Nevada chart type of data, though it depends what model you build. Would need a little more details on data transformation challenges
A. The survival analysis routines expect the source data to be in very specific formats. In the case of the Cox PH model, it is expected that each subscription has one record which includes the subscription duration (in days in our scenario) along with the status of the subscription at the end of that duration. Predictive features then follow with categorical features one-hot transformed.

Q: What are some difficulties you might expect to encounter when doing survival analysis on high frequency consumers (grocery, frequency is usually a few days or a week)?
A. Data engineering is absolutely the biggest challenge. Remember that you want to summarize all of those interactions down to a single record for analysis. That is a ton of data crunching and most systems can’t handle that. Adding to that complexity is that you’ll want to iterate on which features to extract for that single record.

Q: What is the difference between survival rate and retention rate? or are we just using it interchangeably?
A. For the sake of this webinar and accompanying blog posts we use customer retention and customer survival rate interchangeably 🙂

Q: How do you handle the number of observations? For example, day-30 signups might be 10x bigger than day-10 signups.
A. By stratifying these, we can calculate statistics for each of those stratas. In the case of the Kaplan-Meier curves, we actually have 95% confidence intervals for each which will depend on the number of observations. You can see this very clearly in the prior subscriptions K-M curve.
HowQ: can we use these survival model output to calculate Customer Lifetime values?
A. Once we have a predictive model, we can then identify the end dates of the periods for which we are calculating CLV and retrieve a retention ratio/survival probability. For example, if I were to calculate a three-year CLV on an annual basis, I would grab the retention rate at the 365, 730 and 1095 day points.

Q: How long did it take for implementation of this approach (the whole architecture)?
A. This really depends on your organization. If you have the data available you could deploy our notebook and connect your data in a couple of days. We commonly do POCs with customers on this code and it never takes longer than 2 weeks.

Q: Has your model considered the seasonality factor?
A. There are machine learning models for CLV which consider seasonality but in general I’d very carefully examine what I’m trying to predict when seasonality becomes a consideration. Often, when we start looking at seasonality we’re attempting to make a more precise revenue projection than what CLV is oriented around.

Notebooks from the webinar

Watch Webinar

Try Databricks for free. Get started today.

The post On Demand Virtual Workshop: Predicting Churn to Improve Customer Retention appeared first on Databricks.

↧

Modern Industrial IoT Analytics on Azure – Part 2

August 11, 2020, 8:55 am

≫ Next: Data Teams Unite! Spark + AI Summit Recap

≪ Previous: On Demand Virtual Workshop: Predicting Churn to Improve Customer Retention

Introduction

In part 1 of the series on Modern Industrial Internet of Things (IoT) Analytics on Azure, we walked through the big data use case and the goals for modern IIoT analytics, shared a real-world repeatable architecture in use by organizations to deploy IIoT at scale and explored the benefits of Delta format for each of the data lake capabilities required for modern IIoT analytics.

The Deployment

We use Azure’s Raspberry PI IoT Simulator to simulate real-time machine-to-machine sensor readings and send them to Azure IoT Hub.

Data Ingest: Azure IoT Hub to Data Lake

Our deployment has sensor readings for weather (wind speed & direction, temperature, humidity) and wind turbine telematics (angle and RPM) sent to an IoT cloud computing hub. Azure Databricks can natively stream data from IoT Hubs directly into a Delta table on ADLS and display the input vs. processing rates of the data.

# Read directly from IoT Hubs using the EventHubs library for Azure Databricks
iot_stream = (
    spark.readStream.format("eventhubs")                                        # Read from IoT Hubs directly
    .options(**ehConf)                                                        # Use the Event-Hub-enabled connect string
    .load()                                                                   # Load the data
    .withColumn('reading', F.from_json(F.col('body').cast('string'), schema)) # Extract the payload from the messages
    .select('reading.*', F.to_date('reading.timestamp').alias('date'))        # Create a "date" field for partitioning
)

# Split our IoT Hubs stream into separate streams and write them both into their own Delta locations
write_turbine_to_delta = (
    iot_stream.filter('temperature is null')                          # Filter out turbine telemetry from other streams
    .select('date','timestamp','deviceId','rpm','angle')            # Extract the fields of interest
    .writeStream.format('delta')                                    # Write our stream to the Delta format
    .partitionBy('date')                                            # Partition our data by Date for performance
    .option("checkpointLocation", ROOT_PATH + "/bronze/cp/turbine") # Checkpoint so we can restart streams gracefully
    .start(ROOT_PATH + "/bronze/data/turbine_raw")                  # Stream the data into an ADLS Path
)

Delta allows our IoT data to be queried within seconds of it being captured in IoT Hub.

%sql 
-- We can query the data directly from storage immediately as it streams into Delta 
SELECT * FROM delta.`/tmp/iiot/bronze/data/turbine_raw` WHERE deviceid = 'WindTurbine-1'

We can now build a downstream pipeline that enriches and aggregates our IIoT applications data for data analytics.

Data Storage and Processing: Azure Databricks and Delta Lake

Delta supports a multi-hop pipeline approach to data engineering, where data quality and aggregation increases as it streams through the pipeline. Our time-series data will flow through the following Bronze, Silver and Gold data levels.

Our pipeline from Bronze to Silver will simply aggregate our turbine sensor data to 1 hour intervals. We will perform a streaming MERGE command to upsert the aggregated records into our Silver Delta tables.

# Create functions to merge turbine and weather data into their target Delta tables
def merge_records(incremental, target_path): 
    incremental.createOrReplaceTempView("incremental")
    
# MERGE consists of a target table, a source table (incremental),
# a join key to identify matches (deviceid, time_interval), and operations to perform 
# (UPDATE, INSERT, DELETE) when a match occurs or not
    incremental._jdf.sparkSession().sql(f"""
        MERGE INTO turbine_hourly t
        USING incremental i
        ON i.date=t.date AND i.deviceId = t.deviceid AND i.time_interval = t.time_interval
        WHEN MATCHED THEN UPDATE SET *
        WHEN NOT MATCHED THEN INSERT *
    """)


# Perform the streaming merge into our  data stream
turbine_stream = (
    spark.readStream.format('delta').table('turbine_raw')        # Read data as a stream from our source Delta table
    .groupBy('deviceId','date',F.window('timestamp','1 hour')) # Aggregate readings to hourly intervals
    .agg({"rpm":"avg","angle":"avg"})
    .writeStream                                                                                         
    .foreachBatch(merge_records)                              # Pass each micro-batch to a function
    .outputMode("update")                                     # Merge works with update mod
    .start()
)

Our pipeline from Silver to Gold will join the two streams together into a single table for hourly weather and turbine measurements.

# Read streams from Delta Silver tables
turbine_hourly = spark.readStream.format('delta').option("ignoreChanges", True).table("turbine_hourly")
weather_hourly = spark.readStream.format('delta').option("ignoreChanges", True).table("weather_hourly")

# Perform a streaming join to enrich the data
turbine_enriched = turbine_hourly.join(weather_hourly, ['date','time_interval'])

# Perform a streaming merge into our Gold data stream
merge_gold_stream = (
    turbine_enriched.writeStream 
    .foreachBatch(merge_records)
    .start()
)

We can query our Gold Delta table immediately.

The notebook also contains a cell that will generate historical hourly power readings and daily maintenance logs that will be used for model training. Running that cell will:

Backfill historical readings for 1 year in the turbine_enriched table
Generate historical power readings for each turbine in the power_output table
Generate historical maintenance logs for each turbine in the turbine_maintenance table

We now have enriched, artificial intelligence (AI)-ready data in a performant, reliable format on Azure Data Lake that can be fed into our data science modeling to optimize asset utilization.

%sql
-- Query all 3 tables together
CREATE OR REPLACE VIEW gold_readings AS
SELECT r.*, 
    p.power, 
    m.maintenance as maintenance
FROM turbine_enriched r 
    JOIN turbine_power p ON (r.date=p.date AND r.time_interval=p.time_interval AND r.deviceid=p.deviceid)
    LEFT JOIN turbine_maintenance m ON (r.date=m.date AND r.deviceid=m.deviceid);
    
SELECT * FROM gold_readings

Our data engineering pipeline is complete! Data is now flowing from IoT Hubs to Bronze (raw) to Silver (aggregated) to Gold (enriched). It is time to perform some analytics on our data.

Summary

To summarize, we have successfully:

Ingested real-time IIoT data from field devices into Azure
Performed complex time-series processing on Data Lake directly

In the next post we will explore the use of machine learning to maximize the revenue of a wind turbine while minimizing the opportunity cost of downtime.

What’s Next?

Try out the notebook hosted here, learn more about Azure Databricks with this 3-part training series and see how to create modern data architectures on Azure by attending this webinar.

Try Databricks for free. Get started today.

The post Modern Industrial IoT Analytics on Azure – Part 2 appeared first on Databricks.

↧

Data Teams Unite! Spark + AI Summit Recap

August 11, 2020, 10:32 am

≫ Next: Interoperability between Koalas and Apache Spark

≪ Previous: Modern Industrial IoT Analytics on Azure – Part 2

It’s been a few weeks since Spark + AI Summit 2020 and we can still feel the amazing energy from this global virtual event. Judging from the positive feedback across social media posts, press coverage and directly from attendees, it’s clear we created an exceptional experience.

Putting together a virtual summit of this scale was uncharted territory, and our initial goals kept expanding as we blew past every target. We launched the online event with registrant goals of 35,000 and attendee goals of 50% attendance. We not only met these goals but exceeded them, ending the conference with nearly 70,000 registrants. Over 52% of those registrants attended the event, which is a 10-20% higher attendance rate than comparable virtual conferences.

A key indicator for a successful program is the net promoter score (NPS), which is the metric used to measure an attendee’s willingness to recommend the event to their peers. The NPS for our virtual Spark + AI Summit is 66, nearly double the industry benchmarks for live stream events.

There is no clear-cut formula on how to make a virtual conference work; it will be different for every organization. Rather than shrink the conference to fit the virtual format, we went big, expanding the event to five full days packed with over 200 breakout sessions, 75+ pre-conference training courses, 10+ keynotes, a full exhibitor hall, a hackathon for social good, 10+ special events and two evening events — all designed to keep people engaged during a pandemic.

So how did we do all this? I want to share the top three things that really set us up for success, especially with such a short runway to make Spark + AI Summit happen virtually.

Teamwork makes the dream work

As cliche as it may sound, it really took a village to make this happen. Taking on a real-time virtual event demands you consider every nuance of the experience because you are relying solely on technology. Engaging with our internal teams in infosec, IT and more helped support the standard and protocols needed to deliver a sound virtual environment — but also put in place a series of contingency plans if anything went wrong across multiple breakage points. Our internal partnerships were foundational in making the right decision and shedding light on anything we could have overlooked from a technology and experiential standpoint.

Partnership with our virtual conference platform vendor, MeetingPlay, two agencies and AV production vendor were integral in staying organized and moving things forward. These partnerships were true extensions of our team, in that they all committed long hours to deliver on our vision. MeetingPlay’s customer service and ability to customize the event platform are uncanny. Most importantly, they invested the time and effort to make sure that scalability and stability of our virtual environment was possible. Our two agencies helped divide and conquer by managing the build process with MeetingPlay and kept all of us on track for the entire build, while also managing registration/data integrations, customer service and audio visual production for keynotes and sessions, with additional support from our incredible production team. All of our internal and external partnerships created a formula for a dream team of event experts that helped us pull everything off in less than two months.

Content is king (and so is the way it is delivered)

One of the main goals we set for the virtual event was to maintain the same level of content — if not more — than we would deliver for an in-person event. This was a huge undertaking, with over 200 breakout sessions over 13 tracks, morning and afternoon keynotes for three days, two days of pre-conference training, six industry forums, special events and more — packed into five days of virtual content.

In order for us to feel confident about the delivery, we had to curate each experience and find that balance between pre-recorded sessions, live sessions, or a mix of both. We relied heavily on our production team to help us with the polished production of our keynotes and the recordings of all of the breakout sessions. For training, industry forums, and other live streaming events we worked closely with MeetingPlay to make sure we had the right technology in place to fit all of our needs. For instance, we integrated Zoom when necessary for large group engagement, leveraged specific functionality within MeetingPlay for chat and Q&A. In the end, we made sure that we always limited it to Zoom and/or MeetingPlay for technologies involved, so that it was the most consistent experience across the board.

The result was an amazing amount of content for an event of this scale. At heart, Spark + AI Summit is a community event, so we wanted to make sure we offered as much technical and thought leadership content through our sessions, demos, training and more. We also planned ahead to ensure we had all of our content immediately available on demand. Within MeetingPlay, it loaded right into the on-demand library, but we also had our agenda page in the Spark + AI Summit site ready to go with recordings. It was imperative for us to have all of our content consumable immediately and planning ahead for this made it worthwhile.

Always make room for fun

Going virtual doesn’t mean that you need to sacrifice the fun of a live conference experience..This is especially important when your event attendees are sitting in front of their laptops, at home, often alone. So it was really important to infuse some fun into all aspects of the conference.

From the well-choreographed keynote transitions to the DJ set at the end of each day, FUN is necessary to keep people engaged and offer that change of pace within a virtual setting. Integrating daily body breaks, conference gamification, a virtual swag store, and virtual photo booth were just a few things we did to weave in some light-hearted attendee engagement. Our evening events were designed with our audience and networking opportunities in mind, encouraging them to “choose their own adventure” by listening to a live DJ set, participate in a gaming tournament, attend an Ask Me Anything (AMA) session or start up a conversation in a birds of a feather room — Reddit-style rooms for a variety of topics/personas.

These little touches went a long way to enhance the overall experience of our attendees and helped us stay true to what Spark + AI Summit is all about —bringing data teams together to learn and connect.

There are probably a million other things to cover when it comes to everything we learned in putting together a successful virtual event. In the end, we will continue to learn from each other, share best practices and adjust in this new landscape of virtual events.

Try Databricks for free. Get started today.

The post Data Teams Unite! Spark + AI Summit Recap appeared first on Databricks.

↧

Interoperability between Koalas and Apache Spark

August 11, 2020, 2:37 pm

≫ Next: How to accelerate your ETL pipelines from 18 hours to as fast as 5 minutes with Azure Databricks

≪ Previous: Data Teams Unite! Spark + AI Summit Recap

Koalas is an open source project which provides a drop-in replacement for pandas, enabling efficient scaling out to hundreds of worker nodes for everyday data science and machine learning. After over one year of development since it was first introduced last year, Koalas 1.0 was released.

pandas is a Python package commonly used among data scientists, but it does not scale out to big data. When their data becomes large, they have to choose and learn another system such as Apache Spark from the beginning in order to adopt and convert their existing workload. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark. Many of these are introduced in the previous blog post, which also includes the best practices when working with Koalas.

Koalas is useful for not only pandas users but also PySpark users because Koalas supports many features difficult to do with PySpark. For example, Spark users can plot data directly from their PySpark DataFrame via the Koalas plotting APIs, similar to pandas. PySpark DataFrame is more SQL compliant and Koalas DataFrame is closer to Python itself which provides more intuitiveness to work with Python in some contexts. In the Koalas documentation, there are the various pandas equivalent APIs implemented.

In this blog post, we focus on how PySpark users can leverage their knowledge and the native interaction between PySpark and Koalas to write code faster. We include many self-contained examples, which you can run if you have Spark with Koalas installed, or you are using the Databricks Runtime. From Databricks Runtime 7.1, Koalas is packaged together so you can run it without the manual installation.

Koalas and PySpark DataFrames

Before a deep dive, let’s look at the general differences between Koalas and PySpark DataFrames first.

Externally, they are different. Koalas DataFrames seamlessly follow the structure of pandas DataFrames and implement an index/identifier under the hood. The PySpark DataFrame, on the other hand, tends to be more compliant with the relations/tables in relational databases, and does not have unique row identifiers.

Internally, Koalas DataFrames are built on PySpark DataFrames. Koalas translates pandas APIs into the logical plan of Spark SQL. The plan is optimized and executed by the sophisticated and robust Spark SQL engine which is continually being improved by the Spark community. Koalas also follows Spark to keep the lazy evaluation semantics for maximizing the performance. To implement the pandas DataFrame structure and pandas’ rich APIs that require an implicit ordering, Koalas DataFrames have the internal metadata to represent pandas-equivalent indices and column labels mapped to the columns in PySpark DataFrame.

Even though Koalas leverages PySpark as the execution engine, you might still face slight performance degradation when compared to PySpark. As discussed in the migration experience in Virgin Hyperloop One, the major causes are usually:

The default index is used. The overhead for building the default index depends on the data size, cluster composition, etc. Therefore, it is always preferred to avoid using the default index. It will be discussed more about this in other sections below.

Some APIs in PySpark and pandas have the same name but different semantics. For example, both Koalas DataFrame and PySpark DataFrame have the count API. The former counts the number of non-NA/null entries for each column/row and the latter counts the number of retrieved rows, including rows containing null.

>>> ks.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}).count()
a    3
b    3

>>> spark.createDataFrame(
...     [[1, 4], [2, 5], [3, 6]], schema=["a", "b"]).count()
3

Conversion from and to PySpark DataFrames

For a PySpark user, it’s good to know how easily you can go back and forth between a Koalas DataFrame and PySpark DataFrame and what’s happening under the hood so that you don’t need to be afraid of entering Koalas world to apply the highly scalable pandas APIs on Spark.

to_koalas()

When importing the Koalas package, it automatically attaches the to_koalas() method to PySpark DataFrames. You can simply use this method to convert PySpark DataFrames to Koalas DataFrames.

Let’s suppose you have a PySpark DataFrame:

>>> sdf = spark.createDataFrame([(1, 10.0, 'a'), (2, 20.0, 'b'), (3, 30.0, 'c')],schema=['x', 'y', 'z'])
>>> sdf.show()
+---+----+---+
|  x|   y|  z|
+---+----+---+
|  1|10.0|  a|
|  2|20.0|  b|
|  3|30.0|  c|
+---+----+---+

First, import the Koalas package. You conventionally use ks as an alias for the package.

>>> import databricks.koalas as ks

Convert your Spark DataFrame to a Koalas DataFrame with the to_koalas() method as described above.

>>> kdf = sdf.to_koalas()
>>> kdf
    x     y  z
0  1  10.0  a
1  2  20.0  b
2  3  30.0  c

kdf is a Koalas DataFrame created from the PySpark DataFrame. The computation is lazily executed when the data is actually needed, for example showing or storing the computed data, the same as PySpark.

to_spark()

Next, you should also know how to go back to a PySpark DataFrame from Koalas. You can use the to_spark() method on the Koalas DataFrame.

>>> sdf_from_kdf = kdf.to_spark()
>>> sdf_from_kdf.show()
+---+----+---+
|  x|   y|  z|
+---+----+---+
|  1|10.0|  a|
|  2|20.0|  b|
|  3|30.0|  c|
+---+----+---+

Now you have a PySpark DataFrame again. Notice that there is no longer the index column that the Koalas DataFrame contained. The best practices for handling the index below will be discussed later.

Index and index_col

As shown above, Koalas internally manages a couple of columns as “index” columns in order to represent the pandas’ index. The “index” columns are used to access rows by loc/iloc indexers or used in the sort_index() method without specifying the sort key columns, or even used to match corresponding rows for operations combining more than two DataFrames or Series, for example df1 + df2, and so on.

If there are already such columns in the PySpark DataFrame, you can use the index_col parameter to specify the index columns.

>>> kdf_with_index_col = sdf.to_koalas(index_col='x')  # or index_col=['x']
>>> kdf_with_index_col
        y  z
x
1  10.0  a
2  20.0  b
3  30.0  c

This time, column x is not considered as one of the regular columns but the index.

If you have multiple columns as the index, you can pass the list of column names.

>>> sdf.to_koalas(index_col=['x', 'y'])
    z
x y
1 10.0  a
2 20.0  b
3 30.0  c

When going back to a PySpark DataFrame, you also use the index_col parameter to preserve the index columns.

>>> kdf_with_index_col.to_spark(index_col='index').show()  # or index_col=['index']
+-----+----+---+
|index|   y|  z|
+-----+----+---+
|    1|10.0|  a|
|    2|20.0|  b|
|    3|30.0|  c|
+-----+----+---+

Otherwise, the index is lost as below.

>>> kdf_with_index_col.to_spark().show()
+----+---+
|   y|  z|
+----+---+
|10.0|  a|
|20.0|  b|
|30.0|  c|
+----+---+

The number of the column names should match the number of index columns.

>>> kdf.to_spark(index_col=['index1', 'index2']).show()
Traceback (most recent call last):
...
ValueError: length of index columns is 1; however, the length of the given 'index_col' is 2.

Default Index

As you have seen, if you don’t specify index_col parameter, a new column is created as an index.

>>> sdf.to_koalas()
    x     y  z
 0  1  10.0  a
 1  2  20.0  b
 2  3  30.0  c

Where does the column come from?

The answer is “default index”. If the index_col parameter is not specified, Koalas automatically attaches one column as an index to the DataFrame. There are three types of default indices: “sequence”, “distributed-sequence”, and “distributed”. Each has its distinct characteristics and limitations such as performance penalty. For reducing the performance overhead, it is highly encouraged to specify index columns via index_col when converting from a PySpark DataFrame.

The default index is also used when Koalas doesn’t know which column is intended for the index. For example, reset_index() without any parameters which tries to convert all the index data to the regular columns and recreate an index:

>>> kdf_with_index_col.reset_index()
    x     y  z
 0  1  10.0  a
 1  2  20.0  b
 2  3  30.0  c

You can change the default index type by setting it as a Koalas option “compute.default_index_type”:

ks.set_option('compute.default_index_type', 'sequence')

ks.options.compute.default_index_type = 'sequence'

sequence type

The “sequence” type is currently used by default in Koalas as it guarantees the index increments continuously, like pandas. However, it uses a non-partitioned window function internally, which means all the data needs to be collected into a single node. If the node doesn’t have enough memory, the performance will be significantly degraded, or OutOfMemoryError will occur.

>>> ks.set_option('compute.default_index_type', 'sequence')
>>> spark.range(5).to_koalas()
    id
0   0
1   1
2   2
3   3
4   4

distributed-sequence type

When the “distributed-sequence” index is used, the performance penalty is not as significant as “sequence” type. It computes and generates the index in a distributed manner but it needs another extra Spark Job to generate the global sequence internally. It also does not guarantee the natural order of the results. In general, it becomes a continuously increasing number.

>>> ks.set_option('compute.default_index_type', 'distributed-sequence')
>>> spark.range(5).to_koalas()
    id
3   3
1   1
2   2
4   4
0   0

distributed type

“distributed” index has almost no performance penalty and always creates monotonically increasing numbers. If the index is just needed as unique numbers for each row, or the order of rows, this index type would be the best choice. However, the numbers have an indeterministic gap. That means this index type will unlikely be used as an index for operations combining more than two DataFrames or Series.

>>> ks.set_option('compute.default_index_type', 'distributed')
>>> spark.range(5).to_koalas()
                id
17179869184   0
34359738368   1
60129542144   2
77309411328   3
94489280512   4

Comparison

As you have seen, each index type has its distinct characteristics as summarized in the table below. The default index type should be chosen carefully considering your workloads.

	Distributed computation	Map-side operation	Continuous increment	Performance
sequence	No, in a single worker node	No, requires a shuffle	Yes	Bad for large dataset
distributed-sequence	Yes	Yes, but requires another Spark job	Yes, in most cases	Good enough
distributed	Yes	Yes	No	Good

Using Spark I/O

There are a lot of functions to read and write data in pandas, and in Koalas as well.

Here is the list of functions from pandas, where Koalas uses Spark I/O under the hood.

The APIs and their arguments follow the APIs corresponding to pandas. However, there are subtle differences in the behaviors currently. For example, pandas’ read_csv can read a file over http protocol, but Koalas still does not support it since the underlying Spark engine itself does not support it.

These Koalas functions also have the index_col argument to specify which columns should be used as an index or what the index column names should be, similarly to the to_koalas() or to_spark() function as described above. If you don’t specify it, the default index is attached or the index column is lost.

For example, if you don’t specify index_col parameter, the default index is attached as below – distributed default index was used for simplicity.

>>> kdf.to_csv('/path/to/test.csv')
>>> kdf_read_csv = ks.read_csv('/path/to/test.csv')
>>> kdf_read_csv
                x     y  z
0            2  20.0  b
8589934592   3  30.0  c
17179869184  1  10.0  a

Whereas if you specify the index_col parameter, the specified column becomes an index.

>>> kdf.to_csv('/path/to/test.csv', index_col='index')
>>> kdf_read_csv_with_index_col = ks.read_csv("/path/to/test.csv", index_col='index')
>>> kdf_read_csv_with_index_col
        x     y  z
index
2      3  30.0  c
1      2  20.0  b
0      1  10.0  a

In addition, each function takes keyword arguments to set options for the DataFrameWriter and DataFrameReader in Spark. The given keys are directly passed to their options and configure the behavior. This is useful when the pandas-origin arguments are not enough to manipulate your data but PySpark supports the missing functionality.

>>> # nullValue is the option specific to Spark’s CSV I/O.
>>> ks.read_csv('/path/to/test.csv', index_col='index', nullValue='b')
        x     y     z
index
2      3  30.0     c
1      2  20.0  None
0      1  10.0     a

Koalas specific I/O functions

In addition to the above functions from pandas, Koalas has its own functions.

Firstly, DataFrame.to_table and ks.read_table is to write and read Spark tables by just specifying the table name. It is analogous to DataFrameWriter.saveAsTable and DataFrameReader.table in Spark, respectively.

Secondly, DataFrame.to_spark_io and ks.read_spark_io are for general Spark I/O. There are a few optional arguments for ease of use, and the others are keyword arguments. You can freely set the options used for DataFrameWriter.save and DataFrameReader.load in Spark.

>>> # 'compression' is a Spark specific option.
>>> kdf.to_spark_io('/path/to/test.orc', format='orc', index_col='index', compression="snappy")
>>> kdf_read_spark_io = ks.read_spark_io('/path/to/test.orc', format='orc', index_col='index')
>>> kdf_read_spark_io
        x     y  z
index
1      2  20.0  b
0      1  10.0  a
2      3  30.0  c

The ORC format in the above example is not supported in pandas, but Koalas can write and read it because the underlying Spark I/O supports it.

Last but not least, Koalas also can write and read Delta tables if you have Delta Lake installed.

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.

Different from the other file sources, the read_delta function enables users to specify the version of the table to time travel.

>>> kdf.to_delta('/path/to/test.delta', index_col='index')
>>> kdf_read_delta = ks.read_delta('/path/to/test.delta', index_col='index')
>>> kdf_read_delta
        x     y  z
index
0      1  10.0  a
1      2  20.0  b
2      3  30.0  c

>>> # Update the data and overwrite the Delta table
>>> kdf['x'] = kdf['x'] + 10
>>> kdf['y'] = kdf['y'] * 10
>>> kdf['x'] = kdf['x'] * 2
>>> kdf.to_delta('/path/to/test.delta', index_col='index')

>>> # Read the latest data
>>> ks.read_delta('/path/to/test.delta', index_col='index')
        x      y  z
index
0      22  100.0  a
1      24  200.0  b
2      26  300.0  c

>>> # Read the data of version 0
>>> ks.read_delta('/path/to/test.delta', version=0, index_col='index')
        x     y  z
index
0      1  10.0  a
1      2  20.0  b
2      3  30.0  c

Please see Delta Lake for more details.

Spark accessor

Koalas provides the spark accessor for users to leverage the existing PySpark APIs more easily.

Series.spark.transform and Series.spark.apply

Series.spark accessor has transform and apply functions to handle underlying Spark Column objects.

For example, suppose you have the following Koalas DataFrame:

>>> kdf = ks.DataFrame({'a': [1, 2, 3, 4]]})
>>> kdf
    a
0  1
1  2
2  3
3  4

You can cast type with astype function, but if you are not used to it yet, you can use cast of Spark column using Series.spark.transform function instead:

>>> import numpy as np
>>> from pyspark.sql.types import DoubleType
>>> 
>>> kdf['a_astype_double'] = kdf.a.astype(np.float64)
>>> kdf['a_cast_double'] = kdf.a.spark.transform(lambda scol: scol.cast(DoubleType()))
>>> kdf[['a', 'a_astype_double', 'a_cast_double']]
    a  a_astype_double  a_cast_double
0  1              1.0            1.0
1  2              2.0            2.0
2  3              3.0            3.0
3  4              4.0            4.0

The user function passed to the Series.spark.transform function takes Spark’s Column object and can manipulate it using PySpark functions.

Also you can use functions of pyspark.sql.functions in the transform/apply functions:

>>> from pyspark.sql import functions as F
>>> 
>>> kdf['a_sqrt'] = kdf.a.spark.transform(lambda scol: F.sqrt(scol))
>>> kdf['a_log'] = kdf.a.spark.transform(lambda scol: F.log(scol))
>>> kdf[['a', 'a_sqrt', 'a_log']]
    a    a_sqrt     a_log
0  1  1.000000  0.000000
1  2  1.414214  0.693147
2  3  1.732051  1.098612
3  4  2.000000  1.386294

The user function for Series.spark.transform should return the same length of Spark column as its input’s, whereas one for Series.spark.apply can return a different length of Spark column, such as calling the aggregate functions.

>>> kdf.a.spark.apply(lambda scol: F.collect_list(scol))
0    [1, 2, 3, 4]
Name: a, dtype: object

DataFrame.spark.apply

Similarly, DataFrame.spark accessor has an apply function. The user function takes and returns a Spark DataFrame and can apply any transformation. If you want to keep the index columns in the Spark DataFrame, you can set index_col parameter. In that case, the user function has to contain a column of the same name in the returned Spark DataFrame.

>>> kdf.spark.apply(lambda sdf: sdf.selectExpr("index * 10 as index", "a + 1 as a"), index_col="index")
    a
index
0      2
10     3
20     4
30     5

If you omit index_col, it will use the default index.

>>> kdf.spark.apply(lambda sdf: sdf.selectExpr("a + 1 as a"))
    a
17179869184  2
42949672960  3
68719476736  4
94489280512  5

Spark schema

You can see the current underlying Spark schema by DataFrame.spark.schema and DataFrame.spark.print_schema. They both take the index_col parameter if you want to know the schema including index columns.

>>> import numpy as np
>>> import pandas as pd
>>> 
>>> kdf = ks.DataFrame({'a': list('abc'),
...                     'b': list(range(1, 4)),
...                     'c': np.arange(3, 6).astype('i1'),
...                     'd': np.arange(4.0, 7.0, dtype='float64'),
...                     'e': [True, False, True],
...                     'f': pd.date_range('20130101', periods=3)},
...                    columns=['a', 'b', 'c', 'd', 'e', 'f'])

>>> # Print the schema out in Spark’s DDL formatted string
>>> kdf.spark.schema().simpleString()
'struct<a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
>>> kdf.spark.schema(index_col='index').simpleString()
'struct<index:bigint,a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'

>>> # Print out the schema as same as Spark’s DataFrame.printSchema()
>>> kdf.spark.print_schema()
root
    |-- a: string (nullable = false)
    |-- b: long (nullable = false)
    |-- c: byte (nullable = false)
    |-- d: double (nullable = false)
    |-- e: boolean (nullable = false)
    |-- f: timestamp (nullable = false)

>>> kdf.spark.print_schema(index_col='index')
root
    |-- index: long (nullable = false)
    |-- a: string (nullable = false)
    |-- b: long (nullable = false)
    |-- c: byte (nullable = false)
    |-- d: double (nullable = false)
    |-- e: boolean (nullable = false)
    |-- f: timestamp (nullable = false)

Explain Spark plan

If you want to know the current Spark plan, you can use DataFrame.spark.explain().

>>> # Same as Spark’s DataFrame.explain()
>>> kdf.spark.explain()
== Physical Plan ==
Scan ExistingRDD[...]

>>> kdf.spark.explain(True)
== Parsed Logical Plan ==
...

== Analyzed Logical Plan ==
...

== Optimized Logical Plan ==
...

== Physical Plan ==
Scan ExistingRDD[...]

>>> # New style of mode introduced from Spark 3.0.
>>> kdf.spark.explain(mode="extended")
== Parsed Logical Plan ==
...

== Analyzed Logical Plan ==
...

== Optimized Logical Plan ==
...

== Physical Plan ==
Scan ExistingRDD[...]

Cache

The spark accessor also provides cache related functions, cache, persist, unpersist, and the storage_level property. You can use the cache function as a context manager to unpersist the cache. Let’s see an example.

>>> from pyspark import StorageLevel
>>> 
>>> with kdf.spark.cache() as cached:
...   print(cached.spark.storage_level)
...
Disk Memory Deserialized 1x Replicated

>>> with kdf.spark.persist(StorageLevel.MEMORY_ONLY) as cached:
...   print(cached.spark.storage_level)
...
Memory Serialized 1x Replicated

When the context finishes, the cache is automatically cleared. If you want to keep it cached, you can do as below:

>>> cached = kdf.spark.cache()
>>> print(cached.spark.storage_level)
Disk Memory Deserialized 1x Replicated

When it is no longer needed, you have to call DataFrame.spark.unpersist() explicitly to remove it from cache.

>>> cached.spark.unpersist()

Hints

There are some join-like operations in Koalas, such as merge, join, and update. Although the actual join method depends on the underlying Spark planner under the hood, you can still specify a hint with the ks.broadcast() function or DataFrame.spark.hint() method.

>>> kdf1 = ks.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'],
...                      'value': [1, 2, 3, 5]},
...                     columns=['key', 'value'])
>>> kdf2 = ks.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'],
...                      'value': [5, 6, 7, 8]},
...                     columns=['key', 'value'])
>>> kdf1.merge(kdf2, on='key').explain()
== Physical Plan ==
...
... SortMergeJoin ...
...

>>> kdf1.merge(ks.broadcast(kdf2), on='key').explain()
== Physical Plan ==
...
... BroadcastHashJoin ...
...

>>> kdf1.merge(kdf2.spark.hint('broadcast'), on='key').explain()
== Physical Plan ==
...
... BroadcastHashJoin ...
...

In particular, DataFrame.spark.hint() is more useful if the underlying Spark is 3.0 or above since more hints are available in Spark 3.0.

Conclusion

Koalas DataFrame is similar to PySpark DataFrame because Koalas uses PySpark DataFrame internally. Externally, Koalas DataFrame works as if it is a pandas DataFrame.

In order to fill the gap, Koalas has numerous features useful for users familiar with PySpark to work with both Koalas and PySpark DataFrame easily. Although there is some extra care required to deal with the index during the conversion, Koalas provides PySpark users the easy conversion between both DataFrames, the input/output APIs to read/write for PySpark and the spark accessor to expose PySpark friendly features such as caching and exploring the DataFrame internally. In addition, the spark accessor provides a natural way to play with Koalas Series and PySpark columns.

PySpark users can benefit from Koalas as shown above. Try out the examples and learn more in Databricks Runtime.

To find out more about Koalas, see the following resources:

Try the accompanying notebook
Read the previous blog on 10 Minutes from pandas to Koalas on Apache Spark
Spark+AI Summit 2020 talk “Koalas: Pandas on Apache Spark”
Spark+AI Summit 2020 talk “Koalas: Making an Easy Transition from Pandas to Apache Spark”

The post Interoperability between Koalas and Apache Spark appeared first on Databricks.

↧

How to accelerate your ETL pipelines from 18 hours to as fast as 5 minutes with Azure Databricks

August 18, 2020, 9:00 am

≫ Next: Flagging at-risk subscribers for direct-to-consumer media services

≪ Previous: Interoperability between Koalas and Apache Spark

Azure Databricks enables organizations to migrate on-premises ETL pipelines to the cloud to dramatically accelerate performance and increase reliability. If you are using SQL Server Integration Services (SSIS) today, there are a number of ways to migrate and run your existing pipelines on Microsoft Azure.

The challenges with on-premises ETL pipelines

In speaking with customers, some of the most common challenges we hear about with regard to on-premises ETL pipelines are reliability, performance and flexibility. ETL pipelines can be slow due to a number of factors, such as CPU, network, and disk performance as well as the available on-premises compute cluster capacity. In addition, data formats change and new business requirements come up that break existing ETL pipelines, underscoring lack of flexibility. On-premises ETL pipelines can slow growth and efficiency for the following reasons:

Cost – On-premises infrastructure is riddled with tangible and intangible costs related to hardware, maintenance, and human capital.
Scalability – Data volume and velocity are rapidly growing and ETL pipelines need to scale upward and outward to meet the computing, processing and storage requirements.
Data Integrations – Data must be combined from various financial, marketing, and operational sources to help direct your business investments and activities.
Reliability – On-premises big data ETL pipelines can fail for many reasons. The most common issues are changes to data source connections, failure of a cluster node, loss of a disk in a storage array, power interruption, increased network latency, temporary loss of connectivity, authentication issues and changes to ETL code or logic.

How Azure Databricks helps

Migrating ETL pipelines to the cloud provides significant benefits in each of these areas. Let’s look at how Azure Databricks specifically addresses each of these issues.

Lower Cost – Only pay for the resources you use. No need to purchase physical hardware far in advance or pay for specialized hardware that you only rarely use.
Optimized Autoscaling – Cloud-based ETL pipelines that use Azure Databricks can scale automatically as data volume and velocity increase.
Native Integrations – Ingest all of your data into a reliable data lake using 90+ native connectors.
Reliability – Leverage Azure compute, storage, networking, security, authentication and logging services to minimize downtime and avoid interruptions.

How Bond Brand Loyalty accelerated time to value with Azure Databricks

Bond Brand Loyalty provides customer experience and loyalty solutions for some of the world’s most influential and valuable brands across a wide variety of industry verticals, from banking and auto manufacturing, to entertainment and retail. For Bond, protecting customer data is a top priority, so security and encryption are always top of mind. Bond is constantly innovating to accelerate and optimize their solutions and the client experience. Bond provides customers with regular reporting to help them see their brand performance. They needed to improve reporting speed from weeks to hours.

Bond wanted to augment its reporting with data from third-party sources. Due to the size of the data, it did not make sense to store the information in a transactional database. The Bond team investigated the concept of moving to a data lake architecture with Delta Lake to support advanced analytics. The team began to gather data from different sources, landing it in the data lake and staging it through each phase of data processing. Using Azure Databricks, Bond was able to leverage that rich data, serving up more powerful, near real-time reports to customers. Instead of the typical pre-canned reports that took a month to produce, Bond is now able to provide more flexible reporting every 4 hours.

Diagram: Advanced Analytics Architecture

Going from monthly reports to fresh reports every 4 hours has enabled efficiencies. Bond used to have warehouses, reporting cubes, and disaster recovery to support the old world. Now, with Azure Databricks, Bond has observed that its data pipelines and reporting are easier to maintain. ETL pipelines that previously took 18 hours to run on-premises with SSIS can now be run in as little as 5 minutes. As a byproduct of the transition to a modern, simplified architecture, Bond was able to reduce pipeline creation activities from two weeks down to about two days.

With Azure Databricks and Delta Lake, Bond now has the ability to innovate faster. The company is augmenting its data with Machine Learning (ML) predictions, to orchestrate campaigns, and enable deeper campaign personalization so clients can provide more relevant offers to their customers. Thanks to Azure Databricks, Bond can easily scale its data processing and analytics to support rapid business growth.

Get Started by migrating SSIS pipelines to the cloud

Customers have successfully migrated a wide variety of on-premises ETL software tools to Azure. Since many Azure customers use SQL Server Integration Services (SSIS) for their on-premises ETL pipelines, let’s take a deeper look at how to migrate an SSIS pipeline to Azure.

There are several options and the one you choose will depend on a few factors:

Complexity of the pipeline (number and types of input and destination data sources)
Connectivity to the source and destination data stores
Number and type of applications dependent on the source data store

Execute SSIS Packages on Azure Data Factory

This is a great transitionary option for data teams that prefer a phased approach to migrating data pipelines to Azure Databricks. Leveraging Azure Data Factory, you can run your SSIS packages in Azure.

Diagram: Execute SSIS packages with Azure Data Factory

Modernize ETL Pipelines with Azure Databricks Notebooks

Azure Databricks enables you to accelerate your ETL pipelines by parallelizing operations over scalable compute clusters. This option is best if the volume, velocity, and variety of data you expect to process with your ETL pipeline is expected to rapidly grow over time. You can leverage your skills with SQL with Databricks notebooks to query your data lake with Delta Lake.

Diagram: Transformation in Azure Data Factory with Azure Databricks

Making the Switch from SSIS to Azure Databricks

When considering a migration of your ETL pipeline to Azure Databricks and Azure Data Factory, start your discovery, planning and road map by considering the following:

Data Volume – how much data to process in each batch?
Data Velocity – how often should you run your jobs?
Data Variety – structured vs. unstructured data?

Diagram: Databricks migration methodology

Next, ensure that your target data architecture leverages Delta Lake for scalability and flexibility supporting varying ETL workloads.

Diagram: ETL at scale with Azure Data Factory, Azure Data Lake Storage, Delta Lake and Azure Databricks

Migrate and validate your ETL pipelines

When you are ready to begin your ETL migration, start by migrating your SSIS logic to Databricks notebooks where you can interactively run and test data transformations and movement. Once the notebooks are executing properly, create data pipelines in Azure Data Factory to automate your ETL jobs. Finally, validate the outcome of the migration from SSIS to Databricks by reviewing the data in your destination data lake, check logs for errors, then schedule ETL jobs and setup notifications.

Migrating your ETL processes and workloads to the cloud helps you accelerate results, lower expenses and increase reliability. Learn more about Modern Data Engineering with Azure Databricks and Using SQL to Query Your Data Lake with Delta Lake, see this ADF blog post and this ADF tutorial. If you are ready to experience acceleration of your ETL pipelines, schedule a demo.

Try Databricks for free. Get started today.

The post How to accelerate your ETL pipelines from 18 hours to as fast as 5 minutes with Azure Databricks appeared first on Databricks.

↧

Flagging at-risk subscribers for direct-to-consumer media services

August 18, 2020, 10:00 am

≫ Next: Modern Industrial IoT Analytics on Azure – Part 3

≪ Previous: How to accelerate your ETL pipelines from 18 hours to as fast as 5 minutes with Azure Databricks

“The biggest problem for streaming services is not so much getting new members, it’s holding them. It’s the churn factor.”

Tom Rogers, Executive Chairman at WinView, Inc and former NBC Cable President on CNBC

As more content owners monetize their content libraries through direct-to-consumer (D2C) streaming services, their biggest challenge isn’t getting new customers in the door: With no upfront costs like truck roll or set-top boxes, digital services can easily generate free trials to their apps or channel (if selling through a channel platform like Prime Video Channels or Roku Channels).

But this ease of sampling is a double edged sword. With no 2-year commitments locking customers in, subscribers can cancel anytime with a few clicks, whether before rolling-to-pay, or when their sports or show season ends. Driving a relevant, 1:1 experience with consumers across all channels and at all times is now table stakes for many in the direct-to-consumer space as they look to reduce churn and win the war for attention for their subscription services.

But rather than broad save offers, what if you could segment subscribers by behavior so that you proactively send at risk customers an offer before they go to cancel? What if that offer was personalized to their estimated customer lifetime value to ensure positive ROI? More broadly, what are the insights customer behavior can reveal about your product and content in order to improve the customer experience and increase value subscribers get every month? Databricks customer Showtime, for example, analyzed their D2C data to revamp their production schedule to ensure there was no off-season to their new release schedule in order to keep subscribers year round to maximize their customer lifetime value.

Introducing the Churn Prediction Modeling Solution Accelerator

Based on best practices of subscription services across industries ranging from digital media to retail to financial services, we have released a solution accelerator allowing enterprises to better understand not only when but also why customers leave subscription services. In this blog we walk through the general consumer lifecycle issues facing subscription-based businesses, and provide pre-built notebooks and sample data to jumpstart data science teams in media companies tasked with reducing subscriber churn rates. While the sample customer data looks at signup source and offer types, you can customize the machine learning models and analytics tools to your own unique viewers so you can analyze churn risk based on:

when and how often they access subscription content
specific content or kinds of content consumed
quality of service events
type of devices used to access content
depth of engagement (e.g., do they watch full screen, how much do they watch)
referral source performance
and just about any other user behavior data tagged in your data lake

Link to the churn prediction solution accelerator

You can also watch our on-demand workshop where we walk through the customer churn analysis solution accelerator.

Try Databricks for free. Get started today.

The post Flagging at-risk subscribers for direct-to-consumer media services appeared first on Databricks.

↧

Modern Industrial IoT Analytics on Azure – Part 3

August 20, 2020, 9:00 am

≫ Next: Top 5 Reasons to Convert Your Cloud Data Lake to a Delta Lake

≪ Previous: Flagging at-risk subscribers for direct-to-consumer media services

In part 2 of this three-part series on Azure data analytics for modern industrial internet of things (IIoT) applications, we ingested real-time IIoT data from field devices into Azure and performed complex time-series processing on Data Lake directly. In this post, we will leverage machine learning for predictive maintenance and to maximize the revenue of a wind turbine while minimizing the opportunity cost of downtime, thereby maximizing profit.

The end result of our model training and visualization will be a Power BI report shown below:

The end-to-end architecture is again shown below.

Machine Learning: Power Output and Remaining Life Optimization

Optimizing the utility, lifetime, and operational efficiency of industrial assets like wind turbines has numerous revenue and cost benefits. The real-world challenge we explore in this article is maximizing the revenue of a wind turbine while minimizing the opportunity cost of downtime, thereby maximizing our net profit.

Net profit = Power generation revenue – Cost of added strain on equipment

If we push a turbine to a higher RPM, it will generate more energy and therefore more revenue. However, the added strain on the turbine will cause it to fail more often, introducing cost.

To solve this optimization problem, we will create two models:

Predict the power generated of a turbine given a set of operating conditions
Predict the remaining life of a turbine given a set of operating conditions

We can then produce a profit curve to identify the optimal operating conditions that maximize power revenue while minimizing costs.

Using Azure Databricks with our Gold Delta tables, we will perform feature engineering to extract the fields of interest, train the two models, and finally deploy the models to Azure Machine Learning for hosting.

To calculate the remaining useful lifetime of each Wind Turbine, we can use our maintenance records that indicate when each asset is replaced.

%sql
-- Calculate the age of each turbine and the remaining life in days
CREATE OR REPLACE VIEW turbine_age AS
WITH reading_dates AS (SELECT distinct date, deviceid FROM turbine_power),
	maintenance_dates AS (
	SELECT d.*, datediff(nm.date, d.date) as datediff_next, datediff(d.date, lm.date) as datediff_last 
	FROM reading_dates d LEFT JOIN turbine_maintenance nm ON (d.deviceid=nm.deviceid AND d.date<=nm.date)
	LEFT JOIN turbine_maintenance lm ON (d.deviceid=lm.deviceid AND d.date>=lm.date ))
SELECT date, deviceid, min(datediff_last) AS age, min(datediff_next) AS remaining_life
FROM maintenance_dates 
GROUP BY deviceid, date;

To predict power output at a six-hour time horizon, we calculate time series shifts using Spark window functions.

CREATE OR REPLACE VIEW feature_table AS
SELECT r.*, age, remaining_life,
	-- Calculate the power 6 hours ahead using Spark Windowing and build a feature_table to feed into our ML models
	LEAD(power, 6, power) OVER (PARTITION BY r.deviceid ORDER BY time_interval) as power_6_hours_ahead
FROM gold_readings r 
JOIN turbine_age a ON (r.date=a.date AND r.deviceid=a.deviceid)
WHERE r.date < CURRENT_DATE(); -- Only train on historical data

There are strong correlations between both Turbine operating parameters (RPM and Angle) as well as weather conditions and the power generated six hours from now.

We can now train an XGBoost Regressor model to use our feature columns (weather, sensor and power readings) to predict our label (power reading six hours ahead). We can train a model for each Wind Turbine in parallel using a Pandas UDF, which distributes our XGBoost model training code to all the available nodes in the Azure Databricks cluster.

# Create a Spark Dataframe that contains the features and labels we need
feature_cols = ['angle','rpm','temperature','humidity','windspeed','power','age']
label_col = 'power_6_hours_ahead'

# Read in our feature table and select the columns of interest
feature_df = spark.table('feature_table')

# Create a Pandas UDF to train a XGBoost Regressor on each turbine's data
@pandas_udf(feature_df.schema, PandasUDFType.GROUPED_MAP)
def train_power_model(readings_pd):
	mlflow.xgboost.autolog() # Auto-Log the XGB parameters, metrics, model and artifacts
	with mlflow.start_run():
	# Train an XGBRegressor on the data for this Turbine
	alg = xgb.XGBRegressor() 
	train_dmatrix = xgb.DMatrix(data=readings_pd[feature_cols].astype('float'),label=readings_pd[label_col])
	model = xgb.train(dtrain=train_dmatrix, evals=[(train_dmatrix, 'train')])
	return readings_pd

# Run the Pandas UDF against our feature dataset
power_predictions = feature_df.groupBy('deviceid').apply(train_power_model)

Azure Databricks will automatically track each model training run with a hosted MLflow experiment. For XGBoost Regression, MLflow will track any parameters passed into the params argument, the RMSE metric, the turbine this model was trained on, and the resulting model itself. For example, the RMSE for predicting power on deviceid WindTurbine-18 is 45.79.

We can train a similar model for the remaining life of the wind turbine. The actuals vs. predicted for one of the turbines is shown below.

Model Deployment and Hosting

Azure Databricks is integrated with Azure Machine Learning for model deployment and scoring. Using the Azure ML APIs directly inside of Databricks, we can automatically deploy an image for each model to be hosted in a fast, scalable container service (ACI or AKS) by Azure ML.

# Create a model image inside of AzureML
model_image, azure_model = mlflow.azureml.build_image(model_uri=path, 
														workspace=workspace, 
														model_name=model,
														image_name=model,
														description="XGBoost model to predict power output”
														synchronous=False)

# Deploy a web service to host the model as a REST API
dev_webservice_deployment_config = AciWebservice.deploy_configuration()
dev_webservice = Webservice.deploy_from_image(name=dev_webservice_name, 
												image=model_image,                                                      
												workspace=workspace)

Once the model is deployed, it will show up inside the Azure ML studio, and we can make REST API calls to score data interactively.

# Construct a payload to send with the request
payload = {
	'angle':12,
	'rpm':10,
	'temperature':25,
	'humidity':50,
	'windspeed':10,
	'power':200,
	'age':10
}

def score_data(uri, payload):
	rest_payload = json.dumps({"data": [list(payload.values())]})
	response = requests.post(uri, data=rest_payload, headers={"Content-Type": "application/json"})
	return json.loads(response.text)

print(f'Predicted power (in kwh) from model: {score_data(power_uri, payload)}')
print(f'Predicted remaining life (in days) from model: {score_data(life_uri, payload)}')

Now that both the power optimization and the RUL models are deployed as prediction services, we can utilize both in optimizing net profit from each wind turbine.

Assuming $1 per KWh, annual revenue can simply be calculated by multiplying the expected hourly power by 24 hours and 365 days.

The annual cost can be calculated by multiplying the daily revenue by the number of times the Turbine needs to be maintained in a year (365 days / remaining life).

We can iteratively score various operating parameters simply by making multiple calls to our models hosted in Azure ML. By visualizing the expected profit cost for various operating parameters, we can identify the optimal RPM to maximize profit.

Data Serving: Azure Data Explorer and Azure Synapse Analytics

Operational Reporting in ADX

Azure Data Explorer (ADX) provides real-time operational analytics on streaming time-series data. IIoT device data can be streamed directly into ADX from IoT Hub, or pushed from Azure Databricks using the Kusto Spark Connector from Microsoft as shown below.

stream_to_adx = (
	spark.readStream.format('delta').option('ignoreChanges',True).table('turbine_enriched')
		.writeStream.format("com.microsoft.kusto.spark.datasink.KustoSinkProvider")
		.option("kustoCluster",kustoOptions["kustoCluster"])
		.option("kustoDatabase",kustoOptions["kustoDatabase"])
		.option("kustoTable", kustoOptions["kustoTable"])
		.option("kustoAadAppId",kustoOptions["kustoAadAppId"])
		.option("kustoAadAppSecret",kustoOptions["kustoAadAppSecret"])
		.option("kustoAadAuthorityID",kustoOptions["kustoAadAuthorityID"])
	)

PowerBI can then be connected to the Kusto table to create a true, real-time, operational dashboard for Turbine engineers.

ADX also contains native time-series analysis functions such as forecasting and anomaly detection. For example, the Kusto code below finds anomalous points for RPM readings in the data stream.

turbine_raw
| where rpm > 0
| make-series rpm_normal = avg(rpm) default=0 on todatetime(timestamp) in range(datetime(2020-06-30 00:00:00), datetime(2020-06-30 01:00:00), 10s)
| extend anomalies = series_decompose_anomalies(rpm_normal, 0.5)
| render anomalychart with(anomalycolumns=anomalies, title="RPM Anomalies")

Analytical Reporting in ASA

Azure Synapse Analytics (ASA) is the next generation data warehouse from Azure that natively leverages ADLS Gen 2 and integrates with Azure Databricks to enable seamless data sharing between these services.

While leveraging the capabilities of Synapse and Azure Databricks, the recommended approach is to use the best tool for the job given your team’s requirements and the user personas accessing the data. For example, data engineers that need the performance benefits of Delta and data scientists that need a collaborative, rich and flexible workspace will gravitate towards Azure Databricks. Analysts that need a low-code or data warehouse-based SQL environment to ingest, process and visualize data will gravitate towards Synapse.

The Synapse streaming connector for Azure Databricks allows us to stream the Gold Turbine readings directly into a Synapse SQL Pool for reporting.

spark.conf.set("spark.databricks.sqldw.writeSemantics", "copy")                           # Use COPY INTO for faster loads

write_to_synapse = (
	spark.readStream.format('delta').option('ignoreChanges',True).table('turbine_enriched') # Read in Gold turbine readings
	.writeStream.format("com.databricks.spark.sqldw")                                     # Write to Synapse
	.option("url",dbutils.secrets.get("iot","synapse_cs"))                                # SQL Pool JDBC (SQL Auth)
	.option("tempDir", SYNAPSE_PATH)                                                      # Temporary ADLS path
	.option("forwardSparkAzureStorageCredentials", "true")
	.option("dbTable", "turbine_enriched")                                                # Table in Synapse to write to
	.option("checkpointLocation", CHECKPOINT_PATH+"synapse")                              # Streaming checkpoint
	.start()
)

Alternatively, Azure Data Factory can be used to read data from the Delta format and write it into Synapse SQL Pools. More documentation can be found here.

Now that the data is clean, processed, and available to data analysts for reporting, we can build a live PowerBI dashboard against the live data as well as the predictions from our ML model as shown below.

Summary

To summarize, we have successfully:

Ingested real-time IIoT data from field devices into Azure
Performed complex time-series processing on Data Lake directly
Trained and deployed ML models to optimize the utilization of our Wind Turbine assets
Served the data to engineers for operational reporting and data analysts for analytical reporting

They key big data technology that ties everything together is Delta Lake. Delta on ADLS provides reliable streaming data pipelines and highly performant data science and analytics queries on massive volumes of time-series data. Lastly, it enables organizations to truly adopt a Lakehouse pattern by bringing best of breed Azure tools to a write-once, access-often data store.

What’s Next?

Try out the notebook hosted here, learn more about Azure Databricks with this 3-part training series and see how to create modern data architectures on Azure by attending this webinar.

Try Databricks for free. Get started today.

The post Modern Industrial IoT Analytics on Azure – Part 3 appeared first on Databricks.

↧

Top 5 Reasons to Convert Your Cloud Data Lake to a Delta Lake

August 21, 2020, 3:52 pm

≫ Next: Profit-Driven Retention Management with Machine Learning

≪ Previous: Modern Industrial IoT Analytics on Azure – Part 3

If you examine the agenda for any of the Spark Summits in the past five years, you will notice that there is no shortage of talks on how best to architect a data lake in the cloud using Apache Spark™ as the ETL and query engine and Apache Parquet as the preferred file format. There are talks that give advice on how to [and how not to] partition your data, how to calculate the ideal file size, how to handle evolving schemas, how to build compaction routines, how to recover from failed ETL jobs, how to stream raw data into the data lake, etc.

Databricks has been working with customers throughout this time to encapsulate all of the best practices of a data lake implementation into Delta Lake, which was open-sourced at Spark +AI Summit in 2019. There are many benefits to converting an Apache Parquet Data Lake to a Delta Lake, but this blog will focus on the Top 5 reasons:

Prevent Data Corruption
Faster Queries
Increase Data Freshness
Reproduce ML Models
Achieve Compliance

Fundamentally, Delta Lake maintains a transaction log alongside the data. This enables each Delta Lake table to have ACID-compliant reads and writes.

Prevent Data Corruption

Data lakes were originally conceived as an on-premise big data complement to data warehousing on top of HDFS and not in the cloud. When the same design pattern was replicated onto a blob data storage, like Amazon Web Services (AWS) S3, unique challenges ensued because of eventual consistency properties. The original Hadoop commit protocol assumes RENAME functionality for transactions, which exist on HDFS but not in S3. This forced engineers to choose from two different Hadoop commit protocols to either be safe but slow, or fast but unsafe. Prior to releasing Delta Lake, Databricks developed their own commit protocol to first address this. This Spark Summit presentation from 2017, Transactional Writes to Cloud Storage, explains the challenges and our then-solution to address it.

Delta Lake was designed from the beginning to accommodate blob storage and the eventual consistency and data quality properties that come with it. If an ETL job fails against a Delta Lake table before fully completing, it will not corrupt any queries. Each SQL query will always refer to a consistent state of the table. This allows an enterprise data engineer to troubleshoot why an ETL job may have failed, fix it, and re-run it without needing to worry about alerting users, purging partially written files, or reconciling to a previous state.

Before Delta Lake, a common design pattern is to partition the first stage of data by a batch id so that if a failure occurred upon ingestion, the partition could be dropped and a new one created on retry. Although this pattern helps with ETL recoverability, it usually results in many partitions with a few small Parquet files; thereby impeding downstream query performance. This is typically rectified by duplicating the data into other tables with broader partitions. Delta Lake still supports partitions, but you only need to match them to expected query patterns, and only if each partition contains a substantial amount of data. This ends up eliminating many partitions in your data and improving performance by scanning fewer files.

Spark allows you to merge different Parquet schemas together with the mergeSchema option. With a regular Parquet data lake, the schema can differ across partitions, but not within partitions. However, a Delta Lake table does not have this same constraint. Delta Lake gives the engineer a choice to either allow the schema of a table to evolve, or to enforce a schema upon write. If an incompatible schema change is detected, Delta Lake will throw an exception and prevent the table from being corrupted with columns that have incompatible types. Additionally, a Delta Lake table may include NOT NULL constraints on columns, which cannot be enforced on a regular Parquet table. This prevents records from being loaded with NULL values for columns which require data (and could break downstream processes).

One final way that Delta Lake prevents data corruption is by supporting the MERGE statement. Many tables are structured to be append-only, however, it is not uncommon for duplicate records to enter pipelines. By using a MERGE statement, a pipeline can be configured to INSERT a new record or ignore records that are already present in the Delta Table.

Faster Queries

Delta Lake has several properties that can make the same query much faster compared to regular parquet. Rather than performing an expensive LIST operation on the blob storage for each query, which is what the regular Parquet reader would do, the Delta transaction log serves as the manifest.

The transaction log not only keeps track of the Parquet filenames but also centralizes their statistics. These are the min and max values of each column that is found in the Parquet file footers. This allows Delta Lake to skip the ingestion of files if it can determine that they do not match the query predicate.

Another technique to skip the ingestion of unnecessary data is to physically organize the data in such a way that query predicates only map to a small number of files. This is the concept behind the ZORDER reorganization of data. This table design accommodates fast queries on columns that are not part of the partition key. The combination of these data skipping techniques is explained in the 2018 blog:

Processing Petabytes of Data in Seconds with Databricks Delta.

At Spark+AI Summit 2020, Databricks announced our new Delta Engine, which adds even more performance enhancements. It has an intelligent caching layer to cache data on the SSD/NVME drives of a cluster as it is ingested; thereby making subsequent queries on the same data faster. It has an enhanced query optimizer to speed up common query patterns. However, the biggest innovation is the implementation of Photon, a native vectorization engine written in C++. All together, these components deliver Delta Engine a significant performance gain over Apache Spark, while still keeping the same open APIs.

Because Delta Lake is an open-source project, a community has been forming around it and other query engines have built support for it. If you’re already using one of these query engines, you can start to leverage Delta Lake and achieve some of the benefits immediately.

Increase Data Freshness

Many Parquet Data Lakes are refreshed every day, sometimes every hour, but rarely every minute. Sometimes this is linked to the grain of an aggregation, but often this is due to technical challenges with being able to stream real-time data into a data lake. Delta Lake was designed from the beginning to accommodate both batch and streaming ingestion use cases. By leveraging Structured Streaming with Delta Lake, you automatically get built-in checkpointing when transforming data from one Delta Table to another. With a single config change of the Trigger, the ingestion can be changed from batch to streaming.

One challenge is accommodating streaming ingestion is that more frequent writes generate many very small files, which negatively impacts downstream query performance. As a general rule, it is faster to query a smaller number of large files than it is to query a large number of small files. Each table should attempt to achieve uniform parquet file size; typically somewhere between 128MB – 512MB. Over the years, engineers have developed their own compaction jobs to compact these small files into larger files. However, because blob storage lacks transactions, these routines are typically run in the middle of the night to prevent downstream queries from failing. Delta Lake can compact these files with a single OPTIMIZE command, and because of ACID compliance, it can be run at the same time that users query the table. Likewise, Delta Lake can leverage auto-optimize to continuously write files in the optimal size.

Reproduce ML Models

In order for a machine learning model to be improved, a data scientist must first reproduce the results of the model. This can be particularly daunting if the data scientist who trained the model has since left the company. This requires that the same logic, parameters, libraries, and environment must be used. Databricks developed MLflow in 2018 to solve this problem.

The other element that needs to be tracked for reproducibility is the training and test data sets. The Time Travel feature enables the ability to query the data as it was at a certain point in time using data versioning. So you can reproduce the results of a machine learning model by retraining it with the exact same data without needing to copy the data.

Achieve Compliance

New laws such as GDPR and CCPA require that companies be able to purge data pertaining to a customer should a request by the individual be made. Deleting or updating data in a regular Parquet Data Lake is compute-intensive. All of the files that pertain to the personal data being requested must be identified, ingested, filtered, written out as new files, and the original ones deleted. This must be done in a way as to not disrupt or corrupt queries on the table.

Delta Lake includes DELETE and UPDATE actions for the easy manipulation of data in a table. For more information, please refer to this article Best practices: GDPR and CCPA compliance using Delta Lake and this tech talk Addressing GDPR and CCPA Scenarios with Delta Lake and Apache Spark™.

Summary

In summary, there are many benefits from switching your Parquet Data Lake to a Delta Lake, but the top 5 are:

Prevent Data Corruption
Faster Queries
Increase Data Freshness
Reproduce ML Models
Achieve Compliance

One final reason to consider switching your Parquet Data Lake to a Delta Lake is that it is simple and quick to change a table with the CONVERT command, and equally simple to undo the conversion. Give Delta Lake a try today!

Try Databricks for free. Get started today.

The post Top 5 Reasons to Convert Your Cloud Data Lake to a Delta Lake appeared first on Databricks.

↧