Quantcast
Channel: Databricks
Viewing all 1873 articles
Browse latest View live

Powering a Better Future With ESG Data and Analytics

$
0
0

By now, most professionals know that the future of business goes hand in hand with social responsibility, environmental stewardship and corporate ethics. It should come as no surprise, then, that Sustainability and Environmental, Social, and Governance (ESG) have become top priorities for consumers, investors and regulators alike.

But creating a better future for all stakeholders requires organizations to re-think what they make, how they operate, and where to make strategic bets for the future. Data and technology are playing an increasingly important role in driving innovation and transforming how organizations operate in a responsible and sustainable fashion.

To showcase some of the amazing ways data and AI are driving ESG initiatives, we’ll be hosting a global executive sustainability forum in early March. Featuring some of the foremost ESG thought leaders and innovators, we’ll explore the forces driving this shift in corporate responsibility and how global organizations are leveraging data and AI to make smarter investments towards a more sustainable, inclusive and equitable future. Here’s a snapshot of the amazing lineup of speakers:


Kathy Matsui
Former Vice-Chair, Goldman Sachs

Why womenomics is good business
Top financial strategist and female empowerment leader Kathy Matsui, best known for her insightful research on “womenomics,” will discuss why diversity and inclusion is good business and where challenges still remain.


Daiana Beitler, PhD
Cross-Industry Digital Strategist, Digital Transformation Partnerships, Microsoft

Panel Discussion: How leading companies are making a positive impact
As a digital transformation strategist, Daiana Beitler will discuss the human impact of digital transformation, and how sustainability is transforming not only businesses but also communities.


Ilaria Chan
Group Advisor on Social Impact, Grab

Panel Discussion: How leading companies are making a positive impact
As a social impact advisor for Grab, a company that provides safe and affordable transportation in sometimes dangerous areas for gig workers, Ilaria Chan will discuss the role data and AI has on GRAB’s mission to unleash the potential of the underserved and underprivileged.


Sara Menker
Founder, CEO, Gro Intelligence

Fireside chat: Combating climate risk in your business with AI
As one of the most successful female CEOs in the world, Sara Menker of Gro Intelligence will discuss how the organization leverages data and AI to fight against climate change by helping users better understand and predict global food and agriculture markets.

In addition to the expert panel, we’ll also be showcasing how to leverage data and AI to improve ESG investing. Specifically, we’ll demonstrate how institutional investors can use machine learning to extract key ESG insights about an organization by analyzing a variety of data sources including annual reports and media coverage.

As a special thank you, Databricks will be planting a tree in honor of each attendee.

Space is limited for this event. Register now to save your spot!

--

Try Databricks for free. Get started today.

The post Powering a Better Future With ESG Data and Analytics appeared first on Databricks.


Paths Crossed Again

$
0
0

A Korean proverb many Koreans like myself often forget is:

옷깃만 스쳐도 인연.

Merely brushing by someone means you’re fatefully connected to that person.

So many social connections come and go through our lives, and we don’t always value them as much as they deserve. But so much can happen even with brief interactions, which is what I’m going to unfold in this post.

In 2014, I was having busy days as a lead of the Netty project. Like any moderately popular open-source project, its maintenance is often a race against your inbox. Overwhelmed by an abundance of questions and feature requests, it’s easy to focus too much on achieving ‘inbox zero’ and forget about the milestones users are trying to achieve with the project.

Apache Spark™ was one such user, and that was my first brush with Reynold Xin, Databricks’ Co-founder. Norman (another Netty maintainer) and I helped him with a few interesting issues related to zero-copy file transfer and /dev/epoll transport. We didn’t even know what project Reynold was working on at that time. A few months later, we finally learned from Reynold’s email about Apache Spark’s winning at 2014 Gray Sort competition that Netty played a non-trivial role. It was a very pleasant surprise, and I was excited that Netty could make an impact in many different areas.

Exciting moments are fleeting and the life of an open-source project maintainer continues. We, maybe Reynold, too, all forgot about this milestone and moved on to the next pile of issues in an issue tracker.

Fast forward — my focus moved from pure network programming to frameworks that help organizations scale by providing smooth migration paths to asynchronous and reactive programming models. Armeria was the open-source RPC/REST framework my team designed from scratch for this purpose. In the last few years, thanks to Armeria, I had the opportunity to help users from various companies who were facing similar migration challenges.

In its essence, the journey with Armeria was not largely different from what I had with Netty. I was thrilled by our happy users and the pleasant surprises they gave us. At the same time, it was a race against an endless stream of issues and pull requests to address and review. We shipped more than 170 releases with 2,500 commits for 5 years. It was a great achievement, but it also made me crave something new to work on that would broaden my horizon.

That was when my and Databricks’ paths crossed again.

Earlier this year, the Databricks engineering team was designing a new communication protocol for the recently announced SQL Analytics product. As part of the researching solutions for a scalability issue, they found Armeria. The team reached out to me explaining the challenges they had with sending massive amounts of data across the network, with the goal to optimize for both higher throughput and lower latency. A series of interesting technical discussions led to Reynold asking whether I’d consider joining Databricks.

I must confess that, after all those years, I had no idea what Databricks does! Actually, I couldn’t even remember the name of the company. After brief research about the company, I was even less sure whether Databricks would be the right place for me given my lack of experience in data or machine learning. I decided to talk to the team more since the technical interactions were interesting (sorting petabytes of data and protocol optimizations).

However, my impression of the company changed dramatically as I went through the interview process. The interviews were more like bidirectional technical discussions, even when I was solving a given problem. I was impressed that I was not treated as a student to fill in the blanks but as a partner with the same goal. All in all, the interviewers were very friendly, and I was able to express my ideas interactively with great comfort. Such an attitude does not come from a recruiting manual. It was a clear sign that Databricks is a healthy place built on top of good faith and curiosity.

What fascinated me was the engineers’ telltale excitement at every moment: when explaining what they’ve built, showing some growth charts, and confessing the challenges ahead–or even some technical debt. At the end of the interview day, I started to think Databricks may be a place worth betting my career on.

As you can guess, the rest is history – well, in progress. As part of our engineering team, I focus on our RPC stack to help scale the organization and optimize our software fleet of millions of machines. The three short months here reinforce the choice I made. I love the technical challenges, in-depth design review process, investment in new technologies, and heavy leverage of automation I found here. I also enjoy the impact I have had already, and will have, on the future, with the great culture and learnings from the rest of the team.

Looking back, I’m amazed at how such a brief cross-continental collaboration led us here – even after many years. It is indeed a magical journey that leaves a lot to be discovered and remembered.

Visit our Careers page to explore open software engineer positions and other global opportunities.

--

Try Databricks for free. Get started today.

The post Paths Crossed Again appeared first on Databricks.

Announcing the Launch of Databricks on Google Cloud

$
0
0

Today, we are proud to announce the availability of Databricks on Google Cloud. This jointly developed service provides a simple, open lakehouse platform for data engineering, data science, analytics, and machine learning. It brings together the Databricks capabilities customers love with the data analytics solutions and global scale available from Google Cloud.

Open data platform meets open cloud

Databricks and Google Cloud share a common vision for an open data platform built on open standards, open APIs, and open infrastructure. With this partnership, organizations get the choice and flexibility to manage infrastructure and access data with the tools they need across cloud and on-premises environments. By adopting open frameworks and APIs, customers get the benefits of open source combined with managed cloud analytics and AI products.

What does our new partnership mean for customers? Enterprises can now implement the Databricks Lakehouse Platform on Google Cloud — made possible by Delta Lake on Databricks. Delta Lake adds data reliability to data lakes with ACID transactions and versioning and better data governance and query performance to data in Google Cloud Storage. This announcement paves the way for one simple, unified architecture for all data applications on Google Cloud, including real-time streaming, SQL workloads, business intelligence, data science, machine learning, and graph analytics.

The open cloud approach also improves interoperability and portability for enterprises that want to use multiple public clouds for analytics applications. A recent Gartner study concluded that at least 80% of enterprises had adopted a multi-cloud strategy across multiple geographies. The multi-cloud capability of Databricks allows customers to increase the efficiency and productivity of data processes, improve customer experiences, and create new revenue opportunities even when data is distributed across more than one cloud. For example, one leading global fast-food company (and Google Cloud customer) wants to build and deploy marketing solutions, such as churn reduction, behavioral segmentation, and lifetime value for about a dozen global markets by year-end 2021. By architecting a global data platform with Databricks, they will provide each regional business with a choice for their public cloud platform.

Streamlined integrations

Databricks is tightly integrated with Google Cloud compute, storage, analytics, and management products to give customers a simple, unified experience with high performance and enterprise security.

Compute and Storage: Built on Google Kubernetes Engine (GKE), Databricks on Google Cloud is the first fully container-based Databricks runtime on any cloud. It takes advantage of GKE’s managed services for the portability, security, and scalability developers know and love. Read/write access to GCS from Databricks allows customers to execute workloads faster and at lower costs.

Analytics: Databricks has an optimized connector with Google BigQuery that allows easy access to data in BigQuery directly via its Storage API for high-performance queries. The connector has support for additional predicate pushdown, querying named tables and views, and for directly running SQL on BigQuery and loading the results in an Apache Spark™ DataFrame. Also, Looker’s integration with Databricks and support for SQL Analytics, along with an open API environment on Google Cloud, complements the open, multi-cloud architecture. This integration gives Looker users the ability to directly query the data lake, providing an entirely new visualization experience.

Security and Administration: Experience a simplified deployment from the Google Cloud Marketplace with unified billing and one-click setup inside the Google Cloud console. Databricks’ integration with Google Cloud Identity allows customers to simply use their Google Cloud credentials for single sign-on and user provisioning on Databricks.

Put Databricks to work on Google Cloud

Some of the most innovative use cases for Databricks on Google Cloud are in retail, telco, media and entertainment, manufacturing, and financial services. Across every industry, data is driving digital transformation initiatives. With the lakehouse architecture, Databricks and Google Cloud customers are finding new ways to accelerate data-driven innovation.

Here are some of the most popular workloads customers are using Databricks for today. To learn more about industry-specific use cases, visit the Industry Solutions page.

Data lake modernization

Delta Lake on Databricks provides a modern foundation to transition from expensive, hard-to-scale on-premises systems to well-architected Google Cloud Storage–based data lakes. Even cloud-based Hadoop services lack the performance benefits of cloud-native, modern cloud data platforms. In fact, companies that have migrated to Databricks from a cloud-based Hadoop service realize up to 50% performance improvement in data processing and 40% lower monthly infrastructure cost. Moving to Databricks on Google Cloud helps customers reduce administrative overhead, quickly scale up or down compute resources and reduce operational costs with autoscaling and job termination.

Scalable data processing to prepare data for analytics

Databricks simplifies your ETL architecture and lowers costs to ingest and process data using a high-performance runtime on clusters optimized for data processing at scale. With Delta Lake, you can reliably store all data (structured, semi-structured, and unstructured) in raw format and incrementally move it through the transformation stages to an aggregated, BI-ready tier with ACID guarantees.

Reliable analytics on the data lake

Customers use Delta Lake on top of data lakes based on Google Cloud Storage file-store to bring reliability, performance, and lifecycle management. Delta Lake helps prevent data corruption, run faster queries, improve data freshness, and reproduce ML models, allowing customers to always trust their data for analytical insights. In addition, Databricks provides Delta Engine to significantly accelerate query performance on data lakes, especially those enabled by Delta Lake.

Data science and machine learning

Managed MLflow on Databricks allows data teams to track all experiments and models in one place, publish dashboards, and facilitate handoffs with peers and stakeholders across the entire workflow — from raw data to insights. Databricks’ collaborative workspace allows data teams to explore data, share insights, run experiments, and build ML models faster to be more productive.

Getting started

The launch of Databricks on Google Cloud is a win-win for customers. The tight integration of Databricks with Google Cloud‘s analytics and AI products delivers a broad range of capabilities — with more to come. Together, we will continue to innovate and support customers in building intelligent applications that solve tough data problems.

If you are interested in Databricks on Google Cloud, request access via the product page. To learn more, visit us at the launch event hosted by TechCrunch where Ali Ghodsi and Thomas Kurian share their vision from this partnership and the benefits to customers.

SIGN UP FOR PUBLIC PREVIEW

The post Announcing the Launch of Databricks on Google Cloud appeared first on Databricks.

The Hidden Value of Hadoop Migration

$
0
0

For years Hadoop was the default technology for big data analytics. But over time, it has fallen behind as new technologies have been introduced to provide better analytics solutions. Many organizations are looking at their Hadoop costs and trying to justify migrating to a modern cloud-based analytics platform. Databricks just released the whitepaper, “The Hidden Value of Hadoop Migration.” This whitepaper shows how practitioners can frame the business value of migrating to Hadoop to, for example, convince their leadership team to invest in switching over.

Databricks provides $13.8M+ potential value

There are three key topics in the whitepaper:

  1. Moving to a cloud-based analytics platform to reduce the operational costs of licenses and maintenance.
  2. The power of a modern cloud-based analytics platform to address more advanced use cases with huge business impact. This is the Hidden Value of Hadoop Migration.
  3. Avoiding the temptation to migrate your Hadoop platform in the cloud increases migration costs and delays the migration’s productivity and innovation gains.

Recognizing the true cost of ownership

Many organizations initially look to migrate from Hadoop on-premises to lower their total cost of ownership (TCO). The focus is on the licensing cost of their Hadoop system, which alone makes a compelling case to migrate. Yet, getting internal funding can still be challenging since surface-level analysis doesn’t accurately depict just how costly Hadoop can be—and how valuable migrating to cloud-based solutions are. To get a real sense of what Hadoop costs your organization, you have to step back.

Licensing costs are usually the smallest component of the TCO. From a benchmark of Databricks customers, we found that licensing is less than 15% of the total cost, whereas data center management is nearly 50%. Additional costs include:

  • Hardware overcapacity is a given in on-premises implementations so that you can scale up to your largest needs, but much of that capacity sits idle most of the time.
  • Scaling costs add up fast. The ability to separate storage and compute does not exist in an on-premises Hadoop implementation, so costs grow as datasets grow. On top of that, to achieve significant insights, organizations need big datasets that cross many different sources, so this cost is part of the process.
  • DevOps burden is another factor. Based on our customers’ experience, you can assume 4-8 full-time employees for every 100 nodes.
  • Power costs can be as much as $800 per server per year based on consumption and cooling. That’s $80K per year for a 100 node Hadoop cluster!
  • Purchasing new and replacement hardware accounts for ~20% of TCO—that’s equal to the Hadoop clusters’ administration.

When the costs are all factored in, migration becomes an obvious decision.

But the power of a modern cloud-based analytics platform to address more advanced use cases and increased productivity is the Hidden Value of Hadoop Migration. By having all your data, analytics and AI on one unified data platform, you can combine the best of data warehouses and data lakes into a lakehouse architecture, enabling collaboration on all of your data, analytics and AI workloads. It makes your data processes faster and more streamlined, improves the productivity and collaboration of your data teams, and gives you the scale to handle game-changing machine learning use cases. These use cases can increase revenue, reduce costs, and mitigate risk. Companies that do not adopt a modern cloud-based analytics platform will find themselves falling further and further behind organizations that invest in using data to drive their business.

Calculating the business cost of legacy technology

A typical path is to try to take the Hadoop experience and recreate it in the cloud. But it brings the same problems in terms of performance limitations, use case limitations, and employee productivity. Databricks is built to tackle these challenges while driving value for customers who migrate off Hadoop in these three areas: infrastructure, productivity, and business impact. You can learn more about our approach and real-life examples in the white paper.

Determining these hidden costs isn’t always intuitive. To help, we created a cost framework for developing estimates on the impact migration will have on your organization.

databricks hadoop migration blog image

What’s next

This blog just touched the surface of how Databricks can drastically cut down costs and boost your team’s business value. To dive deeper into these topics and real-life customer examples, read   “The Hidden Value of Hadoop Migration.”  For more information, visit www.databricks.com/migration.

 

--

Try Databricks for free. Get started today.

The post The Hidden Value of Hadoop Migration appeared first on Databricks.

Growth-Hacking Impact: Why I’m Joining Databricks

$
0
0

When I look at every wave of computing — be it the advent of mobile or the Internet — there’s a core theme of their underpinnings. The power behind the most impactful technologies, though often invisible, has been robust developer platforms. These platforms spur innovation, creativity and collaboration at a truly global scale and speed up development to keep up with the rapidly-evolving market and customer demands.

At the same time, data — in all of its forms — is the heart of nearly every enterprise. More and more companies are tapping into sophisticated data analytics and ML-enriched applications to innovate faster, increase profitability and stay steps ahead of their competition. I truly believe that enabling every organization to build these capabilities will unleash impact in life-changing ways, ranging from improving transportation logistics to driving sustainability to powering the discovery and accessibility of life-saving drugs.

Opportunities to build a platform with this impact potential only come along so often. And it’s no secret that I have a big passion for developer platforms — something I was fortunate to really dig into during my time at Google, where I led the Actions on Google platform. That’s why, today, I’m thrilled to announce that I’ve joined Databricks as SVP of Engineering!

So, why Databricks? The prospect of helping to build such an impactful platform is compelling enough. But here are a few other reasons…

The pioneers of innovation: The opportunity to work with a team that has practically defined the space made Databricks irresistible. The Databricks Lakehouse platform — a new, open architecture that combines the best elements of data lakes and data warehouses to support all data-driven use cases for the modern enterprise. They bring a true platform mindset, which is reflected in the adoption of the platform and the vibrant developer community that they have nourished.

Customer-first mindset: Nearly every company works to be customer-focused, but customer obsession is unequivocally in Databricks’ DNA. At the highest level, Databricks’ platform is about bringing together data engineers, data scientists and analysts to solve some of the world’s toughest problems.

Massive adoption and marketing potential: With over 5,000 global customers — including more than 40% of the Fortune 500 — relying on Databricks’ Lakehouse platform for data engineering, machine learning and analytics, it’s clear that there’s a real need for what we’ve built.

The velocity of achievements Databricks has accomplished thus far is in many ways unprecedented. And this is just the beginning. I’m thrilled to be part of this journey.

Interested in joining Databricks?

Visit our career page to explore our global opportunities and to learn more about our people and the impact we’re making around the world.

--

Try Databricks for free. Get started today.

The post Growth-Hacking Impact: Why I’m Joining Databricks appeared first on Databricks.

Solution Accelerator: Telco Customer Churn Predictor

$
0
0

Skip directly to the notebooks referenced throughout this post.

When T-Mobile embraced the un-carrier label, they didn’t just kick off a marketing campaign; they fundamentally changed the dynamics in the US market for telecom. Previously, telecom had been a staid, utility-like industry with steady growth and subscribers locked into two-year contracts to cover a “free” handset with a phone plan. But three factors changed the nature of the business:

  1. Starting in 2004, users could change phone carriers but keep their phone number, eliminating one of the largest barriers to switching providers.
  2. Escalating handset prices led carriers to discontinue handset subsidies, resulting in the elimination of phone plan contracts.
  3. T-Mobile gained market share with aggressive data plan pricing combined with increased advertising spend, bringing a strong third competitor to what had previously been a duopoly.

These rapidly changing dynamics have moved telco providers from being a utility to being a value-added services provider across multiple lines of business including broadband, security, cable, and streaming video services. This, along with increased competition from new entrants — has accelerated communications service providers’ investment in personalized, frictionless customer experiences across all channels, at all times. Core to building these experiences is understanding where existing customers are in the subscription life cycle, and in particular identifying those most at risk for churn. Reducing churn continues to be one of the most strategic areas of focus for every provider and the goal of many churn rate initiatives is to predict customer life cycle events and find ways to extend the life cycle profitably.

Introducing the Telco Customer Churn Predictor Solution Accelerator from Databricks

Based on best practices from our work with the leading communication service providers, we’ve developed solution accelerators for common analytics and machine learning use cases to save weeks or months of development time for your data engineers and data scientists.

This solution accelerator complements our work doing customer lifetime value, attrition for subscription services, and profitable customer retention, but with a telco-specific lens.

Using sample telco datasets from IBM, and the Lifelines library, this solution accelerator will:

  • Introduce survival analysis, a collection of statistical methods used to examine and predict the time until an event of interest occurs.
  • Review three methods that are commonly used for survival analysis: Kaplan-Meier, Cox Proportional Hazards, Accelerated Failure Time.
  • Build a churn prediction model and use the model output as an input for calculating lifetime value.
  • Build an interactive dashboard for calculating the net present value of a given cohort of subscribers over a three-year time horizon.

The contents of this solution accelerator are contained in Databricks notebooks that are linked to at the end of this post.

About survival analysis

Survival analysis is a collection of statistical methods used to examine and predict the time until an event of interest occurs. This form of analysis originated in healthcare, with a focus on time to death. Since then, survival analysis has been successfully applied to use cases in virtually every industry around the globe.

In Telco specifically, use cases include:

  • Customer retention: It is widely accepted that the cost of retention is lower than the cost of acquisition. With the event of interest being a service cancellation, Telco companies can more effectively manage customer retention efforts by using survival analysis to better predict at what point in time-specific customers are likely to be at risk of churning.
  • Hardware failures: The quality of experience a customer has with your products and services plays a key role in the decision to renew or cancel. The network itself is at the epicenter of this experience. With time to failure as the event of interest, survival analysis can be used to predict when hardware will need to be repaired or replaced.
  • Device and data plan upgrades: There are key moments in a customer’s lifecycle when changes to their plan take place. With the event of interest being a plan change, survival analysis can be used to predict when such change will take place and then actions can be taken to positively influence the selected products or services.

In contrast to other methods that may seem similar on the surface, such as linear regression, survival analysis takes censoring into account. Censoring occurs when the start and/or end of a measured value is unknown. For example, suppose our historical data includes records for the two customers below. In the case of customer A, we know the precise duration of the subscription because the customer churned in December 2020. For customer B, we know that the contract started four months ago and is still active, but we do not know how much longer they will be a customer. This is an example of right censoring because we do not yet know the end date for the measured value. Right censoring is what we most commonly see with this form of analysis.

Customer Subscription Start Date Subscription End Date Subscription Duration Active Subscription Flag
A Feb 3, 2020 Dec 2, 2020 10 months 0
B Nov 11, 2020 4 months 1
Customer survival probability curve produced by a survival analysis machine learning model

Figure 1: Survival Probability Curve

As illustrated above, we could move forward with a duration of four months for customer B, but this would lead to underestimating survival time. This problem is alleviated when using survival analysis since censoring is taken into account.

Using survival analysis in production

After accounting for censoring, the key output of a survival analysis machine learning model is a survival probability curve. As shown below, a survival probability curve plots time on the x-axis and survival probability on the y-axis. Starting at 0 months, this chart can be interpreted as saying: the probability of a customer staying at least 0 months is 100%. This is represented by the point (0, 1.0). Likewise, moving down the survival curve to the median (34 months), showing that a customer has a 50% probability of surviving at least 34 months, given that they have survived 33 months. Note that this last clause, “given that…”, signifies that this is a conditional probability.

Visualizing survival probability curves is particularly helpful when building a model and/or analyzing a model for inference. In many cases, however, the end goal is to use the output of a survival analysis model as an input for another model. For example, in this solution accelerator, we use the output of a survival analysis model as an input for calculating customer lifetime value. We then build an application that provides visibility into the net present value for a given cohort of users throughout a three-year time horizon. This is powerful because it enables marketers to understand what the payback period will be for various new customer acquisition campaigns. Similarly, one could use the output of the survival analysis model we build in this solution accelerator to align marketing messages to where consumers are in their customer journey.

In practice, the reference architecture that enables these types of use cases in production resembles the following:

Recommended analytics architecture for enabling customer churn prediction use cases.

Getting started

The goal of this solution accelerator is to help you leverage survival analysis for your own customer retention use case as quickly as possible. As such, this solution accelerator contains an in-depth review of commonly used methods: Kaplan-Meier, Cox Proportional Hazards, and Accelerated Failure Time. Get started today by importing this solution accelerator directly into your Databricks workspace.

You can also view our on-demand webinar on Survival Analysis in Telecommunications.


 

Try the notebooks

--

Try Databricks for free. Get started today.

The post Solution Accelerator: Telco Customer Churn Predictor appeared first on Databricks.

Announcing General Availability (GA) of the Power BI connector for Databricks

$
0
0

We are excited to announce General Availability (GA) of the Microsoft Power BI connector for Databricks for Power BI Service and Power BI Desktop 2.85.681.0. Following the public preview, we have already seen strong customer adoption, so we are pleased to extend these capabilities to our entire customer base. The native Power BI connector for Databricks in combination with the recently launched SQL Analytics service provides Databricks customers with a first-class experience for performing BI workloads directly on their Delta Lake. SQL Analytics allows customers to operate a multi-cloud lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance than traditional cloud data warehouses.
 
 
Call Detail Record (CDR) Power BI report to visualize over 1 billion rows of data”> Call Detail Record (CDR) Power BI report to visualize over 1 billion rows of data
The Power BI connector for Databricks enables seamless connectivity through the following capabilities:

Support for Azure Active Directory (Azure AD) and SSO: Users can use their Azure AD credentials to connect to Azure Databricks. Administrators no longer need to generate Personal Access Tokens (PAT) tokens for authentication.

Simple connection configuration: The Databricks connector is natively integrated into Power BI. Connections to Databricks are configured with a couple of clicks — users select Databricks as a data source, enter Databricks-specific connection details and authenticate. Just like that, you are ready to query the data!

Secure and direct access to Azure Data Lake Storage via DirectQuery: When using Power BI DirectQuery, data is directly accessed in Databricks, enabling users to query and visualize large datasets. DirectQuery results are always fresh and Delta Lake data security controls are enforced.

Faster results via Databricks ODBC: The Databricks ODBC driver is optimized with reduced query latency, increased result transfer speed and improved metadata retrieval performance.

Get started with the Power BI connector

The enhanced Power BI Connector for Databricks is the result of an on-going collaboration between Databricks and Microsoft. Attend a Quickstart Lab to get hands-on experience with Databricks and connect to Databricks using the Power BI connector for Databricks.

--

Try Databricks for free. Get started today.

The post Announcing General Availability (GA) of the Power BI connector for Databricks appeared first on Databricks.

Analyzing Algorand Blockchain Data With Databricks Delta (Part 2)

$
0
0

This post was written in collaboration betweeen Eric Gieseke, principal software engineer at Algorand, and Anindita Mahapatra, solutions architect, Databricks.

 
Algorand is a public, decentralized blockchain system that uses a proof of stake consensus protocol. It is fast and energy efficient, with a transaction commit time under five seconds and a throughput of one thousand transactions per second. Blockchain is a disruptive technology that will transform many industries including Fintech. Algorand, being a public blockchain, generates large amounts of transaction data and this provides interesting opportunities for data analysis.

Databricks provides a Unified Data Analytics Platform for massive-scale data engineering and collaborative data science on multi-cloud infrastructure. This blog post will demonstrate how Delta Lake facilitates real-time data ingestion, transformation, and SQL Analytics visualization of the blockchain data to provide valuable business insights. SQL is a natural choice for business analysts who benefit from SQL Analytics’ out-of-box visualization capabilities. Graphs are also a powerful visualization tool for blockchain transaction data. This article will show how Apache Spark™ GraphFrame and graph visualization libraries can help analysts identify significant patterns.

This article is the second part of a two part blog. In part one, we demonstrated the analysis of operational telemetry data. In part two, we will show how to use Databricks to analyze the transactional aspects of the Algorand blockchain. A robust ecosystem of accounts, transactions and digital assets is essential for the health of the blockchain. Assets are digital tokens that represent reward tokens, cryptocurrencies, supply chain assets, etc. The Algo digital currency price reflects the intrinsic value of the underlying blockchain. Healthy transaction volume indicates user engagement.

Data processing includes ingestion, transformation and visualization of Algorand transaction data. The insights derived from the resulting analysis will help in determining the health of the ecosystem. For example:

  • Which assets are driving transaction volume? What is the daily trend of transaction volume? How does it vary over time? Are there certain times of the day that transaction volumes peak?
  • Which applications or business models are driving growth in the number of accounts or transactions?
  • What are the distribution of asset types and transaction types? Which assets are most widely used, and which assets are trending up or down?
  • What is the latest block? How long did it take to create, and what are the transactions that it contains?
  • Which are the most active accounts, and how does their activity vary over time?
  • What is the relationship between accounts? Is it possible to detect illicit activity?
  • How does the Algo price vary with the transaction volume over time?  Can fluctuations in price or volume be predicted?

Algorand network
The Algorand Blockchain network is composed of nodes and relays hosted on servers connected via the internet. The nodes provide the compute and storage needed to host the immutable blocks. The blocks hold the individual transactions committed on the blockchain, and each block links to the preceding block in the chain.

The Algorand Blockchain network is composed of nodes and relays hosted on servers connected via the internet.
Figure 1: Each block is connected to the prior block to form the blockchain

Algorand data

Data Type What Why Where
Node Telemetry

(JSON data from ElasticSearch API)

Peer connection data that describes the network topology of nodes and relays It gives a real-time view of where the nodes & relays are and how they are connected and are communicating, and the network load. The nodes periodically transmit this information to a configured ElasticSearch endpoint.
Block, Transaction, Account Data

(JSON/CSV data from S3)

Transaction data committed into blocks chained sequentially and individual account balances This data gives visibility into usage of the blockchain network and people (accounts) transacting. Each account is an established identity, and each tx/block has a unique identifier. The Algorand blockchain generates block, account, and transaction data. The Algorand Indexer aggregates the data, which is accessible via a REST API.

Block, transaction and account data

The Algorand blockchain uses an efficient and high-performance consensus protocol based on proof of stake, which enables a throughput of 1,000 transactions per second.

The Algorand blockchain uses an efficient and high-performance consensus protocol based on proof of stake. Transactions transfer value between Accounts.

Blocks aggregate Transactions that are committed to the blockchain.

A new Block is created in less than 5 seconds and is linked to the previous Block to form the blockchain.


Figure 2: Entity Relation

Analytics workflow

The following diagram describes the data flow. Originating from the Algorand Nodes, blocks are aggregated by the Algorand Indexer into a Postgres database. A Databricks job uses the Algorand Python SDK to retrieve blocks from the Indexer as JSON documents and stores them in an S3 bucket. Using the Databricks Autoloader, the JSON documents are auto-ingested from S3 into Delta Tables as they arrive. Additional processing converts the block records to transaction records in a silver table. Then, from the transaction records, aggregates are produced in the gold table. Both the silver and gold tables support visualization and analytics.

Algorand Blockchain-Databricks analytic workflow
Figure 3: Analytic Workflow

As the blockchain creates new blocks, our streaming pipeline automatically updates the Delta tables with the latest transaction and account data. The processing steps follow:

  1. Data ingestion
  2. Fetch data as JSON files using the Algorand Indexer V2 SDK into an S3 bucket:
  • Block data containing transactions (sender, payer, amount, fee, time, type, asset type)
  • Account balances (account, asset, balance)
  • Asset data (asset id, asset name, unit name)
  • Algo Trading details using the CryptoCompare API (time, price, volume). This trading data will augment the transaction data to correlate the Algo price with the transaction activity.
  1. Auto Data Loader
  2. The Databricks Auto Loader adds new block data into the ‘bronze’ Delta table as a real-time streaming job.
  1. Data refinement
  2. A second streaming job reads the block data from the bronze table, flattens the blocks into individual transactions, and stores the resulting transactions into the ‘silver’ Delta table.
    Additional transformations include:
  • Compute statistics for each block, e.g., block time and number of transactions
  • A User-Defined Function (UDF) extracts the note field from each transaction, decodes, and removes non-ASCII characters.
  • Compute word counts from the processed note field to determine trending words.
  1. InStream aggregation
  2. Compute in-stream aggregation statistics on transaction data from the silver table and persist into the ‘gold’ Delta table.
  • Compute count, sum, average, min, max of transaction amounts grouped hourly over each asset type to study trends over time
  1. Analysis
  2. Perform Data Analysis and Visualization using the silver and gold tables.
  • Using GraphFrames & pyvis 
    • Create vertices and edges from the transaction data to form a directed graph representing accounts as vertices and transactions as edges.
    • Using Graph APIs, analyze the data for top users and their incoming/outgoing transactions.
    • Visualize the resulting graph using pyvis.
  1. Using SQL Analytics
  • Use SQL queries to analyze data in the Delta Lake and build parameterized Redash dashboards with alerting webhooks.

Algorand Blockchain-Databricks analytic multihop data flow
Figure 4: Multihop Data Flow

Step 1: Data ingestion into S3
This notebook is run as a Periodic Job to retrieve new algorand blocks as JSON files from Algorand Indexer V2 into the S3 bucket location. It also retrieves and refreshes the asset information.

  • With%pip, install the notebook-scoped library. 
  • Databricks Secretssecurely stores credentials and sensitive information.
  • For the initial bulk load of historical block data, a Spark UDFutilizes the distributed computing of the Spark worker nodes.
%pip install py-algorand-sdk

def getBlocks_map(round, indexer_token, indexer_address):
    from algosdk.v2client import indexer
    import json
    import boto3

myindexer =  indexer.IndexerClient(indexer_token=indexer_token,  indexer_address=indexer_address)
s3 = boto3.client('s3')
block_data = myindexer.block_info(block=round)
s3.put_object(Body=json.dumps(block_data), Bucket="delta-autoloader",
Key="algorand-mainnet-blocks/"+str(round)+".txt")
getBlocksMapUDF = udf(getBlocks_map)

# Blocks are sequentially numbered, A chunk of blocks is retrieved from starting_block
# Each node processes X sequential blocks
dataset = spark.range(starting_block, starting_block + numBlocks, blockChunk)

result = dataset.select( getBlocksMapUDF("id", lit(indexer_token), lit(indexer_address)))
    A transaction can be associated with any asset type–the asset information has details on each asset created on the blockchain, including the ID, unit, name and decimals. The decimals specify the number of zeros following the decimal point for the amount. For example, Tether (USDt) amounts are adjusted by 2 decimal places, Meld Gold & Silver by 5, whereas Bitcoin needs no adjusting. A UDF function adjusts the amount by assetId during the aggregation phase.

  • Thisnotebook runs periodically to retrieve Algo trading information on a daily and hourly basis.
  • Data is converted to a Spark dataframe and persisted in a Delta table.
%pip install cryptocompare
import cryptocompare

res_hourly = cryptocompare.get_historical_price_hour('ALGO', 'USD', limit=history_limit, exchange='CCCAGG', toTs=datetime.now())
Step 2: AutoLoader 
Thisnotebook is the primary notebook that drives the streaming pipeline for block data ingestion.
  • The Autoloader incrementally and efficiently processes new data files as they arrive in S3 using a Structured Streaming source named ‘cloudFiles’
df_autoloader = (spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "json")
      ...

      .schema(raw_schema)
      .load("/mnt/algo-autoload") )
  • The data engineer can monitor live stream processing with the live graph in the notebook (using display command) or using the Streaming Query Statistics tab in the Spark UI.

Using the Spark UI to display Live Graph and Streaming Query Statistics for Algorand Blockchain transactions.
Figure 5: (A) Live Graph and (B) Streaming Query Statistics tab in the Spark UI

  • The stream is written out in micro-batches into Delta format. A Delta table is created to point to the data location for easy SQL access. Structured Streaming uses checkpoint files to provide resilience to job failures and ‘exactly once’ semantics.
(df_autoloader
  .writeStream                                        # Write the stream
  .format("delta")                                    # Use the Delta format
  .option("checkpointLocation", bronzeCheckpointPath) # Specify where to log
  .option("path", bronzeOutPath)                      # Specify the output path
  .outputMode("append")                               # Append to the output path
  .queryName("delta_block_bronze")                    # The name of the stream
  .start()                                            # Start the operation
)

Step 3:Stream refinement
Stream from the bronze table, flatten the transactions and persist the transaction stream into the silver table.

  • ‘round’ is the block number.
  • ‘timestamp’ is when committed.
block_df = spark.readStream.format("delta").table(bronze_tbl)

tx_df = block_df.select("genesis-hash", "genesis-id", "previous-block-hash", "rewards", "round",
                        "timestamp", from_unixtime("timestamp", 'yyyy-MM-dd').alias("tx_date"),
                        explode("transactions").alias("tx"))
Step 4: In-stream aggregations
Compute in-stream aggregations and persist into the ‘gold’ Delta table.
  • Read from the silver Delta table as a streaming job.
  • Compute aggregation statistics on transaction data over a sliding window on tx_time. An interval of one hour specifies the aggregation time unit. The watermark allows late arriving data to be included in the aggregates. In distributed and networked systems, there’s always a chance for disruption, which is why it is necessary to preserve the state of the aggregate for a little longer. (keeping it indefinitely will exceed memory capacity)
  • Persist into the gold Delta table
(spark.readStream
      .format("delta").table("algo.silver_transactions")
      .select("timestamp",to_timestamp("timestamp").alias("ts"),
              "tx.asset-transfer-transaction.asset-id",
              "tx.asset-transfer-transaction.amount")
             .withWatermark("ts", "1 hour")
             .groupBy(window("ts", "1 hour"), "asset-id")
             .agg(avg("amount").alias("avg_amount"), count("amount").alias("cnt_tx"),
                  max("amount").alias("max_amount"),min("amount").alias("min_amount"),
                  sum("amount").alias("sum_amount"))
            .select(year("window.end").alias("yr"), month("window.end").alias("mo"),
                    dayofmonth("window.end").alias("dy"),
                    weekofyear("window.end").alias("wk"),
                    hour("window.end").alias("hr"),"asset-id", "avg_amount", "cnt_tx",
                    "max_amount","min_amount","sum_amount")
.writeStream
.format("delta")
.option("checkpointLocation", goldCheckpointPath)
.option("path", goldOutPath )
.outputMode("append")
.queryName("delta_gold_agg")
.start()
)
Step 5a: Graph analysis
A graph is an intuitive way to model data to discover inherent relationships. With SQL, single hop relations are easy to identify, and graphs are better suited for more complex relationships. This notebook leverages graph processing on the transaction data. Sometimes it is necessary to push data to a specialized graph database. With Spark, it is possible to use the Delta Lake data by applying Graph APIs directly. The notebook utilizes Spark’s distributed computing with the Graph APIs’ flexibility, augmented with additional ML models – all from the same source of truth.

While Blockchain reduces the potential for fraud, there is always a risk of fraud, and  Graph semantics can help discover indicative features. Properties of a transaction other than the sender/receiver account ids are more useful in detecting suspicious patterns. A single actor can shield behind multiple identities on the blockchain. For example, a fanout from a single account to multiple accounts through several other layers of accounts and a subsequent convergence to a target account where the original source and target accounts are distinct but in reality map to the same user.

  • Create an optimized Delta table for the graph analysis Z-ordered by sender & receiver
%sql
CREATE TABLE IF NOT EXISTS algo.tx_graph (
  sender STRING,
  receiver STRING,
  amount BIGINT,
  fee INT,
  tx_id STRING,
  tx_type STRING,
  tx_date DATE)
  USING DELTA
  PARTITIONED BY (tx_date)
  LOCATION 'dbfs:/mnt/algo-processing/blocks/silver/graph/data';
  OPTIMIZE algo.tx_graph ZORDER BY sender, receiver;
  • Create Vertices, Edges and construct the transaction Graph from it
%scala
val vertices_df = sqlContext.sql("SELECT tx.from as id FROM algo.bronze_block UNION SELECT tx.payment.to as id  FROM algo.bronze_block").distinct()

val edges_df = sqlContext.sql("SELECT tx_id, sender as src, receiver as dst, fee as fee, amount as amount, tx_type as type FROM algo.tx_graph")

val txGraph = GraphFrame(vertices_df, edges_df)

println("Total Number of accounts: " + txGraph.vertices.count)
println("Total Number of Tx in Graph: " + txGraph.edges.count)
  • Once the graph is in memory, analyze user activity such as:
// Which accounts are the heavy users
val topTransfers = txGraph .edges
  .groupBy("src", "dst")
  .count()
  .orderBy(desc("count")).limit(10)

// Highest transfers into an account
val inDeg = txGraph.inDegrees
display(inDeg.orderBy(desc("inDegree")).limit(15))

// Highest transfers out of an account
val outDeg = txGraph.outDegrees
display(outDeg.orderBy(desc("outDegree")).limit(15))

// Degree ratio of inbound Vs outbound transfers
val degreeRatio = inDeg.join(outDeg, inDeg.col("id") === outDeg.col("id"))
  .drop(outDeg.col("id"))
  .selectExpr("id", "double(inDegree)/double(outDegree) as degreeRatio")
degreeRatio.cache()
display(degreeRatio.orderBy(desc("degreeRatio")).limit(10))
  • PageRank measures the importance of a vertex (i.e., account) using the directed edges’ link analysis. It is implemented either with controlled iterations or allowing it to converge.
// Run page rank with 10 iterations
val rank = txGraph.pageRank.maxIter(10).run().vertices

var rankedNodes = rank.orderBy(rank("pagerank").desc)
display(rankedNodes) 
  • SP745JJR4KPRQEXJZHVIEN736LYTL2T2DFMG3OIIFJBV66K73PHNMDCZVM was on top of the list. Investigation shows this account processes asset exchanges, which explains the high activity.
d3 chord visualization right inside the Databricks notebook can help show relationships, especially fan-out/in type transactions, using the top Algorand Blockchain active accounts. D3 chord visualization right inside the Databricks notebook can help show relationships, especially fan-out/in type transactions, using the top active accounts.

The notebook converts the graph tx data into an NxN adjacency matrix, and each vertex is assigned a unique color along the circumference.

The chart uses the first 6 char of the account ids for readability.

Figure 6: Account Interaction using D3 Chord

Motif Search is another powerful search technique to find structural patterns in the graph using a simple DSL (Domain Specific Language).
  • For example, to find all paths (intermediate vertices and edges) between a given start (A) and end vertex (D), separated by say 3 hops (e1, e2, e3), one can use an expression like:
val motifs_3_hops =
txG.find("(A)-[e1]->(B);(B)-[e2]->(C);(C)-[e3]->(D)")

Algorand Blockchain data visualization using pyvis in a time range or for select user

A: Distinct clusters form around the most active accounts.
B: Different types of graphs can be constructed–account to account with the nodes representing the accounts and edges representing the transactions. These are directed graphs with the arrows going from sender to receiver. The thickness of the edge is an indication of the volume of traffic.
C: Another method would involve giving each asset a different color to see how the various assets interact. The diagram above shows a subset of the assets and the observation is that they are generally distinct with some overlap.
D: Zooming into the vortex displays the account id, which aligns with the top senders from the Graph APIs.

Step 5b: SQL Analytics
SQL Analytics offers a performant and full-featured SQL-native query editor that allows data analysts to write queries in a familiar syntax and easily explore Delta Lake table schemas. Queries can be saved in a catalog for reuse and are cached for quicker execution. These queries can be parameterized and set to refresh on an interval and are the building blocks of dashboards that can be created quickly and shared. Like the queries, the dashboards can also be configured to automatically refresh with a minimum refresh interval of a minute and alert the team to meaningful changes in the data. Tags can be added to queries and dashboards to organize them into a logical entity.

The dashboard is divided into multiple sections and each has a different focus:

1. High-level blockchain stats Provides a general overview of key aggregate metrics indicating the health of the blockchain.
2. Algo Price and Volume Monitors the Algo cryptocurrency price and volume for correlation with blockchain stats.
3. Latest (‘Last’) block status Provides stats of the most recent block, which is an indicator of the operational state of the blockchain.
4. Block Trends Provide a historical view of the number of transactions per block and the time required to produce each block.
5. Transaction Trends Provides a more detailed analysis of transaction activity, including volume, transaction type, and assets transferred.
6. Account Activity Provides a view of account behavior, including the most active accounts and the assets transferred between them.

Section 1: High-level blockchain stats
This section is a birds-eye view of aggregate stats, including the count of distinct asset types, transaction types, active accounts in a given time period.

High-level details of the Algorand Blockchain
Figure 8: High-level details of the Algorand Blockchain

A: The cumulative number of active accounts in a given time period
B: The average number of transactions per block in the given time period
C: The cumulative number of distinct assets used in the given time period
D: The cumulative number of distinct transaction types in the given time period
E: A word cloud representing the top trending words extracted from the note field of the transactions
F: An alphabetic listing of the asset types. Each asset has a unique identifier, a unit name, and the total number of assets available.

Section 2: Algo price and volume
This section provides price and volume data for Algos, the Algorand cryptocurrency. The price and volume are retrieved using the CryptoCompare API to correlate with the transaction volume.

Algorand Blockchain price and volume data provided via Databricks analytics’ tools.
Figure 9: Price and Volume data for Algos

A: Shows trading details daily (Algo price on left axis and volume traded on right axis) since a given date
B: Shows the same on an hourly basis for a given day

Section 3: Latest (‘Last’) block status
The latest block stats are an indicator of the operational state of the blockchain. E.g., suppose the number of transactions in a block or the amount of time it takes to generate a block falls below acceptable thresholds. In that case, it could indicate that the underlying blockchain nodes may not be functioning optimally.

Latest Algorand Blockchain statistics provided by Databricks’ analytics tools.

A:  The latest Block number
B:  Number of transactions in the most recent block
C: Time in seconds for the block to be created
D: The distribution of transaction types in this block
E: The asset type distribution for each transaction type within this block. Pay transactions use Algo and are not associated with an asset type.
F: The individual transactions within this block
Section 4: Block trends
This section is an extension of the previous and provides the historical view of the number of transactions per block and the time required to produce each block.

Algorand Blockchain historical transaction view provided by Databricks’ analytics tools.
Figure 11: Per block trends

A:  The number of transactions per block has a few spikes but shows a regular pattern. Transaction volume is significant since it reflects user adoption on the Algorand blockchain.

B:  The time in seconds to create a new block is always less than 5 seconds. The latency indicates the health of the blockchain network and the efficiency of the consensus protocol. An alert monitors this critical metric to ensure that it remains below 5 seconds.

Databricks’ analytics tools allow the analysts monitoring blockchain health to configure alert thresholds for key metrics.
Figure 12: Configuring thresholds for Alert notifications

Section 5: Transactions trends
This section provides a more detailed analysis of transaction activity, including volume, transaction type and assets transferred.

Databricks’ blockchain analytics tools provide users with detailed analyses and visualization for a wide range of transaction data.
Figure 13: Trends in Transactions

A: Asset statistics (count, average, min, max, sum on the amount) by hour and asset type
B: For a given day, the trend of the transaction count by hour
C: Average transaction volume by hour across the entire time period. There is a pattern that is similar to the previous. It appears that 10 AM is the trough and hour 21 is the crest, possibly on account of Asia’s market waking up.
D: The distribution of transaction types on a daily basis shows a high number of asset transfers (axfer) followed by payments with Algos (pay)
E: The transaction type distribution is the same for the selected day
F: The transaction volume distribution by asset type
G: The transaction volume by asset id over time on a daily basis. YouNow and Planet are the top asset ids traded in the given period.
H: The max transaction amount by asset id over time on a daily basis

Section 6: Account activity
This section provides a view of current account activity, including the most active accounts and the assets transferred.  A Sanky diagram illustrates the flow of assets between the most active accounts.

Databricks’ blockchain analytics tools provide users with detailed analyses and visualization for a wide range of account activity data.

Figure 14: Top Accounts by transaction volume

A: Top Senders by transaction volume
B: Daily transaction volumes of the identified top 20 senders
C: Sankey Diagrams are useful to capture behavioral flows and sequences. Transactions have a ‘point in time view’ of a single sender and receiver. How the transactions flow on either side tell the bigger story and help us understand the hidden nuances of a large source or sink account and contributors along the path.

Summary

This post has shown the Databricks platform’s versatility for analyzing transactional blockchain data (block, transaction and account) from the Algorand blockchain in real time. Apache Spark’s open source distributed compute architecture and Delta provide a scalable cloud infrastructure performant with reliable and real-time data streaming and curation. Machine learning practitioners and data analysts perform multiple types of analysis on all the data on a single platform in place. SQL, Python, R and Scala provide the tools for exploratory data analysis. Graph algorithms are applied to analyze account behavior. With SQL Analytics, business analysts derive better business insights through powerful visualizations using queries directly on the data lake.

Try the notebooks


Databricks Named a Leader in 2021 Gartner Magic Quadrant for Data Science and Machine Learning Platforms

$
0
0

Today, we’re pleased to announce that Databricks has been named a Leader in the 2021 Gartner Magic Quadrant for Data Science and Machine Learning Platforms for the second year running. This recognition builds off an already momentous kickstart to the year—including our recent funding round (at a $28B valuation)—and we believe it is a testament to our healthy obsession with building the lakehouse platform: one platform to unify all of your data, analytics, and AI workloads.

Databricks Named a Leader in 2021 Gartner Magic Quadrant for Data Science and Machine Learning Platforms

AI is driving innovation and disruption across every industry

Why are we so excited about this? Industry leaders predict that data science and ML will drive trillions of dollars in value across all industries. In a May 2020 Gartner report: Top 10 Trends in Data and Analytics, 2020, analysts predict that “By the end of 2024, 75% of enterprises will shift from piloting to operationalizing AI, driving 5X increase in streaming data and analytics infrastructures.”

Databricks’ ability to execute and completeness of vision has led to our positioning in two Gartner Magic Quadrant reports: November 2020 Gartner Magic Quadrant for Cloud Database Management Systems (DBMS) andMarch 2021 Gartner Magic Quadrant for Data Science and Machine Learning Platforms.

Databricks Lakehouse platform helps data teams solve the world’s toughest problems. Here are just some of the ways customers across industries are leveraging Databricks to level-up their data-driven decision-making and business outcomes:

  • Shell delivers innovative energy solutions for a cleaner world. Hampered by large volumes of data, Shell chose Databricks to be a foundational component of its Shell.ai platform. Databricks empowers hundreds of Shell’s engineers, scientists, and analysts to innovate together as part of their ambition to deliver cleaner energy solutions more rapidly and efficiently.
  • Comcast unlocks the future of entertainment with AI. Comcast struggled with massive data, fragile data pipelines, and insufficient data science collaboration. With our solutions, including Delta Lake and MLflow, they can create an innovative, unique, and award-winning viewer experience using voice recognition and ML.
  • Regeneron discovers new treatments with AI. Regeneron’s mission is to leverage genomic data to bring new medicines to patients in need. Yet, transforming this data into targeted therapies is challenging. Databricks empowers them to quickly analyze entire genomic data sets to accelerate the discovery of new therapeutics.

A Unified Approach for the Full Data and ML Lifecycle

Today’s data leaders must look at the entire data and machine learning landscape when considering new solutions. The 2021 Gartner Magic Quadrant for Data Science and Machine Learning Platforms is based on the rigorous evaluation of 20 vendors for their completeness of vision and ability to execute. We believe here are the three strengths of Databricks:

Collaborative Platform for the Full ML Lifecycle: Databricks empowers data science and machine learning teams with one unified platform to prepare, process data, train models in a self-service manner and manage the full ML lifecycle from experimentation to production. No matter your background, skill set or favorite tools, Databricks makes it easy to collaborate, access the compute power you need, and keep all data science projects in one place – on a managed, secure, and scalable platform.

Open Source Leadership and Rapid Adoption: Open source is in our DNA. We are the original creators of the widely-used data and machine learning open source projects, including MLflow, an open platform to manage the ML lifecycle experimentation: reproducibility, deployment and a central model registry. MLflow is on its way to becoming the defacto standard for managing the ML lifecycle — it currently averages 2.2 million monthly downloads and has over 260 contributions.

Multi-cloud Platform: Databricks is built for multi-cloud enterprises, offering a unified data platform on Microsoft Azure, AWS and Alibaba Cloud. Users can do data science and machine learning without having to learn cloud-specific tools and processes. Oh, and don’t just take our word for it. Earlier this month, we raised a $1 billion funding round with the buy-in of the “cloud elite”: Amazon, Google, Microsoft and Salesforce.

At Databricks, our ultimate goal is to empower customers to make better, faster use of data with one simple, open platform for analytics, data science, and ML that brings together teams, processes, and technologies. Read the Gartner Magic Quadrant for Data Science and Machine Learning Platforms to learn more.

Sources:
Gartner, Magic Quadrant for Data Science and Machine Learning Platforms, Peter Krensky, Carlie Idoine, Erick Brethenoux, Pieter den Hamer, Farhan Choudhary, Afraz Jaffri,Shubhangi Vashisth, 01 March 2021

Gartner, Magic Quadrant for Cloud Database Management Systems, Donald Feinberg | Merv Adrian, | Rick Greenwald | Adam Ronthal | Henry Cook, 23 November 2020

Gartner, Top 10 Trends in Data and Analytics, 2020, Rita Sallam, Svetlana Sicular, Pieter den Hamer, Austin Kronz, W. Roy Schulte, Erick Brethenoux, Alys Woodward, Stephen Emmott, Ehtisham Zaidi, Donald Feinberg, Mark Beyer, Rick Greenwald, Carlie Idoine, Henry Cook, Guido De Simoni, Eric Hunter, Adam,Ronthal, Bettina Tratz-Ryan, Nick Heudecker, Jim Hare, Lydia Clougherty Jones, 11 May 2020

Disclaimer: Gartner does not endorse any vendor, product or service depicted in its research publications.

--

Try Databricks for free. Get started today.

The post Databricks Named a Leader in 2021 Gartner Magic Quadrant for Data Science and Machine Learning Platforms appeared first on Databricks.

How S&P Global Leverages ESG Data to Help Customers Make Sustainable Investments

$
0
0

Don’t miss our Global Sustainability Leadership Forum to hear how leaders from Goldman Sachs, Grab, Gro Intelligence and Microsoft are using data and AI to build a more sustainable, inclusive and greener future.

 
Environmental, social and governance (ESG) investing is a form of sustainable responsible investing (SRI) that considers both the financial returns and overall impact an investment has on the world. This is a growing focus for organizations looking to build more ethical portfolios to both drive social good and meet shifting consumer demands. These initiatives are also good for businesses, as ESG funds frequently outperform other funds. Unfortunately, wealth advisors, portfolio managers and asset allocators face three key challenges with sustainable investing:

  1. Lack of standardization around how companies disclose sustainability metrics
  2. Inability to effectively monitor ESG data
  3. Lack of automation in collecting the data due to its unstructured, unstandardized nature
  4. Ultimately, taking a data and AI-driven approach to sustainable investing provides closures for all three of these gaps to ensure smarter, more informed investment decisions.

S&P Global (S&P for short) is an example of how organizations are driving growth while making ESG or sustainability initiatives core to their business. S&P Global enables financial services professionals to take a much more data-driven approach by providing companies with practical, useful data and expert analysis. In this on-demand webinar, Junta Nakai, Industry Leader for Financial Services and Sustainability at Databricks and Mark Avallone, Vice President of Architecture at S&P Global Market Intelligence, discuss real-world examples of the value of ESG data and third-party datasets to guide the investment process and make decisions that generate sustainable alpha.

The promise of a data-driven tomorrow

Just like every other area of business, financial intermediation is under a lot of pressure (fee compression, margin compression, asset investments, etc.) to perform. When you take a much more data-driven approach to ESG factors, professionals can more easily reassess and benchmark their business, redefine future plans and reinvent themselves if needed — ultimately, relieving much of the related stress.

The promise of data can often sound too good to be true, but S&P has proven otherwise with its ability to leverage ESG data and artificial intelligence to help its customers make smarter socially responsible investments. Through their Global Marketplace, companies can access alternative and third-party ESG data such as social media and other non-financial data sources, and extract intermediary insights to make more sustainable decisions.

“Our ESG scores offer a really quick and dirty way to get working with ESG data,” explained Mark Avallone, Vice President of Architecture at S&P Global. “Through the use of Databricks, we’re able to use AI to parse unstructured data and to evaluate numerous factors and how companies are operating against them. The result is a score that’s reliable and standardized because it’s based on data that’s been collected over a long period of time. It’s incredibly important to have that history; you can’t just take a snapshot and base business decisions on it.”

S&P leverages the Databricks Lakehouse platform to analyze various ESG data sources to produce normalized ESG benchmarks to make smarter investment decisions.

S&P leverages the Databricks Lakehouse platform to analyze various ESG data sources to produce normalized ESG benchmarks to make smarter investment decisions.

Thanks to Databricks, S&P is also able to drill down and offer insights on more granular unstructured data, such as environmental and weather data, as well as a marketplace for customers to explore different offerings. Enabled by a Lakehouse architecture on Databricks, S&P can easily federate their data and build reliable and performant data pipelines for all forms of downstream analytics, from business intelligence dashboards for customers and powering machine learning models to deliver ESG score predictions.

For example, they have the ability to model the impact of weather on airlines. “It’s a massive data set with over 9 million flights every year in the US alone,” said Mark. “Combined with weather data, we can determine trends over decades of time.”

&P uses Databricks to help extract insights across a range of ESG issues.

&P uses Databricks to help extract insights across a range of ESG issues.

Linking important information for better business decision-making (like airline and weather data) is what S&P is able to do for their customers on a regular basis. “We provide standardized data and wherever possible, the hooks in the data that link it to other data sets,” said Mark. “For years we’ve managed regulated financial attributes for millions of companies, and providing the ability to cross-reference environmental data with asset data is extremely powerful.”

Databricks enables S&P to leverage machine learning to predict ESG risk scores and mitigate investment risk.

Databricks enables S&P to leverage machine learning to predict ESG risk scores and mitigate investment risk.

Delivering self-service ESG Analytics through Databricks

It’s so powerful, in fact, that S&P is productizing it. The company offers an API that can model physical risk over the next 10-50 years across a number of climate-related factors and then cross-references that to the geolocation of various assets. This enables customers to pick assumptions based on climate change and get a quick readout of exactly what the risks are across a number of related dimensions. And all of this is made possible thanks to the Databricks Lakehouse Platform, an open, simple platform to store and manage all of your data for all of your analytics workloads.

Through the Databricks notebooks environment, S&P has developed a customer-facing Analytics Workbench, which allows customers to easily access and unify available alternative and ESG data with the customer’s proprietary data to explore and gain deeper insights. All of these capabilities combined to enable them to make smarter and more sustainable investments.

The S&P Analytics Workbench, powered by Databricks, empowers their customers to unify ESG data with proprietary data to gain deeper and more targeted insights.

The S&P Analytics Workbench, powered by Databricks, empowers their customers to unify ESG data with proprietary data to gain deeper and more targeted insights.

Start taking a data-driven approach to sustainable investing

Ready to get started on your ESG journey? Check out these resources:

--

Try Databricks for free. Get started today.

The post How S&P Global Leverages ESG Data to Help Customers Make Sustainable Investments appeared first on Databricks.

Upgrade Production Workloads to Be Safer, Easier, and Faster With Databricks Runtime 7.3 LTS

$
0
0

What a difference a year makes. One year ago, Databricks Runtime version (DBR) 6.4 was released — followed by 8 more DBR releases. But now it’s time to plan for an upgrade to 7.3 for Long-Term Support (LTS) and compatibility, as support for DBR 6.4 will end on April 1, 2021. (Note that a new DBR 6.4 (Extended Support) release was published on March 5 and will be supported until the end of the year). Upgrading now allows you to take advantage of all the improvements from 6.4 to 7.3 LTS, which has long-term support until Sept 2022. This blog highlights the major benefits of doing so.

DBR 6.4 is the last supported release of the Apache Spark 2.x code line. Spark 3.0 was released in June of 2020 with it a whole bevy of improvements. The Databricks 7.3 Runtime built on the Apache Spark 3.x code line includes many new features for Delta Lake and the Databricks platform as a whole, resulting in these improvements:

DBR 7.3 LTS makes it easier to develop and run your Spark applications thanks to the Apache Spark 3.0 improvements.

Easier to use

DBR 7.3 LTS makes it easier to develop and run your Spark applications thanks to the Apache Spark 3.0 improvements. The goal of Project Zen within Spark 3.0 is to adhere PySpark more closely to Python principles and conventions. Perhaps the most noticeable improvement is the new interface for Pandas UDFs, which leverages Python type hints. This standardizes on a preferred way to write Pandas UDFs and leverages type hints to have a better developer experience within your IDE.

If you haven’t yet converted your Apache Parquet data lake into a Delta Lake, you are missing out on many benefits, such as:

  1. Preventing data corruption
  2. Faster queries
  3. Increased data freshness
  4. Easy reproducibility of machine learning models
  5. Easy implementation of data compliance

These Top 5 Reasons to Convert Your Cloud Data Lake to Delta Lake should provide an opportunity to upgrade your Data Lake along with your Databricks Runtime.

Data ingestion from cloud storage has been simplified with Delta Auto Loader, which was released for general availability in DBR 7.2. This enables a standard API across cloud providers to stream data from blob storage into your Delta Lake. Likewise, the COPY INTO (AWS | Azure) command was introduced to provide an easy way to import data into a Delta table using SQL.

Refactoring changes to your data pipeline became easier with the introduction of Delta Table Cloning. This allows you to quickly clone a production table in a safe way so that you can experiment on it with the next version of your data pipeline code without the risk of corrupting your production data. Another common scenario is the need to move a table to a new bucket or storage system for performance or governance reasons. You can easily do this with the CLONE command to copy massive tables in a more scalable and robust way. Additionally, you can specify your own meta-data (AWS | Azure) to the transaction log when committing to delta.

When long-running queries need to be troubleshooted, it is common to generate an explain plan of the query or dataframe. The formatting of large explain plans can be unwieldy to navigate. Explain plans have become much more consumable with the reformatting introduced in Spark 3.0.

Here is an example of an explain plan pre Spark 3.0:

An example of an explain plan pre Spark 3.

Here is the newly formatted explain plan. It is separated into a Header to show the basic operating tree for the execution plan, and a footer, where each operator is listed with additional attributes.

Finally, any subqueries will be listed separated:

ewly formatted explain plan introduced with Spark 3.0.

Fewer failures

Perhaps no feature has been more hotly anticipated than the ability for Spark to automatically calculate the optimum number of shuffle partitions. Gone are the days of manually adjusting spark.shuffle.partitions. This is made possible by the new Adaptive Query Execution (AQE) added in Spark 3.0 and was a major step-change to the execution engine for Spark.

Spark now has an adaptive planning component to its optimizer so that as a query is executing, statistics can automatically be collected and fed back into the optimizer to replan subsequent sections of the query.

Spark 3.0 now has an adaptive planning component to its optimizer so that as a query is executing, statistics can automatically be collected and fed back into the optimizer to replan subsequent sections of the query.

Another benefit of AQE is the ability for Spark to automatically replan queries when it detects data skew. When joining large datasets, it’s not uncommon to have a few keys with a disproportionate amount of data. This can result in a few tasks taking an excessive amount of time to complete, or in some cases, fail the entire job. The AQE can re-plan such a query as it executes to evenly spread the work across multiple tasks.

Sometimes failures can occur on a shared cluster because of the actions of another user, such as when two users are experimenting with different versions of the same library. The latest runtime includes a feature for Notebook-scoped Python Libraries (AWS | Azure). This ensures that you can easily install Python libraries with pip, but their scope is limited to the current notebook and any associated jobs. Other notebooks attached to the same Databricks cluster are not affected.

Improved performance

In the cloud, time is money. The longer it takes to run a job, the more you pay for the underlying infrastructure. Significant performance speedups were introduced in Spark 3.0. Much of this is due to the AQE, dynamic partition pruning, automatically selecting the best join strategy, automatically optimizing shuffle partitions and other optimizations. Spark 3.0 was benchmarked as being 2x faster than Spark 2.4 on the TPC-DS 30TB dataset.

Spark 3.0 was benchmarked as being 2x faster than Spark 2.4 on the TPC-DS 30TB dataset.

UDFs created with R now execute with a 40x improvement by vectorizing the processing and leveraging Apache Arrow.

UDFs created with R now execute with a 40x improvement by vectorizing the processing and leveraging Apache Arrow.

Finally, our Delta Engine was enhanced to provide even faster performance when reading and writing to your Delta Lake. This includes the collection of optimizations that reduce the overhead of Delta Lake operations from seconds to tens of milliseconds. We introduced a number of optimizations so that the MERGE statement performs much faster.

Getting started

The past year has seen a major leap in usability, stability, and performance. If you are still running DBR 6.x, you are missing out on all of these improvements. If you have not upgraded yet, then you should plan to do so before extended support ends at the close of 2021. Doing so will also prepare you for future improvements that are to be released for Delta Engine later this year–all dependent on Spark 3.0 APIs.

You can jumpstart your planning by visiting our documentation on DBR 7.x Migration – Technical Considerations (AWS | Azure), and by reaching out to your Account Team.

--

Try Databricks for free. Get started today.

The post Upgrade Production Workloads to Be Safer, Easier, and Faster With Databricks Runtime 7.3 LTS appeared first on Databricks.

Glow V1.0.0, Next Generation Genome Wide Analytics

$
0
0

Genomics data has exploded in recent years, especially as some datasets, such as the UK Biobank, become freely available to researchers anywhere. Genomics data is leveraged for high-impact use cases – gene discovery, research and development prioritization, and to conduct randomized controlled trials. These use cases will help in developing the next generation of therapeutics.

The catch: deriving insights from this data requires data teams to scale their analytics. And scaling requires data scientists and engineers with deep technical skill sets. That’s why we’re excited to announce the release of Glow version 1.0.0, an open source library that solves key challenges of applying distributed computing to genomics data in the cloud.

Challenges with genetic association studies

As genetic data has grown, processing, storing and analyzing it has become a major bottleneck. Challenges include:

  1. Variety of data. The variety of data types can make managing it a real headache. For instance, Biobank data contains genomics, electronic health records, medical devices and images.
  2. Volume and velocity of data. Genetic data is massive and constantly evolving, and analyses are rerun continually as fresh data comes in.
  3. Inflexible analytics. Single node bioinformatics tools do not allow users to work together interactively on large datasets. Genomics data formats may be optimized for compression and storage, but not for analytics. Bioinformatics scientists filter samples that are either from the same family or of different ethnicities. Hard filtering limits the power to make new discoveries.

Introducing Glow

Glow is an open-source toolkit for working with genomic data at population-level scale. The toolkit is natively built on Apache Spark™, a unified analytics engine for large-scale data processing and machine learning.

  1. Bridges bioinformatics and the big data ecosystem. With Glow, you can ingest variant call format (VCF), bgen, plink and Hail matrix tables under a common variant schema. Variant data can then be written to Delta Lake to create genomics data lakes, which can be linked to a variety of data sources using distributed machine learning algorithms such as GraphFrames.
  2. Built to scale. Glow natively builds on Apache Spark™ and Delta Lake, allowing users to ramp from 1 to 10 to 100 nodes. Scaling computers is faster than optimizing code or hardware.
  3. Natively supports genetic association studies.Glow is concordant with regenie for linear and logistic regression and now supports up to 20 phenotypes simultaneously. The method allows you to include all the data without filtering, and controls for an imbalance of cases and controls. Glow is written using Python and Pandas user defined functions, allowing computational biologists to extend Glow to gene burden or joint variant analysis, for example.


Figure 1. The Glow library can be run on Databricks on any of the three major clouds, starter notebooks can be found on the documentation.

Glow’s whole genome regression (GloWGR) is orders of magnitude more scalable than existing methods.
Figure 2. Glow’s whole genome regression (GloWGR) is orders of magnitude more scalable than existing methods

Conclusions

We have collaborated with the Regeneron Genetics Center to solve key scaling challenges in genomics through project Glow. Bioinformatics, computational biologists, statistical geneticists and research scientists can work together on The Databricks analytics platform, on any cloud, to scale their genomics data analytics and downstream machine learning applications. The first use case of Apache Spark™ and Delta Lake to genomics has been for population genetic association studies. And we are now seeing new use cases emerging for cancer and childhood developmental disorders.

Get started

Try out Glow V1.0.0 on Databricks or learn more at projectglow.io.

--

Try Databricks for free. Get started today.

The post Glow V1.0.0, Next Generation Genome Wide Analytics appeared first on Databricks.

Delta Lake: the Foundation of Your Lakehouse

$
0
0

More and more, we have seen the term “lakehouse” referenced in today’s data community. Beyond our own work at Databricks, companies and news organizations alike have increasingly turned to this idea of a data lakehouse as the future for unified analytics, data science, and machine learning. But what is a lakehouse? Join us for our upcoming virtual event: Delta Lake: the Foundation of Your Lakehouse.

Delta Lake -- foundation of your data lakehosue

Businesses are looking to drive strategic initiatives

As the size and complexity of data at organizations grow, businesses are looking to leverage that data to drive strategic initiatives powered by machine learning, data science, and analytics. The companies that manage to leverage this data effectively are driving innovation across industries. But doing so is challenging – the old ways of managing data can’t keep up with the massive volume. Traditional data warehouses, which were first developed in the late 1980s and were built to handle these large and growing data sets, are expensive, rigid and can’t handle the modern use cases most companies are looking to address.

As an attempted solution, companies turned to data lakes – a low-cost, flexible storage option that can handle the variety of data (structured, unstructured, semi-structured) that is required for the strategic priorities of enterprises today. Data lakes use an open format, giving businesses the flexibility to enable many applications to take advantage of the data.

While data lakes are a step in the right direction, a variety of challenges arise with data lakes that slow innovation and productivity. Data lakes lack the necessary features to ensure data quality and reliability. Seemingly simple tasks can drastically reduce a data lake’s performance and with poor security and governance features, data lakes fall short of business and regulatory needs.

The best of both worlds: lakehouse

The answer to the challenges of data warehouses and data lakes is the lakehouse, a next generation data platform that uses similar data structures and data management features to those in a data warehouse but instead runs them directly on cloud data lakes. Ultimately, a lakehouse allows traditional analytics, data science, and machine learning to coexist in the same system, all in an open format.

To build their lakehouse and solve the challenges with data lakes, customers have turned to Delta Lake, an open format storage layer that combines the best of both data lakes and data warehouses. Across industries, enterprises have enabled true collaboration among their data teams with a reliable single source of truth enabled by Delta Lake. By delivering quality, reliability, security and performance on your data lake — for both streaming and batch operations — Delta Lake eliminates data silos and makes analytics accessible across the enterprise. With Delta Lake, customers can build a cost-efficient, highly scalable lakehouse that eliminates data silos and provides self-serving analytics to end-users.

Join us for our upcoming virtual event Delta Lake: the Foundation of Your Lakehouse.

In the event, you will learn more about the importance of a lakehouse and how Delta Lake forms its foundation. Through a keynote, demo and story of a customer’s experience with Delta Lake, you will gain a better understanding of what Delta Lake can do for you and how a lakehouse architecture can create a single source of truth at your organization, and unify machine learning, data science, and analytics. We hope you’ll join us and look forward to seeing you soon!

REGISTER NOW!

The post Delta Lake: the Foundation of Your Lakehouse appeared first on Databricks.

Productionize Data Science With Repos on Databricks

$
0
0

Most data science solutions make data teams choose between flexibility for exploration and rigidity for production. As a result, data scientists often need to hand off their work to engineering teams that use a different technology stack and essentially rewrite their work in a new environment. This is not only costly but also delays the time it takes for a data scientist’s work to deliver value to the business.

By integrating with Git, Databricks Repos provide a best-of-breed developer environment for data science and data engineering.

The next-generation Data Science Workspace on Databricks navigates these trade-offs to provide an open and unified experience for modern data teams. As part of this Databricks Workspace, we are excited to announce public availability of the new Repos feature, which delivers repository-level integration with Git providers, enabling any member of the data team to follow best practices. Databricks Repos integrate with your developer toolkit with support for a wide range of Git providers, including Github, Bitbucket, Gitlab, and Microsoft Azure DevOps.

By integrating with Git, Databricks Repos provide a best-of-breed developer environment for data science and data engineering. You can enforce standards for code developed in Databricks, such as code reviews, tests, etc., before deploying your code to production. Developers will find familiar Git functionality in Repos, including the ability to clone remote Git repos (Figure 1), manage branches, pull remote changes and visually inspect outstanding changes before committing them (Figure 2).

Getting started with Git Repos in Databricks Workspaces by adding a remote Git repo

Figure 1: To get started just provide the URL of the Git repository you want to clone

Developers can work on their own development branch and commit code and pull changes. Outstanding changes can be inspected in the UI before committing.

Figure 2: Developers can work on their own development branch and commit code and pull changes. Outstanding changes can be inspected in the UI before committing.

With the public launch of Repos, we are adding functionality to satisfy the most demanding enterprise use-cases:

  • Allow lists enable admins to configure URL prefixes of Git repositories to which users can commit code to. This makes sure that code cannot accidentally be pushed to non-allowed repositories.
  • Secret detection identifies clear-text secrets in your source code before they get committed, helping data teams follow best practices of using secret managers.

Repos can also be integrated with your CI/CD pipelines and allows data teams to take data science and machine learning (ML) code from experimentation to production seamlessly. With the Repos API (currently in private preview), you can programmatically update your Databricks Repos to the latest version of a remote branch. This enables you to easily implement CI/CD pipelines, e.g. the following best-practice workflow:

  1. Development: Developers work on feature branches on personal checkouts of a remote repo in their user folders.
  2. Review & Testing: When a feature is ready for review and a PR is created, your CI/CD system can use the Repos API to automatically update a test environment in Databricks with the changes on the feature branch and then run a set of tests to validate the changes.
  3. Production: Finally, once all the tests have passed and the PR has been approved and merged, your CI/CD system can use the Repos API to update the production environment in Databricks with the changes. Your production jobs will now run against the latest code.

The Repos feature is a part of the Next Generation Workspace and, with this public release, enables data teams to easily follow best practices and accelerate the path from exploration to production.

Get started

The Repos icon will display for Databricks Workspaces enabled with the feature.

Repos are in Public Preview and can be enabled for Databricks Workspaces! To enable Repos, go to Admin Panel -> Advanced and click the “Enable” button next to “Repos.” Learn more in our developer documentation.

--

Try Databricks for free. Get started today.

The post Productionize Data Science With Repos on Databricks appeared first on Databricks.

Segmentation in the Age of Personalization

$
0
0

Quick link to notebooks referenced through this post.

Personalization is heralded as the gold standard of customer engagement. Organizations successfully personalizing their digital experiences are cited as driving 5 to 15% higher revenues and 10 to 30% greater returns on their marketing spend. And now many customer experience leaders are beginning to extend personalization to the in-store experience, revolutionizing how consumers engage brands in the physical world and further setting themselves apart from their competition.

But true personalization is not viable in every aspect of customer engagement. When a retailer decides to construct a store, it does so considering the general needs of the population it intends to serve. Those same considerations carry over into the choice of products with which the store is stocked. Consumer goods manufacturers similarly consider the needs of specific, targeted consumer segments when deciding to launch a new brand or product. And even in the digital world where personalization is most easily deployed, the mix of content, products and services made available through a site or app is designed to fulfill the needs of targeted but still fairly broad groups of consumers.

Why not target the individual?

Fundamentally, it comes down to the cost of delivering a good or service relative to what the consumer is willing to pay. In the earliest applications of segmentation, manufacturers recognized that specialized product lines aligned with the generalized needs and objectives of targeted consumer groups could be used to differentiate their offerings from those of their competitors. By better connecting with these consumers, these products become more attractive, customers shift their spending, and greater value for both the consumer and the manufacturer are obtained. To see the consequence of this way of thinking, simply walk the cooking oil or dairy aisle of any major grocery store and notice the incredible diversity of offerings available for even the most basic of goods.

The recognition of different consumer needs and objectives translates into a variety of product choices.

Figure 1. The recognition of different consumer needs and objectives translates into a variety of product choices.

This mode of thinking, i.e. of considering customers as members of broad groups with similar needs and objectives (aka segments), extends beyond product development and into every business function oriented around the customer. Customer segmentation allows groups to design products, services, messaging and general models of engagement that are more likely to meet the needs of specific consumer groups. But operating in this manner comes at a cost.

Differentiated offerings require differentiated means of production and delivery. Each product, service, advertisement, etc. targeted to a specific segment requires specialized design, engineering, marketing and support efforts to go into it. Because of the greater value delivered by the differentiated product, consumers may be willing to pay more and if the goods can attract customers away from competitors, expanding market share, economies of scale may be accrued. But that’s the gamble.

How do we know we have the right segments?

To describe it as a gamble is not perfectly accurate. The reality is that most organizations spend significant time and resources scrutinizing customers and testing responsiveness before launching a specialized offering. This analysis continues as it is released and becomes established in the marketplace. If successful, the offering may come to occupy a niche from which the organization can derive profits.

But the marketplace is never stable. Shifts in consumer needs and objectives, their willingness or ability to pay, regulatory changes and the actions of competitors may make a particular niche more or less viable over time. Changes in the ability of the organization to produce a differentiated offering may also change how an organization wishes to continue going to market.

As a result, organizations are continually re-examining their customer segments, looking for both threats and opportunities. With the emergence of data science as a key practice in many marketing organizations, more and more data scientists are finding themselves invited into the segmentation dialog.

How does data science fit into segmentation?

Segmentation is frequently described as the foundation of modern marketing. With over 60 years of history behind it, the range of techniques and approaches available for conducting a segmentation exercise can be a bit overwhelming. So, how do we navigate this?

First, let’s acknowledge that segments do not exist as features of the real world. Instead, they are generalizations that we form, allowing us to summarize the unique combination of needs, preferences, objectives, motivations and responses that make up each individual consumer. The value of a segment lies not so much in its absolute truth (though it should be grounded in reality) but instead in its usefulness in dealing with this complexity.

Next, there may be multiple ways for our organization to view consumers, and these may lead to different segment definitions. Ideally, there would be a shared perspective on customers that allows the organization to engage in a consistent and cohesive manner, but sub-segment definitions and even alternative segmentation designs may prove useful in the context of specific business functions.

Finally, a segment definition is useful in that it allows us to focus resources in a manner that is likely to provide a good, predictable return. But because resources are likely already invested in particular segment design, changing our models of customer engagement based on a new segmentation perspective requires careful consideration of organizational change concerns.

A segmentation walk-through

To illustrate how data scientists might engage in a segmentation exercise, let’s imagine a promotions management team for a large grocery chain. This team is responsible for running a number of promotional campaigns, each of which is intended to drive greater overall sales. Today, these marketing campaigns include leaflets and coupons mailed to individual households, manufacturer coupon matching, in-store discounts and the stocking of various private label alternatives to popular national brands.

Recognizing uneven response rates between households, the team is eager to determine if customers might be segmented based on their responsiveness to these promotions. It is anticipated that such segmentation may allow the promotions management team to better target individual households in a way that drives overall higher response rates for each promotional dollar spent.

Using historical data from point of sales systems along with campaign information from their promotions management systems, the team derives a number of features that capture the behavior of various households with regards to promotions. Applying standard data preparation techniques, the data is organized for analysis, and using a variety of clustering algorithms, such as k-means and hierarchical clustering, the team settles on two potentially useful cluster designs.

Overlapping segment designs separating households based on their responsiveness to various promotional offerings.

Figure 2. Overlapping segment designs separating households based on their responsiveness to various promotional offerings.

Applying profiling to these clusters, the team’s marketers can discern that customer households, in general, fall into two groups: those that are responsive to coupons and mailed leaflets and those that are not. Further divisions show differing degrees of responsiveness with other promotional offers.

Profiling of clusters to identify differences in customer behavior between clusters

Figure 3. Profiling of clusters to identify differences in behavior between clusters

Comparing households by demographic factors not used in developing the clusters themselves, some interesting patterns separating cluster members by age and other factors are identified. While this information may be useful in not only predicting cluster membership and designing more effective campaigns targeted to specific groups of households, the team recognizes the need to collect additional demographic data before putting too much emphasis on these results.

Age-based differences in cluster composition of behavior-based customer segments.

Figure 4. Age-based differences in cluster composition of behavior-based customer segments.

The results of the analysis now drive a dialog between the data scientists and the promotions management team. Based on initial findings, a revised analysis will be performed focused on what appear to be the most critical features differentiating households as a means to simplify the cluster design and evaluate overall cluster stability. Subsequent analyses will also examine the revenue generated by various households to understand how changes in promotional engagement may impact customer spend. Using this information, the team believes they will have the ability to make a case for change to upper management. Should a change in promotions targeting be approved, the team makes plans to monitor household spend, promotions spend and campaign responsiveness rates using much of the same data used in this analysis. This will allow the team to assess the impact of these efforts and identify when the segmentation design needs to be revisited.

If you would like to examine the analytics portion of the workflow described here, please check out the following notebooks written using a publicly available dataset and the Databricks platform:

--

Try Databricks for free. Get started today.

The post Segmentation in the Age of Personalization appeared first on Databricks.


Advertising Fraud Detection at Scale at T-Mobile

$
0
0

This is a guest authored post by Data Scientist Eric Yatskowitz and Data Engineer Phan Chuong, T-Mobile Marketing Solutions.

The world of online advertising is a large and complex ecosystem rife with fraudulent activity such as spoofing, ad stacking and click injection.  Estimates show that digital advertisers lost about $23 billion in 2019, and that number is expected to grow in the coming years. While many marketers have resigned themselves to the fact that some portion of their programmatic ad spend will go to fraudsters, there has also been a strong pushback to try to find an adequate solution to losing such large sums of advertising money.

This blog describes a research project developed by the T-Mobile Marketing Solutions (TMS) Data Science team intended to identify potentially fraudulent ad activity using data gathered from T-Mobile network. First, we present the platform architecture and production framework that support TMS’s internal products and services. Powered by Apache Spark™ technologies, these services operate in a hybrid of on-premise and cloud environments. We then discuss best practices learned from the development of our Advertising Fraud Detection service, and give an example of scaling a data science algorithm outside of the Spark MLlib framework. We also cover various Spark optimization tips to improve product performance and utilization of MLflow for tracking and reporting, as well as  challenges we faced while developing this fraud prevention tool.

Overall architecture

Tuning Spark for ad fraud prevention

Sharing an on-prem Hadoop environment

Working in an on-premise Hadoop environment with hundreds of users and thousands of jobs running in parallel has never been easy. Unlike in cloud environments, on-premise resources (vCPUs and memory) are typically limited or grow quite slowly, requiring a lot of benchmarking and analyzing of Spark configurations in order to optimize our Spark jobs.

Resource management

When optimizing your Spark configuration and determining resource allocation, there are a few things to consider.

Spark applications use static allocation by default, which means you have to tell the Resource Manager exactly how many executors you want to work with. The problem is you usually don’t know how many executors you actually need before running your job. Often, you end up allocating too few (which makes your job slow, and likely to fail) or too many (so you’re wasting resources). In addition, the resources you allocate are going to be occupied for the entire lifetime of your application, which leads to unnecessary resource usage if demand fluctuates. That’s where the flexible mode called dynamic allocation comes into play. With this mode, executors spin up and down depending on what you’re working with, as you can see from a quick comparison of the following charts.

When designing an ad prevention framework, dynamic allocation of Spark compute resources is recommended over the static method due to job-size variability.

This mode, however, requires a bit of setup:

  • Set the initial (spark.dynamicAllocation.initialExecutors) and minimum (spark.dynamicAllocation.minExecutors) number of executors. It takes a little while to spin these up, but having them available will make things faster when executing jobs.
  • Set the maximum number of executors (spark.dynamicAllocation.maxExecutors) because without any limitation, your app can take over the entire cluster. Yes, we mean 100% of the cluster!
  • Increase the executor idle timeout (spark.dynamicAllocation.executorIdleTimeout) to keep your executors alive a bit longer while waiting for a new job. This avoids a delay while new executors are started for a job after existing ones have been killed by idling.
  • Configure the cache executor idle timeout (spark.dynamicAllocation.cachedExecutorIdleTimeout) to release the executors that are saved in your cache after a certain amount of time. If you don’t set this up and you cache data often, dynamic allocation is not much better than static allocation.

Reading

Spark doesn’t read files individually; it reads them in batches. It’ll try to merge files together until a partition is filled up. Suppose you have an HDFS directory that has a gigabyte of data with one million files, each with a size of 1 KB (admittedly, an odd use case). If maxPartitionBytes is equal to 8 MB, you will have 500,000 tasks and the job most likely will fail. The chart below shows the correlation between the number of tasks and partition size.

When designing a Spark ad fraud prevention framework, right-sizing the config partition with respect to tasks can help reduce job failures and delays.

It’s quite simple: the bigger the partition you configure, the fewer tasks you will have. Too many tasks might make the job fail because of driver memory error limitations, memory errors, or network connection errors. Too few tasks normally will slow down your job because of lacking parallelization, and quite often it will cause an executor memory error. Fortunately, we’ve found a formula that can calculate the ideal partition size. Plugging in the parameters from the above example will give you a recommended partition size of about 2 GB and 2,000 tasks, a much more reasonable number:

Example equation for right-sizing Spark partitions used for ad fraud prevention.

Joining and aggregating

Whenever you join tables or do aggregation, Spark will shuffle data. Shuffling is very expensive because your entire dataset is sent across the network between executors. Tuning shuffle behavior is an art. In the below toy example, there’s a mismatch between the shuffle partition configuration (4 partitions) and the number of currently available threads (8 as a result of 2 cores per executor). This will waste 4 threads.

Example illustrating the importance of Spark dataset shuffle tuning.

The ideal number of partitions to use when shuffling data (spark.sql.shuffle.partitions) can be calculated by the following formula: ShufflePartitions = Executors * Cores. You can easily get the number of executors if you’re using static allocation, but how about with dynamic allocation? In that case, you can use maxExecutors, or, if you don’t know what the max executor configuration is, you can try using the number of subdirectories of the folder you’re reading from.

Writing

Writing in Spark is a bit more interesting. Take a look at the example below. If you have thousands of executors, you will have thousands of files. The point is, you won’t know if that’s too many or too few files unless you know the data size.

Example illustrating the importance of limiting file size and number when writing to Spark.

From our experience with huge data, keeping file sizes below 1 GB and the number of files below 2,000 is a good benchmark. To achieve that, you can use either coalesce or repartition:

  • The coalesce function works in most cases if you just want to scale down the files, and it’s typically faster because of no data shuffling.
  • The repartition function, on the other hand, does a full shuffle and redistributes the data more or less evenly. It will speed up the job if your data is skewed, as in the following example. However, it’s more expensive.

Python to PySpark

Similarities between Python and PySpark

For day-to-day data science use cases, there are many functions and methods in Spark that will be familiar to someone used to working with the Pandas and scikit-learn libraries. For example, reading a CSV file, splitting data into training and test sets, and fitting a model all have similar syntax.

The similarities between Python and PySpark make it easy for Python developers to perform basic functions in Spark.

If you’re using only predefined methods and classes, you may not find writing code in Spark very difficult—but there are a number of instances where converting your Python knowledge into Spark code is not straightforward. Whether because the code for a particular machine learning algorithm isn’t in Spark yet or because the code you need is highly specific to the problem you are trying to solve, sometimes you will have to write your own Spark code. Note that if you’re comfortable working with Pandas, Koalas could be a good option for you, as the API works very similarly to how you would work with dataframes in native Python.Take a look at this 10-minute intro to Koalas for more info.

Python UDF vs. PySpark

If you’re more comfortable writing code in Python than Spark, you may be inclined to start by writing your code as a Python UDF. Here’s an example of some code we first wrote as a UDF, then converted to PySpark. We started by breaking it down into logical snippets, or sections of code that each perform a fairly basic task, as shown by the colored boxes. This allowed us to find the set of functions that accomplished those tasks. Once they’d been identified, we just had to make sure the bits of code fed into each other and test that the end result was more or less the same. This left us with code that ran natively in Spark rather than a Python UDF.

Python UDF vs. PySpark code

And to justify this effort, here are the results showing the difference in performance between the UDF and the PySpark code.

PySpark can be twice as efficient as UDF for running jobs associated with ad fraud prevention.

The PySpark code ran over twice as fast, which for a piece of code  that runs every day adds up to many hours of compute time savings over the course of the year.

Normalized Entropy

Now, let’s  walk through one of the metrics we use to identify potentially fraudulent activity in our network. This metric is called normalized entropy. Its use was inspired by a 2016 paper that found that domains and IPs with very high or low normalized entropy scores based on user traffic were more likely to be fraudulent. We found a similar distribution of normalized entropy scores when analyzing apps using our own network traffic (see the histogram below).

Data scientists familiar with decision trees may know of Shannon entropy, which is a metric commonly used to calculate information gain. The normalized entropy metric we use is the normalized form of Shannon entropy. For those unfamiliar with Shannon entropy, here’s the equation that defines it, followed by the definition of normalized Shannon entropy:

The idea behind this metric is fairly simple. Making the assumption that most apps used by our customers are not fraudulent, we can use a histogram to determine an expected normalized entropy value. Values around the mean of about 0.4 would be considered “normal.” On the other hand, we would score values close to 0 or 1 higher in terms of their potential fraudulence because they are statistically unusual. In addition to the unusualness of values close to 0 or 1, the meaning behind these values is also relevant. A value of 0 indicates that all of the network traffic for that particular app came from a single user, and a value of 1 means multiple users used an app only once. This is of course not a flawless metric—for example, people often only open their banking apps once every few days, so apps like these will have a higher normalized entropy but are unlikely to be fraudulent. Thus, we include several other metrics in our final analysis too.

Using Shannon Normalized Entropy to detect fraudulent ad schemes like ad stacking or click injections.

Static vs. dynamic (revisited)

Here is the code we are currently using to calculate normalized entropy:

The last row in the  table below shows the time difference between running this code with a default configuration and an optimized configuration. The optimized config sets the number of executors to 100, with 4 cores per executor, 2 GB of memory, and shuffle partitions equal to Executors * Cores–or 400. As you can see, the difference in compute time is significant, showing that even fairly simple Spark code can greatly benefit from an optimized configuration and significantly reduce waiting time. This is a great skill to have for all data scientists who are analyzing large amounts of data in Spark.

Default vs. “optimized” static configuration

 

Config Default Optimized
spark.executors.instances 2 100
spark.executor.cores 1 4
spark.executor.memory 1g 2g
spark.sql.shuffle.partitions 200 400
Completion time 8 min 23 sec

Finally, let’s chat about the difference between static and dynamic allocation. The two previous configs were static, but when we use dynamic allocation with the same optimized config (shown below), it takes almost twice as long to compute the same data. This is because the Spark cluster initially only spins up the designated minimum or initial number of executors and increases the number of executors, as needed, only after you start running your code. Because executors don’t spin up immediately, this can extend your compute time. This concern is mostly relevant when running shorter jobs and when executing jobs in a production environment,  Though, if you’re doing EDA or debugging, dynamic allocation is probably still the way to go.

“Optimized” dynamic configuration

Config Default Optimized
spark.dynamicAllocation.maxExecutors infinity 100
spark.executor.cores 1 4
spark.executor.memory 1g 2g
spark.sql.shuffle.partitions 200 400
Completion time 41 sec

Productionization

Architecture overview

The following image gives an overview of our AdFraud prevention platform. The data pipeline at the bottom of this architecture (discussed in the previous section) is scheduled and put into our Hadoop production environment. The final output of the algorithm and all the other measurements are aggregated  and sent out of the Hadoop cluster for use by web applications, analytic tools, alerting systems, etc. We’ve also implemented our own anomaly detection model, and together with MLflow have a complete workflow to automate, monitor and operate this product.

T-Mobile AdFraud prevention architecture

MLflow

MLflow plays an important role in the monitoring phase because of its simplicity and built-in functions for tracking and visualizing KPIs. With these few lines of code, we are able to log hundreds of our metrics with very little effort:

We tag them by dates for visualization purposes, and with an anomaly detection model in place, we’re able to observe our ML model’s output without constant monitoring.

MLflow plays an important role in monitoring for ad fraud because of its simplicity and its built-in functions for tracking and visualizing KPIs.

Discussion

As you may know, validation is a very difficult process when it comes to something like ad fraud detection due to the lack of a clear definition of fraud, and it is very important from a legal and moral standpoint to make sure it is done correctly. So, we’re currently working with our marketing team to run A/B tests to validate our process and make improvements accordingly. Advertising fraud is a problem that will not go away on its own. It’s going to take a lot of effort from a variety of different stakeholders and perspectives to keep fraudsters in check and meet the goal of ensuring digital advertising spend is used for its intended purpose.

We’ve provided some details on one approach to attempting to solve this problem using Spark and T-Mobile’s network data. There are a number of alternatives, but regardless of the approach, it’s clear that a very large amount of data will need to be monitored on an ongoing basis—making Spark an obvious choice of tool.

As you’ve seen, optimizing your configurations is key—and not just for the initial reading of your data, but for aggregating and writing as well. Learning to tune your Spark configuration will undoubtedly save you hours of expensive compute time that would otherwise be spent waiting for your code to run.

What’s next

Want to learn more or get started with your own use case on Databricks? Contact us to schedule a meeting or sign up for a free trial to get started today.

--

Try Databricks for free. Get started today.

The post Advertising Fraud Detection at Scale at T-Mobile appeared first on Databricks.

Winning With Data + AI: Meet the Databricks 30 Index

$
0
0

Companies that are disrupting entire industries,such as Uber and Airbnb,have a common driver in their competitive edge: data and AI. For these brands, big data analytics and machine learning are central to customer experiences, from predicting when your food will arrive to visualizing your next vacation. In today’s fiercely competitive markets, incumbents are pitted against savvy data-based startups every single day. Companies that thrive in this era will follow Uber and Airbnb’s lead and similarly incorporate data and artificial intelligence into the heart of their products and services.

ntroducing the Databricks 30 index

This isn’t just conjecture. One only needs to look at the stock market for proof. Today, success in corporate America is more concentrated than any other period in modern history, with just a few select innovators,specifically the FAAMG stocks (Facebook, Amazon, Apple, Microsoft, and Google),accounting for more than 20% of the S&P 500 by market capitalization. In other words, winners are concentrated around companies where cloud, data and AI capabilities are central to operations.

However, success is not solely the domain of tech giants. In fact, our internal research shows that non-tech companies that also focus on cloud, data and AI win in the stock market as well.

A recent research report from Morgan Stanley, What’s Technology Worth:
Introducing Data Era Stocks 2.0,* affirms our observation by analyzing “Data Era” stocks, a list of 38 non-tech US companies that create value through investments in technology. They found that centering investment in data and digitalization “allows businesses to gain insights, improve business outcomes, and drive productivity.” A common thread in Morgan Stanley’s analysis is that these companies are heavily investing in cloud, collaboration and data analytics. And “Data Era” companies substantially outperform their peers – via higher margins, lower volatility in stock prices and higher valuations. Oh, and these companies’ stocks outperform both their peers and the S&P 500.

Why does this matter to all of us at Databricks? Cloud, data + AI and data team collaboration are foundational to the Databricks platform. We democratize access to the kinds of data platforms (and predictive and prescriptive capabilities) that only the AirBnBs and Ubers of the world could build for themselves. By leveraging Databricks, our customers across verticals and sizes are making the investments to enable data and AI solutions to automate and streamline business processes, inform decision-making and take a more central role in their operations.

That’s why I’m thrilled to introduce the Databricks 30 Index. Inspired by Morgan Stanely’s research on “Data Era” stocks, this index tracks marquee customers across our top five verticals plus partners. The Databricks 30 is an equal-weight price index composed of 5 marquee customers each across Retail/Consumer Products, Financial Services, Healthcare, Media/Entertainment, Manufacturing/Logistics, in addition to 5 strategic partners.

In other words, if the stock market went up by 50% over this time frame, the Databricks 30 Index would have gone up by 71% (outperformance of 21pp).

As of February 16th, 2021, our analysis shows companies in the Databricks 30 Index outperformed the S&P 500 by +21 percentage points (pp) over the last 3 years. In other words, if the stock market went up by 50% over this time frame, the Databricks 30 Index would have gone up by 71% (outperformance of 21pp). If we remove tech entirely from the Databricks 30, the Databricks 30 ex-Tech index actually outperforms the S&P 500 slightly even more, over the same time period: +23pp.

Similar to Morgan Stanley’s analysis, we find that non-tech US companies that are investing in cloud, data analytics and collaboration do in fact win.

Questions about causality and correlation naturally arise. Do these companies win because of Databricks? Or do winners bet on Databricks? I believe the truth is somewhere in the middle.

That being said, I believe most readers would agree with these word associations: Cloud = Agility, Data = Resilience, AI = Competitive Advantage.

Databricks = Cloud + Data + AI.

Visit our customer page to learn more about some of the leading tech and non-tech companies that use Databricks as part of their winning strategy.

* Morgan Stanely Research, What’s Technology Worth: Introducing Data Era Stocks 2.0, Oct. 1 2020.

--

Try Databricks for free. Get started today.

The post Winning With Data + AI: Meet the Databricks 30 Index appeared first on Databricks.

Tensor Input Now Supported in MLflow

$
0
0

Traditionally, generic MLflow Python models only supported DataFrame input and output. While DataFrames are a convenient interface when dealing with classical models built over tabular data, it is not convenient when working with deep learning models that expect multi-dimensional input and/or produce multidimensional output. In the newly-released MLflow 1.14, we’ve added support for tensors, which are multi-dimensional array structures used frequently in deep learning (DL).

Previously, in order to support deep learning models, MLflow users had to resort to writing custom adaptors or use the models in their native format. However, both of these approaches have significant shortcomings. Using native flavors breaks the abstraction barrier and, more importantly, native flavors can not be used with MLflow model deployment tools. This new support provides a much better experience for MLflow users working with tensors.

New tensor support

To enable deep learning models, MLflow 1.14 introduced the following changes:

  1. Extended mlflow.pyfunc’s predict method input and output types to support tensors.
    predict(input: pdf|np.ary|Dict[str, np.ary]) ->
     pd.DataFrame| pd.Series| np.ary | Dict[str, np.ary]
    
  2. Updated REST API of served models.
  3. Added tensor data type to model signatures.

These changes will enable users to fully utilize DL models in the MLflow framework and take advantage of MLflow model deployment. In Databricks, users will be able to pass tensor input to models deployed in Databricks as well as view tensor signatures and examples in the UI. The following section will give more details about the individual changes and highlight their use in an example.

Working with tensors

Let’s demonstrate this new tensor support using a simple image classification example.

For this example, we will train a model to classify the MNIST handwritten digit dataset.

import keras
from keras.layers import Dense, Flatten, Dropout
import numpy as np
import mlflow
import mlflow.keras
from mlflow.models.signature import infer_signature

# Let's prepare the training data!
(train_X, train_Y), (test_X, test_Y) = keras.datasets.mnist.load_data()
trainX, testX = train_X / 255.0, test_X / 255.0
trainY = keras.utils.to_categorical(train_Y)
testY = keras.utils.to_categorical(test_Y)

# Let's define the model!
model = keras.models.Sequential(
    [
      Flatten(),
      Dense(128, activation="relu", name="layer1"),
      Dropout(0.2),
      Dense(10, activation='softmax')
    ]
)
opt = keras.optimizers.SGD(lr=0.01, momentum=0.9)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

# Let's fit the model!
model.fit(trainX, trainY, epochs=2, batch_size=32, validation_data=(testX, testY))
 

NEW: We can use MLflow’s infer_signature helper to create a tensor-based model signature.

 

# Create a model signature using the tensor input to store in the MLflow model registry
signature = infer_signature(testX, model.predict(testX))
# Let's check out how it looks
print(signature)
# inputs: 
#    [Tensor('float64', (-1, 28, 28))]
# outputs: 
#    [Tensor('float32', (-1, 10))]
 

NEW: We can use a single tensor data-point from the training data to store as an input_example to the logged model. Note that the input_example is not limited to a single data-point and may contain a subsample of the training data.

 

# Create an input example to store in the MLflow model registry
input_example = np.expand_dims(trainX[0], axis=0)

We can now log the trained model along with the model signature and input_example as metadata.

# Let's log the model in the MLflow model registry
model_name = 'mnist_example'
registered_model_name = "tensor-blog-post"
mlflow.keras.log_model(model, model_name, signature=signature, input_example=input_example, registered_model_name=registered_model_name)

MLflow’s support for tensors includes displaying the tensor signature in the model schema.
 

NEW: Let’s load the trained model using the mlflow.pyfunc wrapper and make predictions using a tensor input.

 

# Let's load the model and make a sample prediction!
model_version = "1"
loaded_model = mlflow.pyfunc.load_model(f"models:/{registered_model_name}/{model_version}")
loaded_model.predict(input_example)

Let’s also check out querying the model via the Databricks UI using MLflow Model Serving. In order to do so, we first enable Model Serving on the Registered Model in Databricks. Once the serving endpoint and version are Ready, we can load the input example that was logged using the log_model API above. Once the input example has loaded, we can send a request to the serving endpoint using the Databricks UI. The returned predictions are now displayed in the Response section of the Model Serving webpage.

User’s can also make REST API calls from their applications to query the model being served.  When making a request with tensor input, the data is expected to be formatted based on TensorFlow Serving’s API documentation.

For more on MLflow’s support for tensors and how to get started using them, see the docs on MLflow Models and MLflow Model Serving. Visit mlflow.org to learn more about open-source MLflow and check out the release page to see what’s new in MLflow 1.14. You can try the newly added tensor support in this notebook containing the above example.

--

Try Databricks for free. Get started today.

The post Tensor Input Now Supported in MLflow appeared first on Databricks.

Introducing Databricks on Google Cloud – Now in Public Preview

$
0
0

Last month, we announced Databricks on Google Cloud, a jointly-developed service that allows data teams (data engineering, data science, analytics, and ML professionals) to store data in a simple, open lakehouse platform for all data, AI and analytics workloads. Today, we are launching the public preview of Databricks on Google Cloud.

When speaking to our customers, one thing is clear: they want to build modern data architectures to drive real business impact, whether that’s by personalizing customer experiences with ML, improving in-product gaming experience or delivering life-saving medical supplies (just to name a few). But many find themselves bogged down with unmanageable amounts of data across data types – structured, unstructured, and semi-structured data – while simultaneously dealing with a variety of applications. For just the day-to-day work, data teams must stitch together various open source libraries and tools for further analytics. Multiple handoffs between data science, ML engineering and deployment teams slow down development. Complexity and cost of transferring data between multiple disparate data systems and challenges managing multiple copies of data and security models add to the overhead.

With these pain points in mind, we believe the way to build a best-in-class Lakehouse platform is to build with open standards. Open standards, open APIs, open platform — it gives customers the choice to build their modern data architecture based on services with a simple, collaborative experience. Google Cloud shares this “vision of openness” with an open cloud service, meaning our joint customers have the choice to choose the right set of tools to solve any problem.

Open data lake with Delta Lake and Google Cloud Storage

The open technology that allows us to unify analytics and artificial intelligence (AI) with a lakehouse on top of your existing data lake is Delta Lake. Data lakes are an affordable way to store large amounts of raw data (structured, unstructured, video, text, audio), often in open source formats such as the widely-used Apache Parquet. Yet, many companies struggle to run their analytics and AI applications in production as they face many of the challenges listed below.

The open technology that allows Databricks to unify analytics and artificial intelligence (AI) with a lakehouse on top of your existing data lake is Delta Lake.

Delta Lake is an open format storage layer that delivers reliability, performance and governance to solve these data lake challenges. Databricks on Google Cloud is based on Delta Lake and the Parquet format, so you can keep all your data in Google Cloud Storage (GCS) without having to move it or copy it in several places. This allows you to store and manage all your data for analytics on your data lake.

Faster experimentation with easy-to-use ML & AI services

Once you have an open data lake, you have laid the foundation for your data science teams to develop and train their machine learning models. With the availability of Databricks on Google Cloud, data scientists and ML engineers can use our collaborative data science and managed MLflow capabilities with the data in Google Cloud Storage or BigQuery Storage.

Databricks on Google Cloud is also integrated with Google Cloud’s suite of AI services. For example, you can deploy MLflow models to AI Platform Predictions for online serving or use AI Platform’s pre-trained ML APIs and AutoML for vision, video, translation and natural language processing.

With the availability of Databricks on Google Cloud, you can use our collaborative data science and managed MLflow capabilities with Google Cloud’s suite of AI services.

Conclusion

Databricks on Google Cloud brings a shared vision to combine the open platform with the open cloud for simplified data engineering, data science and data analytics. Want to learn more about how this joint solution unifies all your analytics and AI workloads? Register for our launch event with Databricks CEO & Co-founder Ali Ghodsi and Google Cloud CEO Thomas Kurian for a deep dive into the benefits of an open Lakehouse platform and how Databricks on Google Cloud drive data team collaboration.

GET STARTED FOR FREE

The post Introducing Databricks on Google Cloud – Now in Public Preview appeared first on Databricks.

Simplifying Data and ML Job Construction With a Streamlined UI

$
0
0

Databricks Jobs make it simple to run notebooks, Jars and Python eggs on a schedule. Our customers use Jobs to extract and transform data (ETL), train models and even email reports to their teams. Today, we are happy to announce a streamlined UI for jobs and new features designed to make your life easier.

The most obvious change is that instead of a single page containing all the information; there are two tabs: Runs and Configuration. You use the Configuration tab to define the Job, whereas the Runs tab contains active and historical runs. This small change allowed us to make room for new features:

New Databricks Jobs UI has a cleaner look.

While doing the facelift, we added a few more features. You can now easily clone a job — useful if you want to, say, change the cluster type but not make changes to a production job. Also, we added the ability to pause a job’s schedule, which you can use to make changes to a job while preventing new runs until the changes are done. Lastly, as shown above, we also added parameter variables, e.g. {{start_date}}, which are interpreted when a job run starts and is replaced with an actual value e.g. “2021-03-17”. A handful of parameter variables are already available, and we plan on expanding this list.

We are excited to be improving the experience of developing jobs while adding useful features for data engineers and data scientists. The update to the Jobs UI also sets the stage for some exciting new capabilities we will announce in the coming months. Finally, we would love to hear from you — please use the feedback button in the UI to let us know what you think. Stay tuned for more updates!

--

Try Databricks for free. Get started today.

The post Simplifying Data and ML Job Construction With a Streamlined UI appeared first on Databricks.

Viewing all 1873 articles
Browse latest View live