A Tale About Vulnerability Research and Early Detection

February 4, 2022, 6:00 am

≫ Next: OMB M-21-31: A Cost-Effective Alternative to Meeting and Exceeding Traditional SIEMs With Databricks

≪ Previous: Google Datastream Integration With Delta Lake for Change Data Capture

This is a collaborative post between Databricks and Orca Security. We thank Yanir Tsarimi, Cloud Security Researcher, of Orca Security for their contribution.

Databricks’ number one priority is the safeguarding of our customer data. As part of our defense-in-depth approach, we work with security researchers in the community to proactively discover and remediate potential vulnerabilities in the Databricks platform so that they may be fixed before they become a risk to our customers. We do this through our private bug bounty and third-party penetration testing contracts.

We know that security is also top-of-mind for customers across the globe. So, to showcase our efforts we’d like to share a joint blog on a recent experience working with one of our security partners, Orca Security.

Orca Security, as part of an ongoing research effort, discovered a vulnerability in the Databricks platform. What follows below is Orca’s description of their process to discover the vulnerability, Databricks’ detection of Orca’s activities, and vulnerability response.

The vulnerability
In the Databricks workspace, a user can upload data and clone Git repositories to work with them inside a Databricks workspace. These files are stored within Databricks-managed cloud storage (e.g., AWS S3 object storage), in what Databricks refers to as a “file store.” Orca Security’s research was focused on Databricks features that work with these uploaded files – specifically Git repository actions. One specific feature had a security issue: the ability to upload files to Git repositories.

The upload is performed in two different HTTP requests:

A user’s file is uploaded to the server. The server returns a UUID file name.
The upload is “finalized” by submitting the UUID file name(s).

This procedure makes sense for uploading multiple files. Looking at the request sent to the server when confirming the file upload, the researcher noticed that the HTTP request is sent with three parameters:

{ "path": "the path from the first step", "name": "file name to create in the git repo", "storedInFileStore": false }

The last parameter caught the Orca researcher’s eye. They already knew the “file store” is actually the cloud provider’s object storage, so it was not clear what “false” meant.

The researcher fiddled with the upload requests and determined that when the file upload is “confirmed” after the first HTTP request, the uploaded file was stored locally on the disk under “/tmp/import_xxx/…”. The import temporary directory gets prepended to the uploaded file name. The Orca researcher needed to determine whether they could execute a directory traversal attack, this involved sending a request for a local file name such as “…/…/…/…/etc./issue” and seeing if it works. It did not. The backend checked for traversals and did not allow the Orca researcher to complete the upload.

While Databricks had prevented this attack, further attempts confirmed that while attempting to take advantage of relative paths to traverse did not work, the system did have a vulnerability, as it permitted the researcher to provide an absolute path such as “/etc./issue.” After attempting to upload this file, the researcher verified the file contents via the Databricks web console, an indication that they might be able to read arbitrary files from the server’s filesystem.

To understand the severity of this issue without potentially compromising customer data, the Orca Security researcher carefully tried reading files under “/proc/self`. The researcher determined that they would be able to obtain certain information while reading environment variables from “/proc/self/environ.” They ran a script iterating against “/proc/self/fd/XX,” which yielded read access to open log files. To ensure that no data was compromised, they paused the attack to alert Databricks of the findings.

Databricks’ detection and vulnerability response
Prior to notification, Databricks had already rapidly detected the anomalous behavior and began to investigate, contain and take countermeasures to repel further attacks — and contacted Orca Security even before Orca was able to report the issue to Databricks.

The Databricks team, as part of the Incident Response procedures, was able to identify the attack and vulnerability and deploy a fixed version within just a few hours. Databricks also determined that the exposed environmental information was not valid in the system at the time of the research. Databricks even identified the source of the requests and worked diligently with Orca Security to validate their detection and actions and further protect customers.

Orca Security would like to applaud Databricks’ security team efforts, as to this day, this is the only time we’ve been detected while researching a system.

Try Databricks for free. Get started today.

The post A Tale About Vulnerability Research and Early Detection appeared first on Databricks.

↧

OMB M-21-31: A Cost-Effective Alternative to Meeting and Exceeding Traditional SIEMs With Databricks

February 4, 2022, 7:00 am

≫ Next: Saving Time and Costs With Cluster Reuse in Databricks Jobs

≪ Previous: A Tale About Vulnerability Research and Early Detection

On August 29, 2021, the U.S. Office of Management and Budget (OMB) released a memo in accordance with the Biden Administration’s Executive Order (EO) 12028, Improving the Nation’s Cybersecurity. While the EO mandates that Federal Agencies adapt to today’s cybersecurity threat landscape, it doesn’t define specific implementation guidelines. However, the memo (M-21-31) describes a four-tiered maturity model for event management with detailed requirements for implementation. M-21-31 requires Federal Agencies to meet each rising level of maturity using their existing cybersecurity budget.

Early conversations with Federal Agencies have shown that their projected log collection storage requirements will increase by a factor of 4-10x. Since many Agencies use legacy Security Information and Event Management (SIEM) platforms to collect and monitor their logs, they are facing a massive increase in both the licensing and infrastructure cost for these solutions in order to meet the mandate.

Fortunately, there is an alternative architecture using the Databricks Lakehouse Platform for cybersecurity that Agencies can use to quickly, easily, and affordably meet M-21-31 requirements without forklifting operations or filtering the required raw logs. In this blog, we will discuss this architecture and how Databricks can be used to augment existing SIEM and Security Orchestration Automation and Response (SOAR) implementations. We will also provide an overview of M-21-31, the drawbacks of legacy SIEMs for fulfilling the mandate and how the Databricks approach addresses those issues while improving operational efficiency and reducing cost.

Improving investigative and remediation capabilities

Why is M-21-31 being issued now? Recent large-scale cyberattacks including SolarWinds, log4j, Colonial Pipeline, HAFNIUM and Kaseya, highlight the sophistication, complexity and increasing frequency of cyberattacks. In addition to costing the Federal government more than $4 million per incident in 2021, these cyber threats also pose a significant risk to national security. The government believes continuous monitoring of security data from an Agency’s entire attack surface during, and after incidents, is required in the detection, investigation and remediation of cyber threats. Agency-level security operations centers (SOC) also require security data to be democratized to improve collaboration for more effective incident response.

Maturity model for event log management

The maturity model described in M-21-31 guides Agencies through the implementation of requirements across four event logging (EL) tiers: EL0 – EL3:

The expectation is for Agencies to immediately begin to increase performance to reach full compliance with the requirements of EL3 by August 2023. The first deadline came in October 2021 when Agencies had to assess their current maturity against the model and identify resourcing and implementation gaps. From there, Agencies are expected to achieve tiers one through three every six months. Logging requirements and technical details by log category and retention period are provided for each type of data in the memo. Almost across the board, retention period requirements are 12 months for active storage and 18 months for cold data storage.

What’s an agency to do?

How does an agency go about meeting both the M-21-31 and SOC requirements specified in the memo? Generally speaking, M-21-31 is demanding that Chief Information Security Officers (CISOs) grow log collection by what many are measuring as 4-10x current ingest levels. The number of data sources being collected is expanding along with the retention, or lookback, period. In order to fulfill the mandate, the first question you need to answer is, how many terabytes of data does your agency ingest each day? From there, you can determine the increased licensing cost of your current SIEM, increased infrastructure cost and related administration costs. As this Total Cost of Ownership (TCO) for legacy SIEMs is directly related to data ingest, the cost of expansion for an existing architecture could be significant.

Traditional SIEM vs. SIEM augmentation

M-21-31 didn’t come with much warning and is an unfunded mandate. Agencies need a solution that can be implemented with existing resources and budget. Some Agencies are finding that the TCO of expanding their existing SIEM to increase licensing, storage, compute, and integration resources would cost tens of millions of dollars per year. This cost only increases if the legacy architecture is on-premises and requires additional egress costs for new cloud data sources.

SIEM augmentation using a cloud-based data\lLakehouse takes the benefits of legacy SIEMs and scales them to support the high volume data sources required by M-21-31. Open platforms that can be integrated with the IT and security toolchains provide choice and flexibility. A FedRAMP approved cloud platform allows you to run on the cloud environment you choose with stringent security enforcement for data protection. And integration with a scalable and highly-performant analytics platform, where compute and storage are decoupled, supports end-to-end streaming and batch processing workloads. No overhauling operations, specific expertise or extreme costs. Just an augmentation of the security architecture you’re already using.

The Databricks approach: Lakehouse + SIEM

For government agencies that are ready to modernize their security data infrastructure and analyze data at petabyte-scale more cost-effectively, Databricks provides an open lakehouse platform that helps democratize access to data for downstream analytics and Artificial Intelligence (AI).

The cyber data lakehouse is an open architecture that combines the best elements of data lakes and data warehouses and simplifies onboarding security data sources. The foundation for the lakehouse is Databricks Delta Lake, which supports structured, semi-structured, and unstructured data so Federal Agencies can collect and store all of the required logs from their security infrastructure. These raw security logs can be stored for years, in an open format, in the cloud object stores of Amazon Web Services (AWS), Microsoft Azure (Azure), or Google Cloud (GCP) to significantly reduce storage costs.
Databricks can be used to normalize raw security data to conform with Federal Agency taxonomies. The data can also be further processed to simplify the creation of Agency Security Scorecards and Security Posture reports. In addition, Databricks implements table access controls, a security model that grants different levels of access to security data based on each user’s assigned roles to ensure data access is tightly governed.

The cyber lakehouse is also an ideal platform for the implementation of detections and advanced analytics. Built on Apache Spark, Databricks is optimized to process large volumes of streaming and historic data for real-time threat analysis and incident response. Security teams can query petabytes of historic data stretching months or years into the past, making it possible to profile long-term threats and conduct deep forensic reviews to uncover infrastructure vulnerabilities. Databricks enables security teams to build predictive threat intelligence with a powerful, easy-to-use platform for developing AI and ML models. Data scientists can build machine-learning models that better score alerts from SIEM tools, reducing reviewer fatigue caused by too many false positives. Data scientists can also use Databricks to build machine learning models that detect anomalous behaviors existing outside of pre-defined rules and known threat patterns. To provide an example, last year Databricks published a blog on Detecting Criminals and Nation States through DNS Analytics. This blog includes a notebook that ingests passive DNS data into Delta Lake and performs advanced analytics to detect threats and find correlations in the DNS data with threat intelligence feeds.

Additionally, Databricks created a Splunk-certified add-on to augment Splunk for Enterprise Security (ES) for cost-efficient log and retention expansion. Designed for cloud-scale security operations, the add-on provides Splunk analysts with access to all data stored in the Lakehouse. Bi-directional pipelines between Splunk and Databricks allow agency analysts to integrate directly into Splunk visualizations and security workflows. Now you can interact with data stored within the lakehouse without leaving the Splunk User Interface (UI). And Splunk analysts can include Databricks data in their searches and Compliance/SOC dashboards.

The following diagram provides an overview of the proposed solution:

Databricks + Splunk: a cost-saving case study

Databricks integrates with the SIEM/SOAR/UEBA of your choice, but because a lot of agencies use Splunk, the Splunk-certified Databricks add-on can be used to meet both OMB and SOC needs. The following example features a global media telco’s security operation, however, the same add-on can be used by government agencies.

For this use case, the telco company wanted to implement exactly what M-21-31 is requiring agencies to do: expand lookback and data ingestion for better cybersecurity. Unfortunately, with Splunk alone, the more logs retained, the more expensive it gets to maintain. The Databricks add-on solves this problem by increasing the efficiency of Splunk.

Ingesting 35TB/day with 365-day lookbacks can potentially cost 10s of millions per year in Splunk Cloud. Databricks can be leveraged for big resources like DNS, Cloud Native, PCAP — all from the comfort of Splunk — without new personnel skillsets needed and at lower costs.

SIEM throughput comparison between Splunk vs. Splunk + Databricks, demonstrating the superior and cost-savings of the latter.

The diagram above represents the results of the Databricks add-on for Splunk versus Splunk alone and Splunk expanded. The telco organization grew throughput from 10TB per day with only 90 days look back, to 35TB per day with 365 days lookback using the Databricks SIEM augmentation. Despite the 250% increase in data throughput and more than quadrupling the lookback period, the total cost of ownership, including infrastructure and license, remained the same. Without the Databricks add-on, this expansion would have cost 10s of millions per year in the Splunk Cloud, even with significant discounts or remaining on-prem.

Because Databricks is an add-on to Splunk, your user interface doesn’t change and the user experience is seamless. With our Splunk-certified Databricks Connector app, integration, use, and adoption is quick and easy. From the comfort of the Splunk UI, agencies can keep existing processes and procedures, improve security posture, and reduce costs, while meeting the M-21-31 mandate.

Meeting the mandate while maximizing the most value for the lowest TCO

Of course, the nuances of your agency are what will determine TCO to fulfill the mandate within the time requirements. We are positive that the Databricks add-on for Splunk is the most efficient and cost-conscious solution to increasing logs and retention. That’s why Databricks created an editable ROI calculator to personalize your choices and let you weigh your options against your budget and available resources. With our expert resources guiding you through the calculator, you’ll have a clear understanding of how Databricks can help address your most pressing concerns and realize significant operational savings for OMB M-21-31.

Explore your cost-saving opportunities with Databricks as you navigate the M-21-31 mandate.

What’s next

Contact us today for a demo and ROI exercise focused on helping you remain compliant with the OMB’s required timelines without going over budget or using unnecessary resources.

Try Databricks for free. Get started today.

The post OMB M-21-31: A Cost-Effective Alternative to Meeting and Exceeding Traditional SIEMs With Databricks appeared first on Databricks.

↧

Saving Time and Costs With Cluster Reuse in Databricks Jobs

February 4, 2022, 8:00 am

≫ Next: How Butcherbox Uses Data Insights to Provide Quality Food Tailored to Each Customer’s Unique Taste

≪ Previous: OMB M-21-31: A Cost-Effective Alternative to Meeting and Exceeding Traditional SIEMs With Databricks

With our launch of Jobs Orchestration, orchestrating pipelines in Databricks has become significantly easier. The ability to separate ETL or ML pipelines over multiple tasks offers a number of advantages with regards to creation and management. With this modular approach, teams can define and work on their respective responsibilities independently, while allowing for parallel processing to reduce overall execution time. This capability was a major step in transforming how our customers create, run, monitor, and manage sophisticated data and machine learning workflows across any cloud. Today, we are excited to share further enhancement in our orchestration capabilities, with the ability to reuse the same cluster across multiple tasks in a job run, saving even more time and money for our customers.

Until now, each task had its own cluster to accommodate for the different types of workloads. While this flexibility allows for fine-grained configuration, it can also introduce a time and cost overhead for cluster startup or underutilization during parallel tasks.

In order to maintain this flexibility, but further improve utilization, we are excited to announce cluster reuse. By sharing job clusters over multiple tasks customers can reduce the time a job takes, reduce costs by eliminating overhead and increase cluster utilization with parallel tasks.

When defining a task, customers will have the option to either configure a new cluster or choose an existing one. With cluster reuse, your list of existing clusters will now contain clusters defined in other tasks in the job. When multiple tasks share a job cluster, the cluster will be initialized when the first relevant task is starting. This cluster will stay on until the last task using this cluster is finished. This way there is no additional startup time after the cluster initialization, leading to a time/cost reduction while using the job clusters which are still isolated from other workloads.

We hope you are as excited as we are with this new functionality. Learn more about cluster reuse and start using shared Job clusters now to save startup time and cost. Please reach out if you have any feedback for us.

Try Databricks for free. Get started today.

The post Saving Time and Costs With Cluster Reuse in Databricks Jobs appeared first on Databricks.

↧

How Butcherbox Uses Data Insights to Provide Quality Food Tailored to Each Customer’s Unique Taste

February 7, 2022, 8:18 am

≫ Next: Structured Streaming: A Year in Review

≪ Previous: Saving Time and Costs With Cluster Reuse in Databricks Jobs

This is a guest authored post by Jake Stone, Senior Manager, Business Analytics at ButcherBox

The impact of a legacy data warehouse on business speed and agility

From the outside, the ButcherBox concept is simple – subscribe to our service and each month, receive a shipment of fresh meat and seafood that checks all the boxes: organic, grass-fed, free-range, crate-free, wild-caught, etc. But peel back the curtain, and you’ll find that the day-to-day demands on the team in charge of a monthly subscription-based model is a lot trickier than it seems.

As a young e-commerce company, ButcherBox has to be nimble as our customers’ needs change, which means we’re constantly considering behavioral patterns, distribution center efficiency, a growing list of marketing and communication channels, order processing systems— the list goes on.

With such intricate processes at our foundation, and so much data feeding in from different sources — from email systems to our website — the data team here at ButcherBox quickly discovered that data silos were a significant problem and blocked complete visibility into critical insights needed to make strategic and marketing decisions. Our data team also struggled to deliver reports and accurate insights in a timely manner. We knew we needed to migrate from our legacy data warehouse environment to a data analytics platform that would unify our data and make it easily accessible for quick analysis to improve supply chain operations, forecast demand, and, most importantly, keep up with our growing customer base.

True data visibility in the lakehouse fuels better decision making

Once ButcherBox deployed the Databricks Lakehouse Platform on Azure, analyzing data for business optimization was a breeze. Now, with direct visibility into all of our diverse range of data (e.g., customer, inventory, marketing impact, etc.) and granular permissions, our analytics team can safely and securely view the data as it comes in and export it in the format they need to make smarter decisions.

Additionally, Delta Lake provides us with a single source of truth for all of our data. Now our data engineers are able to build reliable data pipelines that thread the needle on key topics, such as inventory management, allowing us to identify in near real-time what our trends are so we can figure out how to effectively move inventory.

Databricks SQL has empowered our team to modernize our data warehousing capabilities to rapidly analyze data at scale without worrying about infrastructure, performance or data quality issues. It provides a simple yet powerful enterprise-level environment for data warehousing on the lakehouse platform, with a strong focus on visualization tools, which is a lot less intimidating than most analytics solutions and great for everyone— even those who only know SQL.

With data at our fingertips, we are much more confident knowing that we are using the most recent and complete data to feed our Power BI dashboards and reports. In many cases, we’ll also use Databricks SQL visualizations that have proven to be more flexible for the data team.

But key to our ability to get the most out of our data is fueled by the collaborative nature of the Databricks platform. Now, analysts are able to share their builds with whoever else has access to the platform, and collaborators can get in and add to the project without having to get into the code itself, making teamwork from the analyst over to the business user much more streamlined. Now we can generate insightful dashboards on the fly and share them with internal teams and work together to help move the business forward.

Looking ahead, we want to continue to democratize our data strategy across the company. In fact, we have established an analytics COE or Center of Excellence, and Databricks is at the core of that initiative. Our goal is to ensure every one of our analysts and business partners can access the data they need and start being effective as soon as they log on. Our only limiting factor is how quickly we can write SQL queries as opposed to how much time it would take to build out a dashboard and detail it.

Knowing your data means knowing your customers

Presently, ButcherBox has hundreds of thousands of subscribers. While this would have been too much data to dig through in the past, thanks to Databricks Lakehouse, we are able to effectively comb through these massive data sets. Being able to query a table of 18 billion rows would have been problematic with a traditional platform. With Databricks, we can do it in 3 minutes.

Now, with a much better window into who our customers are and what they want, it’s like we know each and every one of them personally. For example, we previously ran into various logistical and delivery issues that cost the company tens of thousands of dollars. With Databricks, we were able to access the data and apply advanced analytics to determine how to address the issues 10x faster, which has enabled us to explore significantly more ways to leverage data to solve complex business challenges.

We wouldn’t be where we are today without Databricks and the business value it provides. Simply put, the view the Databricks platform has given us has fundamentally changed the way we think about our member base and their behavior. It has fundamentally changed how we do business for the better.

Try Databricks for free. Get started today.

The post How Butcherbox Uses Data Insights to Provide Quality Food Tailored to Each Customer’s Unique Taste appeared first on Databricks.

↧

Structured Streaming: A Year in Review

February 7, 2022, 9:00 am

≫ Next: Simplify Your Forecasting With Databricks AutoML

≪ Previous: How Butcherbox Uses Data Insights to Provide Quality Food Tailored to Each Customer’s Unique Taste

As we enter 2022, we want to take a moment to reflect on the great strides made on the streaming front in Databricks and Apache Spark™ ! In 2021, the engineering team and open source contributors made a number of advancements with three goals in mind:

Lower latency and improve stateful stream processing
Improve observability of Databricks and Spark Structured Streaming workloads
Improve resource allocation and scalability

Ultimately, the motivation behind these goals was to enable more teams to run streaming workloads on Databricks and Spark, make it easier for customers to operate mission critical production streaming applications on Databricks and simultaneously optimizing for cost effectiveness and resource usage.

Goal # 1: Lower latency & improved stateful processing

There are two new key features that specifically target lowering latencies with stateful operations, as well as improvements to the stateful APIs. The first is asynchronous checkpointing for large stateful operations, which improves upon a historically synchronous and higher latency design.

Asynchronous Checkpointing

In this model, state updates are written to a cloud storage checkpoint location before the next microbatch begins. The advantage is that if a stateful streaming query fails, we can easily restart the query by using the information from the last successfully completed batch. In the asynchronous model, the next microbatch does not have to wait for state updates to be written, improving the end-to-end latency of the overall microbatch execution.

You can learn more about this feature in an upcoming deep-dive blog post, and try it in Databricks Runtime 10.3 and above.

Arbitrary stateful operator improvements

In a much earlier post, we introduced Arbitrary Stateful Processing in Structured Streaming with [flat]MapGroupsWithState. These operators provide a lot of flexibility and enable more advanced stateful operations beyond aggregations. We’ve introduced improvements to these operators that:

Allow initial state, avoiding the need to reprocess all your streaming data.
Enable easier logic testing by exposing a new TestGroupState interface, allowing users to create instances of GroupState and access internal values for what has been set, simplifying unit tests for the state transition functions.

Allow Initial State

Let’s start with the following flatMapGroupswithState operator:

def flatMapGroupsWithState[S: Encoder, U: Encoder](
    outputMode: OutputMode,
    timeoutConf: GroupStateTimeout,
    initialState: KeyValueGroupedDataset[K, S])(
    func: (K, Iterator[V], GroupState[S]) => Iterator[U])

This custom state function maintains a running count of fruit that have been encountered.

val fruitCountFunc =(key: String, values: Iterator[String], state: GroupState[RunningCount]) => {
  val count = state.getOption.map(_.count).getOrElse(0L) + valList.size
  state.update(new RunningCount(count))
  Iterator((key, count.toString))
}

In this example, we specify the initial state to the this operator by setting starting values for certain fruit:

val fruitCountInitialDS: Dataset[(String, RunningCount)] = Seq(
  ("apple", new RunningCount(1)),
  ("orange", new RunningCount(2)),
  ("mango", new RunningCount(5)),
).toDS()

val fruitCountInitial = initialState.groupByKey(x => x._1).mapValues(_._2)

fruitStream
  .groupByKey(x => x)
  .flatMapGroupsWithState(Update, GroupStateTimeout.NoTimeout, fruitCountInitial)(fruitCountFunc)

Easier Logic Testing

You can also now test state updates using the TestGroupState API.

import org.apache.spark.sql.streaming._
import org.apache.spark.api.java.Optional

test("flatMapGroupsWithState's state update function") {
  var prevState = TestGroupState.create[UserStatus](
    optionalState = Optional.empty[UserStatus],
    timeoutConf = GroupStateTimeout.EventTimeTimeout,
    batchProcessingTimeMs = 1L,
    eventTimeWatermarkMs = Optional.of(1L),
    hasTimedOut = false)

  val userId: String = ...
  val actions: Iterator[UserAction] = ...

  assert(!prevState.hasUpdated)

  updateState(userId, actions, prevState)

  assert(prevState.hasUpdated)
  
}

You can find these, and more examples in the Databricks documentation.

Native support for Session Windows

Structured Streaming introduced the ability to do aggregations over event-time based windows using tumbling or sliding windows, both of which are windows of fixed-length. In Spark 3.2, we introduced the concept of session windows, which allow dynamic window lengths. This historically required custom state operators using flatMapGroupsWithState.

An example of using dynamic gaps:

# Define the session window having dynamic gap duration based on eventType
session_window expr = session_window(events.timestamp, \
    when(events.eventType == "type1", "5 seconds") \
    .when(events.eventType == "type2", "20 seconds") \
    .otherwise("5 minutes"))

# Group the data by session window and userId, and compute the count of each group
windowedCountsDF = events \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy(events.userID, session_window_expr) \
    .count()

Goal #2: Improve observability of streaming workloads

While the StreamingQueryListener API allows you to asynchronously monitor queries within a SparkSession and define custom callback functions for query state, progress, and terminated events, understanding back pressure and reasoning about where the bottlenecks are in a microbatch were still challenging. As of Databricks Runtime 8.1, the StreamingQueryProgress object reports data source specific back pressure metrics for Kafka, Kinesis, Delta Lake and Auto Loader streaming sources.

An example of the metrics provided for Kafka:

{
  "sources" : [ {
    "description" : "KafkaV2[Subscribe[topic]]",
    "metrics" : {
      "avgOffsetsBehindLatest" : "4.0",
      "maxOffsetsBehindLatest" : "4",
      "minOffsetsBehindLatest" : "4",
      "estimatedTotalBytesBehindLatest" : "80.0"
    },
  } ]
}

Databricks Runtime 8.3 introduces real-time metrics to help understand the performance of the RocksDB state store and debug the performance of state operations. These can also help identify target workloads for asynchronous checkpointing.

An example of the new state store metrics:

{
  "id" : "6774075e-8869-454b-ad51-513be86cfd43",
  "runId" : "3d08104d-d1d4-4d1a-b21e-0b2e1fb871c5",
  "batchId" : 7,
  "stateOperators" : [ {
    "numRowsTotal" : 20000000,
    "numRowsUpdated" : 20000000,
    "memoryUsedBytes" : 31005397,
    "numRowsDroppedByWatermark" : 0,
    "customMetrics" : {
      "rocksdbBytesCopied" : 141037747,
      "rocksdbCommitCheckpointLatency" : 2,
      "rocksdbCommitCompactLatency" : 22061,
      "rocksdbCommitFileSyncLatencyMs" : 1710,
      "rocksdbCommitFlushLatency" : 19032,
      "rocksdbCommitPauseLatency" : 0,
      "rocksdbCommitWriteBatchLatency" : 56155,
      "rocksdbFilesCopied" : 2,
      "rocksdbFilesReused" : 0,
      "rocksdbGetCount" : 40000000,
      "rocksdbGetLatency" : 21834,
      "rocksdbPutCount" : 1,
      "rocksdbPutLatency" : 56155599000,
      "rocksdbReadBlockCacheHitCount" : 1988,
      "rocksdbReadBlockCacheMissCount" : 40341617,
      "rocksdbSstFileSize" : 141037747,
      "rocksdbTotalBytesReadByCompaction" : 336853375,
      "rocksdbTotalBytesReadByGet" : 680000000,
      "rocksdbTotalBytesReadThroughIterator" : 0,
      "rocksdbTotalBytesWrittenByCompaction" : 141037747,
      "rocksdbTotalBytesWrittenByPut" : 740000012,
      "rocksdbTotalCompactionLatencyMs" : 21949695000,
      "rocksdbWriterStallLatencyMs" : 0,
      "rocksdbZipFileBytesUncompressed" : 7038
    }
  } ],
  "sources" : [ {
  } ],
  "sink" : {
  }
}

Goal # 3: Improve resource allocation and scalability

Streaming Autoscaling with Delta Live Tables (DLT)

At Data + AI Summit last year, we announced Delta Live Tables, which is a framework that allows you to declaratively build and orchestrate data pipelines, and largely abstracts the need to configure clusters and node types. We’re taking this a step further and introducing an intelligent autoscaling solution for streaming pipelines that improves upon the existing Databricks Optimized Autoscaling. These benefits include:

Better Cluster Utilization:

The new algorithm takes advantage of the new back pressure metrics to adjust cluster sizes to better handle scenarios in which there are fluctuations in streaming workloads, which ultimately leads to better cluster utilization.

Proactive Graceful Worker Shutdown:

While the existing autoscaling solution retires nodes only if they are idle, the new DLT Autoscaler will proactively shut down selected nodes when utilization is low, while simultaneously guaranteeing that there will be no failed tasks due to the shutdown.

As of writing, this feature is currently in Private Preview. Please reach out to your account team for more information.

Trigger.AvailableNow

In Structured Streaming, triggers allow a user to define the timing of a streaming query’s data processing. These trigger types can be micro-batch (default), fixed interval micro-batch (Trigger.ProcessingTime(“”), one-time micro-batch (Trigger.Once), and continuous (Trigger.Continuous).

Databricks Runtime 10.1 introduces a new type of trigger; Trigger.AvailableNow that is similar to Trigger.Once but provides better scalability. Like Trigger Once, all available data will be processed before the query is stopped, but in multiple batches instead of one. This is supported for Delta Lake and Auto Loader streaming sources.

Example:

spark.readStream
  .format("delta")
  .option("maxFilesPerTrigger", "1")
  .load(inputDir)
  .writeStream
  .trigger(Trigger.AvailableNow)
  .option("checkpointLocation", checkpointDir)
  .start()

Summary

As we head into 2022, we will continue to accelerate innovation in Structured Streaming, further improving performance, decreasing latency and implementing new and exciting features. Stay tuned for more information throughout the year!

Try Databricks for free. Get started today.

The post Structured Streaming: A Year in Review appeared first on Databricks.

↧

Simplify Your Forecasting With Databricks AutoML

February 9, 2022, 8:00 am

≫ Next: Using Apache Flink With Delta Lake

≪ Previous: Structured Streaming: A Year in Review

Last year, we announced Databricks AutoML for Classification and Regression and showed the importance of having a glass box approach to empower data teams. Today, we are happy to announce that we’re extending those capabilities to forecasting problems with AutoML for Forecasting.

Data teams can easily create forecasts entirely through a UI. These generated forecasts can be used as is or as starting points for further tweaking. Simplifying and reducing the time to start is particularly important in forecasting because stakeholders are often looking at hundreds or even thousands of different forecasts for different products, territories, stores and so forth, which can lead to a backlog of unstarted forecasts. AutoML for Forecasting augments data teams and helps them to quickly verify the predictive power of a dataset, as well as get a baseline model to guide the direction of a forecasting project.

Let’s take a look at how easy it is to get a forecast with AutoML.

Example: Forecasting candy production

With Valentine’s Day coming up soon, we want to forecast the production of candy in the next few weeks.

How it works

A setup wizard guides us through what we need to configure in order to get started. We chose the “Forecasting” problem type and selected the dataset. In this example, we’re using a candy production quantity dataset that we already had created as a table in Databricks and Databricks Runtime 10.3. Here we’re also able to specify if we want to perform a univariate or multi-series forecasting.

Once started, AutoML will perform any necessary data prep, train multiple models using Prophet and ARIMA algorithms, perform hyperparameter tuning with Hyperopt for each time series being forecasted, all while running fully parallelly with Apache Spark™. As AutoML finishes running, we will be able to see the different models that were trained and their performance metrics (e.g., SMAPE and RMSE) to evaluate the best ones.

Augmenting data teams

Next, we can see that AutoML detected that one of the types of candy, “mixed”, did not have enough data to produce a forecast and notified us through a warning.

The best part about AutoML is that everything is transparent. AutoML will provide warnings on important steps that were performed or even skipped based on our data. This gives us the opportunity to use our knowledge of the data and make any necessary updates to the models.

AutoML makes this easy by also allowing us to look at the full Python notebooks for each of the models trained and a data exploration notebook that highlights insights about the data used for the models. In the data exploration notebook, we’re able to confirm that removing the “mixed” candy type will not impact our forecast as we can see that it only had two data points.

These notebooks can be great starting points for data scientists by allowing them to bring in their domain knowledge to make updates to models that were automatically generated.

To see what the predicted production of candy is going to look like, we can select the notebook of the best performing model and view the included plot of the actual candy production vs the forecasts, including those for January 2022 to March 2022.

In addition to making predictions, AutoML Forecast provides more analysis of the forecast in the notebooks. Here, we can see how trends and seasonality factored into the predictions. Overall, it looks like candy production tends to peak from October to December, which aligns with Halloween and the holidays, but has a slight spike in production again in February, just in time for Valentine’s Day.

Now that we’ve identified which model to use, we can register it by clicking the model name or start time from the list of runs and then clicking the “Register Model” button. From here, we can set up model serving and deploy our model for inference and predictions.

Get started with Databricks AutoML public preview

Databricks AutoML is in Public Preview as part of the Databricks Machine Learning experience. To get started:

In the Databricks UI, simply switch to the “Machine Learning” experience via the left sidebar. Click on the “(+) Create” and click “AutoML Experiment” or navigate to the Experiments page and click “Create AutoML Experiment.”. Use the AutoML API, a single-line call, which can be seen in our documentation.

Ready to try Databricks AutoML out for yourself? Read more about Databricks AutoML and how to use it on AWS, Azure, and GCP or take the AutoML Forecasting course (available for Databricks customers with a Databricks Academy login).

If you’re new to AutoML, be sure to join us for a live demo with our friends at Fabletics on Feb 10 at 10AM PT. We’ll be covering the fundamentals of AutoML, and walk you through how – no matter what your role – you can leverage AutoML to jumpstart and simplify your ML projects. Grab a seat!

Try Databricks for free. Get started today.

The post Simplify Your Forecasting With Databricks AutoML appeared first on Databricks.

↧

Using Apache Flink With Delta Lake

February 10, 2022, 8:00 am

≫ Next: Databricks Delta Live Tables Announces Support for Simplified Change Data Capture

≪ Previous: Simplify Your Forecasting With Databricks AutoML

As with all parts of our platform, we are constantly raising the bar and adding new features to enhance developers’ abilities to build the applications that will make their Lakehouse a reality. Building real-time applications on Databricks is no exception. Features like asynchronous checkpointing, session windows, and Delta Live Tables allow organizations to build even more powerful, real-time pipelines on Databricks using Delta Lake as the foundation for all the data that flows through the Lakehouse.

However, for organizations that leverage Flink for real-time transformations, it might appear that they are unable to take advantage of some of the great Delta Lake and Databricks features, but that is not the case. In this blog we will explore how Flink developers can build pipelines to integrate their Flink applications into the broader Lakehouse architecture.

A stateful Flink application

Let’s use a credit card company to explore how we can do this.

For credit card companies, preventing fraudulent transactions is table-stakes for a successful business. Credit card fraud poses both reputational and revenue risk to a financial institution and, therefore, credit card companies must have systems in place to remain constantly vigilant in preventing fraudulent transactions. These organizations may implement monitoring systems using Apache Flink, a distributed event-at-a-time processing engine with fine-grained control over streaming application state and time.

Below is a simple example of a fraud detection application in Flink. It monitors transaction amounts over time and sends an alert if a small transaction is immediately followed by a large transaction within one minute for any given credit card account. By leveraging Flink’s ValueState data type and KeyedProcessFunction together, developers can implement their business logic to trigger downstream alerts based on event and time states.

import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.api.scala.typeutils.Types
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.util.Collector
import org.apache.flink.walkthrough.common.entity.Alert
import org.apache.flink.walkthrough.common.entity.Transaction

object FraudDetector {
  val SMALL_AMOUNT: Double = 1.00
  val LARGE_AMOUNT: Double = 500.00
  val ONE_MINUTE: Long     = 60 * 1000L
}

@SerialVersionUID(1L)
class FraudDetector extends KeyedProcessFunction[Long, Transaction, Alert] {

  @transient private var flagState: ValueState[java.lang.Boolean] = _
  @transient private var timerState: ValueState[java.lang.Long] = _

  @throws[Exception]
  override def open(parameters: Configuration): Unit = {
    val flagDescriptor = new ValueStateDescriptor("flag", Types.BOOLEAN)
    flagState = getRuntimeContext.getState(flagDescriptor)

    val timerDescriptor = new ValueStateDescriptor("timer-state", Types.LONG)
    timerState = getRuntimeContext.getState(timerDescriptor)
  }

  override def processElement(
      transaction: Transaction,
      context: KeyedProcessFunction[Long, Transaction, Alert]#Context,
      collector: Collector[Alert]): Unit = {

    // Get the current state for the current key
    val lastTransactionWasSmall = flagState.value

    // Check if the flag is set
    if (lastTransactionWasSmall != null) {
      if (transaction.getAmount > FraudDetector.LARGE_AMOUNT) {
        // Output an alert downstream
        val alert = new Alert
        alert.setId(transaction.getAccountId)

        collector.collect(alert)
      }
      // Clean up our state
      cleanUp(context)
    }

    if (transaction.getAmount < FraudDetector.SMALL_AMOUNT) {
      // set the flag to true
      flagState.update(true)
      val timer = context.timerService.currentProcessingTime + FraudDetector.ONE_MINUTE

      context.timerService.registerProcessingTimeTimer(timer)
      timerState.update(timer)
    }
  }

 override def onTimer(
      timestamp: Long,
      ctx: KeyedProcessFunction[Long, Transaction, Alert]#OnTimerContext,
      out: Collector[Alert]): Unit = {
    // remove flag after 1 minute
    timerState.clear()
    flagState.clear()
  }

  @throws[Exception]
  private def cleanUp(ctx: KeyedProcessFunction[Long, Transaction, Alert]#Context): Unit = {
    // delete timer
    val timer = timerState.value
    ctx.timerService.deleteProcessingTimeTimer(timer)

    // clean up all states
    timerState.clear()
    flagState.clear()
  }
}

In addition to sending alerts, most organizations will want the ability to perform analytics on all the transactions they process. Fraudsters are constantly evolving the techniques they use in the hopes of remaining undetected, so it is quite likely that a simple heuristic-based fraud detection application, such as the above, will not be sufficient for preventing all fraudulent activity. Organizations leveraging Flink for alerting will also need to combine disparate data sets to create advanced fraud detection models that analyze more than just transactional data, but include data points such as demographic information of the account holder, previous purchasing history, time and location of transactions, and more.

Integrating Flink applications using cloud object store sinks with Delta Lake

There is a tradeoff between very low-latency operational use-cases and running performant OLAP on big datasets. To meet operational SLAs and prevent fraudulent transactions, records need to be produced by Flink nearly as quickly as events are received, resulting in small files (on the order of a few KBs) in the Flink application’s sink. This “small file problem” can lead to very poor performance in downstream queries, as execution engines spend more time listing directories and pulling files from cloud storage than they do actually processing the data within those files. Consider the same fraud detection application that writes transactions as parquet files with the following schema:

root
 |-- dt: timestamp (nullable = true)
 |-- accountId: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- alert: boolean (nullable = true)

Fortunately, Databricks Auto Loader makes it easy to stream data landed into object storage from Flink applications into Delta Lake tables for downstream ML and BI on that data.

from pyspark.sql.functions import col, date_format

data_path = "/demo/flink_delta_blog/transactions"
delta_silver_table_path = "/demo/flink_delta_blog/silver_transactions"
checkpoint_path = "/demo/flink_delta_blog/checkpoints/delta_silver"

flink_parquet_schema = spark.read.parquet(data_path).schema

# Enable Auto Optimize to handle the small file problem
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

flink_parquet_to_delta_silver = (spark.readStream.format("cloudFiles")
                                 .option("cloudFiles.format", "parquet")
                                 .schema(flink_parquet_schema)
                                 .load(data_path)
                                 .withColumn("date", date_format(col("dt"), "yyyy-MM-dd"))  # use for partitioning the downstream Delta table
                                 .withColumnRenamed("dt", "timestamp")
                                 .writeStream
                                 .format("delta")
                                 .option("checkpointLocation", checkpoint_path)
                                 .partitionBy("date")
                                 .start(delta_silver_table_path)
                                )

Delta Lake tables automatically optimize the physical layout of data in cloud storage through compaction and indexing to mitigate the small file problem and enable performant downstream analytics.

-- Further optimize the physical layout of the table using ZORDER.
OPTIMIZE delta.`/demo/flink_delta_blog/silver_transactions`
ZORDER BY (accountId)

Much like Auto-Loader can transform a static source like cloud storage into a streaming datasource, Delta Lake tables also function as streaming sources despite being stored in object storage. This means that organizations using Flink for operational use cases can leverage this architectural pattern for streaming analytics without sacrificing their real-time requirements.

streaming_delta_silver_table = (spark.readStream.format("delta")
                                .load(delta_silver_table_path)
                                # ... additional streaming ETL and/or analytics here...
                               )

Integrating Flink applications using Apache Kafka and Delta Lake

Let’s say the credit card company wanted to use their fraud detection model that they built in Databricks, and the model to score the data in real-time. Pushing files to cloud storage might not be fast enough for some SLAs around fraud detection, so they can write data from their Flink application to message bus systems like Kafka, AWS Kinesis, or Azure Event Hub. Once the data is written to Kafka, a Databricks job can read from Kafka and write to Delta Lake.

For Flink developers, there is a Kafka Connector that can be integrated with your Flink projects to allow for DataStream API and Table API-based streaming jobs to write out the results to an organization’s Kafka cluster. Note that as of the writing of this blog, Flink does not come packaged with this connector, so you will need to include the Kafka Connector JAR in your project’s build file (i.e. pom.xml, build.sbt, etc).

Here is an example of how you would write the results of your DataStream in Flink to a topic on the Kafka Cluster:

package spendreport;

import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.walkthrough.common.entity.Transaction;
import org.apache.flink.walkthrough.common.source.TransactionSource;

public class FraudDetectionJob {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStream transactions = env
            .addSource(new TransactionSource())
            .name("transactions");

	String brokers = “enter-broker-information-here”

KafkaSink sink = KafkaSink.builder()
     .setBootstrapServers(brokers)
     .setRecordSerializer(KafkaRecordSerializationSchema.builder()
           .setTopic("transactions")
           .setValueSerializationSchema(new TransactionSchema())
           .build()
     )
     .build();


        transactions.sinkTo(sink)

        env.execute("Fraud Detection");
    }
}

Now you can easily leverage Databricks to write a Structured Streaming application to read from the Kafka topic that the results of the Flink DataStream wrote out to. To establish the read from Kafka...

kafka = (spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", kafka_bootstrap_servers_plaintext ) 
  .option("subscribe", “fraud-events” )
  .option("startingOffsets", "latest" )
  .load())

kafkaTransformed = kafka.select(from_json(col(“value”).cast(“string), schema) \
			...additional transformations

Once the data has been schematized, we can load our model and score the microbatch of data that Spark processes after each trigger. For a more detailed example of Machine Learning models and Structured streaming, check this article out in our documentation.

import pyspark.ml.Pipeline
pipelineModel = Pipeline.load(“/path/to/trained/model)

streamingPredictions = (pipelineModel.transform(kafkaTransformed)
 .groupBy(“id”)
 .agg(
   (sum(when('prediction === 'label, 1)) / count('label)).alias("true prediction rate"),
   count('label).alias("count")
 ))

Now we can write to Delta by configuring the writeStream and pointing it to our fraud_predictions Delta Lake table. This will allow us to build important reports on how we track and handle fraudulent transactions for our customers; we can even use the outputs to understand how our model is doing over time in terms of how many false positives it outputs or accurate assessments.

streamingPredictions.writeStream \
			.format(“delta”) \
			.outputMode(“append”) \
			.option(“checkpointLocation”, “/location/in/cloud/storage”) \
			.table(“fraud_predictions”)

Conclusion

With both of these options, Flink and Autoloader or Flink and Kafka, organizations can still leverage the features of Delta Lake and ensure they are integrating their Flink applications into their broader Lakehouse architecture. Databricks has also been working with the Flink community to build a direct Flink to Delta Lake connector, which you can read more about here.

Try Databricks for free. Get started today.

The post Using Apache Flink With Delta Lake appeared first on Databricks.

↧

Databricks Delta Live Tables Announces Support for Simplified Change Data Capture

February 10, 2022, 8:52 am

≫ Next: A Breakup Letter to Data Warehouses

≪ Previous: Using Apache Flink With Delta Lake

As organizations adopt the data lakehouse architecture, data engineers are looking for efficient ways to capture continually arriving data. Even with the right tools, implementing this common use case can be challenging to execute – especially when replicating operational databases into their lakehouse or reprocessing data for each update. Using a reliable ETL framework to develop, monitor, manage and operationalize data pipelines at scale, we have made it easy to implement change data capture (CDC) into the Delta Lake with Delta Live Tables (DLT) giving users:

Simplicity and convenience: Easy-to-use APIs for identifying changes, making your code simple, convenient and easy to understand.
Efficiency: The ability to only insert or update rows that have changed, with efficient merge, update and delete operations.
Scalability: The ability to capture and apply data changes across tens of thousands of tables with low-latency support.

Delta Live Tables enables data engineers to simplify data pipeline development and maintenance, enable data teams to self serve and innovate rapidly, provides built-in quality controls and monitoring to ensure accurate and useful BI, Data Science and ML and lets you scale with reliability through deep visibility into pipeline operations, automatic error handling, and auto-scaling capabilities.

With DLT, data engineers can easily implement CDC with a new declarative APPLY CHANGES INTO API, in either SQL or Python. This new capability lets ETL pipelines easily detect source data changes and apply them to data sets throughout the lakehouse. DLT processes data changes into the Delta Lake incrementally, flagging records to be inserted, updated or deleted when handling CDC events. The example below shows how easy it is to identify and delete records from a customer table using the new API:

CREATE STREAMING LIVE TABLE customer_silver;

APPLY CHANGES INTO live.customer_silver
FROM stream(live.customer_bronze)
 KEYS (id)
 APPLY AS DELETE WHEN active = 0
 SEQUENCE BY update_dt
;

The default behavior is to upsert the CDC events from the source by automatically updating any row in the target table that matches the specified key(s) and insert a new row if there’s no preexisting match in the target table. DELETE events may also be handled by specifying the APPLY AS DELETE WHEN condition. APPLY CHANGES INTO is available in all regions. For more information, refer to the documentation (Azure, AWS, GCP) or check out an example notebook.

Try Databricks for free. Get started today.

The post Databricks Delta Live Tables Announces Support for Simplified Change Data Capture appeared first on Databricks.

↧

A Breakup Letter to Data Warehouses

February 14, 2022, 7:00 am

≫ Next: Deploy Production Pipelines Even Easier With Python Wheel Tasks

≪ Previous: Databricks Delta Live Tables Announces Support for Simplified Change Data Capture

Dear Data Warehouse,

We have been trying to make it work for a long time, some would say too long, and it’s just not working anymore. I want to say “it’s not you, it’s me”, but actually – it is you. You have too many issues that make working with you limited and challenging. It’s time for me to find a new way.

You’re limited

When we started working together, a place for all my structured data was the only thing I needed. But now, I have grown past you. In order for my data teams to innovate with data science and machine learning, I need to work with unstructured and semi-structured data too, and you can’t support me in this. Staying with you would mean I’ll fall behind while others move forward. It’s just too limiting, I can’t take it.

You lock me in

I don’t know how to say this lightly, but I want to work with other systems. Your proprietary lock-in just means that you are gaining, while I am losing. The world is moving to be more open, and I need to too.

You’re expensive

Lastly, you’re too high maintenance – and it’s hurting my wallet. I have other priorities and can’t invest all my budget in you alone, especially when there are better alternatives out there. Since keeping costs low is extremely important, and efficiency matters, you’re just going to have to go.

Try Databricks for free. Get started today.

The post A Breakup Letter to Data Warehouses appeared first on Databricks.

↧

Deploy Production Pipelines Even Easier With Python Wheel Tasks

February 14, 2022, 9:37 am

≫ Next: Lakehouse for Financial Services: Paving the Way for Data-Driven Innovation in FSIs

≪ Previous: A Breakup Letter to Data Warehouses

With its rich open source ecosystem and approachable syntax, Python has become the main programming language for data engineering and machine learning. Data and ML engineers already use Databricks to orchestrate pipelines using Python notebooks and scripts. Today, we are proud to announce that Databricks can now run Python wheels, making it easy to develop, package and deploy more complex Python data and ML pipeline code.

Python wheel tasks can be executed on both interactive clusters and on job clusters as part of jobs with multiple tasks. All the output is captured and logged as part of the task execution so that it is easy to understand what happened without having to go into cluster logs.

The wheel package format allows Python developers to package a project’s components so they can be easily and reliably installed in another system. Just like the JAR format in the JVM world, a wheel is a compressed, single-file build artifact, typically the output of a CI/CD system. Similar to a JAR, a wheel contains not only your source code but references to all of its dependencies as well.

To run a Job with a wheel, first build the Python wheel locally or in a CI/CD pipeline, then upload it to cloud storage. Specify the path of the wheel in the task and choose the method that needs to be executed as the entrypoint. Task parameters are passed to your main method via *args or **kwargs.

Python Wheel tasks in Databricks Jobs are now Generally Available. We would love for you to try out this capability and tell us how we can better support Python data engineers.

Try Databricks for free. Get started today.

The post Deploy Production Pipelines Even Easier With Python Wheel Tasks appeared first on Databricks.

↧

Lakehouse for Financial Services: Paving the Way for Data-Driven Innovation in FSIs

February 15, 2022, 6:00 am

≫ Next: How Gemini Built a Cryptocurrency Analytics Platform Using Lakehouse for Financial Services

≪ Previous: Deploy Production Pipelines Even Easier With Python Wheel Tasks

When it comes to “data-driven innovation,” financial service institutions (FSI) aren’t what typically come to mind. But with massive amounts of data at their potential disposal, this isn’t for lack of imagination. FSIs want to innovate but are continually slowed down by complex legacy architectures and vendor lock-in that prevent data and AI from becoming material business drivers.

Largely as a result of these challenges, the financial services industry has arguably seen little innovation in recent decades – even as other regulated sectors such as healthcare and education continue to break barriers. Even for the most established incumbents, a lack of innovation can quickly lead to being taken over by a new, digital-native company – a move some of us at Databricks call Tesla-fication. This is where one disruptive, data and AI-driven innovator becomes disproportionately more successful than the incumbents who previously dominated the space. One indication of this success can be found in the stock market. Today, Tesla boasts a $900+ billion market capitalization, making it worth more than the next 10 leading automotive competitors combined. Incumbency is no longer a moat.

In fact, we’re already starting to see Tesla-fication happening in financial services. Nubank, a Brazilian fintech launched in 2014, has quickly changed the competitive dynamics in its home country and beyond. Early on, Nubank disrupted the credit card market, by enabling online applications, as well as by extending credit to those with no credit history. Today, it uses bleeding-edge technology, data and AI to develop new products and services. Data science plays an essential role in every aspect of their business – from customer support to credit lines. Seven years after their launch, in December of 2021, Nubank became one of the largest IPOs in Latin America and briefly eclipsed the market capitalization of Brazil’s largest bank. Signs of Tesla-fication are emerging across all segments of financial services, from banking to insurance to capital markets. For FSIs, this means that the traditional sources of competitive advantage – capital and scale – no longer cut it. Today, transformation requires leaders to focus their investments on two modern sources of competitive advantage: data and people.

Introducing Lakehouse for Financial Services

Today, we’re thrilled to introduce Lakehouse for Financial Services to help bring data and people together for every FSI. Lakehouse for Financial Services addresses the unique requirements of FSIs via industry-focused capabilities, such as pre-built solutions accelerators, data sharing capabilities, open standards and certified implementation partners. With this platform, organizations across the banking, insurance and capital market sectors can increase the impact and time-to-value of their data assets, ultimately enabling data and AI to become central to every part of their business – from lending to insuring.

So, why is Lakehouse for Financial Services critical for success? When speaking with our customers, we identified the biggest challenges around transforming into a data-driven organization (and how Lakehouse addresses them):

Risk of vendor lock-in: FSIs are particularly vulnerable to being stuck with proprietary data formats and technologies that stifle collaboration and innovation. Lakehouse is powered by open source and open standards, meaning that data teams can leverage the tools of their choice.
No multi-cloud: Increasingly, regulators are asking FSIs to consider systemic risk arising from overreliance on a single vendor. Lakehouse solves this by offering full support for all major cloud vendors.
Real-time data access for BI: The most recent data is typically the most valuable, but traditional architectures often make it a hurdle for data analysts to access it. With Lakehouse, data teams across functions always can access the most up-to-date, reliable data.
Lack of support for all data sets: The fastest-growing data in FSIs is unstructured data sets (text, images, etc), which makes data warehouses less than ideal for critical use cases. Lakehouse handles all types of data – structured, semi-structured and unstructured – and even offers data sharing capabilities with leading providers such as Factset.
Driving AI use cases. Although the regulated aspect of financial services makes it difficult to embrace and scale AI, the main hurdles are internal policies around risk adversity coupled with siloed infrastructures and legacy processes. Lakehouse makes AI accessible and transparent via MLflow; coupled with Delta Lake time travel capability, AI has been adopted as a next generation of model risk management for independent validation.

What makes Lakehouse for Financial Services Equipped to Tackle These Challenges?

We built Lakehouse for Financial Services specifically to tackle these challenges and empower organizations to find new ways to gain a competitive edge, innovate risk management and more, even within highly-regulated environments. Here’s how we’re doing just that:

Pre-built Solution Accelerators for Financial Services Use Cases

Lakehouse for Financial Services aligns with our 14 financial services solution accelerators, fully functional and freely available notebooks that tackle the most common and high-impact use cases that our customers are facing. These use cases include:

Post-Trade Analysis and Market Surveillance: Using an efficient time series processing engine for market data, this library combines core market data and disparate alternative data sources, enabling asset managers to backtest investing strategies at scale and efficiently report on transaction cost analysis.
Transaction Enrichment: This scalable geospatial data library enables hyper-personalization in retail banking to better understand customer transaction behavior required for next-gen customer segmentation and modern fraud prevention strategies.
Regulatory Reporting: This accelerator streamlines the acquisition, processing and transmission of regulatory data following open data standards and open data sharing protocols.
GDPR Compliance: Simplify the technical challenges around compliance to the “right to be forgotten” requirement while ensuring strict audit capabilities.
Common Data Models: A set of frameworks and accelerators for common data models to address the challenges FSIs have in standardizing data across the organization.

Industry Open Source Projects

As part of this launch, we’re thrilled to announce that we have joined FINOS (FinTech Open Source Foundation) to foster innovation and collaboration in financial services. FINOS includes the world’s leading FSIs such as Goldman Sachs, Morgan Stanley, UBS and JP Morgan as members. Open Source has become a core strategic initiative for data strategies in financial services as organizations look to avoid complex, costly vendor lock-in and proprietary data formats. As part of FINOS, Databricks is helping to facilitate the processing and exchange of financial data throughout the entire banking ecosystem. This is executed via our Delta Lake and Delta Sharing integrations with recent open source initiatives led by major FSIs.

Databricks is working to help empower the standardization of data by significantly democratizing data accessibility and insights. Ultimately, we want to bring data to the masses. That’s why we recently integrated the LEGEND ecosystem with Delta Lake functionalities such as Delta Live Tables. Developed by leading financial services institutions and subsequently open-sourced through the LINUX Foundation, the LEGEND ecosystem allows domain experts and financial analysts to map business logic, taxonomy and financial calculations to data. Now integrated into the Lakehouse for Financial Services, those same business processes can be directly translated into core data pipelines to enforce high-quality standards with minimum operation overhead. Coupled with the Lakehouse query layer, this integration provides financial analysts with massive amounts of real-time data directly through the comfort of their business applications and core enterprise services.

Simple deployment of the Lakehouse environment

With Lakehouse for Financial Services, customers can easily automate security standards. More specifically, the utility libraries and scripts we’ve created for financial services deliver automated setup for notebooks and are tailored to help solve security and governance issues important to the financial services industry based on best practices and patterns from our 600+ customers.

A data model framework for standardizing data

In addition to solution accelerators, Lakehouse provides a framework for common data models to address the challenges FSIs have in standardizing data across the organization. For example, one solution accelerator is designed to easily integrate the Financial Regulation (FIRE) Data model to drive the standardization of data, serve data to downstream tools, enable AI quality checks and govern the data using Unity Catalog.

Open data sharing

Last year, we launched Delta Sharing, the world’s first open protocol for securely sharing data across organizations in real-time, independent of the platform on which the data resides. This is largely powered by our incredible ecosystem of partners, which we’re continuing to scale and grow. We are thrilled to announce that we have recently invested in Ticksmith, a leading SaaS platform that simplifies the online data shopping experience and was one of the first platforms to implement Delta Sharing. With the TickSmith and Databricks integration, FSIs can now easily create, package and deliver data products in a unified environment.

Implementation Partners

Databricks is working with consulting and SI partner Avanade to deliver risk management solutions to financial institutions. Built on Azure Databricks, our joint solution makes it easier for customers to rapidly deploy data into value-at-risk models to keep up with emerging risks and threats. By migrating to the cloud and modernizing data-driven risk models, financial institutions are able to reduce regulatory, operational compliance risks related and scale to meet increased throughput.

Databricks is also partnering with the Deloitte FinServ Governed Data Platform, a cloud-based, curated data platform meeting regulatory requirements that builds a single source of truth for financial institutions to intelligently organize data domains and approved provisioning points, enabling activation of business intelligence, visualization, predictive analytics, AI/ML, NLP and RPA.

Conclusion

Tesla-fication is starting to happen all around us. Lakehouse for Financial Services is designed to help our customers make a leapfrog advancement in their data and AI journey with pre-built solution accelerators, data sharing capabilities, open standards and certified implementation partners. We are on the mission to help every FSI become the Tesla of their industry.

Want to learn more? Check out this overview and see how you can easily get started or schedule a demo.

Try Databricks for free. Get started today.

The post Lakehouse for Financial Services: Paving the Way for Data-Driven Innovation in FSIs appeared first on Databricks.

↧

How Gemini Built a Cryptocurrency Analytics Platform Using Lakehouse for Financial Services

February 15, 2022, 9:01 am

≫ Next: Beyond LDA: State-of-the-art Topic Models With BigARTM

≪ Previous: Lakehouse for Financial Services: Paving the Way for Data-Driven Innovation in FSIs

This blog has been co-authored by Gemini. We would like to thank the Gemini team, Anil Kovvuri and Sriram Rajappa, for their contributions.

Gemini is one of the top centralized cryptocurrency exchanges in the United States and across the globe and enables customers to trade cryptocurrency easily and safely on our platform.

Due to the vast amount of massive external real-time volumes, we had challenges with our existing data platform when facilitating internal reporting. Specifically, our data team needed to build applications to allow our end-users to understand order book data using the following metrics:

Spread analysis for each cryptocurrency market comparing Gemini against the competition
Cost of liquidity per crypto assets per exchange
Market volume and capitalization for stability analytics
Slippage and order book depth analysis

In addition to building a dashboard, the team received market data from an external data provider that would be ingested and presented in the web application, providing a rich end-user experience that allows users to refresh metrics anytime. With the sheer volume of historical and live data feeds being ingested, and the need for a scalable compute platform for backtesting and spread calculations, our team needed a performant single source of truth to build the application dashboards.

Ideation to creation

With these challenges outlined, the team defined three core technical requirements for the order book analytics platform:

Performant data marts to support ingestion of complex data types
Support for a highly parallelizable analytical compute engine
Self-service analytics and integration with hosted applications

First, we evaluated native AWS services to build out the order book analytics platform. However, our internal findings suggested the data team would need to dedicate a significant number of hours toward building a framework for ingesting data and stitching AWS native analytical services to build an end-to-end platform.

Next, we evaluated the data lakehouse paradigm. The core lakehouse foundation and features resonated with the team as an efficient way to build the data platform. With Databricks’ Lakehouse Platform for Financial Services, our data team had the flexibility and ability to engineer, analyze and apply ML from one single platform to support our data initiatives.

Going back to the core technical challenges, the main pain point was data ingestion. Data is sourced from 12 major exchanges and their crypto assets on a daily basis, as well as backfilled with new crypto exchanges. Below are a few data ingestion questions we posed to ourselves:

How do you efficiently backfill historical order books and trade data at scale that arrives into AWS S3 as a one-time archive file in tar format?
Batch data arrives as compressed csv files, with each exchange and trade pair in separate buckets. How do you efficiently process new trading pairs or new exchanges?
The external data provider doesn’t send any trigger/signal files, making it a challenge to know when the day’s data is pushed. How do you schedule jobs without creating external file watchers?
Pre and post data processing is a common challenge when working with data files. But how do you handle failures and address job restarts?
How do you make it easy to consume these data sets for a team with a mix of SQL and Python skill sets?

Solving the data ingestion problem

To solve the problem of data ingestion and backfill the historical data for the order book, the team leveraged Databricks’ Auto Loader functionality. Auto Loader is a file source that can perform incremental data loads from AWS S3 as it subscribes to file events from the input directory.

Ingesting third-party data into AWS S3

Once the data was in a readable format, another issue was the automatic processing of historical data. Challenges included listing the S3 directories since the beginning of time (2014 in this case), working with large files that were 1GB or more, and handling data volumes that were multiple terabytes per day. To scale processing,the team leveraged Auto Loader with the option to limit the number of files consumed per structured streaming trigger, as the number of files that needed to be ingested would be in the range of one hundred thousand across all the 12 major exchanges.

.option("cloudFiles. maxFilesPerTrigger", 1000)

Apart from the historical data, Gemini receives order book data from data providers across the 12 major exchanges on a daily basis. The team leveraged Auto Loader’s ability to integrate with AWS SQS that notifies and processes new files as they arrive. This solution eliminates the need to have a time-based process (e.g. a cron job) to check for newly arrived files. As data is ingested into the Lakehouse, it is then captured in Delta format, partitioned by date and exchange type, readily available for further processing or consumption. The example below shows how data is ingested into the Lakehouse:

#### Read raw orderbook data
odf = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.options(header='true') \
.schema(tradeSchema) \
.load(cloudfile_source)

#### Parse trade data
odf.createOrReplaceTempView("orderbook_df")
odf_final = spark.sql("select trade_date_utc, trade_ts_utc, date as trade_dt_epoc, \
                    exchange_name, regexp_replace(file_indicator,'(?<=.{1})([0-9])','' ) trade_pair, \
                    case when type = 'b' then 'bid' else 'ask' end as bid_ask, price, amount, file_indicator \
                    from orderbook_df")

#### Write to Bronze Delta table
odf_final.writeStream.format("delta") \
.partitionBy('trade_date_utc', 'exchange_name') \
.trigger(once=True) \
.start(cloudfile_target)

As the data sets would be leveraged by machine learning and analyst teams, the Delta Lake format provided unique capabilities for managing high volume market/tick data — these features were key in developing the Gemini Lakehouse platform:

Auto-compaction: Data providers deliver data in various formats (gz, flat files) and inconsistent file sizes. Delta Lake makes the data ready in real-time by compacting smaller files to improve query performance. The team leveraged date and exchange names as partitions since they would be used for tracking price movements and market share analysis.

Time Series Optimized-Querying - Many downstream queries require a time slice, for example, to track historical price changes which requires the ZORDER on time.

Unification of batch/streaming - Combine data feeds that are ingested with different velocities using a bronze delta table as a sink. This hugely simplifies the ingestion logic and requires less maintenance for the data engineers teams to maintain code over time.

Scalable metadata handling - Given the scale of tick data, Delta Lake's parallelization on metadata querying eliminates bottlenecks while scanning the data.

Reproducibility - Storing ML source data means forecasts are reproducible and Delta Lake's time travel can be leveraged for audit.

With the core business use case being market analysis, answering fundamental questions, such as Gemini’s daily market share, requires real-time analysis. With Databricks’ Lakehouse Platform for Financial Services, the data team leveraged Apache Spark’s Structured Streaming APIs that allowed the team to leverage key capabilities, like trigger once to schedule daily jobs to ingest and process data.

Enabling core business use cases with Machine Learning and Computed Features

Going back to the business use cases, the team needed to provide insights into two main areas — price predictions and market analysis. The team leveraged Databricks Lakehouse machine learning runtime capabilities to enable the core use cases in the following way:

Price predictions using machine learning

Price prediction is important for Gemini for a number of reasons:

Historical price movements across exchanges allows for time series analysis
Can be used as standalone feature for numerous downstream applications
Gives measure of predicted risk and volatility

To implement price predictions the team used order book data along with other computed metrics, for instance, market depth as the input. To determine price predictions the team leveraged Databricks’ AutoML, which provided a glassbox approach to performing distributed model experimentation at scale. The team used different deep learning architectures which included components from Convolutional Neural Networks (CNNs) that are in computer vision type of problems along more traditional LSTMs.

Market analysis using computed features

Market analysis is key for Gemini to answer questions like "what is our market share?" The team came up with different ways to compute features that would answer the business problem. Below are a couple of examples that include the problem definition:

Scenario based on weekly trade volumes:

To calculate Gemini’s share of market, using Bitcoin as an example, would be:
(Gemini BTC traded)/(Market BTC traded)

Scenario based on assets under custody (AUC):

Gives Gemini insight into the overall market, using Bitcoin as the example:
(Gemini BTC held)/(Market BTC held)

A simplified, collaborative data Lakehouse architecture for all users

As illustrated in the below diagram, the data Lakehouse architecture enables different personas to collaborate on a single platform. This ranges from designing complex data engineering tasks to making incremental data quality updates and providing easy access to the underlying datasets using R, SQL, Python and Scala APIs for data scientists and data analysts, all on top of a Delta engine powered by Databricks. Similarly, in this case, after enriching the bronze tables that were ingested from Auto Loader, these datasets were enriched by computing additional aggregates and the above mentioned time series forecasting, and finally persisted in gold tables for reporting and ad hoc analytics.

Enabling self-service data analytics

One of the big value propositions of the data Lakehouse for the data team was to leverage the Databricks SQL capabilities to build internal applications and avoid multiple hops and copies of data. The team built an internal web application using flask, which was connected to the Databricks SQL endpoint using a pyodbc connector from Databricks. This was valuable for the team since it eliminated the need for multiple BI licenses for the analysts who could not directly query the data in the Lakehouse.

Once we had the data lakehouse implemented with Databricks, the final presentation layer was a React web application, which is customizable according to the analyst requirements and refreshed on demand. Additionally, the team leveraged the Databricks SQL inbuilt visualizations for ad hoc analytics. An example of the final data product, React Application UI, is shown below:

Final thoughts

Given the complexity of the requirements, the data team was able to leverage the Databricks Lakehouse Platform for Financial Services architecture to support critical business requirements. The team was able to use Auto Loader for ingestion of complex tick data from the third party data provider while leveraging Delta Lake features such as partitioning, auto compaction and Z-Ordering to support the multi-terabyte scale of querying in the order book analytics platform.

The built-in machine learning and AutoML capabilities meant the team was quickly able to iterate through several models to formulate a baseline model to support spread, volatility and liquidity cost analytics. Further, being able to present the key insights through Databricks SQL while also making the gold data layer available through the React Web frontend provided rich end user experience for the analysts. Finally, the data lakehouse not only improved the productivity of data engineers, analysts and AI teams, but our teams are now able to access critical business insights by querying upto 6 months of data across multiple terabytes and billions of records which only takes milliseconds due to all the built-in optimizations.

Try Databricks for free. Get started today.

The post How Gemini Built a Cryptocurrency Analytics Platform Using Lakehouse for Financial Services appeared first on Databricks.

↧

Beyond LDA: State-of-the-art Topic Models With BigARTM

February 16, 2022, 7:00 am

≫ Next: Databricks Ventures Invests in Arcion to Enable Real-Time Data Sync with the Lakehouse

≪ Previous: How Gemini Built a Cryptocurrency Analytics Platform Using Lakehouse for Financial Services

This post follows up on the series of posts in Topic Modeling for text analytics. Previously, we looked at the LDA (Latent Dirichlet Allocation) topic modeling library available within MLlib in PySpark. While LDA is a very capable tool, here we look at a more scalable and state-of-the-art technique called BigARTM. LDA is based on a two-level Bayesian generative model that assumes a Dirichlet distribution for the topic and word distributions. BigARTM (BigARTM GitHub and https://bigartm.org) is an open source project based on Additive Regularization on Topic Models (ARTM), which is a non-Bayesian regularized model and aims to simplify the topic inference problem. BigARTM is motivated by the premise that the Dirichlet prior assumptions conflict with the notion of sparsity in our document topics, and that trying to account for this sparsity leads to overly-complex models. Here, we will illustrate the basic principles behind BigARTM and how to apply it to the Daily Kos dataset.

Why BigARTM over LDA?

As mentioned above, BigARTM is a probabilistic non-Bayesian approach as opposed to the Bayesian LDA approach. According to Konstantin Vorontsov’s and Anna Potapenko’s paper on additive regularization the assumptions of a Dirichlet prior in LDA do not align with the real-life sparsity of topic distributions in a document. BigARTM does not attempt to build a fully generative model of text, unlike LDA; instead, it choosesto optimize certain criteria using regularizers. These regularizers do not require any probabilistic interpretations. It is therefore noted that the formulation of multi-objective topic models are easier with BigARTM.

Overview of BigARTM

Problem statement

We are trying to learn a set of topics from a corpus of documents. The topics would consist of a set of words that make semantic sense. The goal here is that the topics would summarize the set of documents. In this regard, let us summarize the terminology used in the BigARTM paper:

D = collection of texts, each document ‘d’ is an element of D, each document is a collection of ‘n_d’ words (w₀, w₁,…w_d)

W = collection of vocabulary

T = a topic, a document ‘d’ is supposed to be made up of a number of topics

We sample from the probability space spanned by words (W), documents (D) and topics(T). The words and documents are observed but topics are latent variables.

The term ‘n_dw’ refers to the number of times the word ‘w’ appears in the document ‘d’.

There is an assumption of conditional independence that each topic generates the words independent of the document. This gives us

p(w|t) = p(w|t,d)
The problem can be summarized by the following equation

What we are really trying to infer are the probabilities within the summation term, (i.e., the mixture of topics in a document (p(t|d)) and the mixture of words in a topic (p(w|t)). Each document can be considered to be a mixture of domain-specific topics and background topics. Background topics are those that show up in every document and have a rather uniform per-document distribution of words. Domain-specific topics tend to be sparse, however.

Stochastic factorization

Through stochastic matrix factorization, we infer the probability product terms in the equation above. The product terms are now represented as matrices. Keep in mind that this process results in non-unique solutions as a result of the factorization; hence, the learned topics would vary depending on the initialization used for the solutions.

We create a data matrix F almost equal to [f_wd] of dimension WxD, where each element f_wd is the normalized count of word ‘w’ in document ‘d’ divided by the number of words in the document ‘d’. The matrix F can be stochastically decomposed into two matrices ∅ and θ so that:

F ≈ [∅] [θ]
[∅] corresponds to the matrix of word probabilities for topics, WxT

[θ] corresponds to the matrix of topic probabilities for the documents, TxD

All three matrices are stochastic and the columns are given by:

[∅]_t which represents the words in a topic and,

[θ]_d which represents the topics in a document respectively.

The number of topics is usually far smaller than the number of documents or the number of words.

LDA

In LDA the matrices ∅ and θ have columns, [∅]_t and [θ]_d that are assumed to be drawn from Dirichlet distributions with hyperparameters given by β and α respectively.

β= [β_w], which is a hyperparameter vector corresponding to the number of words

α= α[α_t], which is a hyperparameter vector corresponding to the number of topics

Likelihood and additive regularization

The log-likelihood we would like to maximize to obtain the solution is given by the equations below. This is the same as the objective function in Probabilistic Latent Semantic Analysis (PLSA) and will be the starting point for BigARTM.

We are maximizing the log of the product of the joint probability of every word in each document here. Applying Bayes Theorem results in the summation terms seen on the right side in the equation above. Now for BigARTM, we add ‘r’ regularizer terms, which are the regularizer coefficients τ_i multiplied by a function of ∅ and θ.

where R_i is a regularizer function that can take a few different forms depending on the type of regularization we seek to incorporate. The two common types are:

Smoothing regularization
Sparsing regularization

In both cases, we use the KL Divergence as a function for the regularizer. We can combine these two regualizers to meet a variety of objectives. Some of the other types of regularization techniques are decorrelation regularization and coherence regularization. (http://machinelearning.ru/wiki/images/4/47/Voron14mlj.pdf, e.g. 34 and eq. 40.) The final objective function then becomes the following:

L(∅,θ) + Regularizer

Smoothing regularization

Smoothing regularization is applied to smooth out background topics so that they have a uniform distribution relative to the domain-specific topics. For smoothing regularization, we

Minimize the KL Divergence between terms [∅]_t and a fixed distribution β
Minimize the KL Divergence between terms [θ]_d and a fixed distribution α
Sum the two terms from (1) and (2) to get the regularizer term

We want to minimize the KL Divergence here to make our topic and word distributions as close to the desired α and β distributions respectively.

Sparsing strategy for fewer topics

To get fewer topics we employ the sparsing strategy. This helps us to pick out domain-specific topic words as opposed to the background topic words. For sparsing regularization, we want to:

Maximize the KL Divergence between the term [∅]_t and a uniform distribution
Maximize the KL Divergence between the term [θ]_d and a uniform distribution
Sum the two terms from (1) and (2) to get the regularizer term

We are seeking to obtain word and topic distributions with minimum entropy (or less uncertainty) by maximizing the KL divergence from a uniform distribution, which has the highest entropy possible (highest uncertainty). This gives us ‘peakier’ distributions for our topic and word distributions.

Model quality

The ARTM model quality is assessed using the following measures:

Perplexity: This is inversely proportional to the likelihood of the data given the model. The smaller the perplexity the better the model, however a perplexity value of around 10 has been experimentally proven to give realistic documents.
Sparsity: This measures the percentage of elements that are zero in the ∅ and θ matrices.
Ratio of background words: A high ratio of background words indicates model degradation and is a good stopping criterion. This could be due to too much sparsing or elimination of topics.
Coherence: This is used to measure the interpretability of a model. A topic is supposed to be coherent, if the most frequent words in a topic tend to appear together in the documents. Coherence is calculated using the Pointwise Mutual Information (PMI). The coherence of a topic is measured as:
- Get the ‘k’ most probable words for a topic (usually set to 10)
- Compute the Pointwise Mutual Information (PMIs) for all pairs of words from the word list in step (a)
- Compute the average of all the PMIs
Kernel size, purity and contrast: A kernel is defined as the subset of words in a topic that separates a topic from the others, (i.e. W_t = {w: p(t|w) >δ}, where is δ selected to about 0.25). The kernel size is set to be between 20 and 200. Now the terms purity and contrast are defined as:

which is the sum of the probabilities of all the words in the kernel for a topic

For a topic model, higher values are better for both purity and contrast.

Using the BigARTM library

Data files

The BigARTM library is available from the BigARTM website and the package can be installed via pip. Download the example data files and unzip them as shown below. The dataset we are going to use here is the Daily Kos dataset.

wget https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt.gz

wget https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt

gunzip docword.kos.txt.gz

LDA

We will start off by looking at their implementation of LDA, which requires fewer parameters and hence acts as a good baseline. Use the ‘fit_offline’ method for smaller datasets and ‘fit_online’ for larger datasets. You can set the number of passes through the collection or the number of passes through a single document.

import artm

batch_vectorizer = artm.BatchVectorizer(data_path='.', data_format='bow_uci',collection_name='kos', target_folder='kos_batches')

lda = artm.LDA(num_topics=15, alpha=0.01, beta=0.001, cache_theta=True, num_document_passes=5, dictionary=batch_vectorizer.dictionary)

lda.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10)

top_tokens = lda.get_top_tokens(num_tokens=10)

for i, token_list in enumerate(top_tokens):

print('Topic #{0}: {1}'.format(i, token_list))

Topic #0: ['bush', 'party', 'tax', 'president', 'campaign', 'political', 'state', 'court', 'republican', 'states']

Topic #1: ['iraq', 'war', 'military', 'troops', 'iraqi', 'killed', 'soldiers', 'people', 'forces', 'general']

Topic #2: ['november', 'poll', 'governor', 'house', 'electoral', 'account', 'senate', 'republicans', 'polls', 'contact']

Topic #3: ['senate', 'republican', 'campaign', 'republicans', 'race', 'carson', 'gop', 'democratic', 'debate', 'oklahoma']

Topic #4: ['election', 'bush', 'specter', 'general', 'toomey', 'time', 'vote', 'campaign', 'people', 'john']

Topic #5: ['kerry', 'dean', 'edwards', 'clark', 'primary', 'democratic', 'lieberman', 'gephardt', 'john', 'iowa']

Topic #6: ['race', 'state', 'democrats', 'democratic', 'party', 'candidates', 'ballot', 'nader', 'candidate', 'district']

Topic #7: ['administration', 'bush', 'president', 'house', 'years', 'commission', 'republicans', 'jobs', 'white', 'bill']

Topic #8: ['dean', 'campaign', 'democratic', 'media', 'iowa', 'states', 'union', 'national', 'unions', 'party']

Topic #9: ['house', 'republican', 'million', 'delay', 'money', 'elections', 'committee', 'gop', 'democrats', 'republicans']

Topic #10: ['november', 'vote', 'voting', 'kerry', 'senate', 'republicans', 'house', 'polls', 'poll', 'account']

Topic #11: ['iraq', 'bush', 'war', 'administration', 'president', 'american', 'saddam', 'iraqi', 'intelligence', 'united']

Topic #12: ['bush', 'kerry', 'poll', 'polls', 'percent', 'voters', 'general', 'results', 'numbers', 'polling']

Topic #13: ['time', 'house', 'bush', 'media', 'herseth', 'people', 'john', 'political', 'white', 'election']

Topic #14: ['bush', 'kerry', 'general', 'state', 'percent', 'john', 'states', 'george', 'bushs', 'voters']

You can extract and inspect the ∅ and θ matrices, as shown below.

phi = lda.phi_   # size is number of words in vocab x number of topics

theta = lda.get_theta() # number of rows correspond to the number of topics

print(phi)
topic_0       topic_1  ...      topic_13      topic_14

sawyer        3.505303e-08  3.119175e-08  ...  4.008706e-08  3.906855e-08

harts         3.315658e-08  3.104253e-08  ...  3.624531e-08  8.052595e-06

amdt          3.238032e-08  3.085947e-08  ...  4.258088e-08  3.873533e-08

zimbabwe      3.627813e-08  2.476152e-04  ...  3.621078e-08  4.420800e-08

lindauer      3.455608e-08  4.200092e-08  ...  3.988175e-08  3.874783e-08

...                    ...           ...  ...           ...           ...

history       1.298618e-03  4.766201e-04  ...  1.258537e-04  5.760234e-04

figures       3.393254e-05  4.901363e-04  ...  2.569120e-04  2.455046e-04

consistently  4.986248e-08  1.593209e-05  ...  2.500701e-05  2.794474e-04

section       7.890978e-05  3.725445e-05  ...  2.141521e-05  4.838135e-05

loan          2.032371e-06  9.697820e-06  ...  6.084746e-06  4.030099e-08

print(theta)
             1001      1002      1003  ...      2998      2999      3000

topic_0   0.000319  0.060401  0.002734  ...  0.000268  0.034590  0.000489

topic_1   0.001116  0.000816  0.142522  ...  0.179341  0.000151  0.000695

topic_2   0.000156  0.406933  0.023827  ...  0.000146  0.000069  0.000234

topic_3   0.015035  0.002509  0.016867  ...  0.000654  0.000404  0.000501

topic_4   0.001536  0.000192  0.021191  ...  0.001168  0.000120  0.001811

topic_5   0.000767  0.016542  0.000229  ...  0.000913  0.000219  0.000681

topic_6   0.000237  0.004138  0.000271  ...  0.012912  0.027950  0.001180

topic_7   0.015031  0.071737  0.001280  ...  0.153725  0.000137  0.000306

topic_8   0.009610  0.000498  0.020969  ...  0.000346  0.000183  0.000508

topic_9   0.009874  0.000374  0.000575  ...  0.297471  0.073094  0.000716

topic_10  0.000188  0.157790  0.000665  ...  0.000184  0.000067  0.000317

topic_11  0.720288  0.108728  0.687716  ...  0.193028  0.000128  0.000472

topic_12  0.216338  0.000635  0.003797  ...  0.049071  0.392064  0.382058

topic_13  0.008848  0.158345  0.007836  ...  0.000502  0.000988  0.002460

topic_14  0.000655  0.010362  0.069522  ...  0.110271  0.469837  0.607572

ARTM

This API provides the full functionality of ARTM, however, with this flexibility comes the need to manually specify metrics and parameters.

model_artm = artm.ARTM(num_topics=15, cache_theta=True, scores=[artm.PerplexityScore(name='PerplexityScore', dictionary=dictionary)], regularizers=[artm.SmoothSparseThetaRegularizer(name='SparseTheta', tau=-0.15)])

model_plsa.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=6))

model_artm.scores.add(artm.SparsityPhiScore(name='SparsityPhiScore'))

model_artm.scores.add(artm.TopicKernelScore(name='TopicKernelScore', probability_mass_threshold=0.3))

model_artm.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=6))

model_artm.regularizers.add(artm.SmoothSparsePhiRegularizer(name='SparsePhi', tau=-0.1))

model_artm.regularizers.add(artm.DecorrelatorPhiRegularizer(name='DecorrelatorPhi', tau=1.5e+5))

model_artm.num_document_passes = 1

model_artm.initialize(dictionary=dictionary)
model_artm.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=15)

There are a number of metrics available, depending on what was specified during the initialization phase. You can extract any of the metrics using the following syntax.
model_artm.scores
[PerplexityScore, SparsityPhiScore, TopicKernelScore, TopTokensScore]

model_artm.score_tracker['PerplexityScore'].value

[6873.0439453125,

 2589.998779296875,

 2684.09814453125,

 2577.944580078125,

 2601.897216796875,

 2550.20263671875,

 2531.996826171875,

 2475.255126953125,

 2410.30078125,

 2319.930908203125,

 2221.423583984375,

 2126.115478515625,

 2051.827880859375,

 1995.424560546875,

 1950.71484375]

You can use the model_artm.get_theta() and model_artm.get_phi() methods to get the ∅ and θ matrices respectively. You can extract the topic terms in a topic for the corpus of documents.

for topic_name in model_artm.topic_names:

    print(topic_name + ': ',model_artm.score_tracker['TopTokensScore'].last_tokens[topic_name])

topic_0:  ['party', 'state', 'campaign', 'tax', 'political', 'republican']

topic_1:  ['war', 'troops', 'military', 'iraq', 'people', 'officials']

topic_2:  ['governor', 'polls', 'electoral', 'labor', 'november', 'ticket']

topic_3:  ['democratic', 'race', 'republican', 'gop', 'campaign', 'money']

topic_4:  ['election', 'general', 'john', 'running', 'country', 'national']

topic_5:  ['edwards', 'dean', 'john', 'clark', 'iowa', 'lieberman']

topic_6:  ['percent', 'race', 'ballot', 'nader', 'state', 'party']

topic_7:  ['house', 'bill', 'administration', 'republicans', 'years', 'senate']

topic_8:  ['dean', 'campaign', 'states', 'national', 'clark', 'union']

topic_9:  ['delay', 'committee', 'republican', 'million', 'district', 'gop']

topic_10:  ['november', 'poll', 'vote', 'kerry', 'republicans', 'senate']

topic_11:  ['iraq', 'war', 'american', 'administration', 'iraqi', 'security']

topic_12:  ['bush', 'kerry', 'bushs', 'voters', 'president', 'poll']

topic_13:  ['war', 'time', 'house', 'political', 'democrats', 'herseth']

topic_14:  ['state', 'percent', 'democrats', 'people', 'candidates', 'general']

Conclusion

LDA tends to be the starting point for topic modeling for many use cases. In this post, BigARTM was introduced as a state-of-the-art alternative. The basic principles behind BigARTM were illustrated along with the usage of the library. I would encourage you to try out BigARTM and see if it is a good fit for your needs!

Please try the attached notebook.

Try Databricks for free. Get started today.

The post Beyond LDA: State-of-the-art Topic Models With BigARTM appeared first on Databricks.

↧

Databricks Ventures Invests in Arcion to Enable Real-Time Data Sync with the Lakehouse

February 17, 2022, 9:00 am

≫ Next: Get to Know Your Queries With the New Databricks SQL Query Profile!

≪ Previous: Beyond LDA: State-of-the-art Topic Models With BigARTM

Databricks customers, regardless of size and industry, are increasingly seeking to unify their data onto a single platform. To do this, they need a simple, scalable and performant solution for moving data that resides in systems of record and operational data stores – whether in the cloud or on premises – to the Databricks Lakehouse Platform. They also need to sync this data in real-time to enable the processing of business-critical analytics and machine learning workloads on the freshest possible data. Since traditional databases provide little interconnectivity, customers often have to invest months of time and development effort to build their own custom data pipelines that support business-critical needs.

We recently established Databricks Ventures to support companies that share our view on the future of data and AI. We’ve already announced investments via the Lakehouse Fund in several innovative startups that are helping drive business-critical use cases for our customers, such as TickSmith (open data exchange for e-commerce) and Hunters (advanced security on the lakehouse).

Today, we’re thrilled to add to this portfolio of game-changing companies with our new investment in Arcion, the cloud-native, zero-code data mobility platform. Databricks first got to know the Arcion team a year ago. Since then, our two companies have been building toward a tight partnership focused on enabling real-time data sync with the lakehouse. From the start, we were impressed by Arcion’s fast, low-impact data replication technology and ability to deliver high-performance, high-volume data streaming for some of the most demanding enterprise customers. Arcion’s capabilities have expanded over the past year to support ingestion and replication of more than a dozen popular enterprise-class data stores, including Oracle, SyBase and MySQL, into Databricks. Together, Arcion and Databricks will help customers quickly and scalably connect their data sources to the lakehouse, unifying their data, analytics and AI workloads onto one simple platform.

This is just the beginning of what we know will be a powerful partnership between Databricks and Arcion. In the future, our joint customers can expect to see enhanced integrations and support for the Databricks Lakehouse, including Arcion’s availability within Databricks Partner Connect. Look out for additional announcements later this calendar year.

Try Databricks for free. Get started today.

The post Databricks Ventures Invests in Arcion to Enable Real-Time Data Sync with the Lakehouse appeared first on Databricks.

↧

Get to Know Your Queries With the New Databricks SQL Query Profile!

February 23, 2022, 7:34 am

≫ Next: Databricks Ventures Partners With dbt Labs to Welcome Analytics Engineers to the Lakehouse

≪ Previous: Databricks Ventures Invests in Arcion to Enable Real-Time Data Sync with the Lakehouse

Databricks SQL provides data warehousing capabilities and first class support for SQL on the Databricks Lakehouse Platform – allowing analysts to discover and share new insights faster at a fraction of the cost of legacy cloud data warehouses.

This blog is part of a series on Databricks SQL that covers critical capabilities across performance, ease of use, and governance. In a previous blog post, we covered recent user experience enhancements. In this article, we’ll cover improvements that help our users understand queries and query performance.

Speed up queries by identifying execution bottlenecks

Databricks SQL is great at automatically speeding up queries – in fact, we recently set a world record for it! Even with today’s advancements, there are still times when you need to open up the hood and look at query execution (e.g. when a query is unexpectedly slow). That’s why we’re excited to introduce Query Profile, a new feature that provides execution details for a query and granular metrics to see where time and compute resources are being spent. The UI should be familiar to administrators who have used databases before.

Query Profile includes these key capabilities:

A breakdown of the main components of query execution and related metrics: time spent in tasks, rows processed, and memory consumption.
Multiple graphical representations. This includes a condensed tree view for spotting the slowest operations at a glance and a graph view to understand how data is flowing between query operators.
The ability to easily discover common mistakes in queries (e.g. exploding joins or full table scans).
Better collaboration via the ability to download and share a query profile.

A common methodology for speeding up queries is to first identify the longest running query operators. We are more interested in total time spent on a task rather than the exact “wall clock time” of an operator as we’re dealing with a distributed system and operators can be executed in parallel.

From the Query Profile above of a TPC-H query, it’s easy to identify the most expensive query operator: scan of the table lineitem. The second most expensive operator is the scan of another table (orders).

Each query operator comes with a slew of statistics. In the case of a scan operator, metrics include number of files or data read, time spent waiting for cloud storage or time spent reading files. As a result, it is easy to answer questions such as which table should be optimized or whether a join could be improved.

Spring cleaning Query History

We are also happy to announce a few small but handy tweaks in Query History. We have enhanced the details that can be accessed for each query. You can now see a query’s status, SQL statement, duration breakdown and a summary of the most important execution metrics.

To avoid back and forth between the SQL editor and Query History, all the features announced above are also directly available from the SQL editor.

Query performance best practices

Query Profile is available today in Databricks SQL. Get started now with Databricks SQL by signing up for a free trial. To learn how to maximize lakehouse performance on Databricks SQL, join us for a webinar on February 24th. This webinar includes demos, live Q&As and lessons learned in the field so you can dive in and find out how to harness all the power of the Lakehouse Platform.

In this webinar, you’ll learn how to:

Quickly and easily ingest business-critical data into your lakehouse and continuously refine data with optimized Delta tables for best performance
Write, share and reuse queries with a native first-class SQL development experience on Databricks SQL — and unlock maximum productivity
Get full transparency and visibility into query execution with an in-depth breakdown of operation-level details so you can dive in

Try Databricks for free. Get started today.

The post Get to Know Your Queries With the New Databricks SQL Query Profile! appeared first on Databricks.

↧

Databricks Ventures Partners With dbt Labs to Welcome Analytics Engineers to the Lakehouse

February 24, 2022, 7:00 am

≫ Next: Building a Similarity-based Image Recommendation System for e-Commerce

≪ Previous: Get to Know Your Queries With the New Databricks SQL Query Profile!

Today, we are thrilled to announce Databricks Ventures’ investment in dbt Labs. With this investment, we are proud to support the growth of the company behind a pivotal open source movement. Alongside this announcement, we’re also excited to introduce major enhancements to our partnership including a native Databricks adapter for dbt, automatic query acceleration for dbt workloads in Photon and, soon, one-click connectivity through Partner Connect. We believe these improvements make the Databricks Lakehouse Platform the best place to build your next dbt project.

Why dbt Labs? At the highest level, they share our vision of an open future in which data-driven use cases are accessible and valuable across all data teams. dbt is a transformation framework that enables analytics engineers to easily build data pipelines using SQL. Everything is organized within a project, in readable SQL and YAML files, making version control, deployment, and data quality testing simple. This opinionated approach enables data practitioners to confidently deliver value across the organization while following software engineering best practices.

At Databricks, we believe that open source and open protocols such as Delta Lake and MLflow are not only fundamentally aligned with the interests of our customers, but also foster vibrant communities that build transformative software. The dbt community is a superb example of this dynamic: tens of thousands of analytics engineers collaborate, share best practices, and teach each other how to build better software across Slack, Discord and more. Our investment in dbt Labs anticipates the incredible potential of this community and the role we believe dbt Labs will play in growing it in the coming years.

In addition, we continue to partner closely with dbt Labs and the community to make Databricks the best place to develop and productionalize dbt projects. The dbt-databricks adapter is now generally available with a v1.0.0 release. It offers simplified installation, support for Delta Lake tables, and plenty of idiomatic macros that make developing and testing dbt projects a breeze. Databricks SQL, our record-setting offering for data warehousing workloads, automatically and transparently accelerates SQL expressions generated by dbt.

Lastly, through Databricks Partner Connect, customers will soon be able connect Databricks to dbt Cloud with a couple of clicks. dbt Labs offers a generous free trial that makes it easy to productionize dbt deployments, including turnkey support for scheduling jobs, CI/CD, serving documentation, monitoring and alerting, and an Integrated Developer Environment (IDE). Look out for more updates this spring.

With this investment, Databricks is excited to participate in a movement that aligns so closely with our own values. We are grateful to the dbt community for providing us with invaluable feedback, and invite you to build your next dbt project on Databricks.

To get started, please join the conversation on Slack.

Try Databricks for free. Get started today.

The post Databricks Ventures Partners With dbt Labs to Welcome Analytics Engineers to the Lakehouse appeared first on Databricks.

↧

Building a Similarity-based Image Recommendation System for e-Commerce

March 1, 2022, 7:00 am

≫ Next: Hyper-Personalization Accelerator for Banks and Fintechs Using Credit Card Transactions

≪ Previous: Databricks Ventures Partners With dbt Labs to Welcome Analytics Engineers to the Lakehouse

Why recommendation systems are important

Online shopping has become the default experience for the average consumer – even established brick-and-mortar retailers have embraced e-commerce. To ensure a smooth user experience, multiple factors need to be considered for e-commerce. One core functionality that has proven to improve the user experience and, consequently revenue for online retailers, is a product recommendation system. In this day and age, it would be nearly impossible to go to a website for shoppers and not see product recommendations.

But not all recommenders are created equal, nor should they be. Different shopping experiences require different data to make recommendations. Engaging the shopper with a personalized experience requires multiple modalities of data and recommendation methods. Most recommenders concern themselves with training machine learning models on user and product attribute data massaged to a tabular form.

There has been an exponential increase in the volume and variety of data at our disposal to build recommenders and notable advances in compute and algorithms to utilize in the process. Particularly, the means to store, process and learn from image data has dramatically increased in the past several years. This allows retailers to go beyond simple collaborative filtering algorithms and utilize more complex methods, such as image classification and deep convolutional neural networks, that can take into account the visual similarity of items as an input for making recommendations. This is especially important given online shopping is a largely visual experience and many consumer goods are judged on aesthetics.

In this article, we’ll change the script and show the end-to-end process for training and deploying an image-based similarity model that can serve as the foundation for a recommender system. Furthermore, we’ll show how the underlying distributed compute available in Databricks can help scale the training process and how foundational components of the Lakehouse, Delta Lake and MLflow, can make this process simple and reproducible.

Why similarity learning?

Similarity models are trained using contrastive learning. In contrastive learning, the goal is to make the machine learning (ML) model learn an embedding space where the distance between similar items is minimized and the distance between dissimilar items is maximized. Here, we will use the fashion MNIST dataset, which comprises around 70,000 images of various clothing items. Based on the above description, a similarity model trained on this labeled dataset will learn an embedding space where embeddings of similar products (e.g., boots, are closer together and different items e.g., boots and pullovers) are far apart. In supervised contrastive learning, the algorithm has access to metadata, such as image labels, to learn from, in addition to the raw pixel data itself.

This could be illustrated as follows.

Traditional ML models for image classification focus on reducing a loss function that’s geared towards maximizing predicted class probabilities. However, what a recommender system fundamentally attempts to do to suggest alternatives to a given item. These items could be described as closer to one another in a certain embedding space than others. Thus, in most cases, the operating principle of recommendation systems closely align with that of contrastive learning mechanisms compared to traditional supervised learning. Furthermore, similarity models are more adept at generalizing to unseen data, based on their similarities. For example, if the original training data does not contain any images of jackets but contains images of hoodies and boots, a similarity model trained on this data would locate the embeddings of the image of a jacket closer to hoodies and farther away from boots. This is very powerful in the world of recommendation methods.

Specifically, we use the Tensorflow Similarity library to train the model and Apache Spark, combined with Horovod to scale the model training across a GPU cluster. We use Hyperopt to scale hyperparameter search across the GPU cluster with Spark in only a few lines of code. All these experiments will be tracked and logged with MLflow to preserve model lineage and reproducibility. Delta will be used as the data source format to track data lineage.

Setting up the environment

The supervised_hello_world example in the Tensorflow Similarity Github repository gives a perfect template to use for the task at hand. What we try to do with a recommender is similar to the manner in which a similarity model behaves. That is, you choose an image of an item, and you query the model to return n of the most similar items that could also pique your interest.

To fully leverage the Databricks platform, it’s best to spin up a cluster with a GPU node for the driver (since we will be doing some single node training initially), two or more GPU worker nodes (as we will be scaling hyperparameter optimization and distributing the training itself), and a Databricks Machine Learning runtime of 10.0 or above. T4 GPU instances are a good choice for this exercise.

The entire process should take no more than 5 minutes (including the cluster spin up time).

Ingest data into Delta tables

Fashion MNIST training and test data can be imported to our environment using a sequence of simple shell commands and the helper function `convert` (modified from the original version at: ‘https://pjreddie.com/projects/mnist-in-csv/’ (to reduce unnecessary file I/O) can be used to convert the image and label files into a tabular format. Subsequently these tables can be stored as Delta tables.

Storing the training and test data as Delta tables is important as we incrementally write new observations (new images and their labels) to these tables, the Delta transaction log can keep track of the changes to data. This enables us to track the fresh data we can use to re-index data in the similarity index we will describe later.

Nuances of training similarity models

A neural network used to train a similarity model is quite similar to one used for regular supervised learning. The primary differences here are in the loss function we use and the metric embedding layer. Here we use a simple convolutional neural network (cnn) architecture, which is commonly seen in computer vision applications. However, there are subtle differences in the code that enable the model to learn using contrastive methods.

You will see the multi similarity loss function in place of the softmax loss function for multiclass classification you would see otherwise. Compared to other traditional loss functions used for contrastive learning, Multi-Similarity Loss takes into account multiple similarities. These similarities are self similarity, positive relative similarity, and negative relative similarity. Multi-similarity Loss measures these three similarities by means of iterative hard pair mining and weighting, bringing significant performance gains in contrastive learning tasks. Further details of this specific loss is discussed at length in the original publication by Wang et al.

In the context of this example, this loss helps minimize the distance between similar items and maximize distance between dissimilar items in the embedding space. As explained in the supervised_hello_world example in the Tensorflow_Similarity repository, the embedding layer added to the model with the MetricEmbedding() is a dense layer with L2 normalization. For each minibatch, a fixed number of embeddings (corresponding to images) are randomly chosen from randomly sampled classes (the number of classes is a hyper parameter). These are then subjected to hard pair mining and weighting iteratively in the Multi-Similarity Loss layer, where information from three different types of similarities is used to penalize dissimilar samples in close proximity more.

This can be seen below.

```
def get_model():
    from tensorflow_similarity.layers import MetricEmbedding
    from tensorflow.keras import layers
    from tensorflow_similarity.models import SimilarityModel
    
    inputs = layers.Input(shape=(28, 28, 1))
    x = layers.experimental.preprocessing.Rescaling(1/255)(inputs)
    x = layers.Conv2D(32, 3, activation='relu')(x)
    x = layers.MaxPool2D(2, 2)(x)
    x = layers.Dropout(0.3)(x)
          …
          …
    x = layers.Dropout(0.3)(x)
    x = layers.Flatten()(x)
    outputs = MetricEmbedding(128)(x)
    return SimilarityModel(inputs, outputs)
…

…
loss = MultiSimilarityLoss(distance=distance)
model.compile(optimizer=Adam(learning_rate), loss=loss)

```

It is important to understand how a trained similarity model functions in TensorFlow Similarity. During training of the model, we learned embeddings that minimize the distance between similar items. The Indexer class of the library provides the capability to build an index from these embeddings on the basis of the chosen distance metric. For example, if the chosen distance metric is ‘cosine’, the index will be built on the basis of cosine similarity.

The index exists to quickly find items with ‘close’ embeddings. For this search to be quick, the most similar items have to be retrieved with relatively low latency. The query method here uses Fast Approximate Nearest Neighbor Search to retrieve the n nearest neighbors to a given item, which we can then serve as recommendations.

```
#Build an index using training data 
x_index, y_index = select_examples(x_train, y_train, CLASSES, 20)
tfsim_model.reset_index()
tfsim_model.index(x_index, y_index, data=x_index)

#Query the index using the lookup method
tfsim_model.lookup(x_display, k=5)
.
.
.
```

Leveraging parallelism with Apache Spark

This model can be trained in a single node without an issue and we can build an index to query it. Subsequently the trained model can be deployed to be queried via a REST endpoint with the help of MLflow. This particularly makes sense, since the fashion MNIST dataset used in this example is small and fits in a single GPU enabled instance’s memory easily. However, in practice, image datasets of products can span several gigabytes in size. Also, even for a model trained on a small dataset, the process of finding the optimal hyperparameters of the model can be a very time consuming process if done on a single GPU enabled instance. In both cases, parallelism enabled by Spark can do wonders only by changing a few lines of code.

Parallelizing hyperparameter optimization with Apache Spark

In the case of a neural network, you could think of weights of the artificial neurons as parameters that are updated during training. This is performed by means of gradient descent and backpropagation of error. However, values such as the number of layers, the number of neurons per layer, and even the activation functions in neurons aren’t optimized during this process. These are termed hyperparameters, and we have to search the space of all such possible hyperparameter combinations in a clever way to proceed with the modeling process.

Traditional model tuning (a shorthand for hyperparameter search) can be done with naive approaches such as an exhaustive grid search or a random search. Hyperopt, a widely adopted open-source framework for model tuning, leverages far more efficient Bayesian search for this process.

This search can be time consuming, even with intelligent algorithms such as Bayesian search. However, Spark can work in conjunction with Hyperopt to parallelize this process across the entire cluster resulting in a dramatic reduction in the time consumed. All that has to be done to perform this scaling is to add 2 lines of python code to what you would normally use with Hyepropt. Note how the parallelism argument is set to 2, (i.e. the number of cluster GPUs).

```
.
.
from hyperopt import SparkTrials
.
.
trials = SparkTrials(parallelism = 2)
.
.
best_params = fmin(
    fn=train_hyperopt,
    space=space,
    algo=algo,
    max_evals=32,
    trials = trials
  )
.
.

```

The mechanism in which this parallelism works can be illustrated as follows.

The article Scaling Hyperopt to Tune Machine Learning Models in Python gives an excellent deep dive on how this works. It is important to use GPU enabled nodes for this process in the case of similarity models, particularly in this example leveraging Tensorflow. Any time savings could be negated by unnecessarily long and inefficient training processes leveraging CPU nodes. A detailed analysis of this is provided in this article.

Parallelizing model training with Horovod

As we saw in the previous section, Hyperopt leverages Spark to distribute hyperparameter search by training multiple models with different hyperparameter combinations, in parallel. The training of each model takes place in a single machine. Distributed model training is yet another way in which distributed processing with Spark can make the training process more efficient. Here, a single model is trained across many machines in the cluster.

If the training dataset is large, it could be yet another bottleneck for training a production ready similarity model. Some approaches to this include training the model only on a subset of the data on a single machine. This comes at the cost of the final model being sub-optimal. However, with Spark and Horovod, an open-source framework for parallelizing the model training process across a cluster, this problem can be solved. Horovod, in conjunction with Spark, provides a data-parallel approach to model training on large-scale datasets with minimal code changes. Here, models are trained in parallel in each node in the cluster once definitions of subsets of data are passed, to learn weights of the neural network. These weights are synchronized across the cluster resulting in the final model. Ultimately, you end up with a highly optimized model trained on the entire dataset within a fraction of the time you would spend on attempting to do this on a single machine. The article How (Not) To Scale Deep Learning in 6 Easy Steps goes into great detail on how to leverage distributed compute for deep learning. Again, Horovod is most effective when used on a GPU cluster. Otherwise the advantages of scaling model training across a cluster would not bring the desired efficiencies.

Handling large image datasets for model training is another important factor to consider. In this example, fashion MNIST is a very small dataset that does not strain the cluster at all. However, large image datasets are often seen in the enterprise and a use case may involve training a similarity model on such data. Here, Petastorm, a data caching library built with Deep learning in mind, will be very useful. The linked notebook will help you leverage this technology for your use case.

Deploying model and index

Once the final model with the optimal hyperparameters is trained, the process of deploying a similarity model is a nuanced one. This is because the model and the index need to be deployed together. However, with MLflow, this process is trivially simple. As mentioned before, recommendations are retrieved by querying the index of data with the embedding inferred from the query sample. This can be illustrated in a simplified manner as follows.

One of the key advantages of this approach is that there is no need to retrain the model as new image data is received. Embeddings can be generated with the model and added to the ANN index for querying. Since the original image data is in the Delta format, any increments to the table will be recorded in the Delta transaction log. This ensures reproducibility of the entire data ingestion process.

In MLflow, there are numerous model flavors for popular (and even obscure) ML frameworks to enable easy packaging of models for serving. In practice, there are numerous instances where a trained model has to be deployed with pre and/or post processing logic, as in the case of the query-able similarity model and ANN index. Here we can use the mlflow.pyfunc module to create a custom `recommender model` class (named TfsimWrapper in this case ) to encapsulate the inference and lookup logic. This link provides detailed documentation on how this could be done.

```
import mlflow.pyfunc
class TfsimWrapper(mlflow.pyfunc.PythonModel):
    """ model input is a single row, single column pandas dataframe with base64 encoded byte string i.e. of the type bytes. Column name is 'input' in this case"""
    """ model output is a pandas dataframe where each row(i.e.element since only one column) is a string  converted to hexadecimal that has to be converted back to bytes and then a numpy array using np.frombuffer(...) and reshaped to (28, 28) and then visualized (if needed)"""
    
    def load_context(self, context):
      import tensorflow_similarity as tfsim
      from tensorflow_similarity.models import SimilarityModel
      from tensorflow.keras import models
      import pandas as pd
      import numpy as np
      
      
      self.tfsim_model = models.load_model(context.artifacts["tfsim_model"])
      self.tfsim_model.load_index(context.artifacts["tfsim_model"])

    def predict(self, context, model_input):
      from PIL import Image
      import base64
      import io

      image = np.array(Image.open(io.BytesIO(base64.b64decode(model_input["input"][0].encode()))))    
      #The model_input has to be of the form (1, 28, 28)
      image_reshaped = image.reshape(-1, 28, 28)/255.0
      images = np.array(self.tfsim_model.lookup(image_reshaped, k=5))
      image_dict = {}
      for i in range(5):
        image_dict[i] = images[0][i].data.tostring().hex()
        
      return pd.DataFrame.from_dict(image_dict, orient='index')

```

The model artifact can be logged, registered and deployed as a REST endpoint all within the same MLflow UI or by leveraging the MLflow API. In addition to this functionality, it is possible to define input and output schema as a model signature in the logging process to assist swift hand-off to deployment. This is handled automatically by including the following 3 lines of code

```
from mlflow.models.signature import infer_signature
signature = infer_signature(sample_image, loaded_model.predict(sample_image))
mlflow.pyfunc.log_model(artifact_path=mlflow_pyfunc_model_path, python_model=TfsimWrapper(), artifacts=artifacts,
        conda_env=conda_env, signature = signature)


```

Once the signature is inferred, data input output schema expectations will be indicated in the UI as follows.

Once the REST endpoint has been created, you can conveniently generate a bearer token by going to the user settings on the sliding panel on the left hand side of the workspace. With this bearer token, you can insert the automatically generated Python wrapper code for the REST endpoint in any end user facing application or internal process that relies on model inference.

The following function will help decode the JSON response from the REST call.

```
import numpy as np

def process_response_image(i):
“””response is the returned JSON object. We can loop through this object and return the reshaped numpy array for each recommended image which can then be rendered”””

  single_image_string = response[i]["0"]
  image_array = np.frombuffer(bytes.fromhex(single_image_string), dtype=np.float32)
  image_reshaped = np.reshape(image_array, (28,28))
  return image_reshaped

```

The code for a simple Streamlit application built to query this endpoint is available in the repository for this blog article. The following short recording shows the recommender in action.

Build your own with Databricks

Typically, the process of ingesting and formatting the data, model optimization, training at scale, and deploying a similarity model for recommendations is a novel and nuanced process for many. With the highly optimized managed Spark, Delta Lake, and MLflow foundations that Databricks provides, this process becomes simple and straightforward in the Lakehouse platform. Given that you can access managed compute clusters, the process of provisioning multiple GPUs is made seamless, with the entire process taking only several minutes. The notebook linked below walks you through the end to end process of building and deploying a similarity model in a detailed manner. We welcome you to try it, customize it in a manner that fits your needs, and build your own production-grade ML-based image recommendation system with Databricks.

Try the notebook.

Try Databricks for free. Get started today.

The post Building a Similarity-based Image Recommendation System for e-Commerce appeared first on Databricks.

↧

Hyper-Personalization Accelerator for Banks and Fintechs Using Credit Card Transactions

March 3, 2022, 1:03 pm

≫ Next: Enabling Zero Trust in the NOC With Databricks and Immuta

≪ Previous: Building a Similarity-based Image Recommendation System for e-Commerce

Just as Netflix and Tesla disrupted the media and automotive industry, many fintech companies are transforming the Financial Services industry by winning the hearts and minds of a digitally active population through personalized services, numberless credit cards that promise more security, and frictionless omnichannel experiences. NuBank’s success story as an eight-year old startup becoming Latin America’s most valuable bank is not an isolated case; over 280 other fintechs unicorns are also willing to disrupt the entire payment industry. As noted in the Financial Conduct Authority (FCA) study, “There are signs that some of the historic advantages of large banks may be starting to weaken through innovation, digitization and changing consumer behavior.” Faced with the choice of either disrupting or being disrupted, many traditional financial services institutions (FSIs) like JP Morgan Chase have recently announced significant strategic investments to compete with fintech companies on their own grounds – on the cloud, using data and artificial intelligence (AI).

Given the volume of data required to drive advanced personalization, the complexity of operating AI from experiments (proof of concepts/POCs) to enterprise scale data pipelines, combined with strict data and privacy regulations on the use of customer data on cloud infrastructure, Lakehouse for Financial Services has quickly emerged as the strategic platform for many disruptors and incumbents alike to accelerate digital transformation and provide millions of customers with personalized insights and enhanced banking experiences (see how HSBC is reinventing mobile banking with AI).

In our previous solution accelerator, we showed how to identify brands and merchants from credit card transactions. In our new solution accelerator (inspired from the 2019 study of Bruss et. al. and from our experience working with global retail banking institutions), we capitalized on that work to build a modern hyper-personalization data asset strategy that captures a full picture of the consumer and goes beyond traditional demographics, income, product and services (who you are) and extends to transactional behavior and shopping preferences (how you bank). As a data asset, the same can be applied to many downstream use cases, such as loyalty programs for online banking applications, fraud prevention for core banking platforms or credit risk for “buy now pay later” (BNPL) initiatives.

Transactional context

While the common approach to any segmentation use case is a simple clustering model, there are only a few off-the-shelf techniques. Alternatively, when converting data from its original archetype, one can access a wider range of techniques that often yield unexpected results. In this solution accelerator, we convert our original card transaction data into graph paradigm and leverage techniques originally designed for Natural Language Processing (NLP).

Similar to NLP techniques where the meaning of a word is defined by its surrounding context, a merchant’s category can be learned from its customer base and the other brands that their consumers support. In order to build this context, we generate “shopping trips” by simulating customers walking from one shop to another, up and down our graph structure. The aim is to learn “embeddings,” a mathematical representation of the contextual information carried by the customers in our network. In this example, two merchants contextually close to one another would be embedded into large vectors that are mathematically close to one another. By extension, two customers exhibiting the same shopping behavior will be mathematically close to one another, paving the way for a more advanced customer segmentation strategy.

Merchant embeddings

Word2Vec was developed by Tomas Mikolov, et. al. at Google to make the neural network training of the embedding more efficient, and has since become the de facto standard for developing pre-trained word embedding algorithms. In our solution, we will use the default wordVec model from the Apache Spark™ ML API that we train against our shopping trips defined earlier.

from pyspark.ml.feature import Word2Vec
 
with mlflow.start_run(run_name='shopping_trips') as run:
 
 word2Vec_model = Word2Vec() \
   .setVectorSize(255) \
   .setWindowSize(3) \
   .setMinCount(5) \
   .setInputCol('walks') \
   .setOutputCol(vectors) \
   .fit(shopping_trips)
 
  mlflow.spark.log_model(word2Vec_model, "model")

The most obvious way to quickly validate our approach is to eyeball its results and apply domain expertise. In this example of brands like “Paul Smith”, our model can find Paul Smiths’ closest competitors to be “Hugo Boss”, “Ralph Lauren” or “Tommy Hilfiger.”

We did not simply detect brands within the same category (i.e. fashion industry) but detected brands with a similar price tag. Not only could we classify different lines of businesses using customer behavioral data, but our customer segmentation could also be driven by the quality of goods they purchase. This observation corroborates the findings by Bruss et. al.

Merchant clustering

Although the preliminary results were troubling, there might be groups of merchants more or less similar than others that we may want to identify further. The easiest way to find those significant groups of merchants/brands is to visualize our embedded vector space into a 3D plot. For that purpose, we apply machine learning techniques like Principal Component Analysis (PCA) to reduce our embedded vectors into 3 dimensions.

Using a simple plot, we could identify distinct groups of merchants. Although these merchants may have different lines of business, and may seem dissimilar at first glance, they all have one thing in common: they attract a similar customer base. We can better confirm this hypothesis through a clustering model (KMeans).

Transactional fingerprints

One of the odd features of the word2vec model is that sufficiently large vectors could still be aggregated while maintaining high predictive value. To put it another way, the significance of a document could be learned by averaging the vector of each of its word constituents (see whitepaper from Mikolov et. al.). Similarly, customer spending preferences can be learned by aggregating vectors of each of their preferred brands. Two customers having similar tastes for luxury brands, high-end cars and fine liquor would theoretically be close to one another, hence belonging to the same segment.

customer_merchants = transactions \
   .groupBy('customer_id') \
   .agg(F.collect_list('merchant_name').alias('walks'))

customer_embeddings = word2Vec_model.transform(customer_merchants)

It is worth mentioning that such an aggregated view would generate a transactional fingerprint that is unique to each of our end consumers. Although two fingerprints may share similar traits (same shopping preferences), these unique signatures can be used to track unique individual customer behaviors over time.

When a signature drastically differs from previous observations, this could be a sign of fraudulent activities (e.g. sudden interest for gambling companies). When signature drifts over time, this could be indicative of life events (having a newborn child). This approach is key to driving hyper-personalization in retail banking: the ability to track customer preferences against real-time data will help banks provide personalized marketing and offers, such as push notifications, across various life events, positive or negative.

Customer segmentation

Although we were able to generate some signal that offers great predictive value to customer behavioral analytics, we still haven’t addressed our actual segmentation problem. Borrowing from retail counterparts that are often more advanced when it comes to customer 360 use cases including segmentation, churn prevention or customer lifetime value, we can use a different solution accelerator from our Lakehouse for Retail that walks us through different segmentation techniques used by best-in-class retail organizations.

Following retail industry best practices, we were able to segment our entire customer base against 5 different groups exhibiting different shopping characteristics.

While cluster #0 seems to be biased towards gambling activities (merchant category 4 in the above graph), another group is more centered around online businesses and subscription-based services (merchant category 6), probably indicative of a younger generation of customers. We invite our readers to complement this view with additional data points they already know about their customers (original segments, products and services, average income, demographics, etc.) to better understand each of those behavioral driven segments and its impact for credit decisioning, next-best action, personalized services, customer satisfaction, debt collection or marketing analytics.

Closing thoughts

In this solution accelerator, we have successfully applied concepts from the world of NLP to card transactions for customer segmentation in retail banking. We also demonstrated the relevance of the Lakehouse for Financial Services to address this challenge where graph analytics, matrix calculation, NLP, and clustering techniques must all be combined into one platform, secured and scalable. Compared to traditional segmentation methods easily addressed through the world of SQL, the disruptive future of segmentation builds a fuller picture of the consumer and can only be solved with data + AI, at scale and in real time.

Although we’ve only scratched the surface of what was possible using off-the-shelf models and data at our disposal, we proved that customer spending patterns can more effectively drive hyper-personalization than demographics, opening up an exciting range of new opportunities from cross-sell/upsell and pricing/targeting activities to customer loyalty and fraud detection strategies.

Most importantly, this technique allowed us to learn from new-to-bank individuals or underrepresented consumers without a known credit history by leveraging information from others. With 1.7 billion adults worldwide who do not have access to a bank account according to the World Economic Forum, and 55 million underbanked in the US alone in 2018 according to the Federal Reserve, such an approach could pave the way towards a more customer-centric and inclusive future for retail banking.

Try the accelerator notebooks on Databricks to test your customer 360 data asset strategy today and contact us to learn more about how we have helped customers with similar use cases.

Try Databricks for free. Get started today.

The post Hyper-Personalization Accelerator for Banks and Fintechs Using Credit Card Transactions appeared first on Databricks.

↧

Enabling Zero Trust in the NOC With Databricks and Immuta

March 8, 2022, 10:29 am

≫ Next: Introducing Lakehouse for Healthcare and Life Sciences

≪ Previous: Hyper-Personalization Accelerator for Banks and Fintechs Using Credit Card Transactions

This post was written in collaboration with Databricks partner Immuta. We thank Sam Carroll, Partner Solutions Architect, Immuta, for his contributions.

Imagine you are a NOC/SOC analyst in a globally distributed organization. You just received an alert, but you can’t access the data because of compliance roadblocks – and as a result your response time lags. But, what if there was a way to enforce data privacy and catch the baddies?

In this blog, you will learn how Databricks and Immuta solve the complicated problem of large scale data analysis and security. Databricks provides a lakehouse platform for big data analytics and artificial intelligence (AI) workloads on a collaborative cloud-native platform. Immuta ensures that data is accessed by the right people, at the right time, for the right reasons. With the Databricks Lakehouse Platform coupled with Immuta’s user and data access controls, empowers organizations to find insights in the most stringent zero trust and compliance environments.

The growing need for Zero Trust architectures

Modern Network Operation Centers (NOCs) and cybersecurity teams collect hundreds of terabytes of data per day, amounting to petabytes of data retention requirements for regulatory and compliance purposes. With ever growing petabytes that contain troves of sensitive data, organizations struggle to maintain the usability of that data while keeping it secure.

According to Akamai, a Zero Trust security model is a methodology that controls access to information by ensuring a strict identity verification process is followed. To maintain data security, Zero Trust architecture (ZTA) is emerging as a requirement for organizations. The ZTA maturity model is underpinned by visibility and analytics and governance. But this model has operational implications, as it requires organizations to adopt technologies that enable large scale data access and security.

We will use a representative example of network data to illustrate how a network security team uses Databricks and Immuta to provide a secure and performant big data analytics platform across multiple compliance zones (different countries). If you want to play along and see the data, you can download it from the Canadian Institute for Cybersecurity. By the end of this blog, you will know how to:

Register and classify a sensitive data source with Immuta
Implement data masking
Enforce least privilege for data access
Build a policy to prevent accidental data discovery
Execute a Databricks notebook in Databricks SQL with least privilege
Detect a DDoS threat while maintaining least privilege

Enabling Zero Trust with Databricks and Immuta

In this scenario, there are network analysts in the United States, Canada, and France. These analysts should only see the alerts for their specific region. However, when we need to perform incident analysis, we’ll need to see information on the events in other regions. Due to the sensitive nature of this data set, we should have a least privilege access model applied in our Databricks environment.

Let’s take a look at the packet capture data in Databricks that we will be using for this blog:

The above data illustrates packet captures during a simulated distributed denial-of-service (DDoS) attack. In order to make this a regional-based simulation, we will generate a country code for each record. Here’s how the data breaks down by country code:

Now that we’ve established what the data set looks like, let’s dive into how Immuta helps enable Zero Trust.

Immuta is an automated data access control solution that was born out of the U.S. Intelligence community. The platform allows you to build dynamic, scalable, and simplified policies to ensure data access controls are enforced appropriately and consistently across your entire data stack.

The first step to protect this PCAP data set is to register the data source in Immuta. This registration simply requires inputting connection details for our Databricks cluster. Once complete, we will see the data sets in Immuta:

Once this table has been registered, Immuta will automatically run sensitive data discovery (SDD). This step is critical, as we will need to prevent certain users from seeing sensitive information. On the PCAP table, SDD identified several different types of sensitivity in the data set, including the IP addresses, personally identifiable information (PII), and location data in the region column.

Now that we’ve classified the data using SDD, let’s build data policies to ensure the sensitive information from the packet data is masked. Immuta offers many dynamic data masking techniques, but in this case we will use an irreversible hash. In the real world, it might not be important to see the actual values for an IP address, but the hashed values still must be consistent and unique. This is important for two reasons:

Analysts need to ensure they can determine if a specific IP address is causing a spike in network traffic.
Any data set containing IP addresses must be consistent. For example, imagine we have a table with honeypot server information that will need to be joined to on a hashed key. Immuta allows users to perform masked joins while ensuring sensitive information is protected using “projects.” Consistency across data sets ensures no sensitive data slips through the cracks in a collaborative project situation.

The example below shows how easy it is to ensure a user has least privilege access, meaning no user has default access to any data set in a Databricks cluster protected by Immuta. Data owners can specify why someone should get access to the data, whether it be for a legal purpose they need to attest to or based on group membership or user profile attribute. In our example, we will allow any user with a Department of “NOC” on their user profile to see this table in Databricks. With Immuta’s plain English policy building capabilities, a policy is as easy to understand as reading a sentence. Consider the following subscription policy:

This policy will automatically dictate who can see any table that has been classified as “NOC” transparently in Databricks. Below is the Databricks notebook for someone who doesn’t have the subscription policy applied:

The user can’t see any tables in Databricks, and if they try to query the table directly, they get a permission error. This least-privilege concept is core to Immuta and one of the reasons it can help enable Zero Trust in Databricks.

Next, let’s enable the subscription policy and see what happens:

As you can see, the user now gets access to the PCAP table and an additional honeypot table because it was classified as a NOC data set. This is beneficial when onboarding new users, groups, or organizations, as it proactively and dynamically lets users see data sets that are relevant to their jobs, while ensuring they can access only those data sets they actually need to see.

Next, let’s build a policy to further protect the data and ensure users don’t accidentally discover any sensitive data in this Databricks environment. Below is a simple data masking policy in Immuta:

In the above example, we can see how easy it is to make rules that can protect your NOC data sets. Any data set containing an IP address will now automatically have the IP address or PII masked for anyone in the organization using Databricks. This ensures that your data is protected in a consistent manner.

Now that we’ve defined a single policy that can be applied to all of our Databricks data sets, let’s take a look at what happened to our honeypot table mentioned earlier:

Notice how all the IP address information and PII are dynamically masked. That one policy is now applied consistently across two different tables because their context is the same.

Next, let’s see how we can ensure least privilege access even more granularly by only allowing users to see the regions they are authorized to see. Currently, users can see data from all countries in their data set:

We will build a segmentation policy in Immuta that uses an attribute derived from an identity manager (in this case, Okta):

This policy states that we will use “Authorized Countries” from a user’s profile to dynamically build a filter that matches any column tagged Discovered.Country on any data set containing an IP address field.

Next, let’s dig into the exclusion rule (in green). The exclusion rule states that users won’t see filtered data when they are operating under a “Purpose” context. This purpose context states that if an analyst is working on an incident response, they are authorized to see unfiltered location data. Purposes have many uses in Immuta, including allowing users to elevate their data access privilege when necessary for a specific, approved reason. Changes like this are audited, and users can update contexts directly in a Databricks notebook.

Now that this data policy has been built, let’s dive into what it will do to our PCAP data set:

Now this user is seeing only the countries they are authorized to access, thus limiting their access to the least amount of data they should see. Let’s check another user to see what their data looks like:

This user is seeing consistent masking, but for the set of countries that they are authorized to access.

Lastly, let’s take a look at the exclusion rule we set earlier. Imagine a cyber attack has just occured and our analyst needs to view network traffic for the entire company, not just their areas of authorization. To do this, the user is going to switch context into the “Incident Response” purpose:

The user in Databricks updated their “project” context to “Incident Response.” This context gets logged in Immuta and when the user runs their subsequent query, they will see all regions within the PCAP data set because they are working under this Purpose. Now you, as the NOC analyst can detect DDoS attacks originating outside of your default access level.

We’ve walked step-by-step how quickly Immuta enables least privilege access on Databricks but to see this process in action, you can watch a demo here. Immuta provides compliance officers assurance that users will only be granted access to the information for which they should have access to, while maintaining robust audit logs of who is accessing what data source and for what reason. Utilizing both Databricks and Immuta allows NOCs to enable a Zero Trust architecture on the data lake. To get started with Databricks and Immuta request a free trial below:

https://databricks.com/try-databricks

https://www.immuta.com/try/

Try Databricks for free. Get started today.

The post Enabling Zero Trust in the NOC With Databricks and Immuta appeared first on Databricks.

↧

Introducing Lakehouse for Healthcare and Life Sciences

March 9, 2022, 6:00 am

≫ Next: Functional Workspace Organization on Databricks

≪ Previous: Enabling Zero Trust in the NOC With Databricks and Immuta

Each of us will likely generate millions of gigabytes of health data in our lifetimes: medical and pharmacy claims, electronic medical records with extensive clinical documentation, medical images; perhaps streaming data from wearable devices, blood biopsy data that can detect cancer, and genomic sequencing. These data sets have enormous potential to uncover new, life-saving treatments, predict disease before it happens, and fundamentally change the way that care is delivered.

For healthcare and life sciences organizations seeking to deliver better patient outcomes, legacy technology is most often the rate-limiting factor. Common challenges include:

Data silos and limited support for semi-structured data (like provider notes) and unstructured data (like images) prevent organizations from gaining a holistic view of a patient
Rapid growth in data is outpacing the scale of existing infrastructure, preventing population-level research
Batch processing and disjointed analytic tools prevent real-time response to challenges such as supply chain constraints and ICU bed capacity
Traditional data architectures that don’t support advanced analytics and AI use cases

Sadly, for these reasons, the opportunity to tap into AI-driven innovation is simply out of reach for most organizations on the front lines of developing new drugs and treating patients in need.

Meet Lakehouse for Healthcare and Life Sciences

Well, that’s changing! Today, we’re thrilled to introduce the Lakehouse for Healthcare and Life Sciences — a platform designed to help organizations collaborate with data and AI in service of a unified goal: improving health outcomes. The Lakehouse eliminates the need for legacy data architectures, which have inhibited innovation, by providing a simple, open and multi-cloud platform for all your data, analytics and AI workloads. Building on this foundation are solution accelerators developed by Databricks and our ecosystem of partners for high-value analytics and AI use cases such as disease prediction, medical image classification, and biomarker discovery.

We know that healthcare organizations face a unique, and often painful, set of challenges that can significantly hinder innovation. We’ve designed the Lakehouse to address these challenges and provide the following benefits:

Build a 360 degree view of the patient: It’s widely accepted that a vast majority of medical data is unstructured, which makes gaining a holistic patient view that much harder with siloed systems. This problem gets exponential as healthcare becomes increasingly interconnected between healthcare providers, payers and pharma manufacturers. Lakehouse is open by design and supports all data types, enabling organizations to create a 360 degree view of patient health. Couple this with the pre-built data ingestion and curation solution accelerators to bring health data to your lakehouse, and it’s even easier.
Scale analytics for population-level insights: Scale is critical for initiatives like population health analytics and drug discovery, but for years legacy technology has failed to keep up with ballooning health data like genomics and imaging. Built in the cloud and designed for performance, the Lakehouse supports the largest of data jobs at lightning-fast speeds. For example, Regeneron reduced data processing from 3 weeks to 5 hours, and genotype- phenotype queries from 30 minutes to 3 seconds for workloads that scaled to 1.5M exomes. With the Lakeouse, organizations can quickly and reliably analyze data for millions of patients.
Deliver real-time care and operations: Healthcare happens in real-time and requires real-time insights for critical use cases from managing ICU capacity to monitoring the distribution of temperature-sensitive vaccines. Unfortunately, traditional data warehouses aren’t designed to operate in real-time. The Lakehouse enables real-time analysis on streaming data so organizations can deliver care when it’s needed, not after the fact.
Leverage predictive health insights: The future of healthcare is predictive, not descriptive. The Lakehouse provides a robust set of analytics and AI tools directly connected to your data so organizations can innovate drug discovery and patient care with machine learning. Additionally, our network of partners has built accelerators for high-value analytics and AI use cases, including drug targeting and repurposing, drug safety monitoring, disease prediction and digital pathology analysis for cancer detection.

With these capabilities, Databricks is empowering a new breed of data and AI innovators in healthcare:


Using AI to develop diagnostic and therapeutic products that help children living with behavioral conditions	Built personalization models that increased medication adherence by nearly 2%, improving the quality of life for their customers.	Applied machine learning to 17M+ electronic health records to identify new treatment indications for approved therapies.	Delivering recommendations to patients using streaming data from connected health wearables for diabetes management.

Tailor-made Solutions for Healthcare & Life Sciences

To help organizations realize value from their Lakehouse projects faster, Databricks and our ecosystem of partners have developed solution accelerators and open-source libraries—like Glow for genomics and Smolder for HL7v2 messages—to address common industry use cases.

Data Ingestion and Curation Tools – easily ingest structured and unstructured health data (e.g. FHIR/HL7v2, imaging, genomics) into your Lakehouse for analytics at scale with our templates for data ingestion and curation.
Analytics and AI Templates – packaged solutions for high-value analytics and AI use cases such as drug target identification, drug repurposing, disease risk prediction, medical image analytics (e.g. detecting cancer in pathology images) and more.

Featured partner solutions


Intelligent Drug Repurposing	Interoperability	Natural Language Processing for Healthcare	Biomedical Research Intelligent Data Management
Identify new therapeutic uses for existing drugs with the power of data and machine learning.	Automate the ingestion of streaming FHIR bundles into your lakehouse and standardize with OMOP for patient analytics at scale.	Extract insights from unstructured medical text for use cases such as automated PHI removal, adverse event detection, and oncology research.	Improve biomarker discovery for precision medicine with a highly scalable and extensible whole-genome processing solution.

Check out our full set of solutions on our Lakehouse for Healthcare and Life Sciences page.

Get started building your Lakehouse

You have the data. Now you have the platform. Join the hundreds of healthcare and life sciences organizations innovating on the Lakehouse. Here are some resources to help you get started:

Join us at HIMSS to meet with our technical experts and participate in live use case demos
Check out our new ebook Improving Health Outcomes with Data + AI

Try Databricks for free. Get started today.

The post Introducing Lakehouse for Healthcare and Life Sciences appeared first on Databricks.

↧