How YipitData Extracts Insights From Alternative Data Using Delta Lake

September 21, 2021, 9:00 am

≫ Next: Managing Model Ensembles With MLflow

≪ Previous: Part 1: Implementing CI/CD on Databricks Using Databricks Notebooks and Azure DevOps

This is a guest post from YipitData. We thank Anup Sega, Data Engineering Tech Lead, and Bobby Muldoon: Director of Data Engineering, at YipitData for their contributions.

Choosing the right storage format for any data lake is an important responsibility for data administrators. Tradeoffs between storage costs, performance, migration cost, and compatibility are top of mind when evaluating options. One option to absolutely consider for your data lake is Delta Lake, an open source, performant storage format that can radically change how to interact with datasets of any size.

With Delta Lake and its streaming capabilities, YipitData efficiently analyzes petabytes of raw, alternative data to answer key questions from leading financial institutions and corporations. This blog will outline YipitData’s general design and approach with alternative data using the Delta Lake.

How YipitData produces insights for its clients

YipitData specializes in sourcing, productizing, and distributing alternative data to the world’s largest investment funds and corporations. Product teams of data analysts complete deep research on raw datasets, design accurate methodologies to answer client questions, and distribute that research over a variety of mediums. Data engineers operate as administrators to provide tooling on top of the databricks platform to make data operations convenient, reliable, and secure for product teams. Early on, product teams were focused on analyzing public web data that was relatively small in scale. The data was stored in Parquet and transformed in a series of batch transformations using Apache Spark™ to inform the analyses in reports, charts, and granular data files delivered to clients.

Figure 1: YipitData product teams analyze a variety of data sources
that are cleaned, aggregated, and distributed to clients

Over the last few years, teams increasingly work with a variety of new source data, such as card, email, and many other data types, to deliver research on 100+ companies and counting. These new data sources are dramatically increasing YipitData’s ability to provide market insights across a variety of verticals and better service its clients.

Challenges with Alternative Data

As the number of data sources increased, the scale and volume of new datasets were concurrently growing. Traditional approaches of using batch transformations were not scaling to meet the needs of product teams.

Large, frequent data deliveries required constant refreshes of downstream analyses in a short timeframe.
Batch transformations would take too long, which was an important consideration in providing timely insights to clients from new data sources.
Techniques to understand shifting data trends on production tables became complicated and unreliable for analysts with data at this scale.
Attempts to use incremental streaming transformations were plagued by unexpected and unverifiable changes in ETL pipelines.

YipitData needed to update its approach to ingesting, cleaning, and analyzing very large datasets, to continue to fulfill its mission of helping clients answer their key questions.

Delta is a robust data storage solution that improves analytics

YipitData analysts now use Delta Lake as the storage layer for all ETL processes and ad hoc exploration to analyze alternative data. This allows them to:

Leverage Delta’s structured streaming APIs that offer intuitive, reliable, and efficient transformations on high volume datasets.
Track changes in transformed data and retain past versions using Delta’s transaction layer to reduce the risk of data loss or corruption.
Perform QA using Delta time travel on data generated via complex batch transformations.
Gain a “source of truth” as Databricks autoloader reliably converts data feeds from third-party providers into “bronze” Delta tables.

Delta streaming facilitates high-volume data ingestion

Product teams are increasingly focused on extracting value from large datasets that YipitData sources externally. These raw datasets can be upwards of several TBs in size and add 10-20 GB of new data across hundreds of files each day. Prior to Delta, applying transformations via batch was time-consuming and costly. Such transformations were inefficiently reprocessing ~99% of the dataset every day to refresh downstream analyses. To generate a clean table of up-to-date records, the following code was used:

# Some licensed data is delivered as a high volume, append-only data feed of records
# records can be flagged as either new ("N"), update ("U"), or delete ("D") from the source party
from pyspark.sql import functions as F, DataFrame, Window

new_or_updated = (
    spark.table("records_feed")
    .where(F.col("record_status").isin(["N", "U"]))
)
deleted = (
    spark.table("records_feed")
    .where(F.col("record_status") == "D")    
)

clean_records = (
    new_or_updated
    .withColumn("rank", F.rank().over(Window.partitionBy("record_id").orderBy(F.desc("record_created_timestamp"))))
    .where(F.col("rank") == 1)
    .drop("rank")
    .join(deleted, ["record_id"], how="left_anti")
)

clean_records.write.format("parquet").mode("overwrite").saveAsTable("records_clean")

With Delta streaming, analysts can reliably and exclusively operate on the incremental data delivered and exponentially increase the efficiency of ETL workflows. Delta’s declarative APIs make it easy to surgically add, replace, or delete data from a downstream Delta target table with transaction guarantees baked in to prevent data corruption:

from pyspark.sql import functions as F, DataFrame
from delta.tables import DeltaTable


def cdc_transformation(batch_df: DataFrame, batch_id: int) -> None:
    new_records = batch_df.where(F.col("record_status") == "N")
    new_records.write.format("delta").mode("append").saveAsTable("records_clean")
    
    table = DeltaTable.forName(spark, "records_clean")
    table_df = table.toDF()
    
    updated_records = batch_df.where(F.col("record_status") == "U")
    (
        table
        .merge(updated_records, updated_records.record_id == table_df.record_id)
        .whenMatchedUpdateAll()
        .execute()
    )
    
    deleted_records = batch_df.where(F.col("record_status") == "D")
    (
        table
        .merge(deleted_records, deleted_records.record_id == table_df.record_id)
        .whenMatchedDelete()
        .execute()
    )

    
(
    spark.readStream.table("records_feed")
    .writeStream.format("delta")
    .option("checkpointLocation", "dbfs://records_clean/_checkpoints")
    .foreachBatch(cdc_transformation)
    .trigger(once=True)
    .start()
)

Using APIs such as .merge, .whenMatchedDelete, and .whenMatchedUpdate, data processing costs on this dataset were reduced 50% and runtime by 75%.

h3>racking changes to production tables using Delta History

While some ETL workflows created by YipitData analysts are conducive to structured streaming, many others are only possible via batch transformations. Retaining outdated data from batch transformations is useful to audit and validate the data products that are shipped to clients. With Delta, this functionality comes out of the box as each table operation creates a new version of that table. Analysts can quickly understand what operations were performed to the datasets they publish and even restore versions to revert any unanticipated changes.

Using the HISTORY operation, YipitData analysts gain visibility into the “who, what, where, and when” regarding actions on a table. They can also query past data from a table using Delta Time Travel to understand the state of the table at any point in time.

-- Display all recent transactions on a delta table
DESCRIBE HISTORY records_clean

-- Query data from a past version
SELECT *
FROM records_clean VERSION AS OF 5

-- Alternatively, query data from a specific point in time
SELECT *
FROM records_clean TIMESTAMP AS OF '2021-01-01'

Using a combination of these tools, analysts construct dynamic queries to QA data over time as each table gets overwritten repeatedly:

FROM records_clean TIMESTAMP AS OF '2021-01-01'


Using a combination of these tools, analysts construct dynamic queries to QA data over time as each table gets overwritten repeatedly:


WITH previous_data AS (
  SELECT
    date_trunc('week', record_date) AS week,
    COUNT(record_id) AS txns,
    SUM(amount) AS total
  FROM
    records_clean TIMESTAMP AS OF '2021-01-01'
  GROUP BY
    1
), 
current_data AS (
  SELECT
    date_trunc('week', record_date) AS week,
    COUNT(record_id) AS txns,
    SUM(amount) AS total
  FROM
    records_clean
  GROUP BY
    1
),
SELECT
  p.week,
  p.txns,
  c.txns,
  p.total,
  c.total,
  ((c.txns - p.txns) * 100.0 / p.txns) AS txns_diff_pct,
  ((c.total - p.total) * 100.0 / p.total) AS total_diff_pct
FROM
  previous_data p
  LEFT JOIN current_data c USING(week)

In the scenarios where a table was overwritten incorrectly, Delta offers a handy RESTORE operation to undo changes relatively quickly. This has substantially improved the durability of production data without requiring complex solutions from the engineering team. It also empowers analysts to be more creative and experimental in creating new analyses as modifying data stored in Delta is far less risky.

Creating a “source of truth” with Databricks Autoloader

As YipitData increasingly ingests alternative data from a variety of sources (web, card, email, etc.), keeping an organized data lake is paramount to ensuring new data feeds get to the right product owners. Databricks Autoloader has allowed YipitData to standardize the ingestion of these data sources by generating “Bronze Tables” in Delta format. Bronze tables serve as the starting point(s) for analyst-owned ETL workflows that create productized data in new, downstream “Silver” and “Gold” tables. YipitData analysts complete their analysis only on Delta tables and do not have to deal with the challenges of working with raw data formats that typically offer worse read performance, among other drawbacks.

Use the cloudFiles connector in databricks to stream incremental file deliveries of any raw format into "Bronze" delta tables with transactional guarantees

Figure 2: Use the cloudFiles connector in databricks to stream incremental file deliveries
of any raw format into “Bronze” delta tables with transactional guarantees

Autoloader specifically manages data ingestion from common file formats (JSON, CSV, Parquet, etc.) and updates Delta tables incrementally as the data lands. The HISTORY of the table is also used to understand how much data has been delivered and to query the past versions of these tables to debug issues with downstream ETL workflows.

Migrating YipitData’s data lake to Delta

To fully realize the benefits of Delta, migrating all existing workloads to store data in Delta format instead of Parquet was necessary. YipitData has over 60,000 tables housing petabytes of data, so the migration process was important to consider. Prior to this migration, analysts had an in-house PySpark function developed by the data engineering team to generate tables from SQL queries or dataframes. This “create table” utility standardized table creations in Parquet format by wrapping the PySpark dataframe APIs.

Delta operations support all spark dataframe APIs, so it was straightforward to switch to writing as Delta instead of Parquet. For tables that were in Parquet format, the CONVERT operation migrates the tables to Delta in place without duplicating cloud storage files. Through the use of these two features, the create table utility is reimplemented under the hood and all data in ETL workflows are converted to or written out in Delta automatically. As a result, YipitData’s entire data lake switched to using Delta with minimal impact for its analysts.

Conclusion

YipitData’s success is driven by its ability to answer clients’ questions quickly and accurately. With structured streaming ingestion backed by Delta, YipitData is able to quickly and reliably ingest, clean, and analyze very large, valuable alternative datasets.

Declarative streaming APIs drastically reduce ETL runtime leading to timely, valuable insights for clients.
Delta provides transactions for every table operation, which allows YipitData analysts to create resilient ingestion pipelines that they can monitor.
Data streamed into Bronze Delta tables via autoloader can be queried and transformed by any number of downstream users, helping permeate raw data across numerous product teams.

As a result, YipitData analysts can independently incorporate new data sources using delta into multiple data products to answer their clients’ questions. These sources even fuel new product development for the company. At the same time, Delta is evolving, and YipitData is excited to continue to unlock business opportunities through its data lakehouse platform.

How yipitdata slashed over 2.5 million off of aws bill

Using Databricks as an Analytic Platform at Yipitdata

Recurring data delivery and ingestion with S3 bucket replication

How Yipitdata uses Databricks integration with AWS glue

Data Team effect at Yipitdata

Customer Story – YipitData turns to its data team to transform financial market information overload into insight.

Try Databricks for free. Get started today.

The post How YipitData Extracts Insights From Alternative Data Using Delta Lake appeared first on Databricks.

↧

Managing Model Ensembles With MLflow

September 21, 2021, 10:00 am

≫ Next: Extracting Oncology Insights From Real-world Clinical Data With NLP

≪ Previous: How YipitData Extracts Insights From Alternative Data Using Delta Lake

In machine learning, an ensemble is a collection of diverse models that provide more predictive power together than any single model would on its own. The outputs of multiple learning algorithms are combined through a process of averaging or voting, resulting in potentially a better prediction for a given set of inputs.

However, there are tradeoffs to the ensemble learning approach; each prediction becomes more difficult to ‘explain’ (model interpretability). In addition, this approach can increase engineering complexity, and it’s often not immediately obvious how to manage ensemble models throughout their lifecycle. Apart from the fact that we are creating N different models, there are several additional concerns around their management such as:

If one model changes, how does this impact the ensemble versioning?
How do we detect model drift of an ensemble?
How do we package the ensemble artifacts and maintain lineage?

This blog post walks through the process of creating and managing ensembles aided by MLflow and Databricks AutoML. If creating and productionizing a single model is hard, then doing the same for an ensemble of models is even harder! Since Databricks AutoML does the heavy lifting of creating all the models, we now have the opportunity of leveraging ensembles with far less effort. A simple stacking strategy using the top N models from some of the architecture types may outperform the single best model.

Ensembles

Some algorithms are natural ensembles (Random forest, AdaBoost), while others are combinations of decision trees and more traditional algorithms like logistic and linear regression. They can even extend into neural networks and deep learning scenarios. Since each algorithm has its own method of modeling the relationships in data, their ensemble can reduce overall variance and bias while improving accuracy.

There are several factors to consider while building an ensemble:

What is the size of the dataset?
How many models to include in the ensemble?
How diverse are the individual models?
How are multiple versions of the model maintained?
How should they be packaged?
Is the model reused across different use cases?

Ensembles usually perform better if there is a lot of variation in the data characteristics. Having a set of diverse learners will help in the overall prediction. However, there is a plateau point, beyond which adding models does not have much impact on the performance. Hence, it is important to balance the cost of creating and managing ensembles with the additional performance gains. Each sub-model in the ensemble will have its own life cycle. Some may have stronger inter-dependencies while others may be more stand-alone. So it is important to consider how the sub-models are trained and packaged for flexible reuse and upgrade.

Let’s take a look at a few use cases that benefit most from an ensemble strategy:

Analyzing the ‘Voice of the Customer’ data

Complaint data needs to be addressed as per regulatory guidelines. This requires swift and accurate classification of the complaints as well as human intervention to redress. This data comes along with the regular customer chatter. While it is alright to respond to some customer queries at leisure, the ones which are labelled ‘legal’ or ‘regulatory’ need to be addressed immediately. This is an excellent candidate use case for ensembles as even a small accuracy boost has a magnified impact on business.

Finding the best fit in a multi-classification scenario for product recommendations

Prescription data is analyzed to find the appropriate product SKU fit. Models at each layer take the data from the previous layer and refine the classification.

Once created and deployed, the model becomes a living artifact that needs to be managed. There are several challenges to consider while managing dependencies and versions of the ensemble and the individual sub-models. In addition, there are various stages(environments), and a model has to successfully perform at each stage to be promoted to the next higher one. It is further exacerbated when the model is part of an ensemble. Yet another level of complexity is if the model is shared by different use cases where the version across each use case may be different. So pulling the latest version from production may not be the right thing to do for all the dependent use cases.

Simplify ensemble creation and management with Databricks AutoML + MLflow

MLflow is an open source, scalable framework for end-to-end model management. It aids the entire MLOps cycle from artifact development all the way to deployment with reproducible runs.

An ML practitioner can either create models from scratch or leverage Databricks AutoML. For any set of models logged in MLflow, not only can you take the best one, but you could also see how well a combination of the top N models performs.

Databricks AutoML is a fully automated, glass box approach model development solution to democratize machine learning for rapid prototyping and using a selected dataset. Under the hood, it leverages MLflow. AutoML solves two key pain points for data scientists, namely quickly verifying the predictive power of a dataset and getting a baseline model to use as is or start refining and includes:

Data pre-processing including Exploratory Data Analytics (EDA) notebooks.
Feature engineering & selection.
Automated training with hyperparameter tuning and tracking of each run with MLflow Tracking, aiding in the selection of the best model and registering in MLflow Registry.

It is not uncommon for data teams to spend a lot of time and effort to produce several models of different architecture types in pursuit of optimal model performance. With AutoML, the model creation process has been completely auto-generated, thereby simplifying the subsequent process of model selection.

AutoML currently supports both regression and classification and includes these phases:

Configuration: This is where we specify the dataset, problem type, target or label column to predict, the metric for evaluating and scoring the experiment runs, and stopping conditions (such as number of trials or maximum amount of time to run)
Training : Each ML training runs in an experiment that we can query and explore subsequently since all the details (code, parameters, metrics, models, artifacts) are logged.
Evaluation: The top model based on our selection criteria is highlighted for scrutiny and subsequent registration. This is where we can use either the single best model (champion) or a combination of top models (challenger) if that outperforms the champion.

Let’s examine the kaggle telco dataset that is used to predict which customers may churn in the next round. Based on the selection criteria, AutoML recommends not only the single best model but also provides details on all the runs across all the model types. We’ll start by logging the recommended Best Model (Champion) in the MLflow model registry, along with the top models in each sub-category (Challengers)

Using a test dataset, we compare the performance between the Champion and the Challengers in this notebook. In the case of the ensemble, a voting strategy was used for final classification. If the ensemble performance is significantly better, that can be the new champion model. Users have different options on how to consume the ensemble model, either individually or collectively.

Figure: Flow to determine the best ensemble, log it in the tracking server, promote to registry

Option #1	Option #2
Log each model of the ensemble separately in the registry Promote to staging/production. At inference time, load all registered models and use ensemble voting strategy to predict.	Load individual models. Create an ensemble pyfunc model by passing each individual model to it and the login tracking server. Perform test inference using model details from the tracking server. Promote ensemble model to registry and transition model to production. Load ensemble from registry to do inference on new data.

In this example, we opt for option #2, which entails logging each model independently and as a single ensemble wrapper model in MLflow.

The ensemble encapsulates all the independent models as a single pickle file. This allows us to deploy the ensemble as one artifact that has a life cycle of its own, separate from the individual contributing models, which can continue to evolve independently. This is very similar to shipping a docker container or an uber jar after combining relevant individual libraries.

Step #1: Fetch the “best” models of each architecture type from the AutoML experiment:

filter_str = "params.classifier LIKE 'DecisionTree%'"

model = (client.search_runs(experiment_ids=experiment_id, filter_string=filter_str,

order_by=["metrics.val_f1_score DESC"]))[0]

best_runId = model.info.run_uuid

DecisionTree_model_uri = f"runs:/{best_runId}/model"

DecisionTree_model = mlflow.sklearn.load_model(DecisionTree_model_uri)

Figure: Best Models of each model type generated by AutoML
(In the example chosen, AutoML identified XGB to be the best model)

Step #2: Build a custom pyfunc model class that encapsulates the best models

This will pickle the different models along with the ensemble. The required functions for the ensemble class are the __init__, load_context, decide, ensembleTopN and predict methods – all of which will be fleshed further down.

class Ensemble(mlflow.pyfunc.PythonModel):

def __init__(self, DecisionTree, RandomForest, LGBM, XGB):

self.DecisionTree = DecisionTree

self.RandomForest = RandomForest

self.LGBM = LGBM

self.XGB = XGB

Step #3: Provide a predict function for the ensemble

The predict function for any pyfunc model needs to fit the following paradigm, which is what will be used at inference time to score new data.

The predict function accepts data as a pandas dataframe and returns another pandas dataframe. This allows the model to be interoperable as a web API via MLflow model serving or via Apache Spark™ UDFs/pandas functions.

 # Input is pandas dataframe or series       
  def predict(self, context, model_input):
    dt = self.DecisionTree.predict(model_input)
    rf = self.RandomForest.predict(model_input)
    lgbm = self.LGBM.predict(model_input)
    xgb = self.XGB.predict(model_input)
    ensemble = self.ensembleTopN(  dt,rf,lgbm,xgb)

    return pd.DataFrame({
      "DecisionTreePredictions": dt,
      "RandomForestPredictions": rf,
      "LGBMPredictions": lgbm,
      "XGBPredictions": xgb,
      "Ensemble Predictions": ensemble
    })

Step #4: Provide a voting function

The meat of the prediction is determined by the voting algorithm, which can have several variations. Here is an example with the simple approach of majority vote.

  #  Helper function to decide based on the number of models provided
  def decide(self, votes, num_scores):
    # The output and return logic will need to change for multiclass as you need to return 0-N as result.
    if votes >= int(num_scores/2) + 1:
      return 1
    else:
      return 0

  #  Scores is a list of series of predictions from the other classifiers
  def ensembleTopN(self, *scores):    
    # This line needs to change for creating votes for multi class.  
    votes = functools.reduce(lambda x, y: x+y, scores)
    num_scores = len(scores)
    decide_with_num_scores = functools.partial(decide, num_scores=num_scores)
    decide_vec = np.vectorize(decide_with_num_scores)

    # Since this is a binary classification return will be 0 or 1    
    return decide_vec(votes)

Figure: Champion Model

Figure: Ensemble challenger models compared to AutoML generated champion model
(In the example chosen, Top4 and Top3 Ensembles are the clear winners.)

Step# 5: Package and log the model in MLflow as a custom pyfunc model

Provenance back to the encapsulated models needs to be maintained, and this is where the MLflow tracking server and parameters/tags are used to save the parent model URIs in the ensemble run.

with mlflow.start_run() as ensemble_run:
  mlflow.log_param("DecisionTree", DecisionTree_model_uri)
  mlflow.log_param("RandomForest", RandomForest_model_uri)
  mlflow.log_param("LGBM", LGBM_model_uri)
  mlflow.log_param("XGB", XGB_model_uri)
  
  mlflow.pyfunc.log_model("Ensemble", python_model=
               Ensemble(DecisionTree_model,  RandomForest_model, LGBM_model, XGB_model))

This process becomes very easy to manage and version because there is a single artifact. If something in the pipeline is not functioning, there are significantly fewer moving parts, which make it easy to debug and validate before the model gets placed in the registry. This paradigm is very similar to shipping a sklearn pipeline, where the pipeline encapsulates all the transformations needed before the predictor. In the end, you also only need to manage just one registry for the prediction.

Step #6: Scoring

The model is now ready to score new data:

import mlflow
# Generate the run uri for the ensemble model from the previous run
single_ensemble_model = f'runs:/{ensemble_run.info.run_uuid}/Ensemble'

# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(single_ensemble_model)
  
# Predict on a Pandas DataFrame.
import pandas as pd
import numpy as np
loaded_model.predict(X_test)

Nuances of ensembles

Multiple models do not necessarily mean an ensemble!

Let us consider the scenario of IoT data sent from different machines across several factories. Each machine has a different operating cycle, so it would be wrong to baseline them together. A model needs to be built per machine. The incoming data is filtered by the type of machine and an appropriate model is applied. Some may argue this is an ensemble. It is a divide-and-conquer approach but the data is trained/scored by a single model. Multiple models are not combined to improve accuracy; hence, this is not an ensemble scenario — it is just N models. The voting strategy discussed earlier can, however, be used on the input data characteristics to invoke the right sub-model.

Summary

The ensemble method is a layering approach where moderately performant un-correlated models are combined to produce a supermodel that improves accuracy while improving stability and is often a divide and conquer strategy used in large, diverse datasets. Apart from the increased engineering complexity and manageability, there is often a tradeoff between accuracy and explainability, which is why people sometimes shy away from ensembles in production, although it is the preferred approach in Kaggle competitions. AutoML, with its inherent use of MLflow, comes to aid by automating and simplifying the creation and management of the underlying models, thereby helping ML practitioners to push the boundaries in their quest to extract value from data.

Try the notebook

Related blogs:

Automl blog

Mlflow model registry blog

Try Databricks for free. Get started today.

The post Managing Model Ensembles With MLflow appeared first on Databricks.

↧

Extracting Oncology Insights From Real-world Clinical Data With NLP

September 22, 2021, 8:00 am

≫ Next: Catalog and Discover Your Databricks Notebooks Faster

≪ Previous: Managing Model Ensembles With MLflow

Preview the solution accelerator notebooks referenced in this blog online or get started right away by downloading and importing the notebooks into your Databricks account.

Cancer is the leading cause of death and disease in the U.S., and the numbers are staggering with nearly 2 million new cases of cancer expected to be diagnosed in the U.S. this coming year. Cancer also represents a significant portion of total U.S. healthcare spending, estimated at more than $200B in 2020. As such, the biopharmaceutical industry is heavily focused on oncology drug development. Nearly 40 new cancer drugs were approved by the FDA in 2019 and 2020 alone, and more than 1300 new medications and vaccines are in clinical development.

Measuring the efficacy of oncology interventions is critical to matching patients with the right intervention. Oncology data, and related real-world evidence, have the potential to inform clinical research, trial design, regulatory decisions, safety assessments, treatment pathways and more. Unfortunately, given the highly specialized nature of oncology care, disease criteria and endpoints typically are not available in structured formats and remain locked in data silos, making them hard to aggregate and analyze.

In oncology, pathology reports (often captured in PDF format and siloed in EMR systems), contain critical information, such as tumor size, grade, stage and histology. These variables, once extracted with a natural language processing (NLP) system, can be used to define disease cohorts, assess disease severity and create a baseline for disease progression, which then can be applied to the aforementioned use cases, ranging from clinical trial matching to treatment pathways. But extracting this information from unstructured clinical text data is often a huge pain point for data teams.

John Snow Labs, the leader in healthcare NLP, and Databricks are tackling these challenges head-on and working with many customers across the healthcare ecosystem to translate unstructured oncology data into actionable evidence.

Clinical natural language processing at scale with Databricks & John Snow Labs

The path forward begins with the Databricks Lakehouse Platform, a modern data platform that combines the best elements of a data warehouse—such as data management and performance —with the low cost, flexibility and scale of a cloud data lake. This new, simplified architecture enables health systems to unify all their data—structured (e.g. diagnoses and procedure codes found in EHR databases), semi-structured (e.g. HL7, FHIR messages) and unstructured (e.g. free-text notes and images)— into a single, high-performance platform for both traditional analytics and data science.

At the core of the Databricks Lakehouse Platform is Delta Lake, an open-source storage layer that brings performance (via Apache Spark™), reliability and governance to a data lake. Healthcare organizations can land all of their data – including raw provider notes, radiology reports and PDF pathology reports – into Delta Lake. This preserves the original source of truth before applying any data transformations. By contrast, with a traditional data warehouse, transformations occur prior to loading the data, which means that all structured variables extracted from unstructured text are disconnected from the native text.

Building on this foundation is John Snow Labs’ Spark NLP for Healthcare, the most widely-used NLP library in the healthcare and life science industries. Optimized to run on Databricks, Spark NLP for Healthcare seamlessly extracts, classifies and structures clinical and biomedical text data with state-of-the-art accuracy at scale. It is the only native distributed open-source text processing library for Python, Java and Scala, and since every Spark NLP pipeline is a Spark ML pipeline, it is particularly well suited to building unified NLP and machine learning pipelines. Spark NLP provides Python, Java and Scala libraries with the full functionality of traditional NLP libraries (like spaCy, nltk, Stanford CoreNLP and Open NLP) and adds additional functionality, such as spell-checking, sentiment analysis and document classification. You can learn more about the joint Databricks and John Snow Labs solution in our previous blog, Applying Natural Language Processing to Health Text at Scale.

Real-world oncology data abstraction in action

To demonstrate the power of Databricks and John Snow Labs, we created a Solution Accelerator for abstracting real-world data from oncology notes. The solution accelerator contains sample data, prebuilt code and step-by-step instructions for ingesting and preparing oncology reports for downstream analytics and real-world evidence generation. The solution is ready to go in a Databricks notebook and to help you get started, we’ve included a brief walkthrough of the solution below.

For this solution we used the MT ONCOLOGY NOTES dataset. It offers resources primarily in the form of transcribed sample medical reports across medical specialties and common medical transcription words/phrases encountered in specific sections that form part of a medical report – sections such as physical examination or PE, review of systems or ROS, laboratory data and mental status exam, among others.

We chose 50 de-identified oncology reports from the MT Oncology notes dataset as the source of the unstructured text and landed the raw text data into the Delta Lake bronze layer. For demonstration purposes, we limited the number of samples to 50, but the framework presented in this solution accelerator can be scaled to accommodate millions of clinical notes and text files.

The first step in our accelerator is to extract variables using various models for Named-Entity Recognition (NER). To do that, we first set up our NLP pipeline, which contains annotators such as documentAssembler and sentenceDetector and tokenizer that are trained specifically for healthcare-related NER. In the example below, we combined bionlp_ner, which is a clinical NER model, and jsl_ner, which is a pre-trained deep NER model for clinical terminology. We see that the mesothelioma patient is experiencing symptoms such as coughing.

Extracting named entities from texts is a great example of AI-assisted ETL: pre-trained deep learning (DL) models enable us to transform unstructured data into a structured format that can be used for downstream clinical analysis.

Once we have the symptoms extracted, we can map to ICD-10 codes, which can be used for coding automation and improving Hierarchical Condition Category (HCC) coding accuracy for Medicare Risk Adjustment. We can further use this data to analyze treatment patterns and analyze the association between symptoms and oncological entities.

Figure 1: Average risk indication for coded symptoms in the clinical dataset

Figure 2: A visualization of symptom enrichment among most frequent conditions in the dataset

We can also generate a chart to study the assertion status of these symptoms as being present, absent or associated with someone else (for example, a family member).

Continuing with the same note set, we run descriptive and visual statistics to display the most common oncology entities (example below) stratified by their assertion status.

Example of Databricks and John Snow Labs’ Oncology NLP Solution Accelerator visualization depicting the assertion status of most common symptoms

Figure 3: Assertion Status Of Most Common Symptoms.

Next, we can look at treatments, including drug frequency and duration, which form the basis of oncology regimens. Below is a screenshot of the NLP model included in our solution notebook extracting drug treatment and duration information.

We can then associate symptoms in relation to treatments, as well as disease statuses such as relapse, with confidence scores.

This data is critical for ensuring both the quality of individual patient care and population-level research, which can help determine the efficacy and safety of interventions in the real world.

Using the Databricks Lakehouse Platform, we can also easily create a database of conditions, symptoms and procedures, along with other relevant extracted information from the unstructured notes, which can then be used for downstream analysis, clinical decision support and research.

With this solution accelerator, Databricks and John Snow Labs have opened the door to extract oncology data at scale with the quality required for real-world evidence generation.

Get started extracting RWD from oncology notes with NLP

To use this solution, preview the notebooks online or get started right away by downloading and importing the notebooks into your Databricks account. The notebooks include guidance for installing the related John Snow Labs NLP libraries and license keys.

You can also visit our industry pages to learn more about our Healthcare and Life Sciences solutions.

Try Databricks for free. Get started today.

The post Extracting Oncology Insights From Real-world Clinical Data With NLP appeared first on Databricks.

↧

Catalog and Discover Your Databricks Notebooks Faster

September 22, 2021, 10:00 am

≫ Next: Shiny and Environments for R Notebooks

≪ Previous: Extracting Oncology Insights From Real-world Clinical Data With NLP

This is a collaborative post from Databricks and Elsevier. We thank Darin McBeath, Director Disruptive Technologies — Elsevier, for his contributions.

As a global leader in information and analytics, Elsevier helps researchers and healthcare professionals advance science and improve health outcomes for the benefit of society. It has supported the work of its research and health partners for more than 140 years. Growing from its roots in publishing, Elsevier provides knowledge and valuable analytics that helps users make breakthroughs and drive societal progress. Digital solutions such as ScienceDirect, Scopus, SciVal, ClinicalKey and Sherpath support strategic research management, R&D performance, clinical decision support, and health education. Researchers and healthcare professionals rely on Elsevier’s 2,500+ digitized journals, including The Lancet and Cell; 40,000 eBook titles; and its iconic reference works, such as Gray’s Anatomy.

Elsevier has been a customer of Databricks for about six years. There are now hundreds of users and tens of thousands of notebooks across their workspace. To some extent, Elsevier’s Databricks users have been a victim of their own success, as there are now too many notebooks to search through to find some earlier work.

The Databricks workspace does provide a keyword search, but we often find the need to define advanced search criteria, such as creator, last updated, programming language, notebook commands and results.

Interestingly, we managed to achieve this functionality using a 100% notebook-based solution with Databricks functionalities. As you will see, this makes it easy to set up in a customer’s Databricks environment.

API-first approach to scan for notebooks

Databricks provides a robust set of APIs that enables programmatic management of accounts and workspaces. For this solution, we leverage the Workspace API to programmatically list and export notebooks and folders inside our workspace.

We also parallelize the API calls to speed up the cataloging process and make it configurable within Databricks’ rate limit of 30 requests per second. To avoid the “429: Too Many Requests” error, we have implemented the exponential retrying mechanism inspired by the Delta Sharing Apache Spark™ connector.

Cataloging using Parquet

This solution does not require any external full-text search system like Solr or ElasticSearch. Instead, we leverage Parquet files for our notebook index. The index is simply a data table where each row describes a separate command cell. Each row includes fields for:

Notebook information: language (nbLang), name (nbName), folder path (nbFolder), url (nbUrl)
Command information: cell text (cText), last-run date (cDateTime), cell language (cLang), url (cUrl)

A notebook-based solution for notebook search

The index and search functionalities are provided by three notebooks (NotebookIndex, NotebookSearch and NotebookSimilarity). Two helper notebooks (NotebookIndexRun and NotebookSimilarityRun) make it easy to configure the index and similarity capabilities.

NotebookIndex

The notebook-based solution leverages the Workspace API to export notebooks programmatically and populate our Parquet table.

Most organizations will have Workspace object access control enabled so that a user can only manage their own notebooks and those in the Shared location. NotebookIndex runs only with the permissions of the user and is limited to notebooks that user can view.

If an organization wants a full catalog of all notebooks in their workspace, an administrator must run the indexing to have a workspace-level catalog. In addition, we expect most organizations will have users and workgroups creating their own index files, which will only contain records for notebooks the users are allowed to see.

NotebookIndexRun

This is a helper notebook for users or groups to run the indexing process. It lets them select which user folders will be scanned — such as their own or perhaps the members of their group. Elsevier found this particularly useful for users in the Labs group.

As noted above, only notebooks readable by the user running the indexing notebook will appear in the index. In the following example, the notebooks contained in the /Shared/ folder, and in someone1, someone2, someone and someone,user folders will be indexed.

NotebookSearch

Each user that wants to use NotebookSearch should clone this notebook into their own workspace. The notebook provides examples for searching the index table described earlier. We expect users will edit their copy to specify which table to use and then customize the examples to suit their needs. Several examples of such searches are given later on in this blog.

Beyond the examples, we have also provided a displaySearchResults function that displays the search results using HTML to be more user friendly:

The language column identifies the language for the command, and the folder indicates the folder location where the notebook (identified in the notebook column) is stored.
The notebook links take you to the notebook containing the match.
The command links take you to the actual command cell within the notebook.

NotebookSimilarity and NotebookSimilarityRun

Now that we’ve captured all commands in all notebooks within our workspace, we can run further analysis on the notebook catalog. One idea is to identify similar notebooks using Natural language processing (NLP) techniques.

Since there is currently no provenance chain to trace the history of notebooks as they are cloned, this helps to identify notebooks that have been potentially cloned from one another. While it’s not possible to identify the initial notebook from where other notebooks were cloned, we can identify notebooks with very similar text based on a threshold.

There are many other measures of similarity, each with its own eccentricities. The NotebookSimilarity notebook demonstrates a simple example using Jaccard distance by representing each notebook as sets of words.

We apply some simple preprocessing to remove markdown, results and empty cells and combine all notebook commands as a string. We then leverage the MinHash function for Jaccard distance in MLlib, which allows for scalability to tens of thousands of notebooks.

This compares notebooks discovered in the index table to produce a similarity score. Instead of keeping a full matrix of all similarities, we specify a maximum similarity distance (e.g., 0.1). For every notebook, a list of notebooks within that distance is kept ready for searching.

Example use cases: shell, Spark SQL and Scala

Wouldn’t it be nice to know which notebooks contain commands that leverage shell commands? The following example searches for shell commands. In other words, the cell starts with the magic command %sh. (Note:while we are looking for specific cells that use shell commands, they will appear within notebooks that have an overall default language. Thus, it makes sense to show the first column that tells what that default language is.)

Do you ever run into the situation of trying to remember how to use a specific Spark SQL function – maybe one you or a buddy has used? Wouldn’t an in-context example be more helpful than resorting to a web search and hoping to find a needle in the web haystack? The following example searches for commands within a specific user’s area containing the string “collect_list.”

Do you ever wonder what notebooks have been executed recently? The following example searches for notebooks containing a cell that has been executed since Aug. 12, 2021. By specifying distinctNotebooks=true, we roll up all of the commands (containing a match) for the same notebook to a single hit for the notebook and only present a link to the notebook.

The above basic examples only scratch the surface for what can be searched in the index table. The following are some representative questions we have seen on the Databricks Community (formerly Forums) over the past couple of years and should easily be addressed by Notebook Discovery:

I’m updating a source table and need to find all the notebooks that have that table. I tried using the search function in Databricks UI, but my problem is I’m getting results from every folder, including other users. Is there a way to conditionally limit searches to a certain folder?
Can you search notebooks using conditionals, such as exact phrases, contains(not), wildcards or regular expressions?
Is there a way to search for a string in all notebooks? I like to find out whether a Delta table is used in any notebooks.
Can I search for similar commands in other notebooks for debugging?

Get started

Elsevier Labs has released this solution as the Notebook Discovery tool and is now publishing it as open source under the very permissive MIT License. Notebook Discovery is provided as a DBC (Databricks archive) file, and it is very simple to get started:

Download the archive: Download the Notebook Discovery archive (DBC file) to a location on your machine.
Importing the notebooks: From the Databricks UI, import the downloaded DBC file into a folder. The workspace folder where the archive is imported is unrelated to the workspace folders that you want to index.
Generating the index file: The person generating the index file needs to edit the NotebookIndexRun helper, indicating folders to index and specify the location of the index file. The indexing process will start and produce the index file when done.
Searching for notebooks: Other users should clone the NotebookSearch notebook into their area and edit it to use the right index file. They can then edit the searches to their liking. Several example searches are given below in the Examples section.
Detecting similar notebooks: If users want to look for similar notebooks, they need to edit the NotebookSimilarityRun file and run the job to generate the similarity file.

Over the past couple of months, Elsevier users have found Notebook Discovery to be very useful and decided to share this with the community. We hope you benefit from using this tool as well.

Try Databricks for free. Get started today.

The post Catalog and Discover Your Databricks Notebooks Faster appeared first on Databricks.

↧

Shiny and Environments for R Notebooks

September 27, 2021, 8:05 am

≫ Next: Interning From a Distance

≪ Previous: Catalog and Discover Your Databricks Notebooks Faster

At Databricks, we want the Lakehouse ecosystem widely accessible to all data practitioners, and R is a great interface language for this purpose because of its rich ecosystem of open source packages and broad use as a computing language for many non-computing scientific disciplines.

The product team at Databricks actively engages with R users to identify pain points and areas for improvement. A highly requested feature is the ability to launch and share shiny applications inside the Databricks notebooks. Previously, users could develop Shiny apps inside a hosted RStudio server on Databricks, but a key limitation was not being able to share the app URL with other users.

In addition, we consistently heard that the existing package management for R code on Databricks, a feature that was introduced in 2017, was not adequate. Users want to simply call the familiar install.packages() function inside a notebook and have the package available on all the workers of the cluster. In addition, the library isolation that was introduced for Python notebooks was attractive for R users.

Shiny inside R Notebooks

Databricks users have been able to interactively develop and test Shiny applications inside the hosted RStudio Server on Databricks. We are taking our support for Shiny to the next level by enabling R notebook users to develop and share live Shiny apps and dashboards.

Using interactive notebooks to build data applications, such as Shiny apps, is an emerging paradigm. Both notebooks and data applications are powerful tools, and data scientists naturally want to use them together. More importantly, a data application running inside a hosted notebook can be easily shared. Users would not need to “publish” Shiny applications to share them. They can simply copy the URL of the app and send it to collaborators. As long as the notebook is attached to the cluster and users have “Can Attach To” permission on the cluster, they can view and interact with the Shiny app.

To try this feature, copy the code for any sample Shiny application into a new R notebook and attach it to a cluster (or single-node) running Databricks Runtime 8.3 or above. The cell that starts the app will display a clickable link, which will open your application in a new tab.

You can use the new Databricks Repositories to check out a Shiny application from a git repository. Simply place the runApp() call in a notebook cell and launch the application.

Streaming output

Another new improvement in Databricks R notebooks is streaming standard out results of long-running commands. If you run long-running functions that print out intermediate results (e.g., iteratively optimization), you can now see the results as they are being generated. Similarly, if a piece of code generates warning messages before returning, you can view those messages in real time. This means your shiny application’s log messages will be printed in results section of the notebook cell that started the app.

Notebook-scoped libraries for R

Previously, all R notebooks running on a Databricks cluster installed packages to a single directory on the driver. This presented two limitations. First, this directory was not shared with the worker nodes, meaning that any library installed on the driver would not be accessible to Apache Spark™ workers. Second, because all notebooks installed the libraries on a shared path, users could run into version conflicts when attempting to install different versions of a package. This is surprisingly common due to transient dependencies.

With notebook-scoped libraries introduced in Databricks Runtime 9.0, each R notebook gets a unique path to install and load libraries. This path exists on a cluster-scoped NFS mount, which allows Spark workers to access libraries installed on the driver by the notebook user. This improves productivity because, as an R user, you do not need to switch out of your notebook to configure a cluster-scoped library — simply install it inside your notebook. This gives you much better isolation on shared clusters.

When the notebook is detached, the system cleans up the per-notebook unique directory. As a result, when a notebook is detached and reattached to a cluster, a new notebook-scoped directory is generated, and the previous library state is reset.

If you are interested in understanding how R manages the package installation and import, as well as how Databricks implements notebook-scoped libraries for R notebooks, please read our user guide.

It is worth noting another approach to isolating R environments named renv. renv is an R package that lets users manage R dependencies specific to a notebook. To learn more about how to use renv inside Databricks notebooks, visit our guide for renv.

Try these new features inside R Notebooks

Try Databricks for free. Get started today.

The post Shiny and Environments for R Notebooks appeared first on Databricks.

↧

Interning From a Distance

September 27, 2021, 9:00 am

≫ Next: Bringing Lakehouse to the Citizen Data Scientist: Announcing the Acquisition of 8080 Labs

≪ Previous: Shiny and Environments for R Notebooks

Summer 2021 brought another summer of virtual game nights, pizza parties and team-building events for Databricks interns. In addition to working on impactful projects that ranged from improving our customer user journey to scaling the Databricks authentication services, our interns were also able to build relationships with their peers and create their own community with the rest of Databricks through various events. Whether it was building camaraderie through Intern Olympics, forming an intern running club during Wellness Week, or cheering each other on through project presentations, our intern team members came together for another great virtual summer. Take a look at some highlights for our summer software engineering interns:

Databricks summer 2021 intern Virtual Pizza & Puzzles event

Virtual Pizza & Puzzle event

Pizza & Puzzle Night across the globe

Nothing brings people together like good food… especially when it’s fresh pizza that’s delivered right to your doorstep, no matter where you are! We invited our interns from around the world to order their own favorite type of pizza to accompany a Virtual Escape Room on a Friday afternoon, and our interns from California to Serbia, New York to Amsterdam enjoyed a fresh slice together.

“Pizza & Puzzle night was a very interesting event because I had the opportunity to meet other interns before my internship even started. Also, the pizza was great! It arrived [to me in Serbia] just in time, 10 minutes before the event. Besides pizza, of course, I enjoyed the puzzle game as well. That was the first time I played an online room escape game; it was pretty interesting since we weren’t able to go to the real room escapes for a long time.” – Filip Ćosović, Storage & IO Team @ AMS

Second annual Databricks Intern Olympics

Second annual Intern Olympics

Virtual Intern Olympics

Collaboration is a big part of our culture at Databricks, and the Intern Olympics provided a perfect opportunity for our interns to not only build relationships but also collaborate as one team! Challenges included things like virtual movie night, group workouts, science experiments, and creating Databricks themed haikus. Congrats to our Yellow Team for winning the Gold!

“The Intern Olympics was a super fun way of bringing the intern class together! We had a bunch of super fun challenges to tackle with our teams, like cooking together over Google Meet. It was an awesome way of tackling disconnection in a remote work environment.” – Ben Zhang, Data Gov Team

Artly virtual improv class

Team improv class

Our interns participated in an interactive, virtual improv session to build trust and engagement with each other. We used improv and acting exercises to learn how to communicate effectively, show support for one another, and improvise when the unexpected happens.

“One of my favorite intern events this summer was a virtual improv class hosted by Artly Working. It’s typically harder to bond with co-workers in a virtual setting, but the spontaneity of the improv activities allowed for people to be more expressive and engaged. I feel like it really brought the intern class together and made people feel more relaxed and welcome to share their thoughts.” – Zi Gao, Data Team

Wellness Week

Mental and physical health is essential to being your whole self at work — that’s why we hosted a Wellness Week for our interns! We kicked it off by inviting a fellow Brickster and trained mindfulness facilitator to lead a Mindfulness Workshop and the rest of the week consisted of a virtual volunteering activity, group stretching session and a company-wide 5k!

“The meditation sessions during our Wellness Week elevated my energy and pushed us to think deeper about our health and well-being! It was also extremely fun completing the 5K challenge with another intern, allowing me to connect with folks in person and build healthy relationships even during the pandemic.” – Kritin Singhal, App Infra Team

National Intern Day Event: DIY Boba!

Celebrating National Intern Day with boba

On July 29th, we celebrated National Intern Day – a day designed to recognize and celebrate interns by hosting a virtual DIY Boba Party! Boba is a special tradition at Databricks, and we wanted to share it with our interns, even while they were working from home. Every intern received a custom boba cup, boba straw, and ingredients for a virtual cooking class to make DIY boba!

“I enjoyed the recent Boba event for National Intern Day. We had a lot of fun making tapioca pearls from scratch — from kneading sugared dough to shaping homemade pearls. Besides that, we got a sneak peek into the daily office life at Databricks with stories around Databricks’ Boba traditions. It was also a great way to connect with interns across the globe during the remote internship. Thanks to the university team for planning this awesome event!” – Eugene Koh, BUI Team @ AMS

A huge thank you to our wonderful Summer 2021 intern class for their engagement and positive attitudes! You all made this summer the most memorable one ever, and we’re beyond proud of all the work you contributed to Databricks — see you soon!

Interested in joining us as an intern next year? Check out our University Recruiting page.

Try Databricks for free. Get started today.

The post Interning From a Distance appeared first on Databricks.

↧

Bringing Lakehouse to the Citizen Data Scientist: Announcing the Acquisition of 8080 Labs

October 6, 2021, 7:00 am

≫ Next: Databricks Repos Is Now Generally Available – New ‘Files’ Feature in Public Preview

≪ Previous: Interning From a Distance

Transforming into a data-driven organization – which means data has permeated into every facet of your company – is critical for driving meaningful business outcomes. Data literacy is the new buzzword as organizations across industries focus on keeping up with consumer demands and driving innovation, all while meeting ever-evolving compliance measures. Even for organizations that lack large teams of highly-trained data engineers, data scientists or ML engineers, building and productionizing data assets is necessary…but also not scalable.

That’s why today, we’re thrilled to announce the acquisition of 8080 Labs, a Frankfurt-based startup behind bamboolib, a no-code data analysis tool built for citizen data scientists. Why are we so excited about this? This is a strategic foray for us into the low-code/no-code space that empowers a broader set of data practitioners and enthusiasts, opening up new pathways to innovation. It builds off our previous acquisition of Redash, which delivers easy-to-use dashboard and visualization capabilities, as we continue to make the power of data and AI accessible to more individuals.

bamboolib delivers an extendable GUI that exports Python code (think of it like recording macros in Excel) for fast, simple data exploration and transformation without any coding required for users. The UI-based workflows help make Databricks accessible for both citizen data scientists and experts alike and reduces employee on-boarding and training costs. These no-code use cases include:

Data Preparation In just a few clicks, clean and organize raw data to make it usable for any downstream use case.

Data Transformation Directly in the UI, easily aggregate and convert highly complex data sets.

Data Visualization Quickly create and export Plotly express plots and leverage 10x faster data visualizations.

Data Exploration Explore data in minutes with the Explore DataFrame functionality.

In our upcoming roadmap, we will integrate bamboolib’s no-code capabilities across the Databricks Lakehouse Platform.

Empowering citizen data scientists with low-code/no-code

Everyone (not just technical teams) wants to leverage data and AI to drive business impact. This has made the citizen data scientist, which describes anyone who can bring data-driven insights to a discussion, a critical function. They can come from any organization, with any title, and help bridge the gap between their organization and the specialized data science or machine learning team (if one even exists).

The technical reality of most citizen data science tools is that they enable simple data exploration, but not much else – they still require engineering resources to execute on ML use cases. Instead, what organizations need is a solution that democratizes the expertise typically required for building and productionizing ML models, lowering the barrier to entry.

Extending data + AI accessibility with a new approach

We’re solving this dilemma with a unique approach that enables citizen data scientists to perform impactful data science and AI use cases without having to write a single line of code. We pioneered this effort earlier this year with Databricks AutoML, which automates all the heavy lifting of preprocessing, feature engineering and model training and tuning, empowering data enthusiasts to quickly build and deploy ML models at any scale. No coding necessary – AutoML generates baseline models with fully editable notebooks, so citizen data scientists can quickly achieve useful results. Together, bamboolib and Databricks AutoML enable anyone within an organization to prepare data and perform downstream use cases, such as data analysis and ML, without relying on technical experts to implement them.

What’s next?

We’re thrilled to welcome co-founders Florian Wetschoreck and Tobias Krabel to the Databricks team, especially as we hit the pedal on hiring in our EMEA region.

The integration of bamboolib’s capabilities into the Databricks Lakehouse Platform will be available to customers in early 2022, so stay tuned for an upcoming blog that will dive deeper into this solution and what it means for you.

Try Databricks for free. Get started today.

The post Bringing Lakehouse to the Citizen Data Scientist: Announcing the Acquisition of 8080 Labs appeared first on Databricks.

↧

Databricks Repos Is Now Generally Available – New ‘Files’ Feature in Public Preview

October 7, 2021, 4:03 pm

≫ Next: 5 Steps to Get Started With Databricks on Google Cloud

≪ Previous: Bringing Lakehouse to the Citizen Data Scientist: Announcing the Acquisition of 8080 Labs

Thousands of Databricks customers have adopted Databricks Repos since its public preview and have standardized on it for their development and production workflows. Today, we are happy to announce that Databricks Repos is now generally available.

Databricks Repos was created to solve a persistent problem for data teams: most tools used by data engineering/machine learning practitioners offer poor or no integration with Git version control systems, forcing them to navigate through multiple files, steps and UIs to simply review and commit code. Not only is this time-consuming, but it’s also error-prone.

Repos solves this problem by providing repository-level integration with all popular Git providers directly within Databricks, enabling data practitioners to easily create new or clone existing Git repositories, perform Git operations and follow development best practices.

With Databricks Repos, you get access to familiar Git functionality, including the ability to manage branches, pull remote changes and visually inspect outstanding changes before committing them so that you can easily follow Git-based development workflows. Furthermore, Repos supports a wide range of Git providers, including Github, Bitbucket, Gitlab and Microsoft Azure DevOps, as well as provides a set of APIs for integration with CI/CD systems.

New: Files in Repos

We are also excited to announce new functionality in Repos that allows you to work with non-notebook files, such as Python source files, library files, config files, environment specification files and small data files in Databricks. This feature, called Files in Repos, helps with easy code reuse and automation of environment management and deployments. Users can import (or clone), read, and edit these files within a Databricks Repo just like in any local filesystem. It is now available in a public preview.

Fig 1: Now work with any kind of file in Databricks Repos. Files can be added to Databricks Repos via Git operations or uploaded manually

Files in Repos provides you a simplified and standards-compliant development experience. Let’s take a look at how this helps with some of the common development workflows:

Benefits of Files in Repos include:

Easier code reuse

Fig 2: Importing python modules in a Repo

Python and R modules can be placed in Repos and notebooks in that Repo can reference their functions with the ‘import’ statement. You no longer have to create new notebooks for each Python function you reference, or package your module (as a whl for python) and install it as a cluster library. Files in Repos helps you replace all of these steps (and more) with a single line of code.

Automate environment management and production deployments

Store your environment configuration with your code: You can store environment configuration files such as requirements.txt in a Repo and then run the command %pip install -r requirements.txt to install the required library dependencies. This reduces the burden of managing the environment manually and eliminates errors and divergence.
Automate deployments: You can store the configuration of Databricks resources such as job, cluster, etc. in a Repo and then automate the deployment of these resources, allowing you to tightly control your production environment.
Version any config file: In addition to the environment specifications and resource configs, your config files could contain algorithm parameters, data inputs for business logic, etc. With Repos, you can be sure to always use the correct version of the file, say from the ‘main’ branch or a particular tag to eliminate errors.

Repos gives you the ability to add the correct version of any filetype"

Fig 3: Repos gives you the ability to version any kind of file

In summary, with Databricks, data teams no longer need to build ad-hoc processes for version control and productionize their code. Databricks Repos enables data teams to automate Git operations, allowing tighter integration with established CI/CD pipelines of the company. The new Files feature in Repos enables importing libraries for code portability, versioning environment specification files and working with small data files.

Get started

Repos is now generally available. To get started, click on the ‘Repos’ button in your sidebar or use the Repos API.

Files in Repos feature is in Public Preview and can be enabled for Databricks Workspaces! To enable it, go to Admin Panel -> Advanced and click the “Enable” button next to “Files in Repos.” Learn more in our developer documentation.

To discover how Databricks simplifies development for data teams by enabling automation at each step of the ML lifecycle check out this on-demand webinar with Databricks architect Rafi Kuralisnik.

Try Databricks for free. Get started today.

The post Databricks Repos Is Now Generally Available – New ‘Files’ Feature in Public Preview appeared first on Databricks.

↧

5 Steps to Get Started With Databricks on Google Cloud

October 8, 2021, 8:00 am

≫ Next: Efficient Point in Polygon Joins via PySpark and BNG Geospatial Indexing

≪ Previous: Databricks Repos Is Now Generally Available – New ‘Files’ Feature in Public Preview

Since we launched Databricks on Google Cloud earlier this year, we’ve been thrilled to see stories about the value this joint solution has brought to data teams across the globe. One of our favorite quotes is from Douglas Mettenburg, Vice President Analytics at J. B. Hunt: “Ultimately, Databricks on Google Cloud is now the source of truth for J.B. Hunt. It’s showing the real value of the data we bring to the entire company, as we create more AI solutions that greatly impact our business.”

As Douglas describes it, Databricks on Google Cloud is designed to store all data on a simple, open lakehouse platform that unifies all analytics and AI workloads. It boosts data-driven decision making within organizations by enabling better collaboration across data engineering, data science and analytics teams with a cloud-based lakehouse architecture. And to make it even easier to access, the solution is available within the Google Cloud console along with the rest of its infrastructure.

Taking the first steps with Databricks on Google Cloud is easy, just follow the onboarding guide below that outlines the step-by-step instructions. You can also see these steps in action in the demo video.

1. Subscribe to Databricks from GCP Marketplace

Start by logging into the Google Cloud Platform. If you are a new user, you need to create an account before you subscribe to Databricks. Once in the console, start by selecting an existing Google Cloud project, or create a new project, and confirm your Google Cloud Identity organization object defined within your Google Cloud Console. This step requires permissions from your billing administrator to set up a Google billing account or select an existing account that you may use for Databricks. This can be done using Billing in the left navigation bar in the GCP console.

Find Databricks under Partner Solutions in the GCP console or simply search in the Marketplace. You are now ready to subscribe.

Once you confirm the terms, you can sign in using the familiar blue Google SSO. A tight integration with Google IAM allows you to simply authenticate Databricks workspace users with your Google Cloud Identity account via Google’s OAuth 2.0 implementation. This means Databricks does not have access to your login info, eliminating the risk associated with storing or protecting your credentials in Databricks.

2. Prerequisites for Databricks setup in GCP

You are almost ready to create your first Databricks workspace, but first review the prerequisites below.

Ensure adequate resource quotas

You will need to allocate the minimum quotas for the target Google Cloud regions where your Databricks clusters will run. We recommend you verify the entire list of quotas in the user documentation in case your project’s quotas are less than the GCP default.

Size your network

Next, configure the GKE subnets used by your Databricks workspace. You only get to do it once before creating the first workspace and is important because your workspace needs sufficient IP space to successfully run Databricks jobs. For convenience, Databricks provides a calculator that helps you determine if the default IP ranges for your subnets meet your needs.

Review session length constraints

If your IT administrator has set a global constraint on the session length for logged in users, Databricks will not be able to function correctly. In that case, please ask your administrator to add Databricks to the Trusted Apps list in the Google Workspace. See more details here.

3. Create your first workspace

Now you are ready to create the Databricks Workspace. Once you have configured the prerequisites, create your first workspace on the Databricks account console with a name, region, and Google Cloud Project ID.

4. Add users to your workspace

Your Databricks admin can manage user accounts in the admin console. As admins, they can:

invite more users or delete them.
assign other users as admins to allow cluster creation permission.

Create groups for role-based access controls (RBAC) so different user groups may have different permissions. Again, the native IAM integration makes user authentication very simple.

5. Run your first Databricks job

Now the fun begins! Create a new cluster in your new Databricks workspace so that you have your compute engine instance to run your queries and jobs. When you create a new cluster for the first time, Databricks bootstraps a GKE cluster, which can take up to 20 minutes. Subsequent Databricks clusters will only take a few minutes.

Let’s explore a quickstart tutorial notebook to see this all in action. A notebook is a collection of cells that run computations on a Databricks cluster. Once you attach a notebook to a cluster, you can start running your queries in any of the supported languages like Python, SQL, R and Scala and switch between them in the same notebook.

Here, we are creating a table using data from a sample CSV data file available in Databricks datasets, a collection of datasets mounted to Databricks File System (DBFS), a distributed file system installed on Databricks clusters.

Write the CSV data to Delta Lake format and create a Delta table. Delta Lake is an open table format that brings reliability, security and performance to your data lake. The Delta Lake format consists of Parquet files plus a transaction log, and we use Delta Lake to get the best performance on future operations on the table.

Next, read the CSV data into a DataFrame and write out in Delta Lake format. This command uses a Python language magic command, which allows you to interweave commands in languages other than the notebook default language (SQL).

Now you are ready to create a Delta table at the stored location and run a SQL statement to query the table for the average diamond price by color. You can click the bar chart icon to display a chart of the average diamond price by color.

That’s it! This is how you set up your Databricks on Google Cloud account and get started as a user by creating a workspace, cluster and notebook, then running SQL commands and displaying results.

Have questions?

Register for a live, instructor-led hands-on workshop to get answers to your questions and learn how to get started with Databricks on Google Cloud. There are multiple dates to choose from – sign up today!

Try Databricks for free. Get started today.

The post 5 Steps to Get Started With Databricks on Google Cloud appeared first on Databricks.

↧

Efficient Point in Polygon Joins via PySpark and BNG Geospatial Indexing

October 11, 2021, 12:59 pm

≫ Next: Native Support of Session Window in Spark Structured Streaming

≪ Previous: 5 Steps to Get Started With Databricks on Google Cloud

This is a collaborative post by Ordnance Survey, Microsoft and Databricks. We thank Charis Doidge, Senior Data Engineer, and Steve Kingston, Senior Data Scientist, Ordnance Survey, and Linda Sheard, Cloud Solution Architect for Advanced Analytics and AI at Microsoft, for their contributions.

This blog presents a collaboration between Ordnance Survey (OS), Databricks and Microsoft that explores spatial partitioning using the British National Grid (BNG).

OS is responsible for the design and development of a new National Geographic Database (NGD) data delivery for Great Britain (GB) under the Public Sector Geospatial Agreement.

OS has been working closely with Databricks and Microsoft on the architectural strategy and data engineering capabilities that underpin the NGD as part of a Core Data Services Platform. This platform enables OS to migrate geospatial data processing that has traditionally been carried out on on-prem machines in single-threaded processes and applications, such as FME to cloud compute, that are available and scalable on-demand — thus, achieving the processing and analysis of geospatial data at scale. OS is using Azure Databricks to add Apache Spark™ capability to the cloud platform, and this brings the opportunity to re-think how to optimize both data and approach to perform geospatial joins at scale using parallelized processing.

Indexing spatial data appropriately is one aspect of such optimization work, and it doesn’t just stop at selecting an index. The focus of this blog is on how we designed a process that makes maximal use of the index to allow the optimizers provided by Azure Databricks to tune the way that data is loaded from disk during scaled geospatial joins.

There are various grid indexes such as BNG, Geohash, Uber’s H3, and Google’s S2 that divide the spatial world into bins with identifiers. While some of these have been developed specifically in the context of modern geoanalytics, and therefore tend to be well supported with associated libraries and practical examples of use in that context, the British National Grid indexing system was defined in 1936 and is deeply embedded in the Great Britain geospatial data ecosystem, but not yet exploited and made accessible for geoanalytics at scale. Our secondary motivation here, therefore, was to show that it can be used directly for optimizing spatial joins, avoiding the need to convert Great Britain’s geospatial datasets to other indexing systems first. Our team implemented a mosaic technique that decomposed polygons into simplified geometries bounded by their presence in a given BNG index. By effectively limiting index space comparisons and spatial predicate evaluations, the approach yielded notable query performance gains.

The point-in-polygon: how hard can it be?

How hard is it to determine whether a point is inside a polygon (PIP)? The question of how to determine whether a point is contained within a polygon has already been answered years ago. This fact can introduce bias, making us jump to conclusions like, “it is easy; it has already been solved.” However, with the advancement of technology and the introduction of parallel systems, we have found ourselves asking this same question but in a new context. That context is using a PIP as a join relation over big (geospatial) data. The new problem is ensuring that we have high levels of parallelism in our approach. Unfortunately, the old answers no longer apply in this new context.

We can think of the join relationship as a pairing problem. We can observe it as having two datasets that contain rows that match with a set of rows from the other dataset while satisfying the join condition. The complexity of join relation is O(n*m) or what is commonly known as the Cartesian Product (complexity). This is the worst-case complexity for a join relation and, in simple terms, means that we need to compare each record from one dataset with each record of the other dataset to resolve all matches. Many systems implement techniques and heuristics to push this complexity to a lower level. However, this is the baseline, and we will start our considerations from this baseline.

In the context of OS’s geospatial data processing, one of the most common PIP joins routinely undertaken is between all address point geometries (approx. 37 million) and all large-scale building polygon geometries (approx. 46 million) in GB.

Point-in-polygon between addresses and buildings

Diagram A

The (not so) hidden cost?

While discussing join relation complexity, we have made an oversight. The traditional complexity assumes a fixed cost for each pair resolution, that is, the cost of arriving at a conclusion of match or no match for each pair of records during the join operation, which we will call O(join). The true cost of the join is O(n*m)*O(join). In the traditional equivalence relationship class, where we are just looking whether a join key on the left matches a join key on the right, we assume O(join) is O(1) or to put it simply, the cost of comparison is one arithmetic operation, and it is constant. This is not always the case; for example, joining on a string comparison is more expensive than an equivalence between two integers.

But what of PIP, how costly is it relatively? The most widely used algorithm to answer PIP is the ray-tracing method. The complexity of this algorithm is O(v), where v is the number of vertices of the polygon in question. The algorithm is applicable to both convex and non-convex shapes, and it maintains the same complexity in both cases.

Adding the comparison cost to our cost model brings our total complexity to cubic form. If we replace O(join) with O(v), where v is the average number of vertices, we have the total complexity of O(n*m)*O(v). And this is both expensive and time-consuming!

Work smarter, not harder!

We can do better than O(n*m)*O(v). We can use Spark to help us beat the Cartesian Product complexity. Spark leverages hash joins under the hood. Depending on the join predicate, Spark can execute one of the following join strategies:

broadcast hash join with complexity of O(max(n,m))*O(join)
shuffle hash join (similar to Grace Hash Join) with complexity of O(n+m)*O(join)
shuffle sort-merge join with complexity of O(n*log(n)+m*log(m))*O(join)
broadcast nested loop join (Cartesian join) with complexity O(n*m)*O(join)

Amazing! We can just use Spark, and we will avoid the most costly outcome, can’t we? No! Unfortunately, Spark will default to Cartesian join for PIP joins. Why? Where PIP differs from traditional equi-based joins is that it is based on a general relation. These joins are commonly known as a Theta Join class. These are usually much harder to execute and require the end-user to help the system. While we are starting from a disadvantageous position, we can still achieve the desired performance.

Spatial indices (PIP as a pseudo-equivalence)

Is there a way to make PIP an equivalence relationship? Strictly speaking no, however, in practice, we can make PIP approach the efficiency of an equivalence relation if we employ spatial indexing techniques.

Spatial indices help us index coordinate space in an efficient way by logically grouping geometries that are close to one another in said space. We achieve this by uniquely associating a point in the coordinate system to an index ID. These systems allow us to represent reference space at different levels of detail, or simply, a different resolution. In addition, geospatial index systems are hierarchical systems; this means that there is a well-defined parent-child relationship between indices on different levels of representation.

How does this help us? If we assign to each geometry an index to which it belongs, we can use index ID to index ID equivalence as an equivalence relation proxy. We will perform PIP (or any other geometry-based relation) only on geometries that belong to the same indices.

It is important to note that while POINT geometries belong to one and only one index, all other geometry types, including LINESTRINGs and POLYGONs, may span over a set of indices. This implies that the cost of resolving a PIP relation via index space is O(k)*O(v), where k is the number of indices used to represent the geometry and v is the number of vertices of such geometry. This indicates that we are increasing the price of each comparison by exploding records of complex geometries into multiple index records carrying the same geometry.

Why is this a wise choice? While we are increasing the price of comparing a single pair of geometries, we are avoiding a full Cartesian Product, our archnemesis in large-scale geospatial joins. As we will show in more detail later, index ID to index ID join will allow us to skip large amounts of unnecessary comparisons.

Lastly, data sources that contain complex geometries do not evolve as fast as do point-wise data sources. Complex geometries usually represent regions, areas of interest, buildings, etc., and these concepts have a fairly stable timeline, objects that change over time change rarely, and objects that change are relatively few. This means that while we do spend extra time to preprocess complex geometries, for the majority of them, this preprocessing is a one-off event. This approach is still applicable even for frequently updated data; the amount of data we can skip when joining via index ID to index ID relationship outweighs the increased number of rows used to represent a single geometry.

The BNG Index System

The BNG is a local coordinate reference (CRS) system (EPSG:27700) established in 1936 and designed for national mapping that covers Great Britain. Unlike global CRS’, BNG has been fitted and shaped to the landmass of Great Britain, projecting coordinates onto a flat, regular square grid with an origin (0, 0) to the southwest of the Isles of Scilly.

Within the grid bounds, geographic grid references (or indices) are used to identify grid squares at different resolutions expressed in meters which can be translated from and to BNG easting (x) and northing (y) coordinates. Given the location of the grid origin, easting and northing values are always positive. BNG serves as the primary reference system for all OS location data captured under their national mapping public task and, therefore, has been widely adopted by public and private users of OS data operating within Great Britain.

Each grid square can be represented as a polygon geometry where the length of each side is equal to the resolution of the grid reference. This makes BNG a much easier starting point for geospatial data partitioning strategies. We are starting with a square as a building block, and it will make a lot of the starting considerations simple while not losing on the generalization of the approach.

By convention, BNG grid references are expressed as strings, using the letters and coordinates of the southwest corner of a given grid square quoted to a particular resolution. The first two characters of any reference are letters (prefixes) (e.g., TQ) identifying one of the 91 grid squares measuring 100.000m (100km) across. Only 55 of the 91 100km grid squares cover some landmass within Great Britain. The remainder of these squares falls into British waters.

British National Grid at 100km resolution

Diagram B

References identifying more granular grid resolutions below 100km will have additional x and y integer values appended after the two letters locating a child grid square are within the parent grid square hierarchy. Child squares are numbered from 0 to 9 from the lower-left (southwest) corner, in an easterly (x) and northerly (y) direction.

Diagram B

British National Grid at 10 km and 1km resolutions

Diagram C

Diagram D

Why BNG?

Whilst there are alternative global index systems that we could have adopted for this work, we chose to use BNG because:

The BNG system is native to OS’s geospatial data collection, with almost all OS data referenced against the BNG CRS (EPSG:27700). This includes OS aerial imagery tiles and other raster datasets, such as Digital Terrain Models (DTMs) and Digital Surface Models (DSMs).
The use of BNG enables the efficient retrieval and colocation of vector and raster data for analysis, including the clipping or masking of raster data for deriving training patches for deep learning applications, as an example.
Using BNG avoids the costly transformation to the World Geodetic System 1984 (WGS-84) (EPSG:4326) or European Terrestrial Reference System 1989 (ETRS89) (EPSG:4258) CRSs via the OSTN15 transformation grid. Different CRSs realize their model of the Earth using different parameters, and a global system (e.g., WGS84) will show an offset when compared to a local system (e.g., BNG). The true cost of this conversion is reflected in the fact that OS published OSTN15, a 15MB corrections file containing approx. 1.75 million parameters to transform accurately between satellite-derived coordinates and BNG coordinates.

Due to the GB-local nature of the problems OS is trying to solve, BNG is a natural choice. In the case of a more global context, we should switch our focus on H3 or S2 as more suitable global alternatives.

BNG as a Spatial Partitioning Strategy

A spatial partitioning strategy defines an approach to segmenting geospatial data into non-overlapping regions. BNG grid squares at different resolutions provide the non-overlapping regions across Great Britain in this context. By retrieving the BNG indices, which cover geometries we can use the indices attribute as a join key to collocate rows and then only test a spatial predicate within those collocated rows (e.g., does geometry A intersect geometry B or does geometry A contain geometry B).

This is very important! Splitting the original data into geospatially collocated portions of data makes our problem “embarrassingly parallel,” and, therefore, very suitable for Spark/PySpark. We can send different chunks of data to different machines and only compare local portions of the data that are likely to join one to another. There is little point in checking if a building in London contains an address in Manchester. Geospatial indices are our way to convey this intuition to the machine.

The baseline

We used Python and PySpark to bring our solution to life. OS provided the logic for converting the pair of coordinates provided as eastings and northings to a unique BNG index ID. Lastly, to ensure an unbiased output, we used a randomized dataset of points and a randomized dataset of polygons; 10 million points were scattered all over the territory of GB, 1 million polygons were scattered in the same manner. To generate such a set of polygonal data, we have loaded a GeoJSON set into a Spark dataframe, we have used a random function in conjunction with a generator function (explode) to generate an unbiased dataset. Due to randomness introduced in the data, one should expect that the relationship between points and polygons is many-to-many.

The baseline algorithm we used for our considerations is the naive join that would result in the unoptimized theta join. This approach will, at the execution time, be evaluated as a Broadcasted Nested Loop Join.

Diagram E

The broadcast nested loop join runs very slowly. And the reason for this is the fact it is evaluated similarly to a Cartesian join. Each of the point-polygon pairs is evaluated against a PIP relation before the join is resolved. The outcome is that we require one billion comparisons for 100 thousand points to be joined to 10 thousand polygons. Note that neither of these datasets is large enough to be called big data.

mlflow output for naive join benchmarks - runtime in seconds

Diagram F

We used MLflow to conduct a series of naive joins to evaluate the baseline performance we are trying to outperform. For the naive approach, the largest join we were able to successfully execute was 10 thousand points to 100 thousand polygons. Any further increase in data volume resulted in our Spark jobs failing without producing the desired outputs. These failures were caused by the unoptimized nature of the workloads we were trying to run.

Let’s frame our problem

What if we represented all of our geometries, no matter their shape, with a corresponding BNG-aligned bounding box? A bounding box is a rectangular polygon that can fit the entirety of the original geometry within. And what if we represented said bounding box as a set of BNG indices at a given resolution that together covers the same area.

Bounding box representation of a polygon via British National GridBNG

Diagram G

Now we can execute our joins via a more optimized theta join. We will only check whether a point is inside the polygon via PIP relation if a point falls into one of the BNG indices that are used to represent the polygon. This reduces our join effort by multiple orders of magnitude.

In order to produce the said set of BNG indices, we have used the following code; note that the bng_to_geom, coords_to_bng and bng_get_resolution functions are not provided with this blog.

from shapely.geometry import box

#auxiliary function to retrieve the first neighbours 
#of a BNG index cell to the right 
def next_horizontal(bng_index, resolution):
  x, y = bng_to_geom(bng_index)
  return coords_to_bng(x+resolution, y, resolution)

#auxiliary function to retrieve the first neighbours 
#of a BNG index cell to the bottom
def next_vertical(bng_index, resolution):
  x, y = bng_to_geom(bng_index)
  return coords_to_bng(x, y-resolution, resolution) 

#filling function that represents the input geometry as set of indices
#corresponding to the area of the bounding box of said geometry    
def bng_polyfil(polygon, resolution):
  (x1,y1,x2,y2) = polygon.bounds
  bounding_box = box(*polygon.bounds)
  lower_left = coords_to_bng(x1, y2, resolution)
  queue = [lower_left]
  result = set()
  visited = set()
  while queue:
    index = queue.pop()
    index_geom = shapely.wkt.loads(bng_to_geom_grid(index, "WKT"))
    intersection = bounding_box.intersects(index_geom) 
    if intersection:
      result.add(index)
      n_h = next_horizontal(index, resolution)
      if n_h not in visited:
        queue.append(n_h)
      n_v = next_vertical(index, resolution)
      if n_v not in visited:
        queue.append(n_v)
    visited.add(index)
  visited = []
  
  return  result

This code ensures that we can represent any shape in a lossless manner. We are using intersects relation between a BNG index candidate and the original geometry to avoid blindspots in representation. Note that a more efficient implementation is possible by using contains relation and a centroid point; that approach is only viable if false positives and false negatives are acceptable. We assume the existence of the bng_to_geom function that given a BNG index ID can produce a geometry representation, the bng_get_resolution function that given a BNG index ID determines the selected resolution and coords_to_bng function that given the coordinates returns a BNG index ID.

Polygon bounding box representation benchmark - runtime in seconds

Diagram H

We have run our polygon bounding box representation for different resolutions of the BNG index system and for different dataset sizes. Note that running this process was failing consistently for resolutions below 100. Resolutions are represented in meters in these outputs. The reason for consistent failures at resolutions below 100m can be found in over-representation; some polygons (due to random nature) are much larger than others, and while some polygons would be represented by a set of a dozen indices, other polygons can be represented by thousands of indices, and this can result in a big disparity in compute and memory requirements between partitions in a Spark job that is generating this data.

We have omitted the benchmarks for points dataset transformations since this is a relatively simple operation that does not yield any new rows; only a single column is added, and the different resolutions do not affect execution times.

With both sides of the join being represented with their corresponding BNG representations, all we have to do is to execute the adjusted join logic:

@udf("boolean")
def pip_filter(poly_wkt, point_x, point_y):
  from shapely import wkt
  from shapely import geometry
  polygon = wkt.loads(poly_wkt)
  point = geometry.Point(point_x, point_y)
  return polygon.contains(point)

def run_bounding_box_join(polygons_path, points_path):  
  polygons = spark.read.format("delta").load(polygons_path)
  polygons = polygons.select(
    F.col("id"),
    F.col("wkt_polygon"),
    F.explode(F.col("bng_set")).alias("bng")
  )
  points = spark.read.format("delta").load(points_path)
  
  return polygons.join(
    points, 
    on=["bng"], 
    how="inner"
  ).where(pip_filter("wkt_polygon", "eastings", "northings"))
#run an action on the join dataset to evaluate join runtime
run_bounding_box_join(polygons_path, points_path).count()

These modifications in our code have resulted in a different Spark execution plan. Spark is now able to first run a sort merge join based on the BNG index ID and vastly reduce the total number of comparisons. In addition, each pair comparison is a string-to-string comparison which is much shorter than a PIP relationship. This first stage will generate all the join set candidates. We will then perform a PIP relationship test on this set of candidates to resolve the final output. This approach ensures that we limit the number of times we have to run the PIP operation.

Diagram I

From the execution plan, we can see that Spark is performing a very different set of operations in comparison to the naive approach. Most notably, Spark is now executing Sort Merge Join instead of Broadcast Nested Loop Join, which is bringing a lot of efficiencies. We are now performing about 186 million PIP operations instead of a billion. This alone is allowing us to run much larger joins with better response time whilst avoiding any breaking failures that we have experienced in the naive approach.

Point to Polygon bounding box join benchmark - runtime in seconds

Diagram J

This simple yet effective optimization has enabled us to run a PIP join between 10 million points and 1 million polygons in about 2500 seconds. If we compare that to the baseline execution times, the largest join we were able to successfully execute was 10 thousand points to 100 thousand polygons, and even that join required about 1500 seconds on the same hardware.

Divide and conquer

Being able to run joins between datasets in the million rows domain is great; however, our largest benchmark join took almost 45 minutes (2500 seconds). And in the world where we want to run ad hoc analytics over large volumes of geospatial data, these execution times are simply too slow.

We need to further optimize our approach. The first candidate for optimization is our bounding box representation. If we are representing polygons via bounding boxes, we include too many false positive indices, i.e., indices that do not overlap at all with the original geometry.

BNG indices that do not overlap at all with the original geometry

Diagram K

The way to optimize that portion of the code is to simply use intersects function call in our polyfill method on the original geometry.

def k_ring(bng_index):
  x, y = bng_to_geom(bng_index)
  increment = bng_get_resolution(bng_index)
  neighbours = [
    [x-increment, y+increment], [x, y+increment], [x+increment, y+increment],
    [x-increment, y], [x+increment, y],
    [x-increment, y-increment], [x, y-increment], [x+increment, y-increment]
  ]
  neighbours = [coords_to_bng(i[0], i[1], increment) for i in neighbours]
  return neighbours
  
def bng_polyfil(polygon, resolution):
  from shapely.geometry import box
  start = get_starting_point(polygon, resolution)
  queue = k_ring(start)
  result = set()
  visited = set()
  while queue:
    index = queue.pop()
    if polygon.intersects(shapely.wkt.loads(bng_to_geom_grid(index, "WKT"))):
      result.add(index)
      for n in k_ring(index):
        if n not in visited and n not in queue:
          queue.append(n)
    visited.add(index)
  visited = []
  
  return  result

This optimization, while increasing the cost by utilizing intersects call, will result in smaller resulting index sets and will make our joins run faster due to the smaller join surface

BNG indices that intersect with the original geometry

Diagram L

The second optimization we can employ is splitting the representation into two sets of indices. Not all indices are equal in our representation. Indices that touch the border of the polygon require a PIP filtering after an index to index join. Indices that do not touch the border and belong to the representation of the polygon do not require any additional filtering. Any point that falls into such an index definitely belongs to the polygon and, in such cases, we can skip the PIP operation.

BNG indices that are contained with the original geometry

Diagram M

The third and final optimization we can implement is the mosaic approach. Instead of associating the complete original geometry with each index that belongs to the set of indices that touch the polygon border (border set), we can only keep track of the section of interest. If we intersect the geometry that represents the index in question and the polygon, we get the local representation of the polygon; only that portion of the original polygon is relevant over the area of the index in question. We refer to these pieces as polygon chips.

Diagram N

Polygon chips serve two purposes from the optimization perspective. Firstly, they vastly improve the efficiency of the PIP filter that occurs after the index-to-index join is executed. This is due to the fact that the ray tracing algorithm runs in O(v) complexity and individual chips on average have an order of magnitude fewer vertices than the original geometry. Secondly, the representation of chips is much smaller than the original geometry, as a result of this, we are shuffling much less data as part of the shuffle stage in our sort merge join stage.

Putting all of these together yields the following code:

def add_children(queue, visited, index):
  for n in k_ring(index):
    if n not in visited and n not in queue:
      queue.append(n)
  return queue
    
def bng_polyfil(polygon, resolution):
  start = get_starting_point(polygon, resolution)
  queue = k_ring(start)
  result = set()
  visited = set()
  while queue:
    index = queue.pop()
    index_geom = shapely.wkt.loads(bng_to_geom_grid(index, "WKT"))
    intersection = polygon.intersection(index_geom)
    if intersection.equals(index_geom):
      result.add((index, False, "POLYGON EMPTY"))
      queue = add_children(queue, visited, index)
    elif "EMPTY" not in intersection.to_wkt():
      result.add((index, True, intersection.to_wkt()))
      queue = add_children(queue, visited, index)
    visited.add(index)
  visited = []
  
  return  result

This code is very similar to the original bounding box methods, and we have only done a few minor changes to make sure we are not duplicating some portions of the code; hence, we have isolated the add_children helper method.

Mosaic Polygon representation benchmark - runtime in seconds

Diagram O

We have performed the same data generation benchmarking as we have done for our bounding box polygon representation. One thing we found in common with the original approach is that resolutions below 100m were causing over-representation of the polygons. In this case, we were, however, able to generate data up to 100 thousand polygons on a resolution of 10m, granted the runtime of such data generation process was too slow to be considered for production workloads.

At the resolution of 100m, we have got some very promising results; it took about 600 seconds to generate and write out the dataset of 1 million polygons. For reference, it took about 300 seconds to do the same for the bounding box approach. Bounding box was a simpler procedure, and we are adding some processing time in the data preparation stage. Can we justify this investment?

Mosaics are pretty (fast!)

We have run the same benchmark for PIP joins using our mosaic data. We have adapted our join logic slightly in order to make sure our border set and core set of indices are both utilized correctly and in the most efficient way.

def run_polyfill_chipping_join(polygons_path, points_path):  
  polygons = spark.read.format("delta").load(polygons_path)
  polygons = polygons.select(
    F.col("id"),
    F.explode(F.col("bng_set")).alias("bng")
  ).select(
    F.col("id"),
    F.col("bng.*")
  )
  right = spark.read.format("delta").load(right_path)

  return polygons.join(
    right, 
    on=["bng"], 
    how="inner"
  ).where(
    ~F.col("is_dirty") | 
    pip_filter("wkt_chip", "eastings", "northings")
  )

#run an action to execute the join
run_polyfill_chipping_join(polygons_path, points_path).count()

is_dirty column is introduced by our polyfill method. Any index that touches the border of the original geometry will be marked as dirty (i.e., is_dirty=True). These indices will require post-filtering in order to correctly determine if any point that falls into said index is contained within the comparing geometry. It is crucial that is_dirty filtering happens first before the pip_fiter call because the logical operators in Spark have a short-circuiting capability; if the first part of the logical expression is true, the second part won’t execute.

Diagram P

This code will yield a much more efficient execution plan in Spark. Due to better representation in the index space, our join surfaces are much smaller. In addition, our post-filters benefit from 2 set representation and mosaic splitting of the geometries.

Point to Polygon Mosaic join benchmark - runtime in seconds

Diagram Q

We can finally quantify our efforts. A PIP type join between 10 million points and 1 million polygons via our new mosaic approach has been executed in 37 seconds. To bring this into context, the bounding box equivalent join at the same index resolution was executed in 2549 seconds. This results in a 69X improvement in run time.

This improvement purely focuses on the serving run time. If we include the preparation times, which were 600 seconds for the mosaic approach and 317 seconds for the bounding box approach, we have the total adjusted performance improvement of 4.5X.

The total potential of these improvements largely depends on how often you are updating your geometrical data versus how often you query it.

A general approach

In this post, we have focused on Point in Polygon (PIP) joins using the British National Grid (BNG) as the reference index system. However, the approach is more general than that. The same optimizations can be adapted to any hierarchical geospatial system. The difference is that of the chip shapes and available resolutions. Furthermore, the same optimizations can help you scale up theta joins between two complex geometries, such as large volume polygon intersection joins.

Our focus remained on a PySpark first approach, and we have consciously avoided introducing any third-party frameworks. We believe that ensures a low barrier to consume our solution, and it is custom-tailored primarily to Python users.

The solution has proved that with few creative optimizations we can achieve up to 70 times the performance improvements of the bounding box approach with a minimal increase in the preprocessing investment.

We have brought large-scale PIP joins into the execution time domain of seconds, and we have unlocked the ad-hoc analytical capabilities against such data.

Try Databricks for free. Get started today.

The post Efficient Point in Polygon Joins via PySpark and BNG Geospatial Indexing appeared first on Databricks.

↧

Native Support of Session Window in Spark Structured Streaming

October 12, 2021, 8:15 am

≫ Next: Developing Databricks’ Runbot CI Solution

≪ Previous: Efficient Point in Polygon Joins via PySpark and BNG Geospatial Indexing

Apache Spark™ Structured Streaming allowed users to do aggregations on windows over event-time. Before Apache Spark 3.2™, Spark supported tumbling windows and sliding windows. In the upcoming Apache Spark 3.2, we add “session windows” as new supported types of windows, which works for both streaming and batch queries

What is a “session window”?

Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. An input can only be bound to a single window.

Sliding windows are similar to the tumbling windows from the point of being “fixed-sized”, but windows can overlap if the duration of the slide is smaller than the duration of the window, and in this case, an input can be bound to the multiple windows.

Session windows have a different characteristic compared to the previous two types. Session window has a dynamic size of the window length, depending on the inputs. A session window starts with an input and expands itself if the following input has been received within the gap duration. A session window closes when there’s no input received within the gap duration after receiving the latest input. This enables you to group events until there are no new events for a specified time duration (inactivity).

It works similar to a session on a website that has session timeout — if you log into a website and don’t show any activity for some duration, the website will prompt you to retain login status and force logging out if you are still inactive after the timeout has been exceeded. The session timeout is extended whenever you show activity.

Applying this to the session window: a new session window is initiated when a new event, such as a streaming job, occurs, and following events within the timeout will be included in the same session window. Each event will extend the session timeout, which introduces a different characteristic compared to the other time windows — the time duration of the session window is not static, whereas both tumbling and sliding windows have a static time duration.

How to implement a query using a session window?

Previously, Spark required you to leverage flatMapGroupsWithState to deal with session windows. You were required to craft your own logic to define the session window and how to aggregate the inputs in the same session. This brought with it several downsides:

You can’t leverage built-in aggregate functions like count, sum, etc and have to do them by yourself.
It is non-trivial to craft the logic considering various output modes and the lateness of the input.
flatMapGroupsWithState is not available in PySpark; hence, you’re required to craft your queries via Java/Scala.

Now, Spark provides the same user experience as using time windows. The sentence remains true, “In Structured Streaming, expressing such windows on event-time is simply performing a special grouping”. For tumbling and sliding windows, `window` function is provided. For session windows, a new function `session_window` is introduced.

For example, counts over 5 minute tumbling (non-overlapping) windows on the eventTime column in the event can be described as following.

# tumbling window
windowedCountsDF = \
  eventsDF \
    .withWatermark("eventTime", "10 minutes") \
    .groupBy(“deviceId”, window("eventTime", "10 minutes") \
    .count()

# sliding window
windowedCountsDF = \
  eventsDF \
    .withWatermark("eventTime", "10 minutes") \
    .groupBy(“deviceId”, window("eventTime", "10 minutes", "5 minutes")) \
    .count()

You can simply replace the function “window” with “session_window” to count over session windows with a 5-minute gap on the eventTime column in the event.

# session window
windowedCountsDF = \
  eventsDF \
    .withWatermark("eventTime", "10 minutes") \
    .groupBy("deviceId", session_window("eventTime", "5 minutes")) \
    .count()

Session window with dynamic gap duration

In addition to the session window, which has the same gap duration across sessions, there is another type of session window, which has a different gap duration per session. We call this “dynamic gap duration.”

The boxes below the line of time denote each event with its gap duration. There are four events and their (event time, gap duration) pairs are (12:04, 4 mins) in blue, (12:06, 9 mins) in orange, (12:09, 5 mins) in yellow, and (12:15, 5 mins) in green.

The box above the line denotes the actual session which is made from these events. You can consider each event as an individual session, and sessions having an intersection are merged into one. As you may indicate, the time range of the session is “union” of the time range of all events included in the session. Note that the end time of the session is no longer the time + gap duration of the latest event in the session.

The new function “session_window” receives two parameters, event time column and gap duration.

For dynamic session windows, you can provide an “expression” to the “gap duration” parameter in the “session_window” function. The expression should resolve to an interval, like “5 minutes”. Since the “gap duration” parameter receives an expression, you can also leverage UDF as well.

For example, counting over session windows with dynamic gap duration based on the eventType column can be described as follows.

# Define the session window having dynamic gap duration based on eventType
session_window expr = session_window(events.timestamp, \
    when(events.eventType == "type1", "5 seconds") \
    .when(events.eventType == "type2", "20 seconds") \
    .otherwise("5 minutes"))

# Group the data by session window and userId, and compute the count of each group
windowedCountsDF = events \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy(events.userID, session_window_expr) \
    .count()

Native support of session window vs. FlatMapGroupsWithState

`flatMapGroupsWithState` provides more flexibility on implementing session windows, but it requires users to write a bunch of lines of code. For example, please refer to the sessionization example on Apache Spark which implements session windows via flatMapGroupsWithState. Note that the sessionization example on Apache Spark is very much simplified and only works with processing time & append mode pairs. The overall complexities of dealing with event time and various output modes are abstracted away with native support of session windows.

Spark sets a goal of native support of session windows to cover general use cases, as it enables Spark to optimize performance and state store usages. You may still want to leverage flatMapGroupsWithState when your business use case requires a complicated session window, for example, if the case session should also be closed on a specific type of event regardless of inactivity.

Conclusion

We have covered the session window in streaming aggregation, which also works for batch queries. With learning how to use the new function `session_window`, you can leverage your knowledge of streaming data aggregation with time window and be able to handle session windows. You can leverage built-in aggregation functions, as well as your own UDAFs on session window aggregation queries This also enables SQL/PySpark users to deal with session windows, as flatMapGroupsWithState API is not available in PySpark and cannot be represented as a SQL statement.

There are still more rooms to improve on-time windowing operations, which requires you to use the flatMapGroupsWithState API for now. We are planning to look into custom window operations in the near future.

If you want to try out the upcoming Apache Spark 3.2 in the Databricks Runtime 10.0, sign up for Databricks Community Edition or Databricks Trial for free and get started in minutes. Using Spark 3.2 is as simple as selecting version “10.0” when launching a cluster.

Try Databricks for free. Get started today.

The post Native Support of Session Window in Spark Structured Streaming appeared first on Databricks.

↧

Developing Databricks’ Runbot CI Solution

October 14, 2021, 9:21 am

≫ Next: Creating an IP Lookup Table of Activities in a SIEM Architecture

≪ Previous: Native Support of Session Window in Spark Structured Streaming

Runbot is a bespoke continuous integration (CI) solution developed specifically for Databricks’ needs. Originally developed in 2019, Runbot incrementally replaces our aging Jenkins infrastructure with something more performant, scalable, and user friendly for both users and maintainers of the service. This blog post will explore the motivations behind developing Runbot, the core design decisions that went into it, and how we used it to greatly improve the experience of all the developers within the Databircks engineering organization.

Background: Why Runbot?

Jenkins issues

Traditionally, Databricks has used Jenkins as the primary component of our CI pipeline that validates pull requests (PRs) before merging into master. Jenkins is a widely used and battle-tested piece of software, and has served Databricks well for many years. We had over time built up some ancillary services and infrastructure surrounding it:

However, over time Jenkins’ weaknesses were becoming more apparent. Despite our best efforts of stabilizing and improving the user experience, we faced continual complaints about our Jenkins infrastructure in our internal surveys:

“Test infrastructure is frequently breaking.”

“Jenkins error message is not well presented.“

“The jenkins test results are very hard to explore”

“I would guess that weekly or biweekly I have to adjust my workflows or priorities to work around an issue with build infrastructure.”

Jenkins was becoming a constant thorn in the side of our development teams. Newbies continued to trip over the same known idiosyncrasies that are unfixed for years. Veteran developers continue to waste time in inefficient workflows. Our team tending to the CI system itself needed to spend time responding to outages, fielding questions due to the poor user experience, or managing chronic issues like flaky tests with inadequate tooling.

Why couldn’t we fix it?

Despite many attempts over the years to improve it, this situation had remained largely unchanged. Weeks to months of work to try and e.g. better integrate our testing infrastructure with the Github Checks UI, or better manage resources usage on the Jenkins master, had proven unsuccessful. The whole experience remained clunky and unstable, and it was becoming clear that more effort expended in the same way would always have the same outcome. Working around Jenkins’ limitations with auxiliary services and integrations was no longer enough, and we needed to try something different if we wanted to provide a significant improvement to the experience for those around the company using Databricks CI.

The issues with our Jenkins infrastructure could largely be boiled down to two main issues:

Our Jenkins Infrastructure was complex and difficult to work on. A mix of microservices, open source software (OSS) and bespoke, JVM and Python, kubernetes and raw-EC2. This made any improvements a slow process, with time spent wrangling services and ETL rather than on the user-facing features we wanted.
Jenkins is operationally difficult to manage. The Jenkins server had long-running stability, usability and complexity issues that we had not managed to solve or mitigate well, and its fundamental architecture made it seem unlikely we would ever manage to solve to our satisfaction

I will discuss each in turn.

Complexity preventing improvement

Over the years, we had implemented many common CI features as a collection of separate services specialized to Databricks’ specific needs:

Our Autoscaling service delegates work to our instance pools, autoscaling the pools up and down job-balancing latency and cost
A bespoke JJB/Bazel configuration pipeline configures Jenkins via config-as-code, and we try to avoid configuring jobs “manually” through the web UI
Our own Github Event Consumer service handles orchestration of which jobs to trigger based on details about the pull request: files changed, labels, author, owners, etc.
A custom Github web hook, Kinesis queue, and Github Receiver service handles the integration with Github
A Test Explorer service provides useful Web UIs to view and slice and dice test results, to investigate breakage or flakiness

While Jenkins itself provides many of these features already, we often found we had our own needs or specific requirements that Jenkins and its plugins did not satisfy. We would have loved to not have to build and maintain all this stuff, but our hand was forced by the needs of the team and organization.

This sprawling complexity makes incremental improvement difficult, things such as:

New views to help manage our core jobs and keep our master branch green. e.g., a dashboard showing the history of multiple jobs, to help us distinguish between localized flakiness, single-job breakage, and system-wide breakage either in the infrastructure or in the code under test.
Up to date test results in the Test Explorer service. This could not be done due to the asynchronous ETL needed to extract data from Jenkins. This meant that we had one UI for viewing test results that was ugly but current, and one historical view that was prettier but delayed; together, there was no single UI that a developer could browse to see test results with a good developer experience.
Seeing which worker a particular job was running on. This is very useful: sometimes a worker (which we re-use for performance) gets into a bad state, causing scattered flaky failures, which are confusing until you notice they’re all on the same node. Other times you need to SSH into the worker to debug a particularly thorny issue that only arises on CI and need to know which one to go into.

In general, if you wanted to make a small change to the Databricks’ CI experience, there just wasn’t a place to “put things”. Jenkins’ own source code was external and non-trivial to modify for our needs, and none of the various services we had spun up were a good platform for implementing general-purpose CI-related features. Data, UIs, and business logic were scattered across the cluster of microservices and a web of tangled integrations between them.

Now, these issues weren’t insurmountable, e.g., to get a multi-job-history dashboard, we ended up opening multiple browsers on our office-wall dashboard, positioned them painstakingly side by side, and then opened each one to a different Jenkins job:

Left for hours/days, Jenkins’ “Blue Ocean” UI would sometimes mysteriously stop updating, and so we installed a chrome plugin to auto-refresh the browsers at an interval. Even then sometimes Jenkins’ “Blue Ocean” web UI would somehow bring down the whole OSX operating system (!) if left open for too long with too many tabs (n > 4) forcing us to power-cycle the mac-mini running the dashboard! It worked, but it couldn’t be said to have worked well.

Improving this experience with our current service architecture was very difficult.

Apart from our own menagerie of microservices, Jenkins itself is a sprawl of functionality. With multiple HTML UIs, multiple configuration subsystems, endless plugins, all with their own bugs and interacting in unintuitive ways. While this is all expected from a project grown over a decade and a half, it definitely levies a tax on anyone trying to understand how it all fits together.

Consider the multi-browser Jenkins dashboard shown above. Let’s say we wanted to add a tooltip showing how long each job run was queued before being picked up by a worker? Or imagine we wanted to make the COMMIT column link back to the commit page on Github. Trivial to describe, but terrible to implement: do we fork Jenkins’ to patch its Blue Ocean UI? Write a Chrome plugin that we ask everyone to install? Write a separate web service that pulls Jenkins’ data through its API to render as HTML? All of these were bad options for someone just wanting to add a tooltip or a link!

Over the years we had made multiple attempts to improve the CI developer experience, with only marginal success. Essentially, attempts to add user-facing features to our CI system get bogged down in ETL data plumbing, microservice infrastructure deployment, and other details that dwarf the actual business logic of the feature we want to implement.

Architecture and operational stability

Apart from making it difficult to make forward progress, the existing architecture was problematic for us trying to manage the service operationally. In particular, Jenkins was an operational headache, causing frequent outages and instability, due to some of its fundamental properties:

A single stateful “master” process, coordinating one or more pools of worker nodes
The master stores its state either in-memory, or on-disk as a folder tree full of XML files
Each worker is constantly connected to the master via an SSH connection

This architecture has the following consequences:

It is impossible to have more than one master node, or more than one master process, due to the in-memory state
The master node/process is very CPU/memory/disk heavy, managing its in-memory state and on-disk XML datastores
Any downtime in the single master causes all ongoing jobs to fail

These consequences caused us pain on a regular basis:

Our master in the best case scenario was taking 150+gb of memory in order to work, and this number would occasionally spike high enough to bring the whole process grinding to a halt
Every time the Jenkins master needed to be re-booted, all in-progress job runs failed, resulting in frequent inconveniences to our developers trying to test their code.
We couldn’t easily spin up replica Jenkins masters to share the load, and were approaching limits on the Amazon Web Services (AWS) instance size to vertically scale our single Jenkins master
We could not upgrade Jenkins without causing downtime and inconveniencing our users

In experiments, we found Jenkins could manage about 100-200 worker instances before the stability of the master started deteriorating, independent of what those workers actually did. The failure modes were varied: thread explosions, heap explosions, ConnectionClosedExceptions etc. Attempts to isolate the issue via monitoring, profiling, heap dumps, etc. had largely been unsuccessful.

As engineering grew, we found the Jenkins master falling over once every few days, always causing ongoing test runs to fail, and sometimes requiring a significant amount of manual effort to recover. These outages even occur at times when the Jenkins load was minimal (e.g. on weekends). Databricks’ bespoke integration also sometimes caused issues, e.g., the test explorer ETL job caused outages. As engineering continued to grow, we expected system stability to become ever more problematic as the load on the CI system increased further.

Goals, non-goals, and requirements

Goals

In finding a replacement for our Jenkins infrastructure, we had the following high-level goals:

The system should be able to run on its own with minimal manual troubleshooting, and scale up smoothly as load increases without the user experience deteriorating
We should be able to make changes, upgrades, and improvements to our CI experience without causing any downtime or inconvenience to our developers
An intern should be able to contribute a new feature to the CI system in a week, just knowing Scala, SQL, HTML, and CSS, without knowing the intricacies of our cloud infrastructure
Using the above properties, we should be able to quickly improve upon and streamline common workflows, building a coherent developer experience to reduce the quantity of CI-related questions asked

Non-goals

In any plan, what you hope to do and accomplish is only half the story. Below are some of the explicit non-goals — things we decided early on that we did not want to do accomplish:

We did not need to re-implement our entire CI system; some components work well and without issue overhead. We wanted to be strategic in replacing the ones that caused us the most issues (Test Explorer, Jenkins, etc.) while leaving others in place
We did not need to implement Jenkins’ breadth of functionality and plugins. As mentioned above, we had already extracted various Jenkins’ features, leaving only a very small subset of Jenkins’ features that we actually relied upon
We did not want to support arbitrarily complex build pipelines. The vast majority of CI jobs were simple, run pre-merge on PRs or post-merge on master, and do not have other jobs upstream or downstream. Graph-based execution engines are cool, but out of scope for this project.
We did not need an infinitely scalable system. Something that can handle the CI load at the time (~500 job runs a day, ~50 concurrent runs), along with another 1-2 orders of magnitude growth, would be enough. We could always evolve the system if usage increases
We did not need to replace the Github UI. Databricks uses the Github UI as the “hub” for any user trying to merge a PR, with PR/commit statuses shown for each CI job running. We just intended improve the experience beyond that, e.g., digging into a job to see what failed, or digging into job/test history to investigate flakiness or breakages, but the Github UI side of things worked great and didn’t need replacing

Requirements

Any CI system we ended up picking would need to do the following things:

Run shell commands: in response to github events, triggers, or on schedules,
Be operationally easy: to deploy (not too many moving parts), manage, and update (zero downtime)
Configuration-as-code: so changes to configuration can be
version-controlled and code-reviewed. We used Jsonnet-generated-YAMLs throughout the rest of the company and were quite happy with the workflow
Use the same EC2/AMI-based test environment we already use, for our Jenkins workers and Devboxes. We had a big and messy codebase, with a significant amount of supporting AWS cloud infrastructure (build caches, kubernetes clusters, etc.), and didn’t want to spend time containerizing it or having to do a cross-cloud migration.
Have a nice UI for viewing the state of the service, jobs, individual runs, or logs; that’s basically all we used Jenkins for anyway.
Be easily extensible with custom features: big (e.g., overview dashboards, flaky test views, etc.) and small (tooltips, links, etc.). We would inevitably want to customize the system to Databricks-specific requirements and workflows, and continue evolving it as usage patterns changed over time.
Autoscale a worker pool, since CI system usage is by nature variable during the workday and workweek.
Allow engineers to competently create or modify jobs, without needing to become an expert in the specific system (not possible with Jenkins’ Groovy config!)

Alternatives

Building your own bespoke CI system is a big project, and not one you want to do if you have any other alternatives. Before even considering this effort, we investigated a number of alternatives:

Kubernetes Prow: this was the most thoroughly investigated, including a full test deployment running some sample jobs. We found the usability wasn’t great at the time, e.g., needing to run kubectl to re-trigger job runs, and it didn’t have a streaming log viewer for in-progress logs. There was also some built-in infrastructure that seemed hardcoded to the Google Cloud Platform (GCP) (e.g., storing logs in Google Cloud Storage) that wouldn’t work with our AWS-based cloud infrastructure

Github Actions: at the time this did not include bring-your-own-infra, and could only work in one specific class of Microsoft Azure VMs with 2CPU/7GBmemory that wouldn’t suffice for us
SAAS Jenkins: in theory would solve the stability issues with Jenkins, but it wouldn’t solve the UI or complexity issues
Jenkins X: seemed more like a CD tool than a CI solution, with a focus on deployment CD pipelines and orchestration rather than CI validation
Travis-CI: no bring-your-own-infra. Also they had just been acquired at the time so their future prospects as a company and project were unclear.
Infrabox: had difficulty setting it up to do a proof-of-concept deployment

One common thread throughout the investigation of alternatives was the lack of “bare metal” EC2 support or bring-your-own-infra in most “modern” CI systems. They all assumed you were running inside containers, often their specific containers inside their specific infrastructure. As an organization which already had a good amount of cloud infrastructure we were perfectly happy with, running inside someone else’s container inside someone else’s cloud was a non-starter.

There were other alternatives that we did not investigate deeply: Concourse CI, CircleCI, Gitlab CICD, Buildkite, and many others. This is not due to any value judgement, but simply the reality of having to draw a line at some point. After digging into a half-dozen alternatives, some deeply, we felt we had a good sense for what was out there and what we wanted. Anyway, all we wanted was something to run bash commands on EC2 instances and show us logs in the browser; how hard could that be?

Designing Runbot

So that brings us to Runbot; the CI system we developed in house. At its core, Runbot is a traditional “three tier” web application, with a SQL database, stateless application server(s), and a website people can look at:

Apart from the backend system that manages CI workers to validate PRs and master commits and reporting statuses to Github, a large portion of Runbot’s value is in its Web UI. This lets a user quickly figure out what’s going on in the system, or what’s going on with a particular job that they care about.

Written in Scala, Runbot has about ~7,000 lines of code all-included: database access, Web/API servers, cloud interactions, worker processes, HTML Web UI, etc. It has since grown to about ~10,000 lines with the addition of new features and use cases.

Basic system architecture

The technical design of Runbot can be summarized as “not Jenkins”. All the issues we had with Jenkins, we strove hard to avoid having with Runbot.

Jenkins	Runbot
XML file system datastore	PostgreSQL database
Stateful Server	Stateless Application Servers
Requires constant connection to workers	Tolerates intermittent connection to workers
Extensible via plugins	Extensible by just changing its code
Groovy config language	Jsonnet config language
Groovy workflow language	No workflow language; it just runs your executable and you do your own thing

Runbot is a much simpler system than Jenkins is: as mentioned earlier it started out around ~7,000 lines of Scala (smaller than java.util.regex!), and has grown modestly since then.

At its core, Runbot is just a traditional three-tier website. HTTP requests come in (GETs from user browsers or JSON POSTs from workers and API clients), we open a database transaction, do some queries, commit the transaction and return a response. Not unlike what someone may learn in an introductory web programming class.

Interesting design techniques

On top of the common system architecture described above, Runbot does do a few notable things in order to support its featureset:

We use PostgreSQL’s LISTEN/NOTIFY capability to implement a basic publish-subscribe workflow; whether it’s a browser waiting for updated logs, or an idle worker waiting for new work to be queued, using LISTEN/NOTIFY lets us notify them immediately without the performance/latency overhead of polling. Combined with SQL tables storing events, this turns Postgres into quite a competent real-time event queue, letting us keep things simple and avoid introducing other infrastructural components like Kafka or RabbitMQ

A number of “scheduled” jobs run housekeeping logic on a regular basis to keep things running: spawning new EC2 workers instances when too much work is queued, terminating old/unused EC2 instances, etc. These are run directly on the Web/API servers, with coordination between the replicas again done through the Postgres database, again keeping our infrastructure simple
A small agent process runs on each worker to coordinate the core interactions with the application servers: listening-for/acquiring work, running the necessary subprocess commands the job is configured to run, streaming logs, etc. All the worker-server interactions go through the same Web/API servers as JSON/HTTP/Websockets, just like Runbot’s other APIs and Web UIs.

Despite being a distributed cluster manager with real-time pub/sub and a live-updating website, at its core Runbot works similarly to any other website you may have seen. We reuse the same database-backed HTTP/JSON web servers as a platform to overlay all other necessary domain-specific systems. This simplicity of implementation has definitely been a boon to maintenance and ease of operating and extending the system over time.

Worker management

One part of Runbot’s design that deserves special discussion is the worker lifecycle. While Runbot manages several elastic clusters of cloud EC2 instances, the actual logic involved is relatively naive. The lifecycle of a job-run and worker can be summarized as follows:

Jobs are configured statically via config-as-code: when Runbot is deployed, we bundle the config together with it. Config changes require a re-deploy, though as the Runbot servers are stateless this doesn’t incur downtime

Our Github event pipeline POSTs to Runbot’s Web/API server to queue up a job run, with some parameters (e.g. just the sha for a post-merge master job, or pr ID for a pre-merge PR job)

If there are already idle workers, one of them will receive the push-event from the queued job run and immediately claim it. Otherwise a scheduled job running a few times a minute will notice that there’s more queued jobs runs than workers, and use the AWS SDK to spin up a new worker EC2 instance (up to a maxWorkers limit)

A new worker runs a set of configured initCommands to initialize itself (e.g., cloning the git repo for the first time), and once done it subscribes for an unclaimed job run from the Runbot server.

Once the worker claims a job run, it runs a set of configured runCommands to actually perform the job logic: checking out the relevant SHA/PR-branch using the parameters given with the queued job run, bazel build or bazel test or sbt testing things as appropriate.

Once runCommands is complete, it runs some pre-configured cleanupCommands to try and get the working directory ready to pick up another job run (bazel shutdown, git clean -xdf, etc.), and re-subscribes with the Runbot server and either receives or waits for more work.

If a worker is idle for more than timeouts.waiting (typically configured to be 10 minutes), it calls sudo shutdown -P now and terminates itself. We also have a scheduled background job on the Runbot server that regularly cleans up workers where shutdown fails (which happens!)

The above description is intentionally simplified, and leaves out some details in the interest of brevity. Nonetheless, it gives a good overview of how Runbot manages its workers.

Some things worth noting about the above workflow include:

The behavior of the worker cluster is an emergent property of the system config: the maxWorkers, the frequency of running the spawn-instance background job, the timeout.waiting. While not precise, it does give us plenty of knobs we can turn to make long-lived single-worker jobs or short-lived multi-worker jobs. timeout.waiting can be tweaked to tradeoff between idle workers and job-run queue times: longer worker idle times means there’s more likely to be a waiting worker to pick up your job run immediately upon you queuing it.

We re-use workers aggressively between job runs within a single job. While this introduces potential interference between job runs, it increases performance dramatically due to local build caches, long-lived daemon processes, etc. Every once in a while we have to investigate and deal with a worker getting into a bad state.
We do not reuse workers between jobs. We actually had this feature in our Jenkins’ auto scaling system, and could in theory improve utilization of workers by sharing them between jobs, but we didn’t think the increased complexity was worth it. We didn’t end up implementing it on Runbot.
We don’t do any clever CPU/Memory/Disk-level resource optimization. This is partly because that kind of system normally requires running everything inside containers, and our jobs typically run on “raw” EC2. With different jobs being configured to use different instance types, and us scaling the size of our instance pool up and down as usage fluctuates, we’re effectively letting AWS perform a coarse-grained resource-optimization on our behalf.

Runbot’s worker management system can be thought of similar to “objects” in object-oriented programming: you “allocate” a worker by calling the EC2 API, invoke the “constructor” by running initCommands and then call “methods” on it with parameters by running runCommands. Like “objects”, Runbot workers use internal mutability and data locality for performance, but try to preserve some invariants between runs.

It’s almost a bit surprising how well Runbot’s worker management works. “ask for instances when there’s work, have them shut themselves down when there isn’t” isn’t going to win any awards for novelty. Nevertheless, it has struck a good balance of efficiency, utilization, simplicity, and understandability. While we have definitely encountered problems as usage has grown over the past two years, the naivete of its worker-management system isn’t one of them.

User-facing design

Scalability and stability were not the only motivations for Runbot. The other half of the picture was to provide a vastly better user experience, based on the learnings from the Jenkins-related microservices that had organically grown over time.

Part of the goal of Runbot is information density: every inch of screen real estate should be something that a user of the system may want to see. Consider again the home page dashboard shown earlier: It’s relatively straightforward page, with each job having a number of workers (small squares on the left) and a number of job runs (histogram of durations on the right), with workers and job runs both colored to show their status (yellow/grey = initializing, green = idle/success, blue = in progress, red/black = failed). At a glance we’re able to draw many conclusions about the nature of the jobs we are looking at:

Even when things are going well, we can already notice some things. Why is compilation on Compile-MacOS-Master flaky? Why was Compile-Master broken for a while, and could that have been avoided?

But this ability is even more important when things are going wrong. Maybe a job started becoming super flaky, maybe a worker instance got corrupted and is in a bad state, maybe AWS is out of instances in that region and new EC2 workers cannot start. All of those are things can be often be seen at a glance, and every element on the page has tooltips and links to further information:

It’s worth contrasting Runbot’s main dashboard with the equivalent Jenkins UI:

While Jenkins theoretically has all the same data if you drill in, it generally requires a lot more drilling in to find the information you need. Something obvious at a glance on Runbot may take a dozen clicks and multiple side-by-side browser windows to notice on Jenkins. (Above I show the traditional UI, but the newer “Blue Ocean” UI isn’t much better). For example, while at-a-glance Jenkins may show only the most basic of summary information, Runbot is able to show you the whole history of the job and its variation over time, giving a visceral feel for how a job is behaving.

Using Runbot, someone investigating an issue is thus able to quickly drill down to information about the job, about the job run, about the worker, all to aid them in figuring out what’s going wrong. Front-end UI design is often overlooked when building these internal backend systems, but Runbot’s UI is a core part of its value proposition and why people at Databricks like to use it.

Static HTML UI

By and large, Runbot’s UI is a simple server-side HTML/CSS UI. There is hardly any interactiveness — any configuration is done through updating and deploying the Jsonnet config files — and what little interactiveness there is (starting, cancelling, and re-starting a job run) is implemented via simple HTML forms/buttons and POST endpoints. We have some custom logic to allow live-updating dashboards and streaming logs without a page refresh, but that’s purely a progressive enhancement and the UI is perfectly usable without it.

One interesting consequence of this static HTML UI is how performant it is. The Github Actions team had published a blog post about how they used clever virtualization techniques to render large log files up to 50k lines. Runbot, with it’s static server-side rendered logs UI (including server-side syntax highlighting), is able to render 50k lines without issue in a browser like Chrome, without any client-side Javascript at all!

Perhaps the larger benefit of static HTML is how simple it is for anyone to contribute. Virtually any programmer knows some HTML and CSS, and would be able to bang out some text and links and colored-rectangles if necessary. The same can’t be said for modern Javascript frameworks, which while powerful, are deep and complex and require front-end expertise to set up and use correctly.

While many projects these days use a front-end framework of some sort, Runbot, being mostly static HTML, has its own benefits. We do not have any front-end specialists maintaining Runbot, and any software engineer is able to bang out a quick page querying the data they want and rendering it in a presentable format.

Rolling out Runbot

Scale

Since its introduction in 2019, we have been gradually offloading jobs from Jenkins and moving them to Runbot. Usage of Runbot has grown considerably, and it now processes around ~15,000 job runs a day.

This gradual migration has allowed us to reap benefits immediately. By moving the most high-load/high-importance jobs from Jenkins to Runbot, not only do those jobs get the improved UX and stability that Runbot provides, but those jobs left behind on Jenkins also benefit from the reduced load and improved stability. Right now we are at about 1/3 on Jenkins and 2/3 on Runbot, and while the old Jenkins master still has occasional issues, the reduced load has definitely given it a new lease on life. We expect to gradually move more jobs from Jenkins to Runbot as time progresses.

The growth of usage on Runbot has not been without issue. From the original plan of 50 concurrent workers, we’re now approaching 500 concurrent workers with 15,000 CPU cores during peak hours:

Problems

While we performed load testing before the initial rollout to make sure things worked at the expected scale, we did end up hitting some speed bumps as the usage of our system grew 10x over the past two years. Given the Runbot application servers are stateless and easily replaceable, most of the problems ended up being with the Postgres RDS database which is a Single Point of Failure:

Sometimes the query planner can be mercurial and uncooperative. Nothing like having Postgres spontaneously decide a millisecond-long query no longer needs the index and instead should start doing minutes-long full table scans, at 1am (local time), just as the rest engineering at HQ (in a different time zone) are starting to wake up and try to use your CI system!
The Postgres read replica could sometimes get overloaded, fall behind on replication, and bring the master down with it. Turns out the transaction logs that the master keeps around waiting for the replica to catch up grew enough to consume all disk space on the master, not something we expected the replica to be able to do.
By default, RDS disk throughput is throttled because of your configured disk size. That means if you have a small database with a lot of disk traffic, you may do well to configure a larger disk even if you don’t expect to use it all.

Despite being problems that caused outages, it is comforting to see that these issues are common problems shared by basically everyone who is operating a database-backed web service. Common problems have common solutions, which sure beats trying to untangle the interactions between Jenkins threadpool explosions, its multi-terabyte XML disk store, and our menagerie of custom microservices surrounding it.

Extensibility

Part of Runbot’s sales pitch was how much it simplified development on our CI system. This has largely been borne out over the past two years it’s been running.

We have had several individuals contribute useful features to the Runbot codebase, from smaller UI improvements like improved test-name truncation or more informative tooltips to bigger features like binary artifact storage or job run history search. We’ve had experienced folks join the company and immediately tweak the UI to smooth over some rough edges. We’ve had interns contribute significant features as a small part of their main internship project.

None of the individuals contributing these improvements were experts in the Runbot system, but because Runbot runs as a simple database-backed web application, they were able to find their way around without issue and modify Runbot to their liking. This ease of making improvements would have been unheard of with the old Jenkins infrastructure.

This has validated our original hope that we’d be able to make Runbot simple enough that anyone would be able to contribute just by writing some Scala/SQL/HTML/CSS. Runbot may not have the huge ecosystem of plugins that Jenkins has, but its ability to be easily patched to do whatever we want more than makes up for it!

Going forward

So far we have talked about what we have done with Runbot in the past; what about the future? There are a few major efforts going into Runbot now and in future.

Scale and stability

Runbot’s usage has grown around 10x in the past two years, and we expect it to continue to grow rapidly in future. Databricks’ engineering team continues to grow and, as we mentioned, will continue to port jobs off our old Jenkins master. Not just that, but as the company matures, people’s expectations continue to grow: a level of reliability that was acceptable in 2019 Databricks is no longer acceptable in 2021.

This means we will have to continue streamlining our processes, hardening our system, and scaling out the Runbot system to handle ever more capacity. From the current scale of ~500 concurrent workers, we could expect that to double or triple over the next year, and Runbot needs to be able to handle that. Scaling database-backed web services is a well-trodden path, and we just need to execute on this to continue supporting Databricks’ engineering as it grows.

Supporting new use cases

One of Runbot’s selling points was that as a fully-bespoke system, we can make it do whatever we want by changing its (relatively small) codebase. As the CI workflows within Databricks evolve, with new integration testing workflows and pre/post-merge workflows and flaky test management, we will need to adapt Runbot with new UI and new code paths to support these workflows.

One interesting example of this is using Runbot as a general-purpose job runner in restricted environments. While we’re familiar with both Jenkins and Runbot, the fact that Runbot is such a small system means that it has a minimal attack surface area. No complex permissions system, no user account management, no UI for re-configuring jobs at runtime, a tiny codebase that’s easily audited. That makes it much easier to be confident you can deploy it securely without any room for vulnerabilities.

Another use case may be building out an experience for managing the deployment validation pipeline. Currently per-commit Master tests, N-hourly integration tests, and others are handled in an ad-hoc manner when deciding whether or not a commit is safe to promote to staging and beyond. This is a place where Runbot’s extensibility really shines, as we could build out any workflow imaginable on Runbot’s simple web-application platform.

Self-service

From its conception, Runbot was developed as a product managed and operated by the team that developed it. As time goes on, other teams are adopting it more and more, and the system is making the transition from a bespoke piece of infrastructure to a self-service commodity, like Github Actions or Travis CI.

Over the past two years we have already polished over a lot of the rough edges back from when Runbot was operated by the team that developed it, but there’s a lot of work to do to turn it into an “appliance” that someone can operate without ever needing to peak under the hood to understand what’s going on. The goal here would be to make Runbot as simple to use as Travis-CI or Github Actions, turning it from “internal infrastructure” to a “product” or “appliance” anyone can just pick up and start using without help.

Conclusion

In this blog post, we have discussed the motivation, design, and rollout of Databricks’ Runbot CI system. As a bespoke CI system tailored to Databricks’ specific requirements, Runbot has several advantages over our old Jenkins infrastructure: stability, simplicity, improved UX, and easier development and evolution over time.

Runbot is an interesting challenge: a mix of cloud infrastructure, distributed systems, full-stack web, front-end UI, and developer experience all mixed into one project. Despite the common adage about re-inventing the wheel, over the past two years Runbot has proven itself a worthy investment, cutting through a decade-and-half of organic growth to provide a streamlined CI system and experience that does exactly what we need and nothing more. Runbot has proved crucial to supporting Databricks as it has grown over the past two years, and we expect it will play a pivotal role in supporting Databricks as it continues to grow in future.

If you’re interested in joining Databricks engineering, whether as a user or developer of our world class internal systems, we’re hiring!

Try Databricks for free. Get started today.

The post Developing Databricks’ Runbot CI Solution appeared first on Databricks.

↧

Creating an IP Lookup Table of Activities in a SIEM Architecture

October 18, 2021, 8:25 am

≫ Next: MLflow for Bayesian Experiment Tracking

≪ Previous: Developing Databricks’ Runbot CI Solution

When working with cyber security data, one thing is for sure: there is no shortage of available data sources. If anything, there are too many data sources with overlapping data. Your traditional SIEM (security information and event management) tool is not really fit to handle the complex transformation and stitching of multiple data sources in an efficient manner. Besides, it is not cost-effective to process terabytes or petabytes of event data on a daily basis in a SIEM system.

One common network security use case requiring marrying multiple data sources is the attribution of an IP address to a certain user. When working with a traditional SIEM tool, to determine malicious or suspicious activities on the network, SOC (security operation center) analysts have to launch multiple queries against different data sources (VPN, DHCP, etc.) and manually stitch them together to find out the timeline and actions of an IP address. This could take up to 15 minutes per IP address. Precious time in the event of a security incident that could be saved if you automated aggregation and curation of data before landing it in the SIEM solution.

In this blog, we will cover a simplistic approach to data collection, combining multiple data sources and automation to create an IP lookup table. This table is a fundamental building block of threat intelligence to create a holistic picture of the activities in your network. It will enable you to query IP addresses in a given time window, attribute them to users/ MAC addresses and track events in the order they happened.

We will use Cisco ISE Posture and Profiler events as VPN logs and Infoblox DHCP logs for off-VPN activities. We are using Databricks Labs Data Generator to simulate these event logs and push them into an S3 bucket. In a real-world scenario, you can use a real-time streaming service such as Kinesis Firehose to push such logs from on-premise servers into S3. These logs are then ingested using the Databricks Lakehouse Platform. We have built the entire end-to-end pipeline using Delta Live Tables, which allows us to ensure data quality in each stage of the incremental data curation, as well as lineage for the end-to-end pipeline. Once data is curated into a master IP lookup table, a SIEM tool such as Splunk (using Databricks Splunk add-on) or a BI tool or dashboard such as Tableau, PowerBI or Looker can be used as the presentation layer.

Cyber security lakehouse architecture

Figure 1 illustrates an example of a typical cyber security ecosystem. It is an entangled web of different data sources and systems with a SIEM tool in the mix. The SOC analyst has to query different sources and stitch results together to get meaningful insights, which gets even more complicated with increasing volumes of event data. An average enterprise will have petabytes of data to comb through, and without the right tools to handle these massive datasets, this task could be quite tedious, if not impossible.

an example of a typical Cyber Security ecosystem, source Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity)

Figure 1: An example of a standard Cyber Security ecosystem (source Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity)

We propose an alternative cyber security lakehouse architecture, as illustrated in figure 2 below. The key steps to realize this architecture are:

Land data from all sources into cheap object storage
Shift the heavy lifting into a highly-optimized Big Data platform: Process and curate and stitch data using Delta Lake
Only move curated data once it is ready to be consumed into your SIEM solution

Among other advantages, the Lakehouse architecture delivers the following benefits:

Breaking silos: As data is available to everyone on the same platform, each data persona can use their favorite tools (Notebooks, DBSQL, etc.) on top of Delta Lake to access a single source of truth. Furthermore, they can easily review code, pair program and collaborate on the same platform without having to export code or data somewhere else.
Combining batch and streaming: Batch and streaming queries are made almost identical using Delta Lake. You can build your pipelines once and make them future-proof in case processing mode changes in the future.
Quality assurance: You can set data quality constraints in different stages of your processing pipeline using EXPECTATIONS. This will allow you to choose how to react to unexpected data.

Databricks Cyber Security Lakehouse architecture

Figure 2: Cyber security Lakehouse architecture

Building cyber security data pipelines with Delta Live Tables

From Delta Live Tables official documentation:

Delta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling.

The diagram below shows the end-to-end pipeline to create an IP Lookup table from VPN and DHCP logs. We chose to use Delta Live Tables (DLT) to build the pipeline because of its simplicity, the data quality assurance measures it provides and the ability to track the lineage of the entire pipeline. With DLT, you can easily build your ETL pipelines using either Python or SQL. The below shows the DAG of the ETL pipeline we built to create an IP Lookup table.

With Delta Live Tables, you can easily build your ETL pipelines using either Python or SQL.

Diagram 1: Delta Live Table DAG

Code walkthrough

You can find the notebooks here. As illustrated in diagram 1, we land each data source in a raw (Bronze) layer. Then we parse the raw logs using appropriate parsers into a Silver layer. Finally, we get the columns of interest (IpAddress, MacAddress, Username, etc.) from the parsed tables and combine them into a curated, ready to consume IP Lookup table (Gold).

To parse Cisco ISE Profiler events, as they come in a semi key-value style, we can use PySpark’s from_json method. In addition, we have extracted event timestamps from metadata (header) accompanying the logs. We have used a similar method to parse CISCO ISE Posture events.

@dlt.table
def ise_profiler_silver():
    return (
        dlt.read_stream("ise_profiler_bronze")
        .withColumn("value", F.concat(F.lit(",Metadata="), F.col("value")))
        .withColumn("parsed", parse_string_as_json(F.col("value")))
        .withColumn("data", F.from_json(F.col("parsed"), profiler_schema))
        .select(F.col("data.*"))
        .withColumn(
            "MetadataTimestamp",
            F.concat_ws(
                " ", F.array([F.split(F.col("Metadata"), " ")[i] for i in [8, 9]])
            ),
        )
    )

We use regex mapping (PySpark’s regexp_extract method) to parse DHCP logs. Each message type (DHCPACK, DHCPREQUEST, etc.) is parsed using its expected regex pattern. We have chosen to parse DHCP messages of types ACK,NACK, OFFER, REQUEST, RELEASE, DECLINE, and EXPIRE for simplicity.

def parse_dhcp_message_with_regex(source_table_name="dhcp_bronze"):
    return (
        dlt.read_stream(source_table_name)
        .select(
            "value", F.regexp_extract("value", REGEX_DHCPOFFER, 0).alias("EventType")
        )
        .where(
            F.col("EventType").isin(
                [
                    "DHCPREQUEST",
                    "DHCPACK",
                    "DHCPNAK",
                    "DHCPOFFER",
                    "DHCPRELEASE",
                    "DHCPDECLINE",
                    "DHCPEXPIRE",
                ]
            )
        )
        # Parse different types of messages with different regex patterns into IP address column
        .withColumn(
            "IpAddress",
            F.when(
                F.col("EventType").isin(["DHCPREQUEST"]),
                F.regexp_extract("value", "(?<=for )(.*)(?= from)", 0),
            )
            .when(
                F.col("EventType").isin(
                    ["DHCPACK", "DHCPNAK", "DHCPOFFER", "DHCPEXPIRE"]
                ),
                F.regexp_extract("value", "(?<=on )(.*)(?= to)", 0),
            )
            .when(
                F.col("EventType").isin(["DHCPRELEASE", "DHCPDECLINE"]),
                F.regexp_extract("value", "(?<=of )(.*)(?= from)", 0),
            )
            .otherwise(F.lit("n/a")),
        )
        # Parse different types of messages with different regex patterns into Mac address column
        .withColumn(
            "MacAddress",
            F.when(
                F.col("EventType").isin(["DHCPREQUEST"]),
                F.regexp_extract("value", "(?<=from )(.*)(?= via)", 0),
            )
            .when(
                F.col("EventType").isin(["DHCPACK", "DHCPNAK", "DHCPOFFER"]),
                F.regexp_extract("value", "(?<=to )(.*)(?= via)", 0),
            )
            .when(
                F.col("EventType").isin(["DHCPRELEASE", "DHCPDECLINE"]),
                F.regexp_extract("value", "(?<=from )(.*)(?= via)", 0),
            )
            .when(
                F.col("EventType").isin(["DHCPEXPIRE"]),
                F.regexp_extract("value", "(?<=to )(.*)", 0),
            )
            .otherwise(F.lit("n/a")),
        )
        .select(
            # Extract event timestamp, add missing year value to the start o the string
            F.to_timestamp(F.concat(F.lit("2021 "), F.substring("value", 0, 15)), 'yyyy MMM dd HH:mm:ss').alias("EventTimestamp"),
            "IpAddress",
            "MacAddress",
            "EventType",
        )
        # sometimes logs appear to have patterns like mac-address (some-string) or ip-address (mac-address)
        # this block unpacks these strings into separate mac and ip address columns
        .withColumn("Address_part1", F.split(F.col("MacAddress"), " ")[0])
        .withColumn("Address_part2", F.split(F.col("MacAddress"), " ")[1])
        .withColumn(
            "MacAddress",
            F.when(F.col("IpAddress") == "", F.col("Address_part2")).otherwise(
                F.col("Address_part1")
            ),
        )
        .withColumn("MacAddress", F.regexp_replace(F.col("MacAddress"), "[()]", ""))
        .withColumn(
            "IpAddress",
            F.when(F.col("IpAddress") == "", F.col("Address_part1")).otherwise(
                F.col("IpAddress")
            ),
        )
        .drop("Address_part1", "Address_part2")
    )

To ensure the data quality of parsed tables, we have set expectations on the incoming data. As expressed by @dlt.expect_all_or_drop, we are expecting EventTimestamp, IpAddress, and MacAddress not to have any missing values.

valid_dhcp_parsed_record = {
    "valid_timestamp": "EventTimestamp IS NOT NULL",
    "valid_IpAddress": "IpAddress IS NOT NULL",
    "valid_MacAddress": "MacAddress IS NOT NULL",
}


@dlt.table
@dlt.expect_all_or_drop(valid_dhcp_parsed_record)
def dhcp_silver():
    return parse_dhcp_message_with_regex()

This is what DHCP records look like before parsing:

Jul 13 09:08:00 12.99.46.134 dhcpd[3761]: DHCPACK to 12.88.0.15 (f0:5e:0c:b2:35:bf) via eth2

And here is what the schema of the parsed Delta table looks like:

EventTimestamp:string

IpAddress:string

MacAddress:string

EventType:string

Finally, once the three data sources are parsed and combined, we have access to IP activities and their timelines, as illustrated in figure 3:

An IP Lookup table which consists of DHCP and VPN logs

Figure 3: Gold IP lookup table

Next steps

In this blog post, we have walked you through the steps to build a simple IP attribution table. This table is the first building block to construct a holistic picture of activities in your network for threat detection and incident response at all times. A sensible next step is to onboard more data sources and add more columns to this table.

As you have more and more information about IP addresses in your table, you can start extracting patterns of “normal” behavior for an IP in your network using machine learning algorithms. Then you can detect anomalies and flag an IP if it behaves outside the normal boundaries.

You can run the accompanying notebooks by following the links posted directly below:

For more Cyber Security content, check out Building a Cybersecurity Lakehouse for CrowdStrike Falcon Events and Detecting Criminals and Nation States through DNS Analytics blogs.

Try Databricks for free. Get started today.

The post Creating an IP Lookup Table of Activities in a SIEM Architecture appeared first on Databricks.

↧

MLflow for Bayesian Experiment Tracking

October 18, 2021, 10:47 am

≫ Next: Introducing Apache Spark™ 3.2

≪ Previous: Creating an IP Lookup Table of Activities in a SIEM Architecture

This post is the third in a series on Bayesian inference ([1], [2] ). Here we will illustrate how to use managed MLflow on Databricks to perform and track Bayesian experiments using the Python package PyMC3. This results in systematic and reproducible experimentation ML pipelines that can be shared across data science teams due to the version control and variable tracking features. The data tracked by MLflow can either be accessed through the managed service provided through Databricks either using the UI or the API. Data scientists who are not using the managed MLflow service can use the API to access the experiments and the associated data. On Databricks, access to the data and the different models are managed through the ACL that MLflow provides. The models can then be easily productionized and deployed through a variety of frameworks.

Tracking Bayesian experiments

What does MLflow do?

MLflow is an open-source framework for managing your ML lifecycle. MLflow can either be used using the managed service on Databricks or can be installed as a stand-alone deployment using the open-source libraries available. This post primarily deals with experiment tracking, but we will also share how MLflow can help with storing the trained models in a central repository along with model deployment. In the context of tracking, MLflow allows you to store:

Metrics — usually related to the model performance, such as deviance or Rhat.
Parameters — variables that help to define your model or run. In a Bayesian setting, this can be your hyperparameter, prior or hyperprior distribution parameters. Note that these are always stored as string values.
Tags — key-value pairs to keep track of information regarding your run, such as the information regarding a major revision of the code to add a feature.
Notes — any information regarding your run that you can enter in the MLflow UI. This can be a qualitative evaluation of the run results and can be quite a useful tool for systematic experimentation.
Artifacts — this stores a byproduct or output of your experiment such as files, images, etc.

Setting up a store for open-source MLflow

This section only applies to the open-source deployment of MLflow, since this is automatically taken care of with the hosted MLflow on Databricks. MLflow has a backend store and an artifact store. As the name indicates, the artifact store holds all the artifacts (including metadata) associated with a model run and everything else exists in the backend store. If you are running MLflow locally, you can configure this backend store, which can be a file store or a database-backed store. You can run a tracking server anywhere if you so choose, as shown below:

mlflow server \
    --backend-store-uri /mnt/persistent-disk \
    --default-artifact-root s3://my-mlflow-bucket/ \
    --host 0.0.0.0

You can then specify the tracking server to be the one you set above as:

mlflow.set_tracking_uri("http://YOUR-SERVER:4040")

The workflow for tracking a Bayesian experiment

On Databricks, all of this is managed for you, minimizing the configuration time needed to get started on your model development workflow. However, the following should be applicable to both managed and opne-source MLflow deployments. MLflow creates an experiment, identified by an experiment ID, and each experiment consists of a series of runs identified using a run ID. Each run has the associated parameters and artifacts logged per run. Here are the steps to create a workflow:

Create an experiment by passing the path to the folder of the experiments, this returns an experiment ID. You can provide a path to store your artifacts, such as files, images etc.
Start the experiment with the experiment ID returned from the above step. The PyMC3 inference code is under this context manager.
Use tags to version your code and data used.
Log the model/ run parameters, specifically the prior and hyperprior distribution parameters, the number of samples and tuning samples and the likelihood distribution.

mlflow.set_tags({"Version notes": "Full run", "Start date": covid_data.data_begin, "End date": covid_data.data_end})
     mlflow.set_tags({"Version notes": "Full run", 
   "Start date": covid_data.data_begin, 
                   "End date": covid_data.data_end})
                nsamples = 2000
                ntune = 2000
                Hyperprior = {"Lambda mean": 0.75, "Lambda std": 2, "Mu mean": 0.75, "Mu std": 2}
                Prior = {"Lambda std": 1.0, "Mu std": 1.0}
                Likelihood = {"Name": "Normal", "Parameters": {"std": 0.01}}
                prior_lam = pm.Lognormal('prior_lam', Hyperior['Lambda mean'], Hyperior['Lambda std']) 
                prior_mu = pm.Lognormal('prior_mu', Hyperprior['Mu mean'], Hyperprior['Mu std'])
                prior_lam_std = pm.HalfNormal('prior_lam_std', Prior['Lambda std'])
                prior_mu_std = pm.HalfNormal('prior_mu_std', Prior['Mu std'])
                lam = pm.Lognormal('lambda', prior_lam , prior_lam_std, shape=2) 
                mu = pm.Lognormal('mu', prior_mu , prior_mu_std, shape=2)                           
                mlflow.log_param("Hyperprior", Hyperprior)
                mlflow.log_param("Prior", Prior)
                mlflow.log_param("Samples", nsamples)
                mlflow.log_param("Tuning samples", ntune)
                mlflow.log_param("Likelihood", Likelihood)

Once the model has finished the sampling, the results contained in the trace can be saved as an artifact using the log_artifacts() method. This will be a folder called ‘trace’ that contains all the information regarding the samples that were drawn by each of the chains. The trace information can be summarized by invoking the PyMC3 summary() method on the trace object. The trace summary is a data frame that can be saved as a JSON string object using the log_text() method from MLflow.

trace_summary = az.summary(trace)
res = trace_summary.to_json(orient='index')
read_json = json.dumps(res)
mlflow.log_text(read_json, artifact_file = 'trace_summary')    
pm.save_trace(trace, directory="trace", overwrite=True)
os.system('cp -R trace /dbfs/Users/USERNAME/mlflow')
mlflow.log_artifacts('/dbfs/Users/USERNAME/mlflow/trace', artifact_path = 'trace')
mlflow.end_run()

Inspecting an experiment

Once the experiment has completed, you can go back and inspect the MLflow UI or programmatically extract the run information. For example, if the current experiment ID is ‘10618537’, you can extract the information about the experiment:

client = mlflow.tracking.MlflowClient()
experiment = mlflow.get_experiment(10618537)
print(experiment)ed
<Experiment: artifact_location='dbfs:/databricks/mlflow-tracking/10618537', experiment_id='10618537', lifecycle_stage='active', name='/Users/USERNAME/mlflow_hier_bayesian', tags={'mlflow.experimentType': 'MLFLOW_EXPERIMENT',
 'mlflow.ownerEmail': 'USEREMAIL',
 'mlflow.ownerId': 'USERID'}>

Search for an experiment run

Assuming that you know your experiment ID, you can search for all the runs within an experiment and extract the data stored for this run, as indicated below:

>>> mlflow.search_runs(experiment_ids = ["10618537"])

run = client.get_run('393287b22e59466dadbaea85052782a9')
print(run.data)
<RunData: metrics={}, params={'Hyperprior': "{'Lambda mean': 0.75, 'Lambda std': 2, 'Mu mean': 0.75, 'Mu "
               "std': 2}",
 'Prior': "{'Lambda std': 1.0, 'Mu std': 1.0}"}, tags={'End date': '4/1/20',
 'Start date': '3/1/20',
 'Version notes': 'Added artifacts',
 'mlflow.databricks.cluster.id': '0512-212345-arbor51',
 'mlflow.databricks.cluster.info': '{"cluster_name":"test","spark_version":"7.6.x-cpu-ml-scala2.12","node_type_id":"i3.xlarge","driver_node_type_id":"i3.xlarge","autotermination_minutes":120,"disk_spec":{},"num_workers":0}',
 'mlflow.databricks.cluster.libraries': '{"installable":[{"pypi":{"package":"pymc3"}}],"redacted":[]}',
 'mlflow.databricks.notebookID': '10564146',
 'mlflow.databricks.notebookPath': '/Users/USERNAME/Hierarchical_downsampled_mlflow',
 'mlflow.databricks.notebookRevisionID': '1626229409220',
 'mlflow.databricks.webappURL': 'https://demo.cloud.databricks.com',
 'mlflow.source.name': '/Users/USERNAME/Hierarchical_downsampled_mlflow',
 'mlflow.source.type': 'NOTEBOOK',
 'mlflow.user': 'USEREMAIL'}>

Accessing the artifacts from a run

The artifacts associated with this run can be listed as shown below. The file size and path are shown for each file

f = client.list_artifacts(run_id)
print(f)

[<FileInfo: file_size=1808, is_dir=False, path='trace_summary'>,
 &ltFileInfo: file_size=None, is_dir=True, path='trace'>]

MLflow manages the artifacts for each run, however one can either view and download them using the UI or use the API to access them. In the example below, we load the trace information and the trace summary from a prior run.

local_path_summary = client.download_artifacts(run_id, "trace_summary", './')
local_path_trace = client.download_artifacts(run_id, "trace", './')
print("Artifacts downloaded in: {}".format(local_path_summary))
print("Artifacts downloaded in: {}".format(local_path_trace))
with open(local_path_summary,'r') as f:
  data = json.load(f)
  trace_summary = pd.read_json(data)
with pm.Model() as model:
    trace2 = pm.load_trace(local_path_trace)
    data_load = az.from_pymc3(trace=trace2)
az.plot_posterior(trace2.get_values('R0')[:,0])
az.plot_posterior(trace2.get_values('R0')[:,1])

If you run the above, you would notice that the trace summary contains the same information as before. The estimates of the parameters that were loaded from the artifacts file or the trace summary, as indicated by their distributions, now become the parameters of the current models. If desired, one can continue to fit new data to our model by using the currently estimated posteriors as the priors for a future training cycle.

Conclusion

In this post, we have seen how one can use MLflow to systematically perform Bayesian experiments using PyMC3. The logging and tracking functionality provided by MLflow can be accessed either through the managed MLflow provided by Databricks or for open-source users through the API. Models and model summaries can be saved as artifacts and can be shared or reloaded into PyMC3 at a later time.

To learn more, please check out the attached notebook.

Check out the notebook to learn more about managed MLflow for Bayesian experiments. Learn more about Bayesian inference in my Coursera courses:

Try Databricks for free. Get started today.

The post MLflow for Bayesian Experiment Tracking appeared first on Databricks.

↧

Introducing Apache Spark™ 3.2

October 19, 2021, 9:54 am

≫ Next: Introducing SQL User-Defined Functions

≪ Previous: MLflow for Bayesian Experiment Tracking

We are excited to announce the availability of Apache Spark™ 3.2 on Databricks as part of Databricks Runtime 10.0. We want to thank the Apache Spark community for their valuable contributions to the Spark 3.2 release.

The number of monthly maven downloads of Spark has rapidly increased to 20 million. The year-over-year growth rate represents a doubling of monthly Spark downloads in the last year. Spark has become the most widely-used engine for executing data engineering, data science and machine learning on single-node machines or clusters.

Continuing with the objectives to make Spark even more unified, simple, fast and scalable, Spark 3.2 extends its scope with the following features:

Introducing pandas API on Apache Spark to unify small data API and big data API (learn more here).
Completing the ANSI SQL compatability mode to simplify migration of SQL workloads.
Productionizing adaptive query execution to speed up Spark SQL at runtime.
Introducing RocksDB statestore to make state processing more scalable.

In this blog post, we summarize some of the higher-level features and improvements. Keep an eye out for upcoming posts that dive deeper into these features. For a comprehensive list of major features across all Spark components and JIRA tickets resolved, please see the Apache Spark 3.2.0 release notes.

Unifying small data API and big data API

Python is the most widely used language on Spark. To make Spark more Pythonic, the pandas API was introduced to Spark, as part of Project Zen (see also Project Zen: Making Data Science Easier in PySpark from Data + AI Summit 2021). Now, the existing users of pandas can scale out their pandas applications with one line change. As shown below, performance can be greatly improved in both single-node machines [left] and multi-node Spark clusters [right], thanks to the sophisticated optimizations in the Spark engine.

pandas performance can be greatly improved in both single-node machines [left] and multi-node clusters [right], thanks to the sophisticated optimizations in the Spark engine.

Figure. pandas vs. pandas API on Spark

At the same time, Python users can also seamlessly leverage the unified analytics functionality provided in Spark, including querying data via SQL, streaming processing and scalable machine learning (ML). The new pandas API also provides interactive data visualization powered by the plotly backend.

For more details, see the blog post “Pandas API on Upcoming Apache Spark™ 3.2”

Simplifying SQL migration

More ANSI SQL features (e.g., lateral join support) were added. After more than one year of development, the ANSI SQL mode is GA in this release. To avoid massive behavior-breaking changes, the mode `spark.sql.ansi.enabled` is still disabled by default. The ANSI mode includes the following major behavior changes:

Runtime error throwing instead of silent ignorance with null results when the inputs to a SQL operator/function are invalid (SPARK-33275). For example, integer value overflow errors on arithmetic operations, or parsing errors on casting string to numeric/timestamp types.
Standardized type coercion syntax rules (SPARK-34246). The new rules define whether values of a given data type can be promoted to another data type implicitly based on the data type precedence list, which is more straightforward than the default non-ANSI mode.
New explicit cast syntax rules (SPARK-33354). When Spark queries contain illegal type casting (e.g., date/timestamp types are cast to numeric types) compile-time errors are thrown informing the user of invalid conversions.

This release also includes some new initiatives that have not been fully finished yet. For example, standardize exception messages in Spark (SPARK-33539); introducing ANSI interval type (SPARK-27790) and improving the coverage of correlated subqueries (SPARK-35553).

Speeding up Spark SQL at runtime

Adaptive Query Execution (AQE) is enabled by default in this release (SPARK-33679). For performance improvements, the AQE can re-optimize the query execution plans based on the accurate statistics collected at runtime. Maintenance and pre-collection of statistics are expensive in big data. Lacking accurate statistics often causes inefficient plans, no matter how advanced the optimizer is. In this release, AQE becomes fully compatible with all the existing query optimization techniques (e.g., Dynamic Partition Pruning) to re-optimize the join strategies, skew join and shuffle partition coalescence.

Both small data and big data should be processed in a highly efficient manner in the unified data analytics system. Short query performance becomes also critical. The overhead of Spark query compilation in complex queries is significant when the volume of processed data is considerably small. To further reduce the query compilation latency, Spark 3.2.0 prunes unnecessary query plan traversals in analyzer/optimizer rules (SPARK-35042, SPARK-35103) and speeds up the construction of new query plans (SPARK-34989). As a result, the compile time of TPC-DS queries is reduced by 61%, compared to Spark 3.1.2.

More scalable state processing streaming

The default implementation of state store in Structured Streaming is not scalable since the amount of state that can be maintained is limited by the heap size of the executors. In this release, Databricks contributed to the Spark community RocksDB-based state store implementation, which has been used in Databricks production for more than four years. This state store can avoid full scans by sorting keys, and serve data from the disk without relying on the heap size of executors.

In addition, state store APIs are enriched with the API for prefix match scan (SPARK-35861) for efficiently supporting event time based sessionization (SPARK-10816), which allow users to do aggregations on session windows over eventTime. For more details, please read the blog post “Native support of session window in Apache Spark’s Structured Streaming“.

Other updates in Spark 3.2

In addition to these new features, the release focuses on usability, stability, and refinement, resolving around 1700 JIRA tickets. It’s the result of contributions from over 200 contributors, including individuals as well as companies such as Databricks, Apple, Linkedin, Facebook, Microsoft, Intel, Alibaba, Nvidia, Netflix, Adobe and many more. We’ve highlighted a number of key SQL, Python and streaming data advancements in Spark for this blog post, but there are many additional capabilities in the 3.2 milestone, including codegen coverage improvements and connector enhancements, which you can learn more about in the release notes.

Get started with Spark 3.2 today

If you want to try out Apache Spark 3.2 in the Databricks Runtime 10.0, sign up for the Databricks Community Edition or Databricks Trial, both of which are free, and get started in minutes. Using Spark 3.2 is as simple as selecting version “10.0” when launching a cluster.

Try Databricks for free. Get started today.

The post Introducing Apache Spark™ 3.2 appeared first on Databricks.

↧

Introducing SQL User-Defined Functions

October 20, 2021, 9:00 am

≫ Next: Simplifying Data + AI, One Line of TypeScript at a Time

≪ Previous: Introducing Apache Spark™ 3.2

A user-defined function (UDF) is a means for a user to extend the native capabilities of Apache Spark™ SQL. Spark SQL has supported external user-defined functions written in Scala, Java, Python and R programming languages since 1.3.0. While external UDFs are very powerful, they also come with a few caveats:

Security. A UDF written in an external language can execute dangerous or even malicious code. This requires tight control over who can create UDF.
Performance. UDFs are black boxes to the Catalyst Optimizer. Given Catalyst is not aware of the inner workings of a UDF, it cannot do any work to improve the performance of the UDF within the context of a SQL query.
SQL Usability. For a SQL user it can be cumbersome to write UDFs in a host language and then register them in Spark. Also, there is a set of extensions many users may want to make to SQL which are rather simple where developing an external UDF is overkill.

To cope with the above limitations, we are thrilled to introduce a new form of UDF: SQL UDFs. Available in DBR 9.1 LTS, the SQL UDF is completely defined with the expressive power of SQL and also completely transparent to the SQL Compiler.

Benefits of using SQL UDFs

SQL UDFs are simple yet powerful extensions to Spark SQL. As functions, they provide a layer of abstraction to simplify query construction – making SQL queries more readable and modularized. Unlike UDFs that are written in a non-SQL language, SQL UDFs are more lightweight for SQL users to create. SQL function bodies are transparent to the query optimizer thus making them more performant than external UDFs. SQL UDFs can be created as either temporary or permanent functions, be reused across multiple queries, sessions and users, and be access-controlled via Access Control Language (ACL). In this blog, we will walk you through some key use cases of SQL UDFs with examples.

SQL UDFs as constants

Let’s start with the most simplistic function imaginable: a constant. We all know we’re not supposed to use literals in our code because it harms readability and, who knows, maybe the constant doesn’t remain constant after all. So we want to be able to change it in one place only:

CREATE FUNCTION blue()
  RETURNS STRING
  COMMENT 'Blue color code'
  LANGUAGE SQL
  RETURN '0000FF'

If you are familiar with external UDFs, you can see there are some differences that stand out:

A SQL UDF must define its parameter list, even if it’s empty. A constant takes no parameters.
The function also declares the data type it will return. In this case that’s a STRING.
The implementation of the function is part of the function definition.
You specify LANGUAGE SQL to say that it’s a SQL UDF. But really, that’s not needed. The RETURN clause is enough of a give away that we decided to make this optional.

Beyond these differences there are many other things that are the same as external UDF:

You can replace a function. More on that later.
You can add a comment that describes the function – as shown above.
You can even create a temporary function that you can use within the current session, only.

Let’s use the function:

SELECT blue();
0000FF

Unsurprisingly this works. But what is happening under the hood?

EXPLAIN SELECT blue();
== Physical Plan ==
*(1) Project [0000FF AS default.blue()#9]
+- *(1) Scan OneRowRelation[]

This is neat! The SQL compiler replaced the function invocation with the constant itself.
That means at least this SQL UDF comes at zero cost to performance.

Now, let’s have a look at another common usage pattern.

SQL UDF encapsulating expressions

Imagine you don’t like the naming of some built-in functions. Maybe you are migrating lots of queries from another product, which has different function names and behaviors. Or perhaps you just can’t stand copy-pasting some lengthy expressions over and over again in your SQL queries. So, you want to fix that.

With SQL UDF, we can simply create a new function with the name we like:

CREATE FUNCTION to_hex(x INT COMMENT 'Any number between 0 - 255')
  RETURNS STRING
  COMMENT 'Converts a decimal to a hexadecimal'
  CONTAINS SQL DETERMINISTIC
  RETURN lpad(hex(least(greatest(0, x), 255)), 2, 0)

Let’s have a look at what new syntax was used here:

This function takes an argument, and the parameter is defined by a name, a type and an optional comment.
The CONTAINS SQL clause is optional, but tells us the function does not read or modify any data in a table. It is the default setting, so you normally wouldn’t specify it.
DETERMINISTIC is also optional and tells us that the function will always return the same result set given the same arguments. The clause is for documentation only at this point. But at some point in the future it may be used to block non deterministic functions in certain contexts.
In the RETURN clause the parameter has been referred to by name. In more complex scenarios below you will see that the parameter can get disambiguated with the function name. Naturally you can use arbitrarily complex expressions as the function body.

Not only does it work …

SELECT to_hex(id) FROM range(2);
00
01

… but it works well:

EXPLAIN SELECT to_hex(id) FROM range(2);
== Physical Plan ==
*(1) Project [lpad(hex(cast(least(greatest(0, cast(id#0 as int)), 255) as bigint)), 2, 0) AS default.to_hex(id)#1]
+- *(1) Range (0, 2, step=1, splits=4)

We can see that the physical plan shows a straight application of the functions lpad, hex, least and greatest. This is the same plan you get invoking the series of functions directly.

You can also compose SQL functions out of SQL functions:

CREATE FUNCTION rgb_to_hex(r INT, g INT, b INT)
  RETURNS STRING
  COMMENT 'Converts an RGB color to a hex color code'
  RETURN CONCAT(to_hex(r), to_hex(g), to_hex(b))


SELECT rgb_to_hex(0, 0, 255);
0000FF

SQL UDF reading from tables

Another common usage of SQL UDF is to codify lookups. A simple lookup may be to decode RGB color codes into English color names:

CREATE FUNCTION from_rgb(rgb STRING 
                             COMMENT 'an RGB hex color code') 
   RETURNS STRING
   COMMENT 'Translates an RGB color code into a color name' 
   RETURN DECODE(rgb, 'FF00FF', 'magenta',
                      'FF0080', 'rose');

SELECT from_rgb('FF0080');
rose

OK, but there are a lot more than two colors in this world. And we want this translation both ways, so these should really be in a lookup table:

CREATE TABLE colors(rgb STRING NOT NULL, name STRING NOT NULL);
INSERT INTO colors VALUES
  ('FF00FF', 'magenta'),
  ('FF0080', 'rose'),
  ('BFFF00', 'lime'),
  ('7DF9FF', 'electric blue');

CREATE OR REPLACE FUNCTION
from_rgb(rgb STRING COMMENT 'an RGB hex color code') 
   RETURNS STRING
   READS SQL DATA SQL SECURITY DEFINER
   COMMENT 'Translates an RGB color code into a color name'
   RETURN SELECT FIRST(name) FROM colors WHERE rgb = from_rgb.rgb;

 
SELECT from_rgb(rgb) 
  FROM VALUES('7DF9FF'),
  ('BFFF00') AS codes(rgb);
electric blue
lime

There are multiple new concepts applied here:

You can REPLACE a SQL UDF. To be allowed to do that, the new function must match the old function’s signature. The signature of a function is defined as the number of its parameters and their types.
This function looks up information in a table, so you can optionally document that using READS SQL DATA. If you state nothing the SQL Compiler will derive the correct value, but you must not lie and state CONTAINS SQL.
SQL SECURITY DEFINER is another optional clause, which states that the query accessing the colors table will use the authorization of the function owner. So the function could be executed by the public without compromising the security of the table.
Just as the function operates under the authorization of its owner it will always be parsed using the current database at time of creation.
`rgb` is the name of the column in numbers. By qualifying the parameter as `from_rgb`.`rgb` you clarify that you mean the parameter reference, and not the column.

How does the physical plan look like now? It is easy to see that using an external UDF, which itself performs a query that would result in a nested loop join, is an awful way to burn precious resources.

EXPLAIN SELECT from_rgb(rgb) 
  FROM VALUES ('7DF9FF'), 
              ('BFFF00') AS codes(rgb);

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [first(name)#1322268 AS default.from_rgb(rgb)#1322259]
   +- BroadcastHashJoin [rgb#1322261], [rgb#1322266], LeftOuter, BuildRight, false
      :- LocalTableScan [rgb#1322261]
      +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(input[1, string, false]),false), [id=#1437557]
         +- SortAggregate(key=[rgb#1322266], functions=[finalmerge_first(merge first#1322271, valueSet#1322272) AS first(name#1322267)()#1322260])
            +- Sort [rgb#1322266 ASC NULLS FIRST], false, 0
               +- Exchange hashpartitioning(rgb#1322266, 200), ENSURE_REQUIREMENTS, [id=#1437553]
                  +- SortAggregate(key=[rgb#1322266], functions=[partial_first(name#1322267, false) AS (first#1322271, valueSet#1322272)])
                     +- Sort [rgb#1322266 ASC NULLS FIRST], false, 0
                        +- FileScan parquet default.colors[rgb#1322266,name#1322267]

In this case, Catalyst has chosen a broadcast hash join instead of a nested loop join. It can do this because it understands the content of the SQL UDF.

Thus far, all examples discussed used scalar-valued functions – ones that return a single value. That result may be of any type, even complex combinations of structs, arrays, and maps.There is also another type of UDF to discuss – the table-valued UDF.

SQL Table UDF

Imagine if views took arguments! You could encapsulate complex predicates even if they rely on user-provided values. A SQL Table UDF is just that: a view by any other name, except with parameters.

Let’s assume that the color mapping above is not unique. At the very least, we can assert the color names differ across languages.

Therefore the `from_rgb` function needs to be modified to return either an array of names or a relation.

INSERT INTO colors VALUES ('BFFF00', 'citron vert');


CREATE OR REPLACE FUNCTION 
     from_rgb(rgb STRING COMMENT 'an RGB hex color code') 
   RETURNS TABLE(name STRING COMMENT 'color name')
   READS SQL DATA SQL SECURITY DEFINER
   COMMENT 'Translates an RGB color code into a color name'
   RETURN SELECT name FROM colors WHERE rgb = from_rgb.rgb;

As you can see, the only difference compared to a scalar function is a more complex RETURNS clause. Unlike views, SQL UDFs mandate a declaration of the returned relation’s signature:

TABLE specifies that the function returns a relation.
The TABLE clause must include a name for each return column and the column’s data type.
You may optionally specify a comment for any return column.

User-defined table functions are new to DBR. Let’s have a look at how to invoke them.

SELECT * FROM from_rgb('7DF9FF'); 
electric blue

In its simplest form, a table function is invoked in the same way and the same places a view is referenced. The only difference are the mandatory braces, which include the function’s arguments. This function is invoked with literal arguments, but the arguments can be any expression, even scalar subqueries.

Most powerful, however, is the usage of SQL table UDF in a join, typically a correlated cross join:

SELECT rgb, from_rgb.name 
    FROM VALUES('7DF9FF'),
               ('BFFF00') AS codes(rgb),
         LATERAL from_rgb(codes.rgb);  
7DF9FF	electric blue
BFFF00	lime
BFFF00	citron vert

Here the arguments refer (correlate) to a preceding (lateral) relation in the FROM clause. The new LATERAL keyword gives Catalyst permission to resolve these columns. Also note that you can refer to the result of the table function by naming the columns as defined in the result signature and optionally qualified by the function name.

Administration

Naturally, SQL UDFs are fully supported by the existing GRANT, REVOKE, SHOW, DESCRIBE and DROP statements.

The statement worth pointing out in more detail is DESCRIBE.

DESCRIBE FUNCTION from_rgb;
Function: default.from_rgb
Type:     TABLE
Input:    rgb STRING 
Returns:  name STRING

The basic describe returns what you might expect, but the extended DESCRIBE adds significantly more detail:

DESCRIBE FUNCTION EXTENDED from_rgb;
Function:    default.from_rgb
Type:        TABLE
Input:       rgb STRING 'an RGB hex color code'
Returns:     name STRING 'color name'
Comment:     Translates an RGB color code into a color name
Data Access: READS SQL DATA
Configs:     spark.sql.datetime.java8API.enabled=true
             spark.sql.hive.version=2.3.7
Owner:       serge.rielau
Create Time: Wed Sep 08 08:59:53 PDT 2021
Body:        SELECT name FROM colors WHERE rgb = from_rgb.rgb

Outlook

What we have described represents the initial functionality for SQL UDF. Future extensions we are pondering include support for:

SQL PATH, so you can create a library of functions in a database and subscribe to them from another, just as you would do in your file system.
Overloading of UDFs.
UDFs with default values for parameters.

SQL UDFs are a big step forward in SQL usability and can be used in many different ways as outlined in this blog. We encourage you to think of even more creative ways to leverage SQL UDFs be it in Databricks SQL or using Photon for Data Engineering jobs. Try the notebook here and see the documentation for more information.

Try Databricks for free. Get started today.

The post Introducing SQL User-Defined Functions appeared first on Databricks.

↧

Simplifying Data + AI, One Line of TypeScript at a Time

October 21, 2021, 9:12 am

≫ Next: Curating More Inclusive and Safer Online Communities With Databricks and Labelbox

≪ Previous: Introducing SQL User-Defined Functions

Today, Databricks is known for our backend engineering, building and operating cloud systems that span millions of virtual machines processing exabytes of data each day. What’s not as obvious is the focus on crafting user experiences that make data more accessible and usable.

We thought it’s time to highlight that via a series of posts on the people, the work, and impact they have had on our customers and the ecosystem. In this first post, we cover the founders’ stories in this area and some of the technical and product challenges we experience. In the future, we will cover newer works such as visualization and micro frontends.

Let’s get started.

Databricks’ founding thesis: can’t simplify data without great UI/UX

Ever since we started doing research on large-scale computing at UC Berkeley, our goal was to make it accessible to many more people. The state-of-the-art back then required a team of engineers working in Java (or C++) for weeks to process terabytes of data. Our work on Apache Spark made it possible for everyone to run distributed computations with just a few lines of Python or SQL.

But that wasn’t enough. From the early days, we saw that many of the exciting early applications of Spark were interactive — for example, one of the most mind-blowing was a group of neuroscientists visualizing zebrafish brain activity in real time to understand how the brain worked. Seeing these applications, we realized a great compute engine could only get us so far: further expanding access to big data would also require new, high-quality user interfaces for both highly-trained developers and more citizen data consumers.

As our first product, we built the world’s first collaborative and interactive notebook for data science, designing a frontend that could display and visualize large amounts of data, a backend that automatically sliced and recomputed data as users manipulated their visualizations, and a full collaborative editing system that allowed our users to work on the same notebook simultaneously and to visualize streaming updates of the data.

Our funding pitch demo to Andreessen Horowitz contained no changes to Spark — it just showed how an interactive, cloud-based interface based on it could make terabytes of data usable in seconds. Our pitches to customers were the same, and they loved it!

Everybody, even the “business person”, had to learn JavaScript

The founding team had strong pedigrees in backend systems (all of us had systems/database PhDs), but we didn’t know how to attract frontend engineers. So we had to take matters into our own hands; the founders each got copies of “JavaScript: The Definitive Guide” and “JavaScript: The Good Parts,” and read them cover-to-cover over summer break before we started the company. (We were happy that the “Good Parts” was only 176 pages.)

Among our 7 cofounders, Arsalan was our “business guy.” He had received his PhD in computer networks from Berkeley a few years back, and had been working at McKinsey as a partner. We thought negotiating partnerships and striking deals wouldn’t quite fill 100% of his time, so we asked him to get up to speed on JavaScript before the company started.

Imagine this: after meeting some CEOs and CFOs, this McKinsey consultant in a suit boards first class, stows his Briggs & Riley carry-on, and then pulls out his copy of JavaScript: The Good Parts.

The whole team pitched in: our CEO wrote the original visualization in D3, Matei wrote the initial file browser, and Arsalan implemented the commenting feature in notebooks that our users still love today.

Although the founding team did not shy away from picking up JavaScript and building the initial product, over the past few years we have significantly expanded the team to bring on more frontend and UX experts. They have taught us a lot and have completely modernized our frontend stack (e.g. with Jest, React, Next.js, TypeScript, Yarn).

Challenges in UI/UX for data and AI

We have also found that we had many UX and engineering challenges that most frontend applications do not run into. These challenges include:

Displaying large amounts of data efficiently. Our users want to explore and visualize massive datasets, showing as many records as possible on their screens. This meant that our table and plot controls all had to be as fast and robust as possible in the face of large, possibly irregular datasets. We also tested them heavily to make them robust — early on, we found many customer workloads that could easily crash their web browser, from the table with 2000 columns to the row with a 100 MB text field. Our frontend and backend now handle all these cases. Even today, we are constantly pushing the boundary of what’s possible in the browser as our customers’ workloads are becoming ever more demanding.
Designing UI for long-running parallel tasks. Sometimes users ask for something that will take a while to compute (e.g. running on petabytes of data), so how do we ensure they feel that the system is fast and responsive? By giving them meaningful progress or even letting them see approximate results before the query completes. One example is our plot control’s ability to quickly render data based on a frontend sample, and then push large queries to the backend on all data.
Letting users rapidly create shareable production applications. We found that most users who do an analysis interactively then want to turn it into a dashboard and publish it to their team– and they don’t want to leave their data analysis product to do it. Thus, we’ve built publish workflows in notebooks that let users combine their results into a usable, publishable report as quickly as possible. Our dashboards now reach hundreds of thousands of users worldwide. For example, when COVID started, our Amsterdam team spotted a Databricks dashboard tracking cases on their TV news.
Integrating with engineering workflows. The data products built on Databricks are increasingly powering mission-critical applications. As a result, while data scientists and analysts want to explore their data quickly, they also want to follow engineering best practices to introduce rigor, such as managing code in Git or running CI/CD. A lot of our work focuses on enabling less technical users to leverage similar tools or concepts for engineering rigor in their own workflows.

We’ve started designing the new Databricks workspace, with spaces for each of the main user personas. These spaces address the critical user journeys [CUJs] for each persona, defined in collaboration with the UX team, engineers and product management.

We’re just getting started

We are humbled by the impact Databricks has had on our customers. Among them are neuroscientists trying to understand how the brain works, energy engineers reducing energy consumption for whole continents, and pharmaceutical researchers speeding up the discovery of the next important drugs.

But we haven’t solved all the problems. Our explosive growth has created even more challenges to solve, and we feel we are just getting started here. For too long, our industry has built the most sophisticated technologies for data behind code-based interfaces. The first step towards democratizing data and AI is to create graphical user interfaces to significantly simplify critical user journeys.

Our team needs to build interfaces to make it easier to present and consume data, but we also need to allow our users to build dashboards that make their data easier for downstream users to explore and understand.

As a recent example, we built a new data explorer UI for easier exploration of data (with zero backend changes). Right after we shipped it, we received a message from our customer Jake: “Data Explorer is night and day better. Whatever witchcraft happened here is heavenly.” We know that we can do the same in many other parts of users’ workflows.

Come build the future of data and AI with us. Your work might be used to create the next cancer drug, catch the next cyberattack, or even explain the next big story on the evening news.

JOIN OUR TEAM!

The post Simplifying Data + AI, One Line of TypeScript at a Time appeared first on Databricks.

↧

Curating More Inclusive and Safer Online Communities With Databricks and Labelbox

October 21, 2021, 10:58 am

≫ Next: How Bread Standardized on the Lakehouse With Databricks & Delta Lake

≪ Previous: Simplifying Data + AI, One Line of TypeScript at a Time

This is a guest authored post by JT Vega, Support Engineering Manager, Labelbox.

While video games and digital content are a source of entertainment, connecting with others, and fun for many around the world, they are also frequently a destination for toxic behavior that can include flaming, trolling, cyberbullying, and hate speech in the form of user-generated content. Social media platforms and video game developers are both aiming to fight online toxic behavior with the latest advances in AI. However, the reality of the situation is that AI training often begins with manual and laborious labeling efforts where teams sift through piles of toxic and benign user comments in order to categorize content for model training.

Finding faster and more cost-effective ways to convert unstructured text data into structured data is highly beneficial towards supporting more advanced use cases for identifying and removing unwanted content. The business benefit of this includes the ability to enhance the work and efficiency of human moderators while creating online communities that engage with each other that are free from harassment.

Easily upload text data from Databricks into Labelbox for annotation

In-game toxicity models can actually hurt the gamer experience if they have high false-negative or false-positive rates. False negatives allow toxic behavior to continue unabated, and false positives can flag healthy players for removal. Active Learning is an efficient process that helps to reduce false positives and false negatives. To facilitate Active Learning, Labelbox allows you to quickly inspect predictions from your model and approve or correct them. You can then use your corrected labels to retrain your model so it will not make the same mistake in the future.

Use the Labelbox Connector to load annotations into Databricks

(Disclaimer: the content provided is used for illustrative purposes that can be considered offensive or objectionable)

An example of labeling unstructured text data imported from Databricks in Labelbox to classify toxicity

The Labelbox Connector supports unstructured data workflows

You can also store model embeddings in Labelbox to facilitate analysis through dimensionality reduction. For example, your model embeddings may reveal new groupings of data that you had not previously thought of before. Perhaps you’ll also find specific types of data where you have a high false negative or false positive rate.

You can check out the project featured in this blog post here. While these demo notebooks are tailored for battling toxic content, you can broadly apply them to other NLP use-cases where high-quality training data is required to train AI models. You can learn more about the Databricks and Labelbox integration by watching this talk from Data & AI Summit 2021. Questions? Reach out to us at ecosystem+databricks@labelbox.com.

Download the Toxicity Solutions Accelerator and try Model Assisted Labeling with Labelbox today! To enable Model Assisted Labeling on your free trial of Labelbox, please reach out to ecosystem+databricks@labelbox.com.

Try Databricks for free. Get started today.

The post Curating More Inclusive and Safer Online Communities With Databricks and Labelbox appeared first on Databricks.

↧

How Bread Standardized on the Lakehouse With Databricks & Delta Lake

October 22, 2021, 11:07 am

≫ Next: GPU-accelerated Sentiment Analysis Using Pytorch and Huggingface on Databricks

≪ Previous: Curating More Inclusive and Safer Online Communities With Databricks and Labelbox

This is a collaborative post from Bread Finance and Databricks. We thank co-author Christina Taylor, Senior Data Engineer–Bread Finance, for her contribution.

Bread, a division of Alliance Data Systems, is a technology-driven payments company that integrates with merchants and partners to personalize payment options for their customers. The Bread platform allows merchants to offer more ways to pay over time, serving up the right options at the right time, empowering merchants to improve conversion rates and lift average-order-value. Bread currently services over 400 merchants — notably GameStop (Canada), SoulCycle, and Equinox (US) — and continues to grow. The platform is driven by big data use cases such as financial reporting, fraud detection, credit risk, loss estimation and a full-funnel recommendation engine.

The Bread platform, running on Amazon Web Services (AWS) Cloud, consists of several dozen microservices. Each microservice represents a part of a user or merchant’s journey, for example: A customer selects a payment option or applies for a loan; a merchant manages the transaction lifecycle or tracks settlement details. Every microservice writes to its own Postgres database, and the databases are isolated from each other by design. For internal business analysis and external partnership reporting, we need a centralized repository where all data from different services come together for the first time.

Existing implementation

Our first iteration of ingestion was a data sync Python module that dumped all databases and tables nightly as CSV files, copied the files into Snowflake warehouse’s raw schema, and overwrote existing tables every night. We then used dbt (Data Build Tool) and a custom decryption module — also run as python containers — to transform the data and make them reporting ready. See the diagram below.

Challenges

While the above ingestion workflow successfully enabled reporting, there were a few significant challenges. The most pressing one was scalability. Our Python module was run by a KubernetesPodOperator on an Airflow cluster in our AWS Cloud. It was subjected to the compute resources (~1GB CPU, ~500 MB memory; 3 times the default) allocated to the pod, and overall extra capacity provisioned by the Airflow cluster. As total data volume grew from Gigabytes to Terabytes, the time it took to run the data sync job in one deployment had also grown from minutes to hours, straining pod resources and creating latency for downstream data transformation. We needed a better solution that could scale with our business as the number of transactions and partners increases.

The second challenge was schema evolution. Microservices continue to evolve and schema changes can occur every week. While we could “automatically” respond by dropping and recreating the tables in Snowflake, we had neither knowledge of the change nor time to update the downstream data models. As a result, the transformation jobs often error on schema change. We needed a solution that could warn us of schema changes and was more fault-tolerant.

The last challenge was velocity. As our team and usage both grew, there was an increasing need for timely ingestion. While Day -1 updates may be sufficient for reporting, internal BI functionalities — especially risk and fraud analytics — required fresh data from our applications. We needed a solution that provides near real-time data.

Proposal

To summarize, we needed a platform to provide:

Scalable computing independent of memory restrictions of kubernetes pods
Storage option which offered simple but safe schema evolution
Ability to transition from batch to streaming with one-line code changes

Fortunately, Delta Lake running on Databricks provided the solution to all of the above. We set out to build 1) A formal change data capture process instead of naive data dump, 2) Apache SparkTM instead of Python modules for ingestion, and 3) Databricks instead of Snowflake for computing. We also wanted to continue supporting data models and users on Snowflake, until we could fully migrate to Databricks.

Lakehouse for a transaction enrichment pipeline

The vision of a lakehouse is to deliver on the business use cases described in the first paragraph in this article. The key is to set a foundation in Delta Lake, empowering data science and analytics engineering to run jobs and analyze data where it exists without incurring costs on egress/ingress. Additionally, since Bread is always looking to optimize for data freshness, the core capabilities had to involve a robust engine for reliable and speedy ingestion.

DMS & Auto Loader for change data ingestion

Inspired by this blog, we chose AWS DMS (Database Migration Services) for database snapshotting and change data capture. The source was our microservices which are backed by Postgres databases (RDS); the target was a collection of Amazon S3 buckets. We then ingested the DMS data with Auto Loader and continuously upserted change sets into Delta Lake. We also refactored external jobs using the newly available Databricks SQL Connector. The following sections explain our rationale and implementation in greater technical detail.

DMS configuration

In our setup, for each microservice, there is a corresponding DMS task and S3 bucket. The migration consists of 3 major phases:

The snapshot of existing data (full load)
The application of cached changes
Ongoing replication (CDC)

We configured the extra connection attributes as such

cdcInsertsOnly=false;compressionType=GZIP;dataFormat=parquet;datePartitionEnabled=true;DatePartitionSequence=YYYYMMDD;includeOpForFullLoad=trueparquetTimestampInMillisecond=truetimestampColumnName=timestamp;DatePartitionDelimiter=NONE;

Given the above configuration, full load files are written to <microservice_bucket>/<schema_name>/<table_name>/LOAD*.parquet

CDC files are written to <microservice_bucket>/<schema_name>/<table_name>/yymmdd/*.parquet

The extra connection attributes partition the change data by date and add an “Op” column with “I”, “U”, or “D” possible values, indicating if the change is an insert, update, or delete operation.

An important customization for us involves limitations using S3 as a DMS target. Some of our source table columns store large binary objects (LoB). When using S3 as a target, full LoB mode is not supported. We must specify a Lob MaxSize in the DMS task setting; DMS LoB columns will appear as Spark StringType. The MaxLobSize parameter is 32 (kb) by default. Based on our calculation, we need to increase the value to prevent string truncation.

SELECT max(pg_column_size(col_name)) from source_table;
-------
17268

DMS Replication handles each character as a double-byte character. Therefore, find the length of the longest character text in the column (max_num_chars_text) and multiply by 2 to specify the value for Limit LOB size to. In this case, Limit LOB size is equal to max_num_chars_text multiplied by 2. Since our data includes 4-byte characters, multiply by 2 again: 17268 * 4 ~ 70 kb

Spark jobs

For each microservice, there is a Snapshot Spark job that traverses the S3 DMS directory, finds all tables, loads the data into Databricks Delta Lake and creates initial tables. This is followed by CDC ingestion spark jobs that locate all tables, find the latest state of each record, and merge the changed data into the corresponding Delta tables. Each time we run CDC ingestion, we also keep track of the schema and store the current version on S3.

When ingesting DMS change data, it is critical to identify the primary key of source tables. For most of our microservices, the primary key is “ID”. Some tables do not observe this naming convention, and others use composite primary keys. Therefore, the key columns to merge into must be declared explicitly or created. We concatenate composite primary key columns.

// Snapshot data ingestion
snapshotData
 .withColumn("latest", to_timestamp(col("timestamp")))
 .drop("timestamp")
 .write
 .format("delta")
 .mode("overwrite")
 .option("overwriteSchema", "true")
 .option("path",deltaTablePath)
 .saveAsTable(deltaTableName)

// Change data ingestion
changeData.writeStream
 .format("delta")
 .foreachBatch(
   Platform.cdcMerge(
     tableName = deltaTableName,
     pkColName = pkColName,
     timestampColName = "timestamp"
   ) _
 )
 .outputMode("update")
 .option("checkpointLocation", checkpointPath)
 .option("mergeSchema", "true")
 .trigger(Trigger.Once())
 .start()

// Merge function
def cdcMerge(tableName: String, pkColName: String, timestampColName: String)(
 microBatchChangeData: DataFrame,
 batchId: Long
): Unit = {
 DeltaTable
   .forName(tableName)
   .as("t")
   .merge(
     microBatchChangeData
       .transform(
         findLatestChangeByKey(
           windowColName = pkColName,
           strTimeColName = strTimeColName
         )
       )
       .as("c"),
     s"c.${pkColName} = t.${pkColName}"
   )
   .whenMatched("c.Op == 'D'")
   .delete()
   .whenMatched()
   .updateAll()
   .whenNotMatched("c.Op != 'D'")
   .insertAll
   .execute()
}

Note: Depending on how microservices perform updates — for instance, when records are replaced in place — there can be concurrent inserts and updates. In this case, finding the latest change by key may require custom ordering. Additionally, change data can arrive out of order. We may receive a DMS file containing the eventual delete operation before the file with insert or update. Special handling such as CDC timestamp marking and using a “premature delete flag” may be needed to prevent insertion of actually deleted data.

Why use Auto Loader?

Databricks Auto Loader can automatically ingest files on cloud storage into Delta Lake. It allows us to take advantage of the bookkeeping and fault-tolerant behavior built-in Structured Streaming, while keeping the cost down close to batching.

Cost savings

Why not a traditional structured streaming job? Streaming job clusters are on 24/7. The cluster can scale up, but not down. During testing, we called the cluster API and forced the cluster to scale down every 2 hours. In comparison, when we use the run once trigger to process files at desired intervals (every 2 hours), our compute cost decreased by more than 90%, even with our naive scaler in place.

Streaming vs batch

How is using Auto Loader different from simply running a batch job? We do have batch jobs that load daily partitioned files from S3. In the batch processing scenario, we set up an S3 sensor and a replace where logic to reprocess when necessary. Structured Streaming, on the other hand, commits all files created by the job to a log after each successful trigger. In the event of failure, we can simply pick up where we left off without having separate processes to remove incorrect or duplicated data.

Notification vs directory listing mode

We have seen DMS output many small files in the change data partition — typically several hundred in each day’s partition. Auto Loader’s notification mode can reduce the amount of time taken by each Spark job listing file prior to ingestion. However, due to AWS limitations, file notification does not have a definite SLA. We have observed that some files landed on S3 did not get discovered until the next day. As each business day’s transaction must be reported to our partners before a cutoff time, notification mode is not a reliable option for us.

Fortunately, in Databricks 9.0 and above, file listing has been greatly optimized. More details on this improvement can be found here. In our scenario, each job run only takes ⅔ of the time in contrast to running with DBR 8.4. The difference compared to using notification mode in 8.4 is also negligible. We no longer need to sacrifice performance to guarantee data freshness.

Use Databricks SQL connector to decrypt PII for data scientists

To fully migrate to a Lakehouse, we need to refactor several jobs running on external systems connected to Snowflake, notably PII decryption on Amazon ECS. A subset of transformation relies on decrypted data and is critical to BI work. We must minimize migration risks and prevent disruption to business functions.

The ECS cluster is configured with access to private keys for decryption. The keys are shared with microservices and stored in Vault. The job writes pandas dataframes to Snowflake and replaces existing data each night. Still, we need to solve the following challenges:

How do we keep the existing ECS setup and secrets management strategy?
Is it possible to write to Delta Lake without installing Apache Spark as a dependency?

Thanks to Databricks SQL connector, we are able to add the databricks-sql-connector Python library to ECS, thereby using a pyodbc connection under the hood to enable a simple data flow writing pandas dataframe to delta lake. More details on this connector can be found here.

from databricks import sql
for s in range(0, conn_timeout):
  while True:
      try:
          connection = sql.connect(
              server_hostname=getenv("DATABRICKS_HOST"),
              http_path=getenv("SQL_CONN"),
              access_token=getenv("DATABRICKS_TOKEN"),
          )
      except Exception as err:
          logger.warning(err)
          logger.info("retrying in 30s")
          sleep(30)
          continue
      break

Databricks SQL Connector is newly released and a good fit for remote connection to Databricks SQL or Clusters

The connector provided enough flexibility so we are able to decrypt in chunks and upsert the data into Delta Lake, leading to a performance improvement over decrypting all records and replacing the entire table in Snowflake.

num_records = df.shape[0]
batch_num = math.floor(num_records / batch_size)
cursor = connection.cursor()

for i in range(batch_num + 1):
  pdf = df.iloc[i * batch_size : (i + 1) * batch_size]
  insert_values = pdf.to_records(index=False, index_dtypes=None, column_dtypes=None).tolist()
  query = f"""MERGE INTO {database_name}.{delta_table_name} as Target
          USING (SELECT * FROM (VALUES {",".join([str(i) for i in insert_values])}) AS s ({key_col}, {",".join(val_cols)})) AS Source
          ON Target.id=Source.id
          WHEN MATCHED THEN UPDATE SET *
          WHEN NOT MATCHED THEN INSERT *"""
  cursor.execute(query)

Spark connector vs external tables

To support Snowflake reporting work and user queries during migration, we tested using Delta Lake Integration with Snowflake external tables. Eventually, we opted for using the Spark connector to copy Delta tables into Snowflake prior to high profile, time-sensitive reporting tasks. Here are our main reasons for moving off external tables:

Frequent schema changes: Although we configured auto refresh using S3 notifications and queue system, Snowflake cannot support automatic schema merge or update as Delta Lake does. CREATE OR REPLACE and external table auto refresh became incompatible.
Performance concerns: External tables have proven to be roughly 20% slower compared to copying the data over with Spark Connector.
Inconsistent views of partitioned, vacuumed, and optimized tables: Maintaining external tables became blockers for Delta Lake optimization.
Lack of documentation and references: External tables configuration can be complex and experimental in nature; comprehensive and accurate documentation proved challenging to find.
Loss of functionalities within Snowflake: Very limited ability to audit and debug external table freshness and validity issues.

Future data science directions

As we productionize DMS/CDC ingestion and Databricks SQL connector, we centralize all our raw data in Delta Lake forming a single source of company facts. We are now ready to build out the Lakehouse vision, moving computation and query to Databricks SQL, and paving the way for near real-time data science and analytics work. Below is the illustration of our platform pipeline (solid line for current state; dotted line for future state):

Delta Live Tables + expectations for rapid prototyping

Our current BI analysis flow requires Data Engineers to write spark jobs and deploy dbt models. To accelerate ML development, we explored Delta Live Tables running on Photon, the next-generation query engine. Data Engineers and Analysts collaborated closely and effectively combining Python and SQL. We were particularly excited by how quickly we were able to ingest raw loan data stored on S3, join external data sets (e.g. consumer sentiment), validate data quality, experiment with ML models in a notebook environment, and visualize the results in our BI tools.

Below is an illustration of our Pipeline, from S3 files to Looker Dashboards delivered by Slackbot. Following are the main reasons we want to use DLT for future Data Science work:

Speed

In just a matter of hours, we can move from raw data to actionable insights and predictions. We can even continuously stream data from S3, and build in expectations for validation.

Democratization

Analysts and data scientists can work directly on an end-to-end pipeline without extensive support from engineering. We can also collaborate and mix languages in one pipeline.

Unification

All stages of the deployment exist in one place, from data load, orchestration to machine learning. The pipeline lives with its execution engine.

Conclusion

In this blog, we demonstrated how Bread is building a resilient, scalable data platform with Databricks Delta Lake: We use AWS DMS and Databricks Auto Loader jobs to incrementally capture changes from RDS data sources and continuously merge CDC data into Delta Lake. We also showcased how to migrate jobs external to Databricks using the native Databricks SQL connector. Once we complete building the centralized data lake, our next steps will be taking advantage of Photon SQL Analytics endpoints and DLT pipelines to enable near real-time BI and ML work with much simpler configurations and less engineering dependency.

Try Databricks for free. Get started today.

The post How Bread Standardized on the Lakehouse With Databricks & Delta Lake appeared first on Databricks.

↧

GPU-accelerated Sentiment Analysis Using Pytorch and Huggingface on Databricks

October 28, 2021, 9:44 am

≫ Next: 100 Years of Horror Films: An Analysis Using Databricks SQL

≪ Previous: How Bread Standardized on the Lakehouse With Databricks & Delta Lake

Sentiment analysis is commonly used to analyze the sentiment present within a body of text, which could range from a review, an email or a tweet. Deep learning-based techniques are one of the most popular ways to perform such an analysis. However, these techniques tend to be very computationally intensive and often require the use of GPUs, depending on the architecture and the embeddings used. Huggingface (https://huggingface.co) has put together a framework with the transformers package that makes accessing these embeddings seamless and reproducible. In this work, I illustrate how to perform scalable sentiment analysis by using the Huggingface package within PyTorch and leveraging the ML runtimes and infrastructure on Databricks.

Sentiment analysis

Sentiment analysis is the process of estimating the polarity in a user’s sentiment, (i.e. whether a user feels positively or negatively from a document or piece of text). The sentiment can also have a third category of neutral to account for the possibility that one may not have expressed a strong positive or negative sentiment regarding a topic. Sentiment analysis is a form of opinion mining but differs from stance or aspect detection, where a user’s stance regarding a particular aspect or feature is extracted.

For example, the sentiment in the sentence below is overwhelmingly positive:

“The restaurant was great”

However, consider the sentence below:

“The restaurant was great but the location could be better”

It is harder to estimate the sentiment, but the user’s stance regarding the restaurant can be seen as generally positive in spite of the fact that their stance regarding the location was negative. To summarize, sentiment analysis provides coarse-grained information while stance detection provides more information regarding certain aspects.

Sentiment analysis can be used to ascertain a customer’s sentiment regarding a particular product, the public’s reaction to an event, etc.

Types of sentiment analysis

Sentiment analysis can be performed using lexicon-based techniques or machine learning-based techniques. Lexicon-based techniques use pre-labeled vocabulary to estimate the sentiment from text. A variety of techniques are used to aggregate the sentiment from the individual sentiment assigned to the tokenized words. Some of the popular frameworks in this category are SentiNet and AFINN . VADER, an open-source package with the NLTK, is another example that is used specifically for analyzing social media posts. Machine learning-based sentiment analysis uses pre-trained embeddings along with a deep learning (DL) architecture to infer the sentiment in a body of text. In this blog, we will only cover ML-based techniques through the embeddings available from Huggingface. The sentiment analysis model, composed of the architecture and the embeddings, can then be optionally fine-tuned if domain-specific labels are available for the data. It is often the case that such supervised training can improve the performance even when only a small amount of labeled data is available. Embeddings such as Elmo, BERT and Roberta are some of the popularly available language embeddings for this purpose.

Introduction to transformers

Huggingface has made available a framework that aims to standardize the process of using and sharing models. This makes it easy to experiment with a variety of different models via an easy-to-use API. The transformers package is available for both Pytorch and Tensorflow, however we use the Python library Pytorch in this post. The easiest way to perform inference using the transformers package is shown below.

from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenize
MODEL =  "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
tokenized_text = tokenizer([“Hello world”], padding=True, return_tensors='pt')
output = model(tokenized_text[‘input_ids’])
pt_predictions = nn.functional.softmax(output.logits, dim=1)

Looking at the example above, we notice two imports for a tokenizer and a model class. We can instantiate these by specifying a certain pre-trained model such as BERT. You can search for a model here. You then pass a sequence of strings to the tokenizer to tokenize it and specify that the result should be padded and returned as Pytorch tensors. The tokenized results are an object from which we extract the encoded text and pass it to the model. The results of the model are then passed through a softmax layer in the case of sentiment analysis to normalize the results as a sentiment score.

(Multi) GPU-enabled inference

The process of inference consists of the following components:

Dataloader for serving batches of tokenized data
Model class that performs the inference
Parallelization of the model on the GPU devices
Iterating through the data for inference and extracting the results

Dataloader

Pytorch uses the Dataloader abstraction for extracting batches of data to be used either for training or inference purposes. It takes as input an object of a class that extends the ‘Dataset’ class. Here we call that class ‘TextLoader’. It is necessary to have at least two methods in this class :

(a) __len__() : returns the length of the entire data
(b) __getitem__(): extracts and returns a single element of the data

MODEL = "distilbert-base-uncased-finetuned-sst-2-english"
class TextLoader(Dataset):
    def __init__(self, file=None, transform=None, target_transform=None, tokenizer=None):
        self.file = pd.read_json(file, lines=True)
        self.file = self.file
        self.file = tokenizer(list(self.file['full_text']), padding=True, truncation=True, max_length=512, return_tensors='pt')
        self.file = self.file['input_ids']
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.file)

    def __getitem__(self, idx):
        data = self.file[idx]
        return(data)

Now Dataloader accepts the object instance of this class named ‘data’ here, along with the batch size of the data to be loaded in a single iteration. Note that I have set the ‘shuffle’ flag to False here, since we want to preserve the order of the data.

Dataloader automatically handles the division of the data that it receives to be utilized by each of the GPU devices. If the data is not evenly divisible, it offers the option to either drop elements or pad a batch with duplicate data points. This is something you might want to keep in mind, especially during the inference or prediction process.

 tokenizer = AutoTokenizer.from_pretrained(MODEL)
data = TextLoader(file=file = ‘/PATH_TO/FILE.txt', tokenizer=tokenizer)
train_dataloader = DataLoader(data, batch_size=120, shuffle=False) # Shuffle should be set to False

Model class

The model class is fairly similar to the code that we saw above, with the only difference being that it is now wrapped in a nn.module class. The model definition is initialized within __init__ and the forward method applies the model that is loaded from Huggingface.

class SentimentModel(nn.Module):
   def __init__(self):
        super(SentimentModel, self).__init__()
        self.model = AutoModelForSequenceClassification.from_pretrained(MODEL)

    def forward(self, input):
        output = self.model(input)
        pt_predictions = nn.functional.softmax(output.logits, dim=1)
        return(pt_predictions)

model3 = SentimentModel()

Model parallelization and GPU dispatch

In Pytorch, a model or variable that is created needs to be explicitly dispatched to the GPU. This can be done by using the ‘.to(‘cuda’) method. If you have multiple GPUs, you can even specify a device id as ‘.to(cuda:0)’. Additionally, in order to benefit from data parallelism and run the training or inference across all the GPU devices on your cluster, one has to wrap the model within ‘DataParallel’.

While this code assumes that you have more than one GPU on your cluster, if that is not the case, the only change required is ‘device_ids’ to [0] or simply not specifying that parameter (the default gpu device will be automatically selected).

dev = 'cuda'
if dev == 'cpu':
  device = torch.device('cpu')
  device_staging = 'cpu:0'
else:
  device = torch.device('cuda')
  device_staging = 'cuda:0'

try:
  model3 = nn.DataParallel(model3, device_ids=[0,1,2,3])
  model3.to(device_staging)
except:
  torch.set_printoptions(threshold=10000)

Iteration loop

The following loop iterates over the batches of data, transfers the data to the GPU device before passing the data through the model. The results are then concatenated so that they can be exported to a data store.

out = torch.empty(0,0)
for data in train_dataloader:
    input = data.to(device_staging)
    if(len(out) == 0):
      out = model3(input)
    else:
      output = model3(input)
      with torch.no_grad():
        out = torch.cat((out, output), 0)

file = ‘/PATH_TO/FILE.txt'
df = pd.read_json(file, lines=True)['full_text']
res = out.cpu().numpy()
df_res = pd.DataFrame({ "text": df, "negative": res[:,0], "positive": res[:,1]})
display(df_res)

Scalable inference for lots of files

In the example above, the data was read in from a single file, however, when dealing with large amounts of data, it is unlikely that all of this data is available in a single file. The following shows the entire code with the changes highlighted for using the Dataloader with multiple files.

MODEL = "distilbert-base-uncased-finetuned-sst-2-english"
def get_all_files():
  file_list = ['/PATH/FILE1',
               '/PATH/FILE2',
               '/PATH/FILE3']
  return(file_list)

class TextLoader(Dataset):
    def __init__(self, file=None, transform=None, target_transform=None, tokenizer=None):
        self.file = pd.read_json(file, lines=True)
        self.file = self.file
        self.file = tokenizer(list(self.file['full_text']), padding=True, truncation=True, max_length=512, return_tensors='pt')
        self.file = self.file['input_ids']
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.file)

    def __getitem__(self, idx):
        data = self.file[idx]
        return(data)
      
class SentimentModel(nn.Module):
      def __init__(self):
        super(SentimentModel, self).__init__()
        self.model = AutoModelForSequenceClassification.from_pretrained(MODEL)

    def forward(self, input):
        output = self.model(input)
        pt_predictions = nn.functional.softmax(output.logits, dim=1)
        return(pt_predictions)
      
dev = 'cuda'
if dev == 'cpu':
  device = torch.device('cpu')
  device_staging = 'cpu:0'
else:
  device = torch.device('cuda')
  device_staging = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained(MODEL)
all_files = get_all_files()
model3 = SentimentModel()
try:
      model3 = nn.DataParallel(model3, device_ids=[0,1,2,3])
      model3.to(device_staging)
except:
      torch.set_printoptions(threshold=10000)

for file in all_files:
    data = TextLoader(file=file, tokenizer=tokenizer)
    train_dataloader = DataLoader(data, batch_size=120, shuffle=False) # Shuffle should be set to False
    
   
    out = torch.empty(0,0)
    for data in train_dataloader:
        input = data.to(device_staging)
        if(len(out) == 0):
          out = model3(input)
        else:
          output = model3(input)
          with torch.no_grad():
            out = torch.cat((out, output), 0)
            
    df = pd.read_json(file, lines=True)['full_text']
    res = out.cpu().numpy()
    df_res = pd.DataFrame({ "text": df, "negative": res[:,0], "positive": res[:,1]})

Conclusion

We discussed how the Huggingface framework can be used for performing sentiment analysis using Pytorch. Additionally, it was shown how GPUs could be used to accelerate this inference process. The use of the Databricks platform with the easily available ML runtimes and availability of the state-of-the-art GPUs make it easy to experiment with and deploy these solutions.

For more details, please check out the attached notebook!

Try Databricks for free. Get started today.

The post GPU-accelerated Sentiment Analysis Using Pytorch and Huggingface on Databricks appeared first on Databricks.

↧

How YipitData produces insights for its clients

Challenges with Alternative Data

Delta is a robust data storage solution that improves analytics

Delta streaming facilitates high-volume data ingestion

Creating a “source of truth” with Databricks Autoloader

Migrating YipitData’s data lake to Delta

Conclusion

Read More

Ensembles

Simplify ensemble creation and management with Databricks AutoML + MLflow

Step #1: Fetch the “best” models of each architecture type from the AutoML experiment:

Step #2: Build a custom pyfunc model class that encapsulates the best models

Step #3: Provide a predict function for the ensemble

Step #4: Provide a voting function

Step# 5: Package and log the model in MLflow as a custom pyfunc model

Step #6: Scoring

Nuances of ensembles

Multiple models do not necessarily mean an ensemble!

Summary

Related blogs:

Clinical natural language processing at scale with Databricks & John Snow Labs

Real-world oncology data abstraction in action

Get started extracting RWD from oncology notes with NLP

API-first approach to scan for notebooks

Cataloging using Parquet

A notebook-based solution for notebook search

NotebookIndex

NotebookIndexRun

NotebookSearch

NotebookSimilarity and NotebookSimilarityRun

Example use cases: shell, Spark SQL and Scala

Get started

Shiny inside R Notebooks

Streaming output

Notebook-scoped libraries for R

Pizza & Puzzle Night across the globe

Virtual Intern Olympics

Team improv class

Wellness Week

Celebrating National Intern Day with boba

Empowering citizen data scientists with low-code/no-code

Extending data + AI accessibility with a new approach

What’s next?

New: Files in Repos

Benefits of Files in Repos include:

Easier code reuse

Automate environment management and production deployments

Get started

1. Subscribe to Databricks from GCP Marketplace

2. Prerequisites for Databricks setup in GCP

Ensure adequate resource quotas

Size your network

Review session length constraints

3. Create your first workspace

4. Add users to your workspace

5. Run your first Databricks job

Have questions?

The point-in-polygon: how hard can it be?

The (not so) hidden cost?

Work smarter, not harder!

Spatial indices (PIP as a pseudo-equivalence)

The BNG Index System

Why BNG?

BNG as a Spatial Partitioning Strategy

The baseline

Let’s frame our problem

Divide and conquer

Mosaics are pretty (fast!)

A general approach

What is a “session window”?

How to implement a query using a session window?

Session window with dynamic gap duration

Native support of session window vs. FlatMapGroupsWithState

Conclusion

Background: Why Runbot?

Jenkins issues

Why couldn’t we fix it?

Complexity preventing improvement

Architecture and operational stability

Goals, non-goals, and requirements