How YipitData Extracts Insights From Alternative Data Using Delta Lake
This is a guest post from YipitData. We thank Anup Sega, Data Engineering Tech Lead, and Bobby Muldoon: Director of Data Engineering, at YipitData for their contributions. Choosing the right storage...
View ArticleManaging Model Ensembles With MLflow
In machine learning, an ensemble is a collection of diverse models that provide more predictive power together than any single model would on its own. The outputs of multiple learning algorithms are...
View ArticleExtracting Oncology Insights From Real-world Clinical Data With NLP
Preview the solution accelerator notebooks referenced in this blog online or get started right away by downloading and importing the notebooks into your Databricks account. Cancer is the leading cause...
View ArticleCatalog and Discover Your Databricks Notebooks Faster
This is a collaborative post from Databricks and Elsevier. We thank Darin McBeath, Director Disruptive Technologies — Elsevier, for his contributions. As a global leader in information and analytics,...
View ArticleShiny and Environments for R Notebooks
At Databricks, we want the Lakehouse ecosystem widely accessible to all data practitioners, and R is a great interface language for this purpose because of its rich ecosystem of open source packages...
View ArticleInterning From a Distance
Summer 2021 brought another summer of virtual game nights, pizza parties and team-building events for Databricks interns. In addition to working on impactful projects that ranged from improving our...
View ArticleBringing Lakehouse to the Citizen Data Scientist: Announcing the Acquisition...
Transforming into a data-driven organization – which means data has permeated into every facet of your company – is critical for driving meaningful business outcomes. Data literacy is the new buzzword...
View ArticleDatabricks Repos Is Now Generally Available – New ‘Files’ Feature in Public...
Thousands of Databricks customers have adopted Databricks Repos since its public preview and have standardized on it for their development and production workflows. Today, we are happy to announce that...
View Article5 Steps to Get Started With Databricks on Google Cloud
Since we launched Databricks on Google Cloud earlier this year, we’ve been thrilled to see stories about the value this joint solution has brought to data teams across the globe. One of our favorite...
View ArticleEfficient Point in Polygon Joins via PySpark and BNG Geospatial Indexing
This is a collaborative post by Ordnance Survey, Microsoft and Databricks. We thank Charis Doidge, Senior Data Engineer, and Steve Kingston, Senior Data Scientist, Ordnance Survey, and Linda Sheard,...
View ArticleNative Support of Session Window in Spark Structured Streaming
Apache Spark™ Structured Streaming allowed users to do aggregations on windows over event-time. Before Apache Spark 3.2™, Spark supported tumbling windows and sliding windows. In the upcoming Apache...
View ArticleDeveloping Databricks’ Runbot CI Solution
Runbot is a bespoke continuous integration (CI) solution developed specifically for Databricks’ needs. Originally developed in 2019, Runbot incrementally replaces our aging Jenkins infrastructure with...
View ArticleCreating an IP Lookup Table of Activities in a SIEM Architecture
When working with cyber security data, one thing is for sure: there is no shortage of available data sources. If anything, there are too many data sources with overlapping data. Your traditional SIEM...
View ArticleMLflow for Bayesian Experiment Tracking
This post is the third in a series on Bayesian inference ([1], [2] ). Here we will illustrate how to use managed MLflow on Databricks to perform and track Bayesian experiments using the Python package...
View ArticleIntroducing Apache Spark™ 3.2
We are excited to announce the availability of Apache Spark™ 3.2 on Databricks as part of Databricks Runtime 10.0. We want to thank the Apache Spark community for their valuable contributions to the...
View ArticleIntroducing SQL User-Defined Functions
A user-defined function (UDF) is a means for a user to extend the native capabilities of Apache Spark™ SQL. Spark SQL has supported external user-defined functions written in Scala, Java, Python and R...
View ArticleSimplifying Data + AI, One Line of TypeScript at a Time
Today, Databricks is known for our backend engineering, building and operating cloud systems that span millions of virtual machines processing exabytes of data each day. What’s not as obvious is the...
View ArticleCurating More Inclusive and Safer Online Communities With Databricks and...
This is a guest authored post by JT Vega, Support Engineering Manager, Labelbox. While video games and digital content are a source of entertainment, connecting with others, and fun for many around...
View ArticleHow Bread Standardized on the Lakehouse With Databricks & Delta Lake
This is a collaborative post from Bread Finance and Databricks. We thank co-author Christina Taylor, Senior Data Engineer–Bread Finance, for her contribution. Bread, a division of Alliance Data...
View ArticleGPU-accelerated Sentiment Analysis Using Pytorch and Huggingface on Databricks
Sentiment analysis is commonly used to analyze the sentiment present within a body of text, which could range from a review, an email or a tweet. Deep learning-based techniques are one of the most...
View Article