Databricks_Examples

Databricks Examples

This repository contains a collection of notebooks demonstrating various features in Azure Databricks.

Working With Pandas: a notebook demonstrating the pandas_udf feature in Spark 2.3, which allows you to distribute processing of pandas dataframes across a cluster

Plotting Distributions: a notebook demonstrating how to plot the distribution of all numeric columns in a Spark dataframe using matplotlib

Write to a Single CSV File: if you have a small dataset in Spark, you can write the data into a single CSV file (instead of Spark’s default behavior of writing to multiple files)

NYC Taxi Data: Do you need a big dataset for experimenting with Spark? The NYC Taxi is free and publicly available. This pair of notebooks will download all of the raw data and then convert it into a Delta table.

Notebook 1: Download Raw Parquet Files
Notebook 2: Convert Raw Parquet to Delta

Custom Delimiter: a brief example showing how you can use Spark to read data from a flat file if it uses a non-standard delmitier

Stream to Kafka: an example of how you can use Spark Streaming to send data to Azure Event Hubs using the Kafka API

DBUtils in Parallel: a demonstration of the performance gains of using dbutils in parallel on a cluster instead of runnig it only on the driver

UDF Speed Testing: a comparision of the performance of Scala UDF’s vs. Python UDF’s

This site is open source. Improve this page.