{ "cells": [ { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "a7362132-9224-4e56-9e1b-669456247aec", "showTitle": false, "title": "" } }, "source": [ "# UDF Speed Testing\n", "\n", "When using Spark, you have a few different options for manipulating data. However, some of these options are more performant than others.\n", "In this notebook, we'll try out a few options and measure their peformance.\n", "\n", "Here are the methods we will be evaluating:\n", " - **Spark Commands**: PySpark is a wrapper around the Spark API. By using PySpark, we can use Python code to send commands directly to Spark.\n", " - **Spark SQL**: Similar to PySpark, Spark SQL allows us to use SQL-like syntax to send commands directly to Spark.\n", " - **Scala UDF**: We can extend Spark by writing functions in Scala and registering them with Spark. This allows Spark to run our custom function\n", " against the data it is storing. Since the Scala code can be executed within the same JVM in which Spark runs, the data does not need to be\n", " serialized to a different process.\n", " - **Python UDF**: We can also extend Spark by writing functions in Python and leverage packages from Python's vast eco-system. However, unlike\n", " a Scala UDF, this Python code is executed in a separate process (a Python runtime environment). Therefore, all of the data needed by the\n", " function must be serialized and sent from the JVM to the Python process. This incurs a performance penalty.\n", " \n", "With this understanding of how Spark works, we would expect the native Spark commands to be the most performant. We would expect the Spark SQL\n", "commands to have almost identical performance since they are functionaly the same as the native commands. A Scala UDF should perform well\n", "but might be slightly slower than a native command. Lastly, we would expect a Python UDF to perform the slowest since it requires the data\n", "to be serialized between processes.\n", "\n", "For this test, we will do some arbitrary string manipulation. We will take a `publicationuuid`, which is a string of 64 characters. We will\n", "take the last 16 characters of the string, append a dash, and then append the remaining 48 characters. Therefore:
\n", "`63dc895fe59caecedeae490969eee8a061fa1e02e5ea354fe6d1e353b7b8ece4`
\n", "becomes:
\n", "`e6d1e353b7b8ece4-63dc895fe59caecedeae490969eee8a061fa1e02e5ea354f`
\n", "\n", "We will implement this same transformation using the four techniques listed above. We will apply each implementation to our data frame\n", "and see how they perform." ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "f5684573-77f0-4612-96ac-9630cf40ae78", "showTitle": false, "title": "" } }, "source": [ "## Create Our UDF Functions\n", "\n", "Now we can implement our UDF functions in Python and in Scala" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "e3ff1bde-78aa-4fe7-aa4c-9f5693259430", "showTitle": false, "title": "" } }, "outputs": [ { "output_type": "display_data", "data": { "text/html": [ "\n", "
Out[1]: <function __main__.<lambda>(x)>
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
Out[1]: <function __main__.<lambda>(x)>
", "datasetInfos": [], "metadata": {}, "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "from pyspark.sql import functions as F\n", "from pyspark.sql import types as T\n", "\n", "def chop_uuid(text:str) -> str:\n", " return text[-16:] + '-' + text[:48]\n", "\n", "spark.udf.register(\"python_chop_uuid\", F.udf(lambda x: chop_uuid(x), T.StringType()))" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "7067883a-2771-47d6-b89e-de10c8fa7836", "showTitle": false, "title": "" } }, "outputs": [ { "output_type": "display_data", "data": { "text/html": [ "\n", "
import org.apache.spark.sql.functions.udf\n", "chop_uuid: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$14887/1823763002@732e70ae,StringType,List(Some(class[value[0]: string])),Some(class[value[0]: string]),None,true,true)\n", "res0: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$14887/1823763002@732e70ae,StringType,List(Some(class[value[0]: string])),Some(class[value[0]: string]),Some(scala_chop_uuid),true,true)\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
import org.apache.spark.sql.functions.udf\nchop_uuid: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$14887/1823763002@732e70ae,StringType,List(Some(class[value[0]: string])),Some(class[value[0]: string]),None,true,true)\nres0: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$14887/1823763002@732e70ae,StringType,List(Some(class[value[0]: string])),Some(class[value[0]: string]),Some(scala_chop_uuid),true,true)\n
", "datasetInfos": [], "metadata": { "isDbfsCommandResult": false }, "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "%scala\n", "import org.apache.spark.sql.functions.udf\n", "\n", "val chop_uuid = udf((x: String) => x.takeRight(16) + \"-\" + x.substring(0, 48))\n", "spark.udf.register(\"scala_chop_uuid\", chop_uuid)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "35e25957-3e52-4ff5-a0ba-b74864b2ad11", "showTitle": false, "title": "" } }, "source": [ "## Get Some Data\n", "\n", "We just need a dataframe with a bunch of double UUID's in it that we can run our functions against. Half a billion rows should be plenty." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "5acc274c-9c2b-4576-b015-ee961314f907", "showTitle": false, "title": "" } }, "outputs": [ { "output_type": "display_data", "data": { "text/html": [ "\n", "
Loading data...\n", "Number of Rows: 596,879,324\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
Loading data...\nNumber of Rows: 596,879,324\n
", "datasetInfos": [], "metadata": {}, "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "print(\"Loading data...\")\n", "df = spark.read.format(\"delta\").load(\"/mnt/silver/patent/family\").select(\"publicationuuid\").cache()\n", "df.foreach(lambda x: True)\n", "print(f\"Number of Rows: {df.count():,}\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "0659755e-26e6-43f3-847e-1993d17d3fb7", "showTitle": false, "title": "" } }, "source": [ "## Run the Tests in an Automated Fashion\n", "\n", "The code below is a loop. Each time, we will perform one of a randomly-selected implementation of the transformation.\n", "We will track how long each run takes and then compare the results across the different implementations of the transformation." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "91dc485c-aca1-4ad5-afd7-9492ab3ae12d", "showTitle": false, "title": "" } }, "outputs": [ { "output_type": "display_data", "data": { "text/html": [ "\n", "
Loading data...\n", "Number of Rows: 596,879,324\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.5722526333333333 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.66192225 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.6883847000000001 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.9080985166666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.6843588833333333 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6569827833333334 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6609970000000001 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.70247645 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.88462395 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6639201 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.9518759166666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8990546333333334 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6615791166666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.9054351166666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.6839297833333333 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.6884910666666666 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6600648166666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6625185666666666 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.6877886666666666 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.7021473 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.66032295 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.6847340000000001 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.68502905 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6608457166666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6599224166666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8943572166666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.6901039999999999 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6619532666666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.7694913666666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.68570345 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8861696 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8893653333333333 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8844189333333333 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.69114995 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6523360666666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.7854270666666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.6847059166666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.6773334166666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.6823809833333333 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8989084833333333 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.7176299833333334 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6539309666666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.6848512666666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6529501 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.6820191333333334 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8936225833333334 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.6794774 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.6780650500000001 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.6780807666666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8965125166666666 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8845585 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.6794598500000001 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8950145333333333 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.66432635 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.6887118333333333 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.6911382833333334 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.68218 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8925327500000001 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.7096297 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6574899833333333 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8974934666666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.68909975 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.68592095 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.6956132166666666 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.6936092833333334 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.6917988 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.68727475 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6540273166666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.6807229166666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.6859080666666666 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8940838000000001 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.9484269 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.8810638833333333 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'spark'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'spark_command']\n", " Completed in 0.6777511833333333 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.68640415 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.6846533 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.68872115 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'scala'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'scala_udf']\n", " Completed in 0.6610048166666667 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'python'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'python_udf']\n", " Completed in 0.9051501833333333 minutes\n", "\n", "--------------------------------------------------\n", "\n", "Trying 'sql'\n", " Applying transformation...\n", " Triggering action...\n", " Columns: ['publicationuuid', 'sql_command']\n", " Completed in 0.6878159166666666 minutes\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
Loading data...\nNumber of Rows: 596,879,324\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.5722526333333333 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.66192225 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.6883847000000001 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.9080985166666667 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.6843588833333333 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6569827833333334 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6609970000000001 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.70247645 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.88462395 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6639201 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.9518759166666667 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8990546333333334 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6615791166666667 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.9054351166666667 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.6839297833333333 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.6884910666666666 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6600648166666667 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6625185666666666 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.6877886666666666 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.7021473 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.66032295 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.6847340000000001 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.68502905 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6608457166666667 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6599224166666667 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8943572166666667 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.6901039999999999 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6619532666666667 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.7694913666666667 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.68570345 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8861696 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8893653333333333 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8844189333333333 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.69114995 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6523360666666667 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.7854270666666667 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.6847059166666667 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.6773334166666667 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.6823809833333333 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8989084833333333 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.7176299833333334 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6539309666666667 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.6848512666666667 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6529501 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.6820191333333334 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8936225833333334 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.6794774 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.6780650500000001 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.6780807666666667 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8965125166666666 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8845585 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.6794598500000001 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8950145333333333 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.66432635 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.6887118333333333 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.6911382833333334 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.68218 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8925327500000001 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.7096297 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6574899833333333 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8974934666666667 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.68909975 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.68592095 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.6956132166666666 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.6936092833333334 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.6917988 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.68727475 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6540273166666667 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.6807229166666667 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.6859080666666666 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8940838000000001 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.9484269 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.8810638833333333 minutes\n\n--------------------------------------------------\n\nTrying 'spark'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'spark_command']\n Completed in 0.6777511833333333 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.68640415 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.6846533 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.68872115 minutes\n\n--------------------------------------------------\n\nTrying 'scala'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'scala_udf']\n Completed in 0.6610048166666667 minutes\n\n--------------------------------------------------\n\nTrying 'python'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'python_udf']\n Completed in 0.9051501833333333 minutes\n\n--------------------------------------------------\n\nTrying 'sql'\n Applying transformation...\n Triggering action...\n Columns: ['publicationuuid', 'sql_command']\n Completed in 0.6878159166666666 minutes\n
", "datasetInfos": [], "metadata": {}, "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "import random\n", "from datetime import datetime\n", "from pyspark.sql import functions as F\n", "\n", "results = {}\n", "\n", "run_types = ['spark', 'sql', 'scala', 'python'] * 20\n", "random.shuffle(run_types)\n", "\n", "for run_type in run_types:\n", " print(\"\\n--------------------------------------------------\\n\")\n", " print(f\"Trying '{run_type}'\")\n", " \n", " print(\" Applying transformation...\")\n", " if run_type == 'spark':\n", " df = df.withColumn(\"spark_command\", F.concat(F.substring(\"publicationuuid\", -16, 16), F.lit(\"-\"), F.substring(\"publicationuuid\", 0, 48)))\n", "\n", " elif run_type == 'sql':\n", " df = df.withColumn(\"sql_command\", F.expr(\"CONCAT(SUBSTRING(publicationuuid, -16, 16), '-', SUBSTRING(publicationuuid, 0, 48))\"))\n", " \n", " elif run_type =='scala':\n", " df = df.withColumn(\"scala_udf\", F.expr(\"scala_chop_uuid(publicationuuid)\"))\n", "\n", " elif run_type =='python':\n", " df = df.withColumn(\"python_udf\", F.expr(\"python_chop_uuid(publicationuuid)\"))\n", " \n", " print(\" Triggering action...\")\n", " start = datetime.now()\n", " df.foreach(lambda x: True)\n", " end = datetime.now()\n", "\n", " columns = df.columns\n", " print(f\" Columns: {columns}\")\n", " df = df.drop(columns[-1])\n", "\n", " run_time = (end - start).total_seconds() / 60\n", " results[run_type] = results.get(run_type, []) + [run_time]\n", " print(f\" Completed in {run_time} minutes\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "8d0a32fc-567b-435d-8043-268145023a60", "showTitle": false, "title": "" } }, "source": [ "## Results\n", "Use some loops to display the results as a table.\n", "\n", "**NOTE**: These results are expressed in **_minutes_**, not seconds." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "6f68e3cc-ee8b-4681-89c8-537c40460571", "showTitle": false, "title": "" } }, "outputs": [ { "output_type": "display_data", "data": { "text/html": [ "\n", "
spark scala python sql\n", " 0 0.57 0.66 0.91 0.70\n", " 1 0.69 0.66 0.88 0.69\n", " 2 0.68 0.66 0.95 0.69\n", " 3 0.68 0.66 0.90 0.68\n", " 4 0.69 0.66 0.91 0.68\n", " 5 0.69 0.66 0.89 0.68\n", " 6 0.68 0.66 0.89 0.68\n", " 7 0.69 0.70 0.89 0.68\n", " 8 0.77 0.66 0.88 0.68\n", " 9 0.69 0.66 0.90 0.68\n", " 10 0.79 0.66 0.89 0.71\n", " 11 0.68 0.66 0.90 0.69\n", " 12 0.68 0.65 0.88 0.70\n", " 13 0.69 0.72 0.90 0.69\n", " 14 0.69 0.65 0.89 0.69\n", " 15 0.68 0.65 0.90 0.69\n", " 16 0.69 0.66 0.89 0.69\n", " 17 0.69 0.66 0.95 0.68\n", " 18 0.68 0.65 0.88 0.69\n", " 19 0.68 0.66 0.91 0.69\n", "--- -------- -------- -------- --------\n", "Avg 0.69 0.66 0.90 0.69\n", "
" ] }, "metadata": { "application/vnd.databricks.v1+output": { "addedWidgets": {}, "arguments": {}, "data": "
spark scala python sql\n 0 0.57 0.66 0.91 0.70\n 1 0.69 0.66 0.88 0.69\n 2 0.68 0.66 0.95 0.69\n 3 0.68 0.66 0.90 0.68\n 4 0.69 0.66 0.91 0.68\n 5 0.69 0.66 0.89 0.68\n 6 0.68 0.66 0.89 0.68\n 7 0.69 0.70 0.89 0.68\n 8 0.77 0.66 0.88 0.68\n 9 0.69 0.66 0.90 0.68\n 10 0.79 0.66 0.89 0.71\n 11 0.68 0.66 0.90 0.69\n 12 0.68 0.65 0.88 0.70\n 13 0.69 0.72 0.90 0.69\n 14 0.69 0.65 0.89 0.69\n 15 0.68 0.65 0.90 0.69\n 16 0.69 0.66 0.89 0.69\n 17 0.69 0.66 0.95 0.68\n 18 0.68 0.65 0.88 0.69\n 19 0.68 0.66 0.91 0.69\n--- -------- -------- -------- --------\nAvg 0.69 0.66 0.90 0.69\n
", "datasetInfos": [], "metadata": {}, "removedWidgets": [], "type": "html" } }, "output_type": "display_data" } ], "source": [ "from statistics import mean\n", "\n", "keys = results.keys()\n", "\n", "line = \" \"\n", "for k in keys:\n", " line = line + f\"{k:>20}\"\n", "\n", "print(line)\n", "\n", "\n", "for i in range(0, len(results[run_type])):\n", " line = f\"{i:>3d}\"\n", " for k in keys:\n", " line = line + f\"{results[k][i]:>20.2f}\"\n", " \n", " print(line)\n", " \n", "line = \"---\"\n", "for k in keys:\n", " line = line + f\"{'--------':>20}\"\n", "\n", "print(line)\n", "\n", " \n", "line = \"Avg\"\n", "for k in keys:\n", " line = line + f\"{mean(results[k]):>20.2f}\"\n", "\n", "print(line)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": {}, "inputWidgets": {}, "nuid": "47af6bcd-1b79-473a-8b94-45141a151ea5", "showTitle": false, "title": "" } }, "source": [ "- Scala UDF: 0.66 minutes\n", " - Python UDF: 0.90 minutes\n", "\n", "---\n", "\n", " - Python UDF is 36% slower than Scala UDF" ] } ], "metadata": { "application/vnd.databricks.v1+notebook": { "dashboards": [], "language": "python", "notebookMetadata": { "pythonIndentUnit": 2 }, "notebookName": "UDF Speed Testing", "widgets": {} } }, "nbformat": 4, "nbformat_minor": 0 }