{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "a7362132-9224-4e56-9e1b-669456247aec",
"showTitle": false,
"title": ""
}
},
"source": [
"# UDF Speed Testing\n",
"\n",
"When using Spark, you have a few different options for manipulating data. However, some of these options are more performant than others.\n",
"In this notebook, we'll try out a few options and measure their peformance.\n",
"\n",
"Here are the methods we will be evaluating:\n",
" - **Spark Commands**: PySpark is a wrapper around the Spark API. By using PySpark, we can use Python code to send commands directly to Spark.\n",
" - **Spark SQL**: Similar to PySpark, Spark SQL allows us to use SQL-like syntax to send commands directly to Spark.\n",
" - **Scala UDF**: We can extend Spark by writing functions in Scala and registering them with Spark. This allows Spark to run our custom function\n",
" against the data it is storing. Since the Scala code can be executed within the same JVM in which Spark runs, the data does not need to be\n",
" serialized to a different process.\n",
" - **Python UDF**: We can also extend Spark by writing functions in Python and leverage packages from Python's vast eco-system. However, unlike\n",
" a Scala UDF, this Python code is executed in a separate process (a Python runtime environment). Therefore, all of the data needed by the\n",
" function must be serialized and sent from the JVM to the Python process. This incurs a performance penalty.\n",
" \n",
"With this understanding of how Spark works, we would expect the native Spark commands to be the most performant. We would expect the Spark SQL\n",
"commands to have almost identical performance since they are functionaly the same as the native commands. A Scala UDF should perform well\n",
"but might be slightly slower than a native command. Lastly, we would expect a Python UDF to perform the slowest since it requires the data\n",
"to be serialized between processes.\n",
"\n",
"For this test, we will do some arbitrary string manipulation. We will take a `publicationuuid`, which is a string of 64 characters. We will\n",
"take the last 16 characters of the string, append a dash, and then append the remaining 48 characters. Therefore:
\n",
"`63dc895fe59caecedeae490969eee8a061fa1e02e5ea354fe6d1e353b7b8ece4`
\n",
"becomes:
\n",
"`e6d1e353b7b8ece4-63dc895fe59caecedeae490969eee8a061fa1e02e5ea354f`
\n",
"\n",
"We will implement this same transformation using the four techniques listed above. We will apply each implementation to our data frame\n",
"and see how they perform."
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "f5684573-77f0-4612-96ac-9630cf40ae78",
"showTitle": false,
"title": ""
}
},
"source": [
"## Create Our UDF Functions\n",
"\n",
"Now we can implement our UDF functions in Python and in Scala"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "e3ff1bde-78aa-4fe7-aa4c-9f5693259430",
"showTitle": false,
"title": ""
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/html": [
"\n",
"