{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
"byteLimit": 2048000,
"rowLimit": 10000
},
"inputWidgets": {},
"nuid": "4f6a9629-547c-4083-856c-20a52db30fd8",
"showTitle": false,
"title": ""
}
},
"source": [
"# Using Spark to Parse Text Files With a Custom Delimiter\n",
"\n",
"Suppose you have a text file with data that uses non-standard delimiters. Can we still use Spark to parse that data? Yes! We just have to use some custom string maniuplation code.\n",
"That won't be quite as efficient as using native Spark, but it will get the job done!\n",
"\n",
"Let's make a file that has data where some fields use a pipe delimiter `|` and other fields use a combination of pipe and caret `|^my data value^|`"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
"byteLimit": 2048000,
"rowLimit": 10000
},
"inputWidgets": {},
"nuid": "90bc3e01-5ca6-47ab-90dd-706e941a3d47",
"showTitle": false,
"title": ""
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"output_type": "stream",
"text": [
"Wrote 205 bytes.\nOut[1]: True"
]
}
],
"source": [
"# Make a small data file\n",
"\n",
"testData = \"\"\"Name|Color|Size|Origin\n",
"Rabbit^|^Brown^|7|^Warehouse\n",
"Horse^|^Green^|28|^Dock\n",
"Pig^|^Orange^|17|^Port\n",
"Cow^|^Blue^|23|^Warehouse\n",
"Bird^|^Yellow^|2|^Dock\n",
"Dog^|^Brown^|10|^Port\n",
"Carrot^Man^|^Orange^|22|^Warehouse\n",
"\"\"\"\n",
"\n",
"dbutils.fs.put(\"data.txt\", testData, True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
"byteLimit": 2048000,
"rowLimit": 10000
},
"inputWidgets": {},
"nuid": "57428e77-8523-436b-93f4-554a9a6c1ccb",
"showTitle": false,
"title": ""
}
},
"source": [
"First, let's just read in the data using the `|` as a separator. This should work, but some of our fields will start and end with a caret."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
"byteLimit": 2048000,
"rowLimit": 10000
},
"inputWidgets": {},
"nuid": "386b452e-d252-4add-9159-a62bbf7a6c8d",
"showTitle": false,
"title": ""
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/html": [
"
Name | Color | Size | Origin |
---|
Rabbit^ | ^Brown^ | 7 | ^Warehouse |
Horse^ | ^Green^ | 28 | ^Dock |
Pig^ | ^Orange^ | 17 | ^Port |
Cow^ | ^Blue^ | 23 | ^Warehouse |
Bird^ | ^Yellow^ | 2 | ^Dock |
Dog^ | ^Brown^ | 10 | ^Port |
Carrot^Man^ | ^Orange^ | 22 | ^Warehouse |
"
]
},
"metadata": {
"application/vnd.databricks.v1+output": {
"addedWidgets": {},
"aggData": [],
"aggError": "",
"aggOverflow": false,
"aggSchema": [],
"aggSeriesLimitReached": false,
"aggType": "",
"arguments": {},
"columnCustomDisplayInfos": {},
"data": [
[
"Rabbit^",
"^Brown^",
"7",
"^Warehouse"
],
[
"Horse^",
"^Green^",
"28",
"^Dock"
],
[
"Pig^",
"^Orange^",
"17",
"^Port"
],
[
"Cow^",
"^Blue^",
"23",
"^Warehouse"
],
[
"Bird^",
"^Yellow^",
"2",
"^Dock"
],
[
"Dog^",
"^Brown^",
"10",
"^Port"
],
[
"Carrot^Man^",
"^Orange^",
"22",
"^Warehouse"
]
],
"datasetInfos": [],
"dbfsResultPath": null,
"isJsonSchema": true,
"metadata": {},
"overflow": false,
"plotOptions": {
"customPlotOptions": {},
"displayType": "table",
"pivotAggregation": null,
"pivotColumns": null,
"xColumns": null,
"yColumns": null
},
"removedWidgets": [],
"schema": [
{
"metadata": "{}",
"name": "Name",
"type": "\"string\""
},
{
"metadata": "{}",
"name": "Color",
"type": "\"string\""
},
{
"metadata": "{}",
"name": "Size",
"type": "\"string\""
},
{
"metadata": "{}",
"name": "Origin",
"type": "\"string\""
}
],
"type": "table"
}
},
"output_type": "display_data"
}
],
"source": [
"display( spark.read.option(\"header\", \"true\").option(\"sep\", \"|\").csv(\"/data.txt\") )"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
"byteLimit": 2048000,
"rowLimit": 10000
},
"inputWidgets": {},
"nuid": "2394acac-c29b-4be9-95a6-98417b05d20d",
"showTitle": false,
"title": ""
}
},
"source": [
"### Option 1: Split the Data and Remove the Carets\n",
"\n",
"One option is to split the data (as we did in our quick example above) and then loop through each column, removing any carets at the beginning and end of the data."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
"byteLimit": 2048000,
"rowLimit": 10000
},
"inputWidgets": {},
"nuid": "7e84a372-8b10-4952-94fb-f179f04150e4",
"showTitle": false,
"title": ""
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/html": [
"Name | Color | Size | Origin |
---|
Rabbit | Brown | 7 | Warehouse |
Horse | Green | 28 | Dock |
Pig | Orange | 17 | Port |
Cow | Blue | 23 | Warehouse |
Bird | Yellow | 2 | Dock |
Dog | Brown | 10 | Port |
Carrot^Man | Orange | 22 | Warehouse |
"
]
},
"metadata": {
"application/vnd.databricks.v1+output": {
"addedWidgets": {},
"aggData": [],
"aggError": "",
"aggOverflow": false,
"aggSchema": [],
"aggSeriesLimitReached": false,
"aggType": "",
"arguments": {},
"columnCustomDisplayInfos": {},
"data": [
[
"Rabbit",
"Brown",
"7",
"Warehouse"
],
[
"Horse",
"Green",
"28",
"Dock"
],
[
"Pig",
"Orange",
"17",
"Port"
],
[
"Cow",
"Blue",
"23",
"Warehouse"
],
[
"Bird",
"Yellow",
"2",
"Dock"
],
[
"Dog",
"Brown",
"10",
"Port"
],
[
"Carrot^Man",
"Orange",
"22",
"Warehouse"
]
],
"datasetInfos": [],
"dbfsResultPath": null,
"isJsonSchema": true,
"metadata": {},
"overflow": false,
"plotOptions": {
"customPlotOptions": {},
"displayType": "table",
"pivotAggregation": null,
"pivotColumns": null,
"xColumns": null,
"yColumns": null
},
"removedWidgets": [],
"schema": [
{
"metadata": "{}",
"name": "Name",
"type": "\"string\""
},
{
"metadata": "{}",
"name": "Color",
"type": "\"string\""
},
{
"metadata": "{}",
"name": "Size",
"type": "\"string\""
},
{
"metadata": "{}",
"name": "Origin",
"type": "\"string\""
}
],
"type": "table"
}
},
"output_type": "display_data"
}
],
"source": [
"# Option 1\n",
"dff = spark.read.option(\"header\", \"true\").option(\"inferSchema\", \"true\").option(\"delimiter\", \"|\").csv(\"/data.txt\")\n",
"\n",
"from pyspark.sql.functions import regexp_replace\n",
"\n",
"dffs_headers = dff.dtypes\n",
"for i in dffs_headers:\n",
" columnLabel = i[0]\n",
" newColumnLabel = columnLabel.replace('^','').replace('^','') \n",
" dff = dff.withColumn(newColumnLabel, regexp_replace(columnLabel, '^\\\\^|\\\\^$', ''))\n",
" if columnLabel != newColumnLabel:\n",
" dff = dff.drop(columnLabel)\n",
" \n",
"display(dff)"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
"byteLimit": 2048000,
"rowLimit": 10000
},
"inputWidgets": {},
"nuid": "38a388ce-51cf-40fa-926e-1bfcd98c56c8",
"showTitle": false,
"title": ""
}
},
"source": [
"### Option 2: Replace Carets Near Pipes Then Split\n",
"\n",
"A different approach is to read each line of the text file as a row in an RDD. Then replace any occurances of `|^` or `^|` with a plain pipe. Then we can split the data based on the pipe.\n",
"\n",
"Note that with this approach, we have no way to tell Spark that our first row contains column headers. So we do an extra step at the end where we filter out the rows that contain\n",
"the header information. We then reapply that header information when we convert the RDD to a Spark DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
"byteLimit": 2048000,
"rowLimit": 10000
},
"inputWidgets": {},
"nuid": "925b47a7-ae42-42a3-97ec-701987ecd53d",
"showTitle": false,
"title": ""
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"text/html": [
"Name | Color | Size | Origin |
---|
Rabbit | Brown | 7 | Warehouse |
Horse | Green | 28 | Dock |
Pig | Orange | 17 | Port |
Cow | Blue | 23 | Warehouse |
Bird | Yellow | 2 | Dock |
Dog | Brown | 10 | Port |
Carrot^Man | Orange | 22 | Warehouse |
"
]
},
"metadata": {
"application/vnd.databricks.v1+output": {
"addedWidgets": {},
"aggData": [],
"aggError": "",
"aggOverflow": false,
"aggSchema": [],
"aggSeriesLimitReached": false,
"aggType": "",
"arguments": {},
"columnCustomDisplayInfos": {},
"data": [
[
"Rabbit",
"Brown",
"7",
"Warehouse"
],
[
"Horse",
"Green",
"28",
"Dock"
],
[
"Pig",
"Orange",
"17",
"Port"
],
[
"Cow",
"Blue",
"23",
"Warehouse"
],
[
"Bird",
"Yellow",
"2",
"Dock"
],
[
"Dog",
"Brown",
"10",
"Port"
],
[
"Carrot^Man",
"Orange",
"22",
"Warehouse"
]
],
"datasetInfos": [],
"dbfsResultPath": null,
"isJsonSchema": true,
"metadata": {},
"overflow": false,
"plotOptions": {
"customPlotOptions": {},
"displayType": "table",
"pivotAggregation": null,
"pivotColumns": null,
"xColumns": null,
"yColumns": null
},
"removedWidgets": [],
"schema": [
{
"metadata": "{}",
"name": "Name",
"type": "\"string\""
},
{
"metadata": "{}",
"name": "Color",
"type": "\"string\""
},
{
"metadata": "{}",
"name": "Size",
"type": "\"string\""
},
{
"metadata": "{}",
"name": "Origin",
"type": "\"string\""
}
],
"type": "table"
}
},
"output_type": "display_data"
}
],
"source": [
"rdd = (sc\n",
" .textFile(\"data.txt\")\n",
" .map(lambda x: x.replace('|^','|').replace('^|','|').split('|'))\n",
" )\n",
"\n",
"header = rdd.first()\n",
"\n",
"dff = rdd.filter(lambda row : row != header).toDF(header)\n",
"\n",
"display(dff)"
]
}
],
"metadata": {
"application/vnd.databricks.v1+notebook": {
"dashboards": [],
"language": "python",
"notebookMetadata": {
"pythonIndentUnit": 2
},
"notebookName": "Custom Delimiter",
"widgets": {}
}
},
"nbformat": 4,
"nbformat_minor": 0
}