{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "4f6a9629-547c-4083-856c-20a52db30fd8",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "# Using Spark to Parse Text Files With a Custom Delimiter\n",
    "\n",
    "Suppose you have a text file with data that uses non-standard delimiters.  Can we still use Spark to parse that data?  Yes!  We just have to use some custom string maniuplation code.\n",
    "That won't be quite as efficient as using native Spark, but it will get the job done!\n",
    "\n",
    "Let's make a file that has data where some fields use a pipe delimiter `|` and other fields use a combination of pipe and caret `|^my data value^|`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "90bc3e01-5ca6-47ab-90dd-706e941a3d47",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wrote 205 bytes.\nOut[1]: True"
     ]
    }
   ],
   "source": [
    "# Make a small data file\n",
    "\n",
    "testData = \"\"\"Name|Color|Size|Origin\n",
    "Rabbit^|^Brown^|7|^Warehouse\n",
    "Horse^|^Green^|28|^Dock\n",
    "Pig^|^Orange^|17|^Port\n",
    "Cow^|^Blue^|23|^Warehouse\n",
    "Bird^|^Yellow^|2|^Dock\n",
    "Dog^|^Brown^|10|^Port\n",
    "Carrot^Man^|^Orange^|22|^Warehouse\n",
    "\"\"\"\n",
    "\n",
    "dbutils.fs.put(\"data.txt\", testData, True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "57428e77-8523-436b-93f4-554a9a6c1ccb",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "First, let's just read in the data using the `|` as a separator.  This should work, but some of our fields will start and end with a caret."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "386b452e-d252-4add-9159-a62bbf7a6c8d",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "output_type": "display_data",
     "data": {
      "text/html": [
       "<style scoped>\n",
       "  .table-result-container {\n",
       "    max-height: 300px;\n",
       "    overflow: auto;\n",
       "  }\n",
       "  table, th, td {\n",
       "    border: 1px solid black;\n",
       "    border-collapse: collapse;\n",
       "  }\n",
       "  th, td {\n",
       "    padding: 5px;\n",
       "  }\n",
       "  th {\n",
       "    text-align: left;\n",
       "  }\n",
       "</style><div class='table-result-container'><table class='table-result'><thead style='background-color: white'><tr><th>Name</th><th>Color</th><th>Size</th><th>Origin</th></tr></thead><tbody><tr><td>Rabbit^</td><td>^Brown^</td><td>7</td><td>^Warehouse</td></tr><tr><td>Horse^</td><td>^Green^</td><td>28</td><td>^Dock</td></tr><tr><td>Pig^</td><td>^Orange^</td><td>17</td><td>^Port</td></tr><tr><td>Cow^</td><td>^Blue^</td><td>23</td><td>^Warehouse</td></tr><tr><td>Bird^</td><td>^Yellow^</td><td>2</td><td>^Dock</td></tr><tr><td>Dog^</td><td>^Brown^</td><td>10</td><td>^Port</td></tr><tr><td>Carrot^Man^</td><td>^Orange^</td><td>22</td><td>^Warehouse</td></tr></tbody></table></div>"
      ]
     },
     "metadata": {
      "application/vnd.databricks.v1+output": {
       "addedWidgets": {},
       "aggData": [],
       "aggError": "",
       "aggOverflow": false,
       "aggSchema": [],
       "aggSeriesLimitReached": false,
       "aggType": "",
       "arguments": {},
       "columnCustomDisplayInfos": {},
       "data": [
        [
         "Rabbit^",
         "^Brown^",
         "7",
         "^Warehouse"
        ],
        [
         "Horse^",
         "^Green^",
         "28",
         "^Dock"
        ],
        [
         "Pig^",
         "^Orange^",
         "17",
         "^Port"
        ],
        [
         "Cow^",
         "^Blue^",
         "23",
         "^Warehouse"
        ],
        [
         "Bird^",
         "^Yellow^",
         "2",
         "^Dock"
        ],
        [
         "Dog^",
         "^Brown^",
         "10",
         "^Port"
        ],
        [
         "Carrot^Man^",
         "^Orange^",
         "22",
         "^Warehouse"
        ]
       ],
       "datasetInfos": [],
       "dbfsResultPath": null,
       "isJsonSchema": true,
       "metadata": {},
       "overflow": false,
       "plotOptions": {
        "customPlotOptions": {},
        "displayType": "table",
        "pivotAggregation": null,
        "pivotColumns": null,
        "xColumns": null,
        "yColumns": null
       },
       "removedWidgets": [],
       "schema": [
        {
         "metadata": "{}",
         "name": "Name",
         "type": "\"string\""
        },
        {
         "metadata": "{}",
         "name": "Color",
         "type": "\"string\""
        },
        {
         "metadata": "{}",
         "name": "Size",
         "type": "\"string\""
        },
        {
         "metadata": "{}",
         "name": "Origin",
         "type": "\"string\""
        }
       ],
       "type": "table"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "display( spark.read.option(\"header\", \"true\").option(\"sep\", \"|\").csv(\"/data.txt\") )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "2394acac-c29b-4be9-95a6-98417b05d20d",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "### Option 1:  Split the Data and Remove the Carets\n",
    "\n",
    "One option is to split the data (as we did in our quick example above) and then loop through each column, removing any carets at the beginning and end of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "7e84a372-8b10-4952-94fb-f179f04150e4",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "output_type": "display_data",
     "data": {
      "text/html": [
       "<style scoped>\n",
       "  .table-result-container {\n",
       "    max-height: 300px;\n",
       "    overflow: auto;\n",
       "  }\n",
       "  table, th, td {\n",
       "    border: 1px solid black;\n",
       "    border-collapse: collapse;\n",
       "  }\n",
       "  th, td {\n",
       "    padding: 5px;\n",
       "  }\n",
       "  th {\n",
       "    text-align: left;\n",
       "  }\n",
       "</style><div class='table-result-container'><table class='table-result'><thead style='background-color: white'><tr><th>Name</th><th>Color</th><th>Size</th><th>Origin</th></tr></thead><tbody><tr><td>Rabbit</td><td>Brown</td><td>7</td><td>Warehouse</td></tr><tr><td>Horse</td><td>Green</td><td>28</td><td>Dock</td></tr><tr><td>Pig</td><td>Orange</td><td>17</td><td>Port</td></tr><tr><td>Cow</td><td>Blue</td><td>23</td><td>Warehouse</td></tr><tr><td>Bird</td><td>Yellow</td><td>2</td><td>Dock</td></tr><tr><td>Dog</td><td>Brown</td><td>10</td><td>Port</td></tr><tr><td>Carrot^Man</td><td>Orange</td><td>22</td><td>Warehouse</td></tr></tbody></table></div>"
      ]
     },
     "metadata": {
      "application/vnd.databricks.v1+output": {
       "addedWidgets": {},
       "aggData": [],
       "aggError": "",
       "aggOverflow": false,
       "aggSchema": [],
       "aggSeriesLimitReached": false,
       "aggType": "",
       "arguments": {},
       "columnCustomDisplayInfos": {},
       "data": [
        [
         "Rabbit",
         "Brown",
         "7",
         "Warehouse"
        ],
        [
         "Horse",
         "Green",
         "28",
         "Dock"
        ],
        [
         "Pig",
         "Orange",
         "17",
         "Port"
        ],
        [
         "Cow",
         "Blue",
         "23",
         "Warehouse"
        ],
        [
         "Bird",
         "Yellow",
         "2",
         "Dock"
        ],
        [
         "Dog",
         "Brown",
         "10",
         "Port"
        ],
        [
         "Carrot^Man",
         "Orange",
         "22",
         "Warehouse"
        ]
       ],
       "datasetInfos": [],
       "dbfsResultPath": null,
       "isJsonSchema": true,
       "metadata": {},
       "overflow": false,
       "plotOptions": {
        "customPlotOptions": {},
        "displayType": "table",
        "pivotAggregation": null,
        "pivotColumns": null,
        "xColumns": null,
        "yColumns": null
       },
       "removedWidgets": [],
       "schema": [
        {
         "metadata": "{}",
         "name": "Name",
         "type": "\"string\""
        },
        {
         "metadata": "{}",
         "name": "Color",
         "type": "\"string\""
        },
        {
         "metadata": "{}",
         "name": "Size",
         "type": "\"string\""
        },
        {
         "metadata": "{}",
         "name": "Origin",
         "type": "\"string\""
        }
       ],
       "type": "table"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Option 1\n",
    "dff = spark.read.option(\"header\", \"true\").option(\"inferSchema\", \"true\").option(\"delimiter\", \"|\").csv(\"/data.txt\")\n",
    "\n",
    "from pyspark.sql.functions import regexp_replace\n",
    "\n",
    "dffs_headers = dff.dtypes\n",
    "for i in dffs_headers:\n",
    "    columnLabel = i[0]\n",
    "    newColumnLabel = columnLabel.replace('^','').replace('^','') \n",
    "    dff = dff.withColumn(newColumnLabel, regexp_replace(columnLabel, '^\\\\^|\\\\^$', ''))\n",
    "    if columnLabel != newColumnLabel:\n",
    "      dff = dff.drop(columnLabel)\n",
    "    \n",
    "display(dff)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "38a388ce-51cf-40fa-926e-1bfcd98c56c8",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "### Option 2: Replace Carets Near Pipes Then Split\n",
    "\n",
    "A different approach is to read each line of the text file as a row in an RDD.  Then replace any occurances of `|^` or `^|` with a plain pipe.  Then we can split the data based on the pipe.\n",
    "\n",
    "Note that with this approach, we have no way to tell Spark that our first row contains column headers.  So we do an extra step at the end where we filter out the rows that contain\n",
    "the header information.  We then reapply that header information when we convert the RDD to a Spark DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "925b47a7-ae42-42a3-97ec-701987ecd53d",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "output_type": "display_data",
     "data": {
      "text/html": [
       "<style scoped>\n",
       "  .table-result-container {\n",
       "    max-height: 300px;\n",
       "    overflow: auto;\n",
       "  }\n",
       "  table, th, td {\n",
       "    border: 1px solid black;\n",
       "    border-collapse: collapse;\n",
       "  }\n",
       "  th, td {\n",
       "    padding: 5px;\n",
       "  }\n",
       "  th {\n",
       "    text-align: left;\n",
       "  }\n",
       "</style><div class='table-result-container'><table class='table-result'><thead style='background-color: white'><tr><th>Name</th><th>Color</th><th>Size</th><th>Origin</th></tr></thead><tbody><tr><td>Rabbit</td><td>Brown</td><td>7</td><td>Warehouse</td></tr><tr><td>Horse</td><td>Green</td><td>28</td><td>Dock</td></tr><tr><td>Pig</td><td>Orange</td><td>17</td><td>Port</td></tr><tr><td>Cow</td><td>Blue</td><td>23</td><td>Warehouse</td></tr><tr><td>Bird</td><td>Yellow</td><td>2</td><td>Dock</td></tr><tr><td>Dog</td><td>Brown</td><td>10</td><td>Port</td></tr><tr><td>Carrot^Man</td><td>Orange</td><td>22</td><td>Warehouse</td></tr></tbody></table></div>"
      ]
     },
     "metadata": {
      "application/vnd.databricks.v1+output": {
       "addedWidgets": {},
       "aggData": [],
       "aggError": "",
       "aggOverflow": false,
       "aggSchema": [],
       "aggSeriesLimitReached": false,
       "aggType": "",
       "arguments": {},
       "columnCustomDisplayInfos": {},
       "data": [
        [
         "Rabbit",
         "Brown",
         "7",
         "Warehouse"
        ],
        [
         "Horse",
         "Green",
         "28",
         "Dock"
        ],
        [
         "Pig",
         "Orange",
         "17",
         "Port"
        ],
        [
         "Cow",
         "Blue",
         "23",
         "Warehouse"
        ],
        [
         "Bird",
         "Yellow",
         "2",
         "Dock"
        ],
        [
         "Dog",
         "Brown",
         "10",
         "Port"
        ],
        [
         "Carrot^Man",
         "Orange",
         "22",
         "Warehouse"
        ]
       ],
       "datasetInfos": [],
       "dbfsResultPath": null,
       "isJsonSchema": true,
       "metadata": {},
       "overflow": false,
       "plotOptions": {
        "customPlotOptions": {},
        "displayType": "table",
        "pivotAggregation": null,
        "pivotColumns": null,
        "xColumns": null,
        "yColumns": null
       },
       "removedWidgets": [],
       "schema": [
        {
         "metadata": "{}",
         "name": "Name",
         "type": "\"string\""
        },
        {
         "metadata": "{}",
         "name": "Color",
         "type": "\"string\""
        },
        {
         "metadata": "{}",
         "name": "Size",
         "type": "\"string\""
        },
        {
         "metadata": "{}",
         "name": "Origin",
         "type": "\"string\""
        }
       ],
       "type": "table"
      }
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "rdd = (sc\n",
    "         .textFile(\"data.txt\")\n",
    "         .map(lambda x: x.replace('|^','|').replace('^|','|').split('|'))\n",
    "      )\n",
    "\n",
    "header = rdd.first()\n",
    "\n",
    "dff = rdd.filter(lambda row : row != header).toDF(header)\n",
    "\n",
    "display(dff)"
   ]
  }
 ],
 "metadata": {
  "application/vnd.databricks.v1+notebook": {
   "dashboards": [],
   "language": "python",
   "notebookMetadata": {
    "pythonIndentUnit": 2
   },
   "notebookName": "Custom Delimiter",
   "widgets": {}
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}