![NYC Logo](https://www.nyc.gov/assets/tlc/images/content/pages/home/nyc-tlc-logo.svg)

# Downloading NYC Taxi Data

When working on a Spark project, it's often handy to have a really big dataset to work with.  One very popular, very large dataset is the
[NYC Taxi](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) dataset.  This data is provided by the New York City Taxi & Limousine Commission.  It contains data
on individual taxi trips taken between 2009 and 2023.  The data is provided as Parquet files and is stored in a public Amazon S3 bucket.

To use the data in Spark, you will need to download it to your own HDFS-compliant storage.  The script below assumes that you've mounted your target storage location in the DBFS.

Note that you can run this code multiple times.  It will skip any files that you've already downloaded and only add new files that it discovers.

In [0]:
# The location where you've mounted the target storage
# IMPORTANT: This script will be accessing the files using native Python libraries, NOT Spark libraries.  Therefore, the path must start with "/dbfs/"
output_path = "/dbfs/mnt/taxi/parquet-raw"

# The most recent month for which data is available.  (Check the web site linked above to see if new data has been made available.)
max_month = "2023-02"

Next, we build a Python list of all possible year/month combinations.  We'll filter this down to the exact range of dates that we need later.

In [0]:
all_months = []
for y in range(2000, 2050):
  for m in range(1, 13):
    month = "{0}-{1:02d}".format(y, m)
    all_months.append(month)

The NYC Taxi dataset is divided by taxi type:
 - "yellow" taxicabs can be hailed by people on the street
 - "fhv" (or "For-Hire Vehicles") are ride-sharing services, black car services, or luxury limo services.  Rides in these vehicles must be pre-arranged with a dispatching service.  They cannot pickup street hails.
 - "green" taxicabs were introduced in 2013.  They are a hybrid between yellow taxis and FHV's.  They can accept street hails in certain parts of the city that are often under-served by yellow taxis.  They can also be pre-arranged by a dispatching service.
 - "fhvhv" (or "High Volume For-Hire Vehicles")

Each taxi type has a different schema and covers a different range of months.  For example, data for yellow taxis goes back to 2009, but FHV data is only provided from 2015 to the present.

Below, we declare the types of taxis and the date ranges for which their files are available.  If a new type of taxi data becomes available, just add it here, and the rest of
this script should pick it up.

In [0]:
taxi_types = {
  'yellow': ("2009-01", max_month),
  'green':  ("2013-08", max_month),
  'fhv':    ("2015-01", max_month),
  'fhvhv':  ("2019-02", max_month)
}

## Downloading the Data

Using Python, we now loop through the different types of taxis and download each type's monthly CSV files.

In [0]:
import urllib.request
import os

for k in taxi_types.keys():
  print("Processing \"{0}\"  ({1}  through  {2})".format(k, taxi_types[k][0], taxi_types[k][1]))
  months = [x for x in all_months if x >= taxi_types[k][0] and x <= taxi_types[k][1]]
  
  type_path = "{0}/{1}".format(output_path, k)
  if not os.path.exists(type_path):
    os.makedirs(type_path)

  for m in months:
    url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/{k}_tripdata_{m}.parquet"
    filename = f"{type_path}/{m}.parquet"

    # Do not download the file if we already have a copy of it
    if not os.path.exists(filename):
      try:
        urllib.request.urlretrieve(url, filename)
        print(f"   Downloaded {m}")
      except Exception as err:
        print(f" !!Could not download {m}\n      {err}")
    else:
      print("   Skipped {} (file already exists)".format(m))
    
  print("============================================================")

Processing "yellow"  (2009-01  through  2023-02)
   Skipped 2009-01 (file already exists)
   Skipped 2009-02 (file already exists)
   Skipped 2009-03 (file already exists)
   Skipped 2009-04 (file already exists)
   Skipped 2009-05 (file already exists)
   Skipped 2009-06 (file already exists)
   Skipped 2009-07 (file already exists)
   Skipped 2009-08 (file already exists)
   Skipped 2009-09 (file already exists)
   Skipped 2009-10 (file already exists)
   Skipped 2009-11 (file already exists)
   Skipped 2009-12 (file already exists)
   Skipped 2010-01 (file already exists)
   Skipped 2010-02 (file already exists)
   Skipped 2010-03 (file already exists)
   Skipped 2010-04 (file already exists)
   Skipped 2010-05 (file already exists)
   Skipped 2010-06 (file already exists)
   Skipped 2010-07 (file already exists)
   Skipped 2010-08 (file already exists)
   Skipped 2010-09 (file already exists)
   Skipped 2010-10 (file already exists)
   Skipped 2010-11 (file already exists)
   Skipp