Task
This project is designed to give you an opportunity to gain experience in programming systems in the Hadoop ecosystem. In this case, we use Spark to analyze taxi rides within New York.
We will use a data set which covers one month. You will find time and location for each trip’s start and end. In the following, this is the data that is meant when we refer to a trip.
The general question is: Can we match trips and return trips? For a given trip a, we consider another trip b as a return trip iff
- b’s pickup time is within 8 hours after a’s dropoff time
- b’s pickup location is within r meters of a’s dropoff location
- b’s dropoff location is within r meters of a’s pickup location where r is a distance in meters between 50 and 200.
To compute the return trips, you may want to break the problem down into the following series of problems:
- Given the (lat,lon) coordinates • a(40.79670715332031, −73.97093963623047) • b(40.789649963378906, −73.94803619384766) • c(40.73122024536133, −73.9823226928711) which trips have dropoff locations within r meters of a,b or c?
- For each trip a in the dataset, compute the trips that have a pickup location within r meters of a’s dropoff location. These are the return trip candidates.
- For all trips a in the dataset, compute all trips that may have been return trips for a.