Skip to content

Latest commit

 

History

History
26 lines (16 loc) · 5.68 KB

proposal.md

File metadata and controls

26 lines (16 loc) · 5.68 KB

Leading Question

In our final project we plan to use openFlights data set to find the shortest distance between two airports and the possible paths between all the potential destinations. We plan to create a graph with the data from airports.dat with each node representing an airport. We also plan on using routes.dat which contains the details of the path of each flight which is going to be used to direct the nodes of the graph and each edge has the distance between two places using the latitude and longitude information from airports.dat. We plan to traverse this graph using a DFS traversal algorithm and use A* Search algorithm to find the shortest path between the airports. Lastly, we will use the Kruskal Algorithm to find the possible paths between all the potential destinations.

Dataset Acquisition

We plan to use the data from airport.dat from openFlights which is in the form of a DAT file. This data set contains information regarding each airport's location, timezone, latitude, longitude, etc. In this data set we only plan to use the airport id, latitude and longitude. We plan to use the data from routes.dat from openFlights which is in the form of a DAT file. This file contains information regarding the route each flight takes. From this data set we only plan to use source airport id, destination airport id and stops.

We would convert both the DAT files into CSV files as it is easier to select and delete columns and access data in general. In the airport dataset, we would only use the airport ID, latitude, and longitude column and delete the rest. We would have a check to ensure that the latitude and longitude values are valid. These columns don't have any empty entries in the dataset so we would not have to deal with that. Similarly, we would only be using source airport ID, destination airport ID, and stops from routes dataset and we would delete the rest of the data. Also, these columns don't have any empty entries in the dataset so we would not have to deal with that.

As far as the data structure for storing the data we considered a few different solutions. Since we are using a directed graph we initially thought of using an adjacency matrix to store all the data. However, this would be O(N^2) space complexity which is pretty inefficient. The other option was to use an adjacency list which would require less memory. However, we decided to instead create our own classes as it provides more flexibility and allows us to come up with our data structure. There will be a node class which will store the source and information about the airport. The node will also have a vector of a list of edge objects pointing to the various destination airports. It will also include the distance as weights. The space complexity is O(V+E) where V is the vertices and E is the edges.

Algorithm

We first would take the data for the list of routes and convert that to a graph. By parsing through every entry, we get a source airport as well as a destination airport. We will then check if a node for these airports exists already and connect them with an edge structure which includes information like the distance between two airports which serves as weight between them. If the nodes do not exist we will create a node for each of them with information from the airport location data set and add the coordinates and airport code in that node.

The first feature we will be implementing is finding the shortest route distance wise between any two airports. We will be using the A* algorithm to achieve this. The heuristic we would use is the direct distance between the airports which we can compute from their coordinates. For this algorithm we would need only the start node, the destination node, and the heuristic value. The output of this algorithm would be a set of all of the edges of the shortest path from the source to the destination airport. The runtime would be O(b^d) where b is the number of children or possible destination airports in our case and d is the depth of the final solutions. The space complexity would also be the same.

The second feature we would include is requesting the user to input a set of airports/destinations that they would like to visit and we would then return the minimum spanning tree visiting all the airports. The algorithm that would help us with this would be Kruskal's algorithm. This algorithm needs the entire graph which would include a vector of all possible edges and nodes in the graph. The algorithm would then return a set of edges of the final spanning tree. The runtime would then be O(ElogV) where E is the number of edges and V is the number of nodes. The space complexity would be O(logE) as a sort and union is required for the edges.

Finally, Kruskal's algorithm gives us the minimum spanning tree so we would then use DFS to then traverse the tree to print output to the console that shows the airports visited on the trip and other information about each airport and route. The output would be simple text displaying all the information about the airports on the routes.

Timeline

In the first week, we hope to complete the dataset acquisition and process data format, as we already have our data sources, but more time will be allotted to data correction and data storage. Our first major goal is to find the shortest distance between two select airports. This is done by using the A* algorithm, and we hope to have this implemented by the THIRD week. The next target is implementing Kruskal's algorithm to analyze user input and provide a minimum spanning tree for all airports requested, and we hope to have this done by week FOUR. The final portion of our timeline is a DFS to then traverse the tree and output all relevant information to the user, and this should take ONE week.