Skip to content
Erfan Mahmoudi edited this page Dec 3, 2024 · 2 revisions

Delivery Fare Estimation - Project Wiki

Overview

This project implements a high-performance delivery fare estimation system in Go (Golang). The program processes large datasets of GPS data logs, filters out invalid GPS points, calculates distances using the Haversine formula, and estimates the fare for each delivery. It outputs the results in CSV format.

Key features:

  • Efficient fare calculation based on GPS data.
  • Handling of large datasets (multiple gigabytes).
  • Optimized for minimal memory usage and execution time.
  • Written in Go to maximize performance.

Project GitHub Repository


Features & Key Requirements

  • Data Ingestion: The program reads large CSV files containing GPS data logs.
  • Invalid Data Filtering: Filters out invalid GPS points (e.g., unrealistic speeds).
  • Fare Calculation: Computes the delivery fare based on various parameters (distance, time of day, idle time).
  • Handling Large Datasets: Optimized to process large datasets efficiently using chunk-based processing.
  • Single-pass Data Processing: The program processes data in a single pass (O(n)), eliminating the need for concurrency, thus reducing unnecessary overhead.
  • Output: Produces CSV files with filtered GPS data and fare estimates.

Design Approach

1. Data Ingestion & File Reading

The program reads input from a CSV file that contains GPS coordinates and delivery data. Each row in the CSV corresponds to a GPS point with the format:

(id_delivery, lat, lng, timestamp)

Since the input datasets can be quite large (several gigabytes), we avoid loading the entire dataset into memory. Instead, the program reads and processes the data line-by-line using buffered file I/O with a custom buffer size of 1e9 to ensure we read the data efficiently while minimizing memory consumption. This approach minimizes the system memory overhead and maximizes read performance.

2. Processing in One Pass (O(n))

The program reads the CSV file in a single pass, processes each line of data, filters invalid GPS points, calculates the fare, and writes the results to an output file. By eliminating the use of goroutines and reducing unnecessary function calls, the code now runs in O(n) time complexity, which is crucial for efficiently processing large datasets.

3. Filtering Invalid Points

The program filters out GPS points that have unrealistic speeds, which may occur due to sensor errors or data corruption. A GPS point is considered invalid if the speed between two consecutive points exceeds 100 km/h. This check is performed using the Haversine formula to calculate the distance between two latitude/longitude points.

  • Haversine Formula: The Haversine formula is used to compute the distance between two points on the Earth's surface, given their latitude and longitude.

4. Fare Calculation

After filtering out invalid GPS points, the program calculates the delivery fare based on the following parameters:

  • Base fare: A base charge of 1.30 for each delivery.
  • Distance-based fare: Calculated using the Haversine formula for the distance between consecutive GPS points.
  • Time-of-day adjustment: A dynamic rate based on the time of day:
    • Daytime rate (5:00 AM to Midnight)
    • Nighttime rate (Midnight to 5:00 AM)
  • Idle time: Additional charges are applied if the vehicle is stationary (speed <= 10 km/h).
  • Minimum fare: The minimum fare for any delivery is 3.47.

5. File Writing

The program produces two CSV files:

  1. Filtered Data: A file (filtered_data.csv) containing only the valid GPS points for each delivery. This file is located in the output_dataset/ folder.
  2. Fare Estimates: A file (fares.csv) with the format:

Since the input datasets can be quite large (several gigabytes), we avoid loading the entire dataset into memory. Instead, the program reads and processes the data line-by-line using buffered file I/O with a custom buffer size of 1e9 to ensure we read the data efficiently while minimizing memory consumption. This approach minimizes the system memory overhead and maximizes read performance.

2. Processing in One Pass (O(n))

The program reads the CSV file in a single pass, processes each line of data, filters invalid GPS points, calculates the fare, and writes the results to an output file. By eliminating the use of goroutines and reducing unnecessary function calls, the code now runs in O(n) time complexity, which is crucial for efficiently processing large datasets.

3. Filtering Invalid Points

The program filters out GPS points that have unrealistic speeds, which may occur due to sensor errors or data corruption. A GPS point is considered invalid if the speed between two consecutive points exceeds 100 km/h. This check is performed using the Haversine formula to calculate the distance between two latitude/longitude points.

  • Haversine Formula: The Haversine formula is used to compute the distance between two points on the Earth's surface, given their latitude and longitude.

4. Fare Calculation

After filtering out invalid GPS points, the program calculates the delivery fare based on the following parameters:

  • Base fare: A base charge of 1.30 for each delivery.
  • Distance-based fare: Calculated using the Haversine formula for the distance between consecutive GPS points.
  • Time-of-day adjustment: A dynamic rate based on the time of day:
    • Daytime rate (5:00 AM to Midnight)
    • Nighttime rate (Midnight to 5:00 AM)
  • Idle time: Additional charges are applied if the vehicle is stationary (speed <= 10 km/h).
  • Minimum fare: The minimum fare for any delivery is 3.47.

5. File Writing

The program produces two CSV files:

  1. Filtered Data: A file (filtered_data.csv) containing only the valid GPS points for each delivery. This file is located in the output_dataset/ folder.
  2. Fare Estimates: A file (fares.csv) with the format:

id_delivery, fare_estimate

To ensure thread-safe file writing and avoid race conditions, a mutex (sync.Mutex) is used to lock the file during write operations. This prevents data corruption when multiple operations attempt to write to the file simultaneously.

6. Performance and Scalability

The program is optimized for processing large datasets, even those exceeding several gigabytes. By reading the CSV file in a single pass and applying chunk-based processing, the program minimizes memory consumption and ensures that the program can handle large datasets efficiently.

7. Testing

The project includes comprehensive unit tests and end-to-end tests:

  • Unit tests: These tests verify the correctness of individual functions, such as the haversine() function, filterInvalidPoints(), and calculateFare().
  • End-to-end tests: Simulate the entire flow of reading data, filtering invalid points, calculating fares, and writing the results to CSV files. Special cases (e.g., very short deliveries, idle time, and invalid GPS points) are covered.

How to Run the Project

Requirements

  • Go (Golang) version 1.19 or higher
  • An input dataset (located in the input_dataset/ folder)

Running the Program

  1. Clone the repository:
git clone https://github.com/Erfanm83/delivery-fare-estimation.git
cd delivery-fare-estimation
  1. Install dependencies and build the project:
go mod tidy
go build
  1. Run the program on a dataset:
./delivery-fare-estimation <input_file.csv>

The results will be saved in the output_dataset/ folder as:

  • filtered_data.csv
  • fares.csv

Test Datasets

  • medium_data.csv (50 deliveries)
  • large_data.csv (200 deliveries)
  • huge_data.csv (over 390,000 deliveries; file size: 1.19 GB)

Note: To test with the large dataset (huge_data.csv), you will need to extract the huge_data.zip file first, as the dataset is compressed.

Conclusion

This project offers an efficient solution to delivery fare estimation for large-scale datasets, leveraging the Go programming language’s performance capabilities. By using optimized file I/O operations, minimal memory usage, and O(n) processing, it can handle datasets that would otherwise be cumbersome to process in memory.

This repository can serve as a foundation for anyone looking to develop or extend a system for delivery logistics, fare estimation, or other GPS-based data processing applications.

For any questions or contributions, please feel free to open an issue or submit a pull request on the GitHub repository.


Best Practices

  • Modular Code: The code is designed to be modular, with each function focused on a specific task (e.g., filtering data, calculating fare). This ensures maintainability and extensibility.
  • Efficient Memory Use: The program processes the data in a single pass, avoiding the need to load large datasets into memory.
  • Concurrency Safety: By using synchronization mechanisms such as mutexes, the program ensures thread-safe operations when writing to files.