Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: GTFS feed import #1013

Merged
merged 36 commits into from
Oct 18, 2024
Merged

Conversation

jzimbel-mbta
Copy link
Member

@jzimbel-mbta jzimbel-mbta commented Sep 19, 2024

Summary of changes

Asana Ticket: [extra] 🏹 Implement /import-GTFS Arrow endpoint

This adds the rest of the code for GTFS import, started in #1006.

I had intended to split the remaining work into more than 1 PR, but hopefully it will be easy enough to review it commit-by-commit. Happy to break it apart if that would be preferable to the reviewer.

Test-running the import

With Arrow running locally and a GTFS archive saved locally, you can run the import with the following:

# Change to 'validate' to hit the validate endpoints instead
ACTION=import
ARCHIVE_PATH="$HOME/Downloads/MBTA_GTFS.zip"

# Enqueue the import job
job_id=$(
  curl -X POST localhost:4000/api/gtfs/$ACTION \
    --header "x-api-key: $ARROW_API_KEY" \
    --header "Content-Type: application/zip" \
    --data-binary "@$ARCHIVE_PATH" \
  | jq .id
)

# Check status of the job until it's something other than "executing"
while true; do
  result=$(
    curl -s -X GET localhost:4000/api/gtfs/$ACTION/${job_id}/status \
      --header "x-api-key: $ARROW_API_KEY" \
    | jq .status
  )

  echo $result
  if [[ $result != "\"executing\"" ]]; then break; else sleep 1; fi
done

(It assumes the import job is successfully enqueued in the first request, so you'll need to adjust things a bit if you're testing import of an invalid archive)

Reviewer Checklist

  • Meets ticket's acceptance criteria
  • Any new or changed functions have typespecs
  • Tests were added for any new functionality (don't just rely on Codecov)
  • This branch was deployed to the staging environment and is currently running with no unexpected increase in warnings, and no errors or crashes.

@jzimbel-mbta jzimbel-mbta requested review from a team and Whoops and removed request for a team September 19, 2024 18:35
@jzimbel-mbta jzimbel-mbta changed the title Jz import gtfs endpoint ecto schemas feat: GTFS feed import Sep 19, 2024
@Whoops
Copy link
Collaborator

Whoops commented Sep 19, 2024

Drive by commentary based on a skimming of the implementation:

  • On directions, how would you feel about a quick GTFS PR to clean up directions that don't have routes in the feed? I'm pretty confident you'd just have to remove them here, right after we remove routes that aren't scheduled.
  • Ecto has insert_all/3 that I believe will outperform doing multiple inserts, since it can load everything up, then check the constraints and update indexes instead of having to do it multiple times for individual insters. It takes maps instead of changesets, but I think we're relying on the database for validations, so I think that's a worthwhile tradeoff.
  • If we really wanted to golf this, there are other things we could do like dropping constraints and re-adding them
    after import (SQL Server actually has a way of disabling constraints during a transaction and rechecking them at the end, but sadly Postgres does not), but I think increasing the timeout is fine. This'll only run on deploys and we already wait ~15 minutes for Alerts UI to validate. We can always try to bring it down later if it becomes annoying.
  • It might be good to run ANALYZE on these tables after import.

@jzimbel-mbta
Copy link
Member Author

jzimbel-mbta commented Sep 23, 2024

@Whoops Re: directions.txt rows referencing nonexistent routes,

Based on what I see here, it seems like at least for shuttle routes we make an exception and keep rows in directions.txt that reference nonexistent Shuttle-* routes. Is there a reason for that?

(I'm not sure why directions referencing bus route 602 doesn't get removed by this line though, they definitely should be)

@Whoops
Copy link
Collaborator

Whoops commented Sep 23, 2024

Based on what I see here, it seems like at least for shuttle routes we make an exception and keep rows in directions.txt that reference nonexistent Shuttle-* routes. Is there a reason for that?

Unless I'm misreading something that doesn't protect Shuttle-* just Shuttle-Generic (it does the same for routes just above). I'm not sure why, but since the route is kept, it shouldn't be causing this issue. On the flip side, we're already truncating out these shapes I'm not sure why they are ending up in the feed. I'm between tickets at the moment, I'll probably drop in a ticket and look into this today.

jzimbel-mbta and others added 4 commits September 26, 2024 14:45
…constraints; define and use Importable behaviour
…1014)

* fix: increase length to 100MB for multipart uploads in Plug.Parsers

* ci: upgrade Github Action upload-artifact to fix deprecation error
Copy link
Collaborator

@Whoops Whoops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very non-blocking commentary: if we went all-in on CSV import, I think that would allow for the removal of a lot of code. But of course that conflicts with what we just discussed around services.

lib/arrow/gtfs.ex Outdated Show resolved Hide resolved
lib/arrow/gtfs.ex Show resolved Hide resolved
lib/arrow/gtfs/import_helper.ex Show resolved Hide resolved
lib/arrow/gtfs/route.ex Show resolved Hide resolved
lib/arrow/gtfs/types/enum.ex Outdated Show resolved Hide resolved
lib/arrow/gtfs/stop_time.ex Outdated Show resolved Hide resolved
lib/arrow_web/controllers/api/gtfs_import_controller.ex Outdated Show resolved Hide resolved
lib/arrow_web/controllers/api/gtfs_import_controller.ex Outdated Show resolved Hide resolved
@jzimbel-mbta jzimbel-mbta force-pushed the jz-import-gtfs-endpoint--ecto-schemas branch from 23c2405 to 119887a Compare October 1, 2024 20:57
Copy link
Collaborator

@Whoops Whoops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@jzimbel-mbta jzimbel-mbta merged commit e502c02 into master Oct 18, 2024
12 checks passed
@jzimbel-mbta jzimbel-mbta deleted the jz-import-gtfs-endpoint--ecto-schemas branch October 18, 2024 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants