Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push-based model for consuming (realtime) GBFS data #630

Open
1 of 3 tasks
testower opened this issue Apr 22, 2024 · 11 comments
Open
1 of 3 tasks

Push-based model for consuming (realtime) GBFS data #630

testower opened this issue Apr 22, 2024 · 11 comments
Labels

Comments

@testower
Copy link
Contributor

testower commented Apr 22, 2024

What is the issue and why is it an issue?

Using poll-based consumption (the current situation) for real-time data has several challenges.

  • The specification states that real-time feeds should be updated as often as possible.
    • This means they should have a ttl (time-to-live) value of 0. Ideally, this means consumers should poll infinitely often, to stay up-to-date.
    • This is, of course, not possible. Consumers will necessarily poll on a finite interval. Choosing the appropriate interval will depend on various factors, mostly related to available computing and bandwidth resources, as well as the total number of feeds that the consumer needs to poll.

Consumer side

There is an inherent conflict within this decision process: Consumers don’t want to poll too infrequently, because that increases the likelihood that data will be stale and that incorrect information is shown to users.

At the same time, polling too frequently is a potential waste of resources, depending on how often data is refreshed. They may also face rate-limiting policies from producers (I have first-hand experience with this).

In the end, we have to decide between over-fetching and stale data, and it will never be better than a mere compromise.

Producer side

Frequent polling of large-size payloads hogs resources and pushes producers to introduce complexity like caching and CDNs. Having consumers poll at an interval close to 0 seconds is resource-intensive and costly for the data producer, and they face the risk of lost revenue if consumers poll too in-frequently.

We must further consider that large-size payloads often only contains minor changes to the totality of the information, causing an additional waste of resources as non-changes have to be computed.

Cloud computing contributes to greenhouse gas emissions on a massive scale. Allocated resources are generally underutilised and unnecessary computing is extremely wasteful on the financial side, as well as damaging on the environmental side.

Potential solutions

I would like to open up for a community discussion on how to solve this challenge by generic and scalable means. Individual arrangements between consumers and producers are not sustainable and finding a common solution will benefit the community as whole and help the standard grow.

I don’t want to constrain the solutions from the outset, but I think potential solutions fall into the following 3 broad categories:

  1. Continue to use a polling-based model but encourage better use of cache headers and not-modified responses.
  2. Use a push-based model without an intermediary, with technologies like WebSocket or Server-Sent Events
  3. Use a push-based model using an intermediary message broker, with technologies like amqp, pub/sub, kafka, mqtt etc.

Personally, I think the second category holds the right trade-off between added complexity and added value. In particular Server-Sent Events seems to be promising as a theoretical extension of existing endpoints. It should also be noted that options 1 and 2 can co-exist. I.e. producers can continue to support the polling-based method for real time feeds, and improve upon it, while at the same time support a push-based model.

Still, there is another axis to consider: For any given update, what is the size of the delta of that update. There is potentially a very large upside to precompute and only ship what has actually changed, rather than always transferring everything. On the other hand, it requires us to introduce new semantics to communicate to consumers the contents of the delta. E.g. what has been added, what has changed and what was removed.

I’m looking forward to hearing what the community has to say about this. I will use your feedback to work on a proposal for a standard way to deal with the problems outlined here.

Is your potential solution a breaking change?

  • Yes
  • No
  • Unsure
@leonardehrenfried
Copy link
Contributor

This is great proposal and and I think this would be very interesting for aggregators and their consumers.

I also think that the best cost/benefit ratio would be to have a some form of HTTP-based event system, like Websockets.

@skinkie
Copy link

skinkie commented May 22, 2024

The problem with WebSocket and Server-Sent Events are that it still requires a non-native implementation as a backend. Having a single (preferably well standardized) interface like MQTT (ISO/IEC 20922:2016) gives in my opinion a much better standardisation effort. That having said, it would require a topic structure, that allows for partial updates. In addition, because retained information remains information, it also supports connecting to a server and get back the clean state.

As producer we are willing to provide an MQTT implementation for evaluation.

@testower
Copy link
Contributor Author

Thanks @leonardehrenfried and @skinkie, I think it's great that we have some opposing views here.

@skinkie could you perhaps elaborate what you mean by "non-native implementation", because I didn't quite understand the argument.

@skinkie
Copy link

skinkie commented May 22, 2024

@skinkie could you perhaps elaborate what you mean by "non-native implementation", because I didn't quite understand the argument.

Imagine you would need a scalable solution for distribution. Internally that will be a publish-subscribe-pattern. Websockets and SSE are web technologies and not per se the transport protocol used within an enterprise grade publish-subscribe system. Surely you could run your own protocol over websockets, including MQTT, but why not go for the native route?

Given the experience we have with "GTFS-RT Differential" and implementing websockets because it was mentioned as being a standardised webtechnology, has in the past ten years not resulted in any operational commercial GTFS-RT client. My personal preference would be going for MQTT, since other transport organisations such as VDV (Germany) have also embraced MQTT in favor of their own distribution protocols. Our own implementation is using ZeroMQ, so it is also not that we are pushing our own choices.

@testower
Copy link
Contributor Author

Thanks for your insight @skinkie

@matt-wirtz
Copy link

matt-wirtz commented Jul 4, 2024

Thx @testower for bringing this up.
As a consumer we do have the problem of stale data from time to time, as described already: Users might be looking for vehicles which we still show as available. Or the opposite the option of a shared vehicle is not shown in trip search results because the newly available vehicle is not yet part of our data.
So we would also be interested in a good, push based approach. Using MQTT as transport mechanism also sounds reasonable.
@skinkie could you provide an example where MQTT is already applied in a VDV defined interface?

@skinkie
Copy link

skinkie commented Jul 4, 2024

@matt-wirtz VDV-435-IoM.

@mobilitydataio
Copy link
Contributor

This discussion has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs. Thank you for your contributions.

@skinkie
Copy link

skinkie commented Sep 3, 2024

Keep open.

@richfab
Copy link
Contributor

richfab commented Sep 26, 2024

This will be the topic of a workshop at the Mobility Data Summit in Montreal, Oct 30-31 2024.
We will discuss possible technical developments.
Workshop title: Achieving Optimal Efficiency in GBFS Data Exchanges.

Please contact me at fabien@mobilitydata.org if you have any questions about the Summit.

cc @skinkie @leonardehrenfried @matt-wirtz for visibility

@mobilitydataio
Copy link
Contributor

This discussion has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants