Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limitation of Streams in the past [General discussion] #23

Open
perellonieto opened this issue Aug 22, 2017 · 1 comment
Open

Limitation of Streams in the past [General discussion] #23

perellonieto opened this issue Aug 22, 2017 · 1 comment

Comments

@perellonieto
Copy link
Member

perellonieto commented Aug 22, 2017

What are the implications of using Streams of data with a timestamp in the Future (or in the present moment)?

At this moment, if you ask for a Stream to be computed for the current time it will raise the following Exception

File "some_path/site-packages/hyperstream/stream/stream_instance.py", line 39, in __new__
    raise ValueError("Timestamp {} should not be in the future!".format(timestamp))
ValueError: Timestamp 2017-08-22 14:05:00.308326+00:00 should not be in the future!

I can think of two cases where it could be interesting to allow Streams in the future:

  1. Asking a classifier tool to train from now till 1 hour in the future.
    • It would be nice to tell to a classifier, given the data that is given from a specific source stream, keep training until the specified time.
    • E.g. Some real time from stock exchanges that gets to a real time stream and that keeps yielding data. The model could take this data every time that it is available and train.
  2. Some dataset where the timestamps are in the future.
    • I am not sure how plausible this scenario is.
    • But I imagine that if someone wants to use a stream that outputs data from some particular future time.
  3. Asking a tool for predictions in the future
    • A model that makes predictions for the future, given that it has been already trained with past data.
@tdiethe
Copy link
Member

tdiethe commented Sep 20, 2017

Interesting cases:
1 - If the classifier is incremental/online, then this isn't a problem, as it can just be called with the last weights/latent variable estimates as a parameter (probably stored in a parameter stream). If it's an offline algorithm, that requires a batch of data, but is designed to iterate over the data (e.g. SGD) then it could indeed make sense to do this kind of thing, rather than wait till all the data is available.
2.+3. I hadn't thought of the forecasting case, but yes this makes sense too.

Perhaps you could come up with a couple of simple use cases ... e.g. predicting tomorrow's weather (it can be a dumb predictor that just predicts the same as today), and see what the consequence of removing that constraint would be.

To be honest I think the main reason it's there at the moment is just to protect the user from unintentional mistakes, as this was the most common cause.

It's worth noting that by default I think all of the channels are valid up to now(), so you might be able to write to a stream but not read from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants