Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TadGAN does not work with the default setup #5

Open
rruizdeaustri opened this issue Jun 22, 2021 · 17 comments
Open

TadGAN does not work with the default setup #5

rruizdeaustri opened this issue Jun 22, 2021 · 17 comments

Comments

@rruizdeaustri
Copy link

Hi,

I have tried to run the code with the current setup (number of epochs is 30) but I get

File TadGAN/anomaly_detection.py", line 129, in find_scores
precision = tp / (tp + fp)
ZeroDivisionError: division by zero

Any ideas about what is going on ?

With Kind Regards,
Roberto

@arunppsg
Copy link
Owner

arunppsg commented Jun 22, 2021

Hi Roberto,

The dataset example-2_cpc_results.csv does not contain any negative points. Hence, tp=0. The model also detects all points as negative. Hence, fp=0.

The attached dataset is not the write one to evaluate the model (sorry for the unnecessary hurdle) since it does not contain any anomalous point. I need to update it with some other time series anomaly detection dataset. You can see here on using the code with other dataset.

Thanks,
Arun

@rruizdeaustri
Copy link
Author

Hi Arun,

Ok, then I'll try with another dataset.

Thanks a lot !

Best,
Rbt

@rruizdeaustri
Copy link
Author

Hi Arun,

I have labeled the nyc_taxi.csv dataset from NAB and I have a question about the split of the data used in your code.
As it is, 70% of the data is used for training and 30% for testing but in this way the training data contain anomalies for this particular dataset. Since the method is unsupervised, shouldn't anomalies be excluded in the training process ? I guess we want to learn the distribution of the say normal samples, right ?

Thanks a lot !!

All the best,
Roberto

@arunppsg
Copy link
Owner

Hi Roberto,

The anomalies are excluded in training process. The anomaly values are used only for evaluation process and not during training. Training uses the time series signals. The generator learns the distribution of normal samples.

Cheers,
Arun.

@rruizdeaustri
Copy link
Author

Hi Arun,

Yes this is what I expect though in some blog about the model in Orion have seen they use the whole time series (including anomalous timesteps). That is why I got confused.

I will split the data and pickup just normal data and let you know whether the code works with this dataset as it does with the "official" implementation in Orion.

BTW, have you tried with this dataset ? I could send it to you with the right format for your code.

Thanks a lot !!

Best,
Rbt

@arunppsg
Copy link
Owner

arunppsg commented Jul 2, 2021

Hi Roberto,

Thanks for your interest. Training of GANs are highly unstable and it requires more computation power. Access to computation power is currently out of scope for me.

Best,
Arun.

@rruizdeaustri
Copy link
Author

Hi Arun,

In fact I have been training the model and the performance is really poor for this dataset in comparison with what is reported in the Orion webpage for the say official version.

I have used the default hyperparameters which are identical to the ones used in the report by the Orion guys:

Accuracy 0.79
Precision 1.00
Recall 0.07
F1 Score 0.13

Any advice to improve this ?

Thanks a lot !!!

Rbt

@arunppsg
Copy link
Owner

arunppsg commented Jul 2, 2021

Hi Rbt,

The same was the result observed in my scenario. But the loss value seems to improve in the right direction after successive epochs. I don't have any particular advice other than the following:

  • Try Orion
  • Use other time series modelling approaches like fbprophet etc

Best,
Arun.

@amanuel2
Copy link

amanuel2 commented Jul 8, 2021

Can one of you send a CSV file that works with this source code? (I get the same error) I can't find any online.

@natkhosh
Copy link

natkhosh commented Jul 9, 2021

Hi Arun,

Yes this is what I expect though in some blog about the model in Orion have seen they use the whole time series (including anomalous timesteps). That is why I got confused.

I will split the data and pickup just normal data and let you know whether the code works with this dataset as it does with the "official" implementation in Orion.

BTW, have you tried with this dataset ? I could send it to you with the right format for your code.

Thanks a lot !!

Best,
Rbt

Hi, could you please send me your dataset. I'll try to use it in my diploma work.
I have the same problem with datasets (I tried NAB too).

@rruizdeaustri
Copy link
Author

Hi Arun,

Maybe I can send you the data and you can add them to the repo ?

Rbt

@arunppsg
Copy link
Owner

arunppsg commented Jul 12, 2021 via email

@rruizdeaustri
Copy link
Author

Hi Arun,

I have created a branch called rruiz-branch where the file nyc_taxi_new.csv has been added and made a pull request.
Could you pls merge it ?

Best,
Rbt

@arunppsg
Copy link
Owner

You need to create a pull request. I don't see any pull request currently.

@AugustComte
Copy link

AugustComte commented Sep 2, 2021

Hi @arunppsg,

Firstly thank you for this, its super cool. I am new to this and have a few questions, which I hope are not too stupid, if you can indulge me?

Looking through this I notice both this and the Orion examples only use a value and date column, it it possible to make this work with additional regressors/columns, so called Xregs i.e. temperature, sales price etc.

Secondly is it necessary to have the labelled anomalies? My anomaly labels (in my datasets) were achieved by using the deviation between a true value and predicted with an RNN, I am expecting tadGAN to be better. So it does not seem appropriate to measure the GAN performance by the results of the RNN, I was under the impression that tadGAN was unsupervised. All I really want is to get the anomaly scores. Does that mean I would need to delete the evaluation section of the code, or will it run regardless and output the outlier scores? Where can I get these?

Again, sorry if these are poor questions. I'm not sure I entirely understand the code.

Best
August

@arunppsg
Copy link
Owner

arunppsg commented Sep 3, 2021

Hello August,

  1. You can also use other variables but for that you might need to change model architecture. I am not sure on how we can change it. Maybe I will think through it and get back to you after some time. In the current architecture, there is only one regressor and it is normalized first, and then the input is a window of data points (window size: 100 * 1). Consider giving a read through this paper for using Multivariate time-series with RNNs.
  2. Labelled anomalies are not necessary since it is an unsupervised approach. Labels are only required to evaluate the model. Anomaly scores are the computed as product of reconstruction error and critic score. See the test function in anomaly_detection.py for anomaly scores. To use it without labels, just create a dummy column called anomaly or modify code in main.py and anomaly_detection.py

Thanks!

@The-Boyy
Copy link

Hi Arun,

I have labeled the nyc_taxi.csv dataset from NAB and I have a question about the split of the data used in your code. As it is, 70% of the data is used for training and 30% for testing but in this way the training data contain anomalies for this particular dataset. Since the method is unsupervised, shouldn't anomalies be excluded in the training process ? I guess we want to learn the distribution of the say normal samples, right ?

Thanks a lot !!

All the best, Roberto

Excuse me, can you send me your dataset? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants