Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebooks implementations #9

Closed
RahimKh opened this issue Mar 18, 2021 · 2 comments
Closed

Notebooks implementations #9

RahimKh opened this issue Mar 18, 2021 · 2 comments

Comments

@RahimKh
Copy link

RahimKh commented Mar 18, 2021

Hello,

I have some concerns regarding the notebooks provided.
Why is the training made with files having anomalies instead of the free anomaly csv file and then testing on the other files?
Also the results are taking in count the whole file including the training samples but I know that results need to be done on only data that is unknown to the model. Am I missing something?
Thank you.

@YKatser
Copy link
Collaborator

YKatser commented Mar 21, 2021

Hello, @RahimKh !
Thank you for your remarks! Cause we are still working on the methodology for algorithms evaluation, your comment is helpful.
We do see 3 possible ways of model fitting (fault-free train set selection):

  1. (priority option) Using separate file (data/anomaly-free/anomaly-free.csv) with relatively long fault-free operating mode. Once trained and applied to all datasets. The problem here is connected with some troubles that appeared during the data collecting, making the fault-free dataset too different from most of the other datasets. It caused the recognition of the wrong patterns by the anomaly detection algorithms. We are currently working on collecting a proper fault-free dataset for model fitting in the future.
  2. Use the beginning of one dataset as a fault-free mode. Once trained and applied to all datasets.
  3. Use the beginning of each dataset as a fault-free mode. Trained and applied using every single dataset.

We have selected the 3rd way, for now, using the first 400 points of each dataset (approx 1/3 of the total number of points) as a train set. It is not entirely fair (doing so, we decrease the number of unknown points, making the problem easier to solve), but still, it is ok for a changepoint detection problem. As for the outlier detection problem: though generally, you are right saying "results need to be done on only data that is unknown to the model" for metrics (FAR, MAR, F1) calculation, it can still be an option. In this case, the results are just slightly overstated.

We definitely want to switch to the 1st way of model fitting. Probably we will switch to the 2nd way while the proper separate fault-free dataset is unavailable.

@YKatser YKatser pinned this issue Mar 24, 2021
@YKatser
Copy link
Collaborator

YKatser commented Mar 24, 2021

The answer moved to the slides about the project.

@YKatser YKatser closed this as completed Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants