Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share how I transformed the logs into lines of IDs here #35

Open
ying1016 opened this issue Mar 30, 2020 · 3 comments
Open

Share how I transformed the logs into lines of IDs here #35

ying1016 opened this issue Mar 30, 2020 · 3 comments

Comments

@ying1016
Copy link

Hey guys,
I used Drain3 to transform the HDFS logs into lines of IDs here:https://github.com/ying1016/Drain3.git.
Hope it can help you if you don't know what to do.
One thing that should be noticed: the rawdata is ordered by time of the log, not block ID.
If you want to transform the logs, you need to have the data ordered by block ID,
not my test data in the URL. But I think it might not be a problem.

@DuoweiPan
Copy link

@ying1016 Thank you for your implementation! I noticed in the IDblks.log there are a lot of single log messages like 06 01, 01 which is smaller than the window size and are quite different from hdfs_train. Those messages will be detected as abnormal if I use the model trained with hdfs_train. Correct if I'm wrong, I think the original log data you used is the same as log data that DeepLog used, then why is the log key so different between them? Any hint would be helpful! Thank you!

@edocorallo
Copy link

edocorallo commented Sep 25, 2020

Hello, @DuoweiPan

I noticed in the IDblks.log there are a lot of single log messages like 06 01, 01 which is smaller than the window size and are quite different from hdfs_train. Those messages will be detected as abnormal if I use the model trained with hdfs_train.

For what i understood the minimal length of the session should never be less than the window size (eg. window_size=9, len(session)>=9) during the training stage (could be wrong thought)

then why is the log key so different between them?

Also For what i understood, the log keys are kinda arbitrary.
I numerated them by appearing order using a simple dictionary and saved the dictionary for later parsing. But if I did the parsing starting from some random lines, I would still obtain a good training set containing the same sequences of logs, but named differently.
(eg. the sequence [ 2 5 2 5 4 7 8 ] is equivalent to [6 8 6 8 1 9 13] and, as long the enumeration of the log keys is consistent through the entire dataset, DeepLog obtains similar results on both enumerations)
Obviously, if you use one enumeration for the train that has to be the same for predicting.

I hope to be helpful. Bye

@OneStepAndTwoSteps
Copy link

This's very helpful to me, thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants