Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data conversion #1

Open
Athiq opened this issue Feb 18, 2019 · 21 comments
Open

data conversion #1

Athiq opened this issue Feb 18, 2019 · 21 comments

Comments

@Athiq
Copy link

Athiq commented Feb 18, 2019

Do you have script that converts the log files(HDFS files - text ) to numbers ??

https://github.com/wuyifan18/DeepLog/blob/master/data/hdfs_train

How did you get the above ?? -- using Spell ?? ... after running the parser i still have text data --- how did you convert to numbers (vectors) ?? --- do you have a script ?? can you please upload ??

https://github.com/logpai/logparser/tree/master/logs/HDFS

is this the above data converted to numbers ??

thanks in advance

@wuyifan18
Copy link
Owner

wuyifan18 commented Feb 19, 2019

I use the dataset provided by the author of this paper.
More details please refer to the web page.

@sotiristsak
Copy link

hello. thanks @wuyifan18 for the great job. I think what @Athiq is talking about can be found in the 4.3 paragraph of the published paper. I'm also trying to find out how this could be implemented! any help would be much appreciated

@Athiq
Copy link
Author

Athiq commented Feb 25, 2019

@sotiristsak exactly ... i want the raw text which has been converted to numbers in the data provided (i think, its TF-IDF) -- if its so, then shouldn't be a problem to implement. Please let me know, if that's the case @wuyifan18

@wuyifan18
Copy link
Owner

@Athiq the raw text can be found in the web page.

@Athiq
Copy link
Author

Athiq commented Feb 25, 2019

@wuyifan18 thanks for the response, i am looking for text data, so that i can use Spell and Deeplog .. but where i fail is after Spell i have text parsed data. This text parsed data i want to train with Deeplog, but i am not sure how to convert this parsed data from Spell to numbers (is it TF-IDF) ?.

@wuyifan18
Copy link
Owner

@Athiq you mean convert the data to numbers according to log keys you have parsed from Spell?
If so, I have no idea. Maybe @sotiristsak can give a hand.

@sotiristsak
Copy link

Sorry for the delayed replay. Unfortunately, I also don't have a clue. I'm thinking I have to implement the 4.3.1 paragraph of the paper on my own, because, I think, this is where the logs are being split into tasks in order to be grouped into workflows. @Athiq what do you mean by TF-IDF? Also, is anyone interested in collaborating to do the above work?

@sotiristsak
Copy link

Btw, @Athiq , the numbers are not TF-IDF. They are the ids of each different log type. So, a sequence of such numbers denotes the workflow of a specific task pattern. The hdfs_train file contains the workflows that were extracted from the raw log file of the normal execution.

@wuyifan18
Copy link
Owner

@sotiristsak You're right.

@Athiq
Copy link
Author

Athiq commented Mar 5, 2019

@sotiristsak @wuyifan18 what i am trying is to run DeepLog for this data as below

https://github.com/logpai/loghub/blob/master/Hadoop/Hadoop_2k.log.

I have successfully ran Spell(parser) on this data then i have two files as below



Sample
structured_file.csv

LineId Date Time Pid Level Component Content EventId EventTemplate              
1 81109 203615 148 INFO dfs.DataNode$PacketResponder PacketResponder 1 for block blk_38865049064139660 terminating ead21f08 PacketResponder * for block * terminating        
2 81109 203807 222 INFO dfs.DataNode$PacketResponder PacketResponder 0 for block blk_-6952295868487656571 terminating ead21f08 PacketResponder * for block * terminating        
3 81109 204005 35 INFO dfs.FSNamesystem BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.251.73.220:50010 is added to blk_7128370237687728475 size 67108864 54e007d2 BLOCK* NameSystem.addStoredBlock blockMap updated * 50010 is added to * size *
4 81109 204015 308 INFO dfs.DataNode$PacketResponder PacketResponder 2 for block blk_8229193803249955061 terminating ead21f08 PacketResponder * for block * terminating        
5 81109 204106 329 INFO dfs.DataNode$PacketResponder PacketResponder 2 for block blk_-6670958622368987959 terminating ead21f08 PacketResponder * for block * terminating        
6 81109 204132 26 INFO dfs.FSNamesystem BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.251.43.115:50010 is added to blk_3050920587428079149 size 67108864 54e007d2 BLOCK* NameSystem.addStoredBlock blockMap updated * 50010 is added to * size *


Sample
template_file.csv

EventId EventTemplate Occurrences
ead21f08 PacketResponder * for block * terminating 311
54e007d2 BLOCK* NameSystem.addStoredBlock blockMap updated * 50010 is added to * size * 314
74cae9fd Received block * of size * from * 292
dd632e5d Receiving block * src * * dest * 50010 292


Now the big question is --- to run Deeplog on this structured_file and template files. Is this possible ?? or i am missing something ??.

thanks in advance

@wuyifan18
Copy link
Owner

@Athiq You should convert the structured_file to numbers according to the template files you have got using Spell.

@Hammadtcs
Copy link

@wuyifan18 : Thanks for your response, Sorry but I am also struggling on how to convert structured files into numbers, can you guide us by given any example if how to do it please. Any example would help.

@williamceli
Copy link

Hello! From my understanding, once raw text logs have been parsed(using Spell or any other parsing tool), I think they should be converted into sequences of log templates to be fed to LSTM model.

@hzxGoForward
Copy link

hzxGoForward commented Mar 21, 2019

Hello! From my understanding, once raw text logs have been parsed(using Spell or any other parsing tool), I think they should be converted into sequences of log templates to be fed to LSTM model.

I agree with u opinion, that's why I am confused about their format of training data, I am also confused why the paper's author divide log to lines, and each line have different length, I think it is not the correct format of training data according to his paper, do you have any idea?

@williamceli
Copy link

williamceli commented Mar 21, 2019

@hzxGoForward I think there is a preprocessing step missing, which is, for each line(block/session), building sequences of same length. I guess that is not the actual final input for training.
My problem is I don't get the same number of block lines. If I group by block in the first 100K log lines I get a different number of sessions. Maybe I am extracting the wrong block id from each line.

@wuyifan18
Copy link
Owner

wuyifan18 commented Mar 21, 2019

@williamceli exactly, the actual final input for training need to padding whose length is the hyperparameter window_size.

@hzxGoForward
Copy link

@hzxGoForward I think there is a preprocessing step missing, which is, for each line(block/session), building sequences of same length. I guess that is not the actual final input for training.
My problem is I don't get the same number of block lines. If I group by block in the first 100K log lines I get a different number of sessions. Maybe I am extracting the wrong block id from each line.

may be you can use the number of each log key extract by the following dataset:
http://iiis.tsinghua.edu.cn/~weixu/sospdata.html
DeepLog's author cited this dataset, in this dataset, there are log key and their number.

@Hammadtcs
Copy link

@wuyifan18 @hzxGoForward : Can you add preprocessing, how you converted lines to numericals by using hyperparameter or window_size or timestamps for LSTM?

We are referring the openstack logs and for your reference, i have attached log.
https://github.com/logpai/logparser/blob/master/logs/OpenStack/OpenStack_2k.log

And we are able to convert unstructured logs to structured logs using spell or log parser but after that we are unable to feed the data to training and I understood by using hyperparmeter window size you are trying to convert . Can you add that details or sample source code?

@Huhu-ooo
Copy link

Huhu-ooo commented Jun 3, 2020

Btw, @Athiq , the numbers are not TF-IDF. They are the ids of each different log type. So, a sequence of such numbers denotes the workflow of a specific task pattern. The hdfs_train file contains the workflows that were extracted from the raw log file of the normal execution.

@Athiq Hi,thanks for your response and it also helps me a lot! And I have something to verify, is you mean that I can verify code of realizing workflow by the hdfs_train file ?Thank you so much!

@stuti-madaan
Copy link

@Athiq hi! I am going through the same issue, I have parsed the logs and I am clueless on how to convert them into numbers for processing. Were you able to find a solution?

@Nightmare2334
Copy link

@sotiristsak @wuyifan18 what i am trying is to run DeepLog for this data as below

https://github.com/logpai/loghub/blob/master/Hadoop/Hadoop_2k.log.

I have successfully ran Spell(parser) on this data then i have two files as below

Sample structured_file.csv

LineId Date Time Pid Level Component Content EventId EventTemplate              
1 81109 203615 148 INFO dfs.DataNode$PacketResponder PacketResponder 1 for block blk_38865049064139660 terminating ead21f08 PacketResponder * for block * terminating        
2 81109 203807 222 INFO dfs.DataNode$PacketResponder PacketResponder 0 for block blk_-6952295868487656571 terminating ead21f08 PacketResponder * for block * terminating        
3 81109 204005 35 INFO dfs.FSNamesystem BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.251.73.220:50010 is added to blk_7128370237687728475 size 67108864 54e007d2 BLOCK* NameSystem.addStoredBlock blockMap updated * 50010 is added to * size *
4 81109 204015 308 INFO dfs.DataNode$PacketResponder PacketResponder 2 for block blk_8229193803249955061 terminating ead21f08 PacketResponder * for block * terminating        
5 81109 204106 329 INFO dfs.DataNode$PacketResponder PacketResponder 2 for block blk_-6670958622368987959 terminating ead21f08 PacketResponder * for block * terminating        
6 81109 204132 26 INFO dfs.FSNamesystem BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.251.43.115:50010 is added to blk_3050920587428079149 size 67108864 54e007d2 BLOCK* NameSystem.addStoredBlock blockMap updated * 50010 is added to * size *
Sample template_file.csv

EventId EventTemplate Occurrences
ead21f08 PacketResponder * for block * terminating 311
54e007d2 BLOCK* NameSystem.addStoredBlock blockMap updated * 50010 is added to * size * 314
74cae9fd Received block * of size * from * 292
dd632e5d Receiving block * src * * dest * 50010 292
Now the big question is --- to run Deeplog on this structured_file and template files. Is this possible ?? or i am missing something ??.

thanks in advance

@Athiq Hello buddy, I have already obtained the template file and the templated log file, but how can I turn them into digital sequence files? Like the author's hdfs_ As with train data, do you have a way? I hope you can reply to me when you see it. This is very important to me. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants