Skip to content
This repository has been archived by the owner on May 22, 2019. It is now read-only.

preprocrepos: what is this for and who is dzhigurda? #304

Open
campoy opened this issue Aug 15, 2018 · 2 comments
Open

preprocrepos: what is this for and who is dzhigurda? #304

campoy opened this issue Aug 15, 2018 · 2 comments

Comments

@campoy
Copy link

campoy commented Aug 15, 2018

I'm reading this document and wondering what this command is for.

The description says preprocess your data before passing it to any command you need but this is too vague to be useful. What are the common use cases of the tool? Why was it created?

Finally, the last flag is dzhigurda ... is that Nikita Dhzigurda?

Nikita Dhzigurda

@vmarkovtsev
Copy link
Collaborator

vmarkovtsev commented Aug 15, 2018

The description is not updated - the real one is https://github.com/src-d/ml/blob/master/sourced/ml/__main__.py#L34 Thus we cache UASTs and/or file contents so that we do not have to extract them again for downstream tasks (especially because it is typically the trickiest and the most unreliable step).

Regarding Nikita, yep. He is a legendary Russian freak, and his surname sounds funny even for ourselves. Mail.Ru group developers (thousands of them) have an internal convention to call the conditions for A/B tests "dzhigurdas". The goal of dzhigurdas is to select the proper configuration depending on the context. I decided that it was funny to continue the tradition and used that name for the dirty hack to artificially extend the dataset in src-d/ml. So dzhigurda chooses which commits to process.

@sakalouski
Copy link

sakalouski commented Oct 9, 2018

Is there some way to access commits from a particular date? I am trying to convert a repo of the size of 440 Mb, having 6k commits. Siva file size is 1.2 Gb, but I am wondering, what would be the size of .parquet...
It takes forever on a cluster node (dzhigurda -1), then crashes - apparently 200 Gb RAM is not enough for this task.

I think, I should use gitbase for that...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants