Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build a Corpus Inventory #100

Open
PonteIneptique opened this issue Jun 20, 2017 · 3 comments
Open

Build a Corpus Inventory #100

PonteIneptique opened this issue Jun 20, 2017 · 3 comments

Comments

@PonteIneptique
Copy link
Member

In order to speed up parsing and reduce parsing time, it might be interesting to build a Generic file merging all data at build. That would reduce the loading time by reducing multiple metadata access.

I can see also a point to group them by a maximum of X (Say 1.000 ? ) :

data
   |-- phi1294
   |-- phi1295
   |-- phi1296
   |-- ...
   |-- phi3300
   |-- __capitains_fastload_0__.xml
   |-- __capitains_fastload_1__.xml

So, from there, the nautilus resolver, if it detects such file, could default to parsing these insted of using glob. It would be a production trick let say...

What do you think @balmas @sonofmun

@balmas
Copy link
Contributor

balmas commented Jun 20, 2017

It seems reasonable, although I wonder if we are starting to go down the path of reinventing a wheel that already exists. Would the addition of an indexing solution, such as Elastic Search be another approach?

@PonteIneptique
Copy link
Member Author

It's not about reinventing the wheel. Even though there is solutions like ES and Solr, it's still more efficient to load all informations from one file rather than 1000. :)
I am looking at situation like the Pompei Corpus where it would be harassing to parse 12k metadata file everytime the rest is updated...

@balmas
Copy link
Contributor

balmas commented Jun 20, 2017

Ok, Just wanted to be sure we weren't overlooking something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants