Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train quantified object recognition #84

Open
ahmed-elyoussefi opened this issue Mar 4, 2019 · 6 comments
Open

Train quantified object recognition #84

ahmed-elyoussefi opened this issue Mar 4, 2019 · 6 comments
Assignees
Labels

Comments

@ahmed-elyoussefi
Copy link

Hi there
thank you for such a great project.

I have looked into the documentation to see if there is a way to train quantified objects (i know that's still in experience mode) but can any one point me to where to start training grobid-quantities to recognize the quantified object? please

@lfoppiano
Copy link
Owner

lfoppiano commented Mar 6, 2019

Dear @ahmed-elyoussefi, thank you for your interest in grobid-quantities.

The current implementation of the quantified object recognition uses a dependency parser and does some heuristic to find the quantified object in the sentence.
The plan was, at some point to migrate such implementation to a full machine learning based approach (and to get rid of clearnlp, see issue #67).

I had started to use the current implementation to pre-annotate training data and then I've started annotating some documents in order to set up annotation guidelines and find problems. I didn't get too far with that actually.

If you download the master branch you can pre-annotate data with all the available models using

-gH ../grobid-home/ -gP ../grobid-home/config/grobid.properties -dIn input_directory -dOut output_directory -r -exe createTrainingQuantities

and, for each PDF you will generate training data for each models. So to correct the quantified object data you will get a set of files *.quantifiedObject.xml from the output directory.

Give it a try and let me know if you have any issue.

Regarding the annotation, this part is very experimental because we didn't really think a lot about it.

The annotation are in the form of
<measure type="value" ptr="#03f98156-4d76-4de0-84fc-c641b112b232">Nine</measure> <quantifiedObject id="03f98156-4d76-4de0-84fc-c641b112b232">cases</quantifiedObject> are selected as true freak wave events

with the <measure> tag just wrapping around the measurement with an id and the <quantifiedObject> linking it through a pointer using the attribute ptr. Have a look at what you got and I can anticipate that correcting such annotations it won't be so easy ;-).

The idea, at the begnning of this task, was to use CRF so is unlikely to cover distant links betweeen measurement and objects however I haven't spent enough time to figure out if some alternative exists.
I will probably need this model for a project I'm working on so I might get back to work on it, however I can support you if you need any further help.

I hope I haven't forgotten anything, in any case feel free to ask.

Regards
Luca

@lfoppiano
Copy link
Owner

@ahmed-elyoussefi
Copy link
Author

thank you for your replies
but if i have the annotated quantified objects how can i trigger the training ?

regards

@lfoppiano
Copy link
Owner

lfoppiano commented Mar 13, 2019

good question... well right now you can't... cause that part haven't been written yet.
You need to wait or try to implement it 😅

Annotating data for this kind of task is very time consuming, so I haven't had time to work on it yet. I will have time at some point in the next weeks I believe but I cannot really say precisely.

@lfoppiano
Copy link
Owner

OK, @ahmed-elyoussefi I've managed to write the trainer for the quantifiedObject. 😅

It can process data, though the training data haven't been checked by anybody else... so everything is very alpha version. I'm planning to plug in the parser soon.

To run the trainer you need to type ./gradlw train_quantifiedObjects and cross your fingers :-)

This particular trainer is validating the consistency of the training data, so if there are missing links between measurement and objects it will raise an exception, like:

Caused by: org.grobid.core.exceptions.GrobidException: [GENERAL] The training data is inconsistent and should be corrected. 
	measureId: null
	quantifiedObjectId: fbf9a8de-3763-4200-90eb-ae5e2fc33eff

Happy annotating!! 😆

@ahmed-elyoussefi
Copy link
Author

awesome
thank you very much
i'll give it a try

@lfoppiano lfoppiano self-assigned this Mar 19, 2019
lfoppiano added a commit that referenced this issue Apr 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants