-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measurement type recognition #120
Comments
Dear @thorge thanks for the feedback. I checked the process and what you did look fine. Just one question, the notation in the lexicon (JSON file) should contain the "official formula" for the unit. In your case, you have the notation with Meanwhile I will have a deeper look because the data flow it's a bit complicated, due to the complexity and the noise that usually comes with quantities and units recognition. Also, would you be able to share the references of the paper you are using? |
Dear @lfoppiano thanks for having a look at it. You're right, it should be only Btw, you can find my work in the feature branch of my forked repo. Regarding the references of the paper. What exactly do you mean exactly? So far I have only worked with text excerpts (.txt) from preprocessed papers. The files are in the |
@thorge I will check more in-depth this week. Thanks for the text you used for testing. |
Hey @lfoppiano I found out why the labelling of MARs did not work properly. I had to add {
"raw":"kg/(s·m²)",
"product":[
{
"prefix":"k",
"base":"g"
},
{
"base":"m",
"pow":"-2"
},
{
"base":"s",
"pow":"-1"
}
]
} Before I had notations like For the normalized units of sedimentation rates, which are kind of velocities I still have problems to differentiate between the two. As far as I can see, the first (or last, I'm not sure right now) defined unit in the unit lexicon that matches the normalized unit is used for labeling. |
Dear @thorge thanks for the feedback on this issue. I'm trying to spare some time to work on it but it's very difficult at the moment. I've managed to release version 0.7.0 which contains quite a lot of improvements (and, luckily, not many bugs ^^). The overlapping problem should be solved in issue #96 but prior to that, the lexicon needs to be updated to allow multiple definitions to be returned. Regarding your question, we used the normalised unit to pull out the unit definition because we use the base normalised units to get the data from the Lexicon. |
Normally, the fix in 7b95705 should solve the issue of fetching the unit definition with the normalised unit having the superscripts numbers |
I'm closing this issue, feel free to reopen if you want me to look more into it. |
Sorry for keeping you busy @lfoppiano. I'm not sure where exactly to discuss this as it is more of a question than an issue. If this is the wrong place, just let me know.
As I mentioned before, I'm going to add more measurement types to the module. More precisely, I want to identify mass accumulation rates (e.g. g cm^-2 yr^-1) and sedimentation rates (e.g. cm yr^-1).
First I enhanced the unit lexicon with
Note, that I also added
yr
as an inflectionSince the generation of training data now works great, I removed the training corpus for the quantities model as a test and replaced it with a single training set covering my use case.
Annotated quantities
Annotated units
Annotated values
Here I also noted that to make the generated data available for model training, it's not enough to add the corresponding files to the corpus directories. You also have to use the file ending
.tei
or.tei.xml
(e.g. see the QuantitiesTrainer.java on line 78). The default training data generation output is.xml
. I think I haven't seen any information on that in the docs.Then I successfully trained all models and tested with web interface. In a first step, it is important to me that the measurements are recognized at all, so I test the corresponding models with the same input that is also used in the training data.
The output is:
As you can see in the web app output, it can handle (recognize and normalize) the different measurements, but I have two problems regarding the recognition of the quantity type:
Any ideas, what I'm doing wrong?
The text was updated successfully, but these errors were encountered: