Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specification of the format for the values of the @unit attribute #39

Open
everzeni opened this issue Aug 24, 2017 · 6 comments
Open

Specification of the format for the values of the @unit attribute #39

everzeni opened this issue Aug 24, 2017 · 6 comments
Assignees
Labels

Comments

@everzeni
Copy link
Collaborator

everzeni commented Aug 24, 2017

To know what form a @Unit attribute's value must take (for example unit="min" or unit="minute"?), we can use this page http://cdsarc.u-strasbg.fr/cgi-bin/Unit?%3f

Is that ok?
Are there other sources we could use?

edit: the page mentioned above is not always the answer since there are sometimes several symbols for a unit (ex. year may be a or yr, and in grobid-quantities it comes out as unit="year").
How should we proceed?

@kermitt2
Copy link
Collaborator

We need to use the src/main/resources/en/units.json file and not an external resource, otherwise the system will not be able to interpret unambiguously the unit value.
We can use the unique raw value, which is always disambiguated by the measure type.
I realize for instance that we used unit="percentage" (so a lemma form) instead of the raw form.

Then if a unit is missing, it has currently to be declared in src/main/resources/en/units.json
The dynamic update of this file based on the training data has not been implemented.

@lfoppiano
Copy link
Owner

We could actually verify the unit.json from the strasbourg reference and complete it if something is missing ;-)

@lfoppiano
Copy link
Owner

Regarding the unit reference, then we could use either any of the notation form (when more than one) or we define an additional attribute that is the unique reference for the unit.

Btw we would need to update also TEMPERATURE, using unit=°C

@kermitt2
Copy link
Collaborator

Actually we need to clarify/remember what we wanted to do with this attribute's value. I think it was introduced to disambiguate the annotated unit, so normally the value here should correspond to the one use in the text - so 'minute' makes sense when minute is annotated, and min when it's min..., as long as the unit is refered in 'units.json' for the measurement type, it would be fine.

We could then use this annotation for the unit parser, its training and evaluation I think.

@lfoppiano
Copy link
Owner

lfoppiano commented Aug 25, 2017

I agree with you on the usage, however if use the value which is in the text, what's the point of having then the attribute at all?

If we want to use this for the unit parser, then is better to have an already (semi)normalised version:

  • if the units are simpler unit, we use the notation as it's defined in the raw entry in unit.json
  • if the units are complex, we use the normalised version, e.g. m2 or meters^2 would be added in the @unit as m^2, like something that we can easily read/transform. A more complex example would be something like mm * kg / minutes we can already rewrite as mm * Kg * min^-1

What do you think?

@kermitt2
Copy link
Collaborator

The attribute is used to build training data, but its value is not used for the moment.

I would agree with you, the value could be - not really a "normalized" but - a "valid" form of the unit as it appears which might be degraded because it comes from the PDF, some example:

<measure type="LENGTH" unit="µm">&#xB5;m</measure>
<measure type="VELOCITY" unit="km.h^-1">km &#xB7; h &#x2212;1</measure>
<measure type="VO2_MAX" unit="ml.kg^-1.min^-1">ml &#xB7; kg &#x2212;1 &#xB7; min &#x2212;1</measure>

(these cleaning/"transliterations" are not always obvious)

The advantage is that we stick on what appears in the text, and the annotator does not need to use a reference list of units and normal forms.

We could also choose a more advanced/normalized form (like put m instead of meter when meter appears in a text), but it adds work to the annotator who has to check the normalized form against a reference and the need to update the reference list when a new unit appears (and there are always new units built/composed from other units).

I will look again at the unit parser, which most likely require a new iteration, and see what kind of information could be most needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants