Specification of the format for the values of the @unit attribute #39

everzeni · 2017-08-24T16:20:23Z

To know what form a @Unit attribute's value must take (for example unit="min" or unit="minute"?), we can use this page http://cdsarc.u-strasbg.fr/cgi-bin/Unit?%3f

Is that ok?
Are there other sources we could use?

edit: the page mentioned above is not always the answer since there are sometimes several symbols for a unit (ex. year may be a or yr, and in grobid-quantities it comes out as unit="year").
How should we proceed?

The text was updated successfully, but these errors were encountered:

kermitt2 · 2017-08-25T04:18:57Z

We need to use the src/main/resources/en/units.json file and not an external resource, otherwise the system will not be able to interpret unambiguously the unit value.
We can use the unique raw value, which is always disambiguated by the measure type.
I realize for instance that we used unit="percentage" (so a lemma form) instead of the raw form.

Then if a unit is missing, it has currently to be declared in src/main/resources/en/units.json
The dynamic update of this file based on the training data has not been implemented.

lfoppiano · 2017-08-25T05:22:37Z

We could actually verify the unit.json from the strasbourg reference and complete it if something is missing ;-)

lfoppiano · 2017-08-25T05:28:17Z

Regarding the unit reference, then we could use either any of the notation form (when more than one) or we define an additional attribute that is the unique reference for the unit.

Btw we would need to update also TEMPERATURE, using unit=°C

…t in the @Unit attribute #39

kermitt2 · 2017-08-25T06:15:56Z

Actually we need to clarify/remember what we wanted to do with this attribute's value. I think it was introduced to disambiguate the annotated unit, so normally the value here should correspond to the one use in the text - so 'minute' makes sense when minute is annotated, and min when it's min..., as long as the unit is refered in 'units.json' for the measurement type, it would be fine.

We could then use this annotation for the unit parser, its training and evaluation I think.

lfoppiano · 2017-08-25T06:43:13Z

I agree with you on the usage, however if use the value which is in the text, what's the point of having then the attribute at all?

If we want to use this for the unit parser, then is better to have an already (semi)normalised version:

if the units are simpler unit, we use the notation as it's defined in the raw entry in unit.json
if the units are complex, we use the normalised version, e.g. m2 or meters^2 would be added in the @unit as m^2, like something that we can easily read/transform. A more complex example would be something like mm * kg / minutes we can already rewrite as mm * Kg * min^-1

What do you think?

kermitt2 · 2017-08-25T14:39:26Z

The attribute is used to build training data, but its value is not used for the moment.

I would agree with you, the value could be - not really a "normalized" but - a "valid" form of the unit as it appears which might be degraded because it comes from the PDF, some example:

<measure type="LENGTH" unit="µm">µm</measure>
<measure type="VELOCITY" unit="km.h^-1">km · h −1</measure>
<measure type="VO2_MAX" unit="ml.kg^-1.min^-1">ml · kg −1 · min −1</measure>

(these cleaning/"transliterations" are not always obvious)

The advantage is that we stick on what appears in the text, and the annotator does not need to use a reference list of units and normal forms.

We could also choose a more advanced/normalized form (like put m instead of meter when meter appears in a text), but it adds work to the annotator who has to check the normalized form against a reference and the need to update the reference list when a new unit appears (and there are always new units built/composed from other units).

I will look again at the unit parser, which most likely require a new iteration, and see what kind of information could be most needed.

everzeni added the question label Aug 24, 2017

everzeni assigned kermitt2 Aug 24, 2017

lfoppiano added a commit that referenced this issue Aug 25, 2017

Correcting training data and guidelines using the raw form of the uni…

13f0251

…t in the @Unit attribute #39

kermitt2 mentioned this issue Aug 25, 2017

Different formats for a @unit value #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specification of the format for the values of the @unit attribute #39

Specification of the format for the values of the @unit attribute #39

everzeni commented Aug 24, 2017 •

edited

Loading

kermitt2 commented Aug 25, 2017

lfoppiano commented Aug 25, 2017

lfoppiano commented Aug 25, 2017

kermitt2 commented Aug 25, 2017

lfoppiano commented Aug 25, 2017 •

edited

Loading

kermitt2 commented Aug 25, 2017

Specification of the format for the values of the @unit attribute #39

Specification of the format for the values of the @unit attribute #39

Comments

everzeni commented Aug 24, 2017 • edited Loading

kermitt2 commented Aug 25, 2017

lfoppiano commented Aug 25, 2017

lfoppiano commented Aug 25, 2017

kermitt2 commented Aug 25, 2017

lfoppiano commented Aug 25, 2017 • edited Loading

kermitt2 commented Aug 25, 2017

everzeni commented Aug 24, 2017 •

edited

Loading

lfoppiano commented Aug 25, 2017 •

edited

Loading