Measurement type recognition #120

thorge · 2021-05-11T12:09:12Z

Sorry for keeping you busy @lfoppiano. I'm not sure where exactly to discuss this as it is more of a question than an issue. If this is the wrong place, just let me know.

As I mentioned before, I'm going to add more measurement types to the module. More precisely, I want to identify mass accumulation rates (e.g. g cm^-2 yr^-1) and sedimentation rates (e.g. cm yr^-1).

First I enhanced the unit lexicon with

[
     {
      "notations": [
        {
          "raw": "kg m^-2 yr^-1",
          "product": [
            {
              "base": "g"
            },
            {
              "base": "m",
              "pow": "-2"
            },
            {
              "base": "yr",
              "pow": "-1"
            }
          ]
        },
        {
          "raw": "kg/(m2*yr)",
          "product": [
            {
              "base": "g"
            },
            {
              "base": "m",
              "pow": "-2"
            },
            {
              "base": "yr",
              "pow": "-1"
            }
          ]
        },
        {
          "raw": "kg/m2/yr",
          "product": [
            {
              "base": "g"
            },
            {
              "base": "m",
              "pow": "-2"
            },
            {
              "base": "yr",
              "pow": "-1"
            }
          ]
        }
      ],
      "type": "MASS_ACCUMULATION_RATE",
      "system": "SI_DERIVED",
      "supportsPrefixes": true,
      "names": [
        {
          "lemma": "mass accumulation rate",
          "inflection": "mass accumulation rates"
        }
      ]
    },
    {
      "notations": [
        {
          "raw": "cm yr^-1",
          "product": [
            {
              "prefix": "c",
              "base": "m"
            },
            {
              "base": "yr",
              "pow": "-1"
            }
          ]
        },
        {
          "raw": "cm/yr",
          "product": [
            {
              "prefix": "c",
              "base": "m"
            },
            {
              "base": "yr",
              "pow": "-1"
            }
          ]
        }
      ],
      "type": "SEDIMENTATION_RATE",
      "system": "SI_DERIVED",
      "supportsPrefixes": true,
      "names": [
        {
          "lemma": "sedimentation rate",
          "inflection": "sedimentation rates"
        }
      ]
    }
]

Note, that I also added yr as an inflection

    {
      "type": "TIME",
      "system": "NON_SI",
      "names": [
        {
          "lemma": "year",
          "inflections": [
            "years",
            "yr"
          ]
        }
      ]
    }

Since the generation of training data now works great, I removed the training corpus for the quantities model as a test and replaced it with a single training set covering my use case.

Annotated quantities

<tei xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc xml:id="_1" />
    <encodingDesc>
      <appInfo>
        <application version="0.6.2" ident="GROBID" when="2021-05-08T10:40+0000">
          <ref target="https://github.com/kermitt2/grobid">A machine learning software for extracting information from scholarly documents</ref>
        </application>
      </appInfo>
    </encodingDesc>
  </teiHeader>
  <text xml:lang="lexicon/en">
    <p>They corresponded to sedimentation rates (acc) of <measure type="list"><num>0.45</num> and <num>0.3</num> <measure type="SEDIMENTATION_RATE" unit="cm yr^-1">cm yr-1</measure></measure> (Table 2). Beyond the middle shelf, MAR decreased sharply to <measure type="list"><num>132</num> and <num>44</num> <measure type="MASS_ACCUMULATION_RATE" unit="g m^-2 yr^-1">g m-2 yr-1</measure></measure> at St. 2 (11 S) and St. 8 (12 S), respectively. These latter values are associated with measurable <measure type="value"><num>210</num></measure>Pbxs in the upper <measure type="value"><num>3</num> <measure type="LENGTH" unit="cm">cm</measure></measure> only and thus indicative of sediment winnowing or resuspension, as mentioned above. A relatively low MAR of <measure type="value"><num>128</num> <measure type="MASS_ACCUMULATION_RATE" unit="kg m^-2 yr^-1">kg m-2 yr-1</measure></measure> was also determined for St. 4 at 12 S compared to the neighbouring stations. MAR and acc tended to be higher at the deep oxygenated stations compared to the OMZ stations, with acc of <measure type="value"><num>0.06</num> <measure type="SEDIMENTATION_RATE" unit="cm yr^-1">cm yr-1</measure></measure> at St. 10 (12 S) and <measure type="value"><num>0.05</num> <measure type="SEDIMENTATION_RATE" unit="cm yr^-1">cm yr-1</measure></measure> at St. 5 (11 S). Aluminium accumulation showed similar trends to MAR, with highest values on the shelf and a pronounced increase below the OMZ (Fig. 3b).</p>
  </text>
</tei>

Annotated units

<units xmlns="http://www.tei-c.org/ns/1.0">
  <unit><prefix>c</prefix><base>m</base> <base>yr</base><pow>-1</pow></unit>
  <unit><base>g</base> <base>m</base><pow>-2</pow> <base>yr</base><pow>-1</pow></unit>
  <unit><prefix>c</prefix><base>m</base></unit>
  <unit><prefix>k</prefix><base>g</base> <base>m</base><pow>-2</pow> <base>yr</base><pow>-1</pow></unit>
  <unit><prefix>c</prefix><base>m</base> <base>yr</base><pow>-1</pow></unit>
  <unit><prefix>c</prefix><base>m</base> <base>yr</base><pow>-1</pow></unit>
</units>

Annotated values

<values xmlns="http://www.tei-c.org/ns/1.0">
  <value><number>0.45</number></value>
  <value><number>0.3</number></value>
  <value><number>132</number></value>
  <value><number>44</number></value>
  <value><number>210</number></value>
  <value><number>3</number></value>
  <value><number>128</number></value>
  <value><number>0.06</number></value>
  <value><number>0.05</number></value>
</values>

Here I also noted that to make the generated data available for model training, it's not enough to add the corresponding files to the corpus directories. You also have to use the file ending .tei or .tei.xml (e.g. see the QuantitiesTrainer.java on line 78). The default training data generation output is .xml. I think I haven't seen any information on that in the docs.

Then I successfully trained all models and tested with web interface. In a first step, it is important to me that the measurements are recognized at all, so I test the corresponding models with the same input that is also used in the training data.

The output is:

As you can see in the web app output, it can handle (recognize and normalize) the different measurements, but I have two problems regarding the recognition of the quantity type:

The quantity type for a mass accumulation rate is not recognized. The type recognition of simple units (I tested with length) works like expected. I'm not sure, what I'm doing wrong.
The type of a sedimentation rate is recognized as a velocity, which is not wrong, since sedimentation rates are velocities. Of course I can label all velocities as sedimentation rates, which would help me for my usecase but the module will loose its generic approach.

Any ideas, what I'm doing wrong?

The text was updated successfully, but these errors were encountered:

lfoppiano · 2021-05-14T07:43:55Z

Dear @thorge thanks for the feedback.
I did not have time to check the whole problem. We might need to do a couple of iteration on this. The Lexicon needs to be improved anyway (#97) and this is a good opportunity.

I checked the process and what you did look fine. Just one question, the notation in the lexicon (JSON file) should contain the "official formula" for the unit. In your case, you have the notation with kg but the product is with g instead of kg, which one should be used?

Meanwhile I will have a deeper look because the data flow it's a bit complicated, due to the complexity and the noise that usually comes with quantities and units recognition.

Also, would you be able to share the references of the paper you are using?

thorge · 2021-05-14T10:45:23Z

Dear @lfoppiano thanks for having a look at it.

You're right, it should be only g, otherwise the prefix k should be added to the unit lexicon. I tried different values and must have forgotten to add the prefix. Unfortunately, I get the same results with the correct notation.

Btw, you can find my work in the feature branch of my forked repo.

Regarding the references of the paper. What exactly do you mean exactly? So far I have only worked with text excerpts (.txt) from preprocessed papers. The files are in the test/resources folder of my fork, see test5.txt and test6.txt.

lfoppiano · 2021-05-24T08:14:00Z

@thorge I will check more in-depth this week. Thanks for the text you used for testing.

thorge · 2021-07-18T17:11:39Z

Hey @lfoppiano I found out why the labelling of MARs did not work properly. I had to add kg/(s·m²) to the unit lexicon in the notation field and now the recognition works fine.

{
   "raw":"kg/(s·m²)",
   "product":[
      {
         "prefix":"k",
         "base":"g"
      },
      {
         "base":"m",
         "pow":"-2"
      },
      {
         "base":"s",
         "pow":"-1"
      }
   ]
}

Before I had notations like kg m^-2 yr^-1, because I was hoping I could define a normalized unit for MARs using this field, but apparently the derived unit is always normalized using the normalized forms of the subunits, like mass -> kilogram, area -> square meter, and time -> seconds. So in the end a MAR will always be normalized to kg/(s·m²) I guess and that's why I had to add it as well.

For the normalized units of sedimentation rates, which are kind of velocities I still have problems to differentiate between the two. As far as I can see, the first (or last, I'm not sure right now) defined unit in the unit lexicon that matches the normalized unit is used for labeling.

lfoppiano · 2021-08-10T05:55:46Z

Dear @thorge thanks for the feedback on this issue. I'm trying to spare some time to work on it but it's very difficult at the moment. I've managed to release version 0.7.0 which contains quite a lot of improvements (and, luckily, not many bugs ^^).

The overlapping problem should be solved in issue #96 but prior to that, the lexicon needs to be updated to allow multiple definitions to be returned.

Regarding your question, we used the normalised unit to pull out the unit definition because we use the base normalised units to get the data from the Lexicon.
However, looking at it I think we should use the product form instead of the formula actually used, so that we don't have to use the specific way the normalised units are formatted.

lfoppiano · 2023-04-12T03:20:02Z

Normally, the fix in 7b95705 should solve the issue of fetching the unit definition with the normalised unit having the superscripts numbers

lfoppiano · 2023-12-15T02:03:47Z

I'm closing this issue, feel free to reopen if you want me to look more into it.

lfoppiano added the enhancement label May 12, 2021

lfoppiano mentioned this issue May 23, 2022

Add mass accumulation rates and sedimentation rates #135

Closed

lfoppiano closed this as completed Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measurement type recognition #120

Measurement type recognition #120

thorge commented May 11, 2021

lfoppiano commented May 14, 2021 •

edited

Loading

thorge commented May 14, 2021

lfoppiano commented May 24, 2021

thorge commented Jul 18, 2021

lfoppiano commented Aug 10, 2021

lfoppiano commented Apr 12, 2023

lfoppiano commented Dec 15, 2023

Measurement type recognition #120

Measurement type recognition #120

Comments

thorge commented May 11, 2021

lfoppiano commented May 14, 2021 • edited Loading

thorge commented May 14, 2021

lfoppiano commented May 24, 2021

thorge commented Jul 18, 2021

lfoppiano commented Aug 10, 2021

lfoppiano commented Apr 12, 2023

lfoppiano commented Dec 15, 2023

lfoppiano commented May 14, 2021 •

edited

Loading