Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measurement type recognition #120

Closed
thorge opened this issue May 11, 2021 · 7 comments
Closed

Measurement type recognition #120

thorge opened this issue May 11, 2021 · 7 comments

Comments

@thorge
Copy link
Contributor

thorge commented May 11, 2021

Sorry for keeping you busy @lfoppiano. I'm not sure where exactly to discuss this as it is more of a question than an issue. If this is the wrong place, just let me know.

As I mentioned before, I'm going to add more measurement types to the module. More precisely, I want to identify mass accumulation rates (e.g. g cm^-2 yr^-1) and sedimentation rates (e.g. cm yr^-1).

First I enhanced the unit lexicon with

[
     {
      "notations": [
        {
          "raw": "kg m^-2 yr^-1",
          "product": [
            {
              "base": "g"
            },
            {
              "base": "m",
              "pow": "-2"
            },
            {
              "base": "yr",
              "pow": "-1"
            }
          ]
        },
        {
          "raw": "kg/(m2*yr)",
          "product": [
            {
              "base": "g"
            },
            {
              "base": "m",
              "pow": "-2"
            },
            {
              "base": "yr",
              "pow": "-1"
            }
          ]
        },
        {
          "raw": "kg/m2/yr",
          "product": [
            {
              "base": "g"
            },
            {
              "base": "m",
              "pow": "-2"
            },
            {
              "base": "yr",
              "pow": "-1"
            }
          ]
        }
      ],
      "type": "MASS_ACCUMULATION_RATE",
      "system": "SI_DERIVED",
      "supportsPrefixes": true,
      "names": [
        {
          "lemma": "mass accumulation rate",
          "inflection": "mass accumulation rates"
        }
      ]
    },
    {
      "notations": [
        {
          "raw": "cm yr^-1",
          "product": [
            {
              "prefix": "c",
              "base": "m"
            },
            {
              "base": "yr",
              "pow": "-1"
            }
          ]
        },
        {
          "raw": "cm/yr",
          "product": [
            {
              "prefix": "c",
              "base": "m"
            },
            {
              "base": "yr",
              "pow": "-1"
            }
          ]
        }
      ],
      "type": "SEDIMENTATION_RATE",
      "system": "SI_DERIVED",
      "supportsPrefixes": true,
      "names": [
        {
          "lemma": "sedimentation rate",
          "inflection": "sedimentation rates"
        }
      ]
    }
]

Note, that I also added yr as an inflection

    {
      "type": "TIME",
      "system": "NON_SI",
      "names": [
        {
          "lemma": "year",
          "inflections": [
            "years",
            "yr"
          ]
        }
      ]
    }

Since the generation of training data now works great, I removed the training corpus for the quantities model as a test and replaced it with a single training set covering my use case.

Annotated quantities

<tei xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc xml:id="_1" />
    <encodingDesc>
      <appInfo>
        <application version="0.6.2" ident="GROBID" when="2021-05-08T10:40+0000">
          <ref target="https://github.com/kermitt2/grobid">A machine learning software for extracting information from scholarly documents</ref>
        </application>
      </appInfo>
    </encodingDesc>
  </teiHeader>
  <text xml:lang="lexicon/en">
    <p>They corresponded to sedimentation rates (acc) of <measure type="list"><num>0.45</num> and <num>0.3</num> <measure type="SEDIMENTATION_RATE" unit="cm yr^-1">cm yr-1</measure></measure> (Table 2). Beyond the middle shelf, MAR decreased sharply to <measure type="list"><num>132</num> and <num>44</num> <measure type="MASS_ACCUMULATION_RATE" unit="g m^-2 yr^-1">g m-2 yr-1</measure></measure> at St. 2 (11 S) and St. 8 (12 S), respectively. These latter values are associated with measurable <measure type="value"><num>210</num></measure>Pbxs in the upper <measure type="value"><num>3</num> <measure type="LENGTH" unit="cm">cm</measure></measure> only and thus indicative of sediment winnowing or resuspension, as mentioned above. A relatively low MAR of <measure type="value"><num>128</num> <measure type="MASS_ACCUMULATION_RATE" unit="kg m^-2 yr^-1">kg m-2 yr-1</measure></measure> was also determined for St. 4 at 12 S compared to the neighbouring stations. MAR and acc tended to be higher at the deep oxygenated stations compared to the OMZ stations, with acc of <measure type="value"><num>0.06</num> <measure type="SEDIMENTATION_RATE" unit="cm yr^-1">cm yr-1</measure></measure> at St. 10 (12 S) and <measure type="value"><num>0.05</num> <measure type="SEDIMENTATION_RATE" unit="cm yr^-1">cm yr-1</measure></measure> at St. 5 (11 S). Aluminium accumulation showed similar trends to MAR, with highest values on the shelf and a pronounced increase below the OMZ (Fig. 3b).</p>
  </text>
</tei>

Annotated units

<units xmlns="http://www.tei-c.org/ns/1.0">
  <unit><prefix>c</prefix><base>m</base> <base>yr</base><pow>-1</pow></unit>
  <unit><base>g</base> <base>m</base><pow>-2</pow> <base>yr</base><pow>-1</pow></unit>
  <unit><prefix>c</prefix><base>m</base></unit>
  <unit><prefix>k</prefix><base>g</base> <base>m</base><pow>-2</pow> <base>yr</base><pow>-1</pow></unit>
  <unit><prefix>c</prefix><base>m</base> <base>yr</base><pow>-1</pow></unit>
  <unit><prefix>c</prefix><base>m</base> <base>yr</base><pow>-1</pow></unit>
</units>

Annotated values

<values xmlns="http://www.tei-c.org/ns/1.0">
  <value><number>0.45</number></value>
  <value><number>0.3</number></value>
  <value><number>132</number></value>
  <value><number>44</number></value>
  <value><number>210</number></value>
  <value><number>3</number></value>
  <value><number>128</number></value>
  <value><number>0.06</number></value>
  <value><number>0.05</number></value>
</values>

Here I also noted that to make the generated data available for model training, it's not enough to add the corresponding files to the corpus directories. You also have to use the file ending .tei or .tei.xml (e.g. see the QuantitiesTrainer.java on line 78). The default training data generation output is .xml. I think I haven't seen any information on that in the docs.

Then I successfully trained all models and tested with web interface. In a first step, it is important to me that the measurements are recognized at all, so I test the corresponding models with the same input that is also used in the training data.

The output is:

Screen Shot 2021-05-12 at 13 54 04

As you can see in the web app output, it can handle (recognize and normalize) the different measurements, but I have two problems regarding the recognition of the quantity type:

  1. The quantity type for a mass accumulation rate is not recognized. The type recognition of simple units (I tested with length) works like expected. I'm not sure, what I'm doing wrong.
  2. The type of a sedimentation rate is recognized as a velocity, which is not wrong, since sedimentation rates are velocities. Of course I can label all velocities as sedimentation rates, which would help me for my usecase but the module will loose its generic approach.

Any ideas, what I'm doing wrong?

@lfoppiano
Copy link
Owner

lfoppiano commented May 14, 2021

Dear @thorge thanks for the feedback.
I did not have time to check the whole problem. We might need to do a couple of iteration on this. The Lexicon needs to be improved anyway (#97) and this is a good opportunity.

I checked the process and what you did look fine. Just one question, the notation in the lexicon (JSON file) should contain the "official formula" for the unit. In your case, you have the notation with kg but the product is with g instead of kg, which one should be used?

Meanwhile I will have a deeper look because the data flow it's a bit complicated, due to the complexity and the noise that usually comes with quantities and units recognition.

Also, would you be able to share the references of the paper you are using?

@thorge
Copy link
Contributor Author

thorge commented May 14, 2021

Dear @lfoppiano thanks for having a look at it.

You're right, it should be only g, otherwise the prefix k should be added to the unit lexicon. I tried different values ​​and must have forgotten to add the prefix. Unfortunately, I get the same results with the correct notation.

Btw, you can find my work in the feature branch of my forked repo.

Regarding the references of the paper. What exactly do you mean exactly? So far I have only worked with text excerpts (.txt) from preprocessed papers. The files are in the test/resources folder of my fork, see test5.txt and test6.txt.

@lfoppiano
Copy link
Owner

@thorge I will check more in-depth this week. Thanks for the text you used for testing.

@thorge
Copy link
Contributor Author

thorge commented Jul 18, 2021

Hey @lfoppiano I found out why the labelling of MARs did not work properly. I had to add kg/(s·m²) to the unit lexicon in the notation field and now the recognition works fine.

{
   "raw":"kg/(s·m²)",
   "product":[
      {
         "prefix":"k",
         "base":"g"
      },
      {
         "base":"m",
         "pow":"-2"
      },
      {
         "base":"s",
         "pow":"-1"
      }
   ]
}

Before I had notations like kg m^-2 yr^-1, because I was hoping I could define a normalized unit for MARs using this field, but apparently the derived unit is always normalized using the normalized forms of the subunits, like mass -> kilogram, area -> square meter, and time -> seconds. So in the end a MAR will always be normalized to kg/(s·m²) I guess and that's why I had to add it as well.

For the normalized units of sedimentation rates, which are kind of velocities I still have problems to differentiate between the two. As far as I can see, the first (or last, I'm not sure right now) defined unit in the unit lexicon that matches the normalized unit is used for labeling.

@lfoppiano
Copy link
Owner

Dear @thorge thanks for the feedback on this issue. I'm trying to spare some time to work on it but it's very difficult at the moment. I've managed to release version 0.7.0 which contains quite a lot of improvements (and, luckily, not many bugs ^^).

The overlapping problem should be solved in issue #96 but prior to that, the lexicon needs to be updated to allow multiple definitions to be returned.

Regarding your question, we used the normalised unit to pull out the unit definition because we use the base normalised units to get the data from the Lexicon.
However, looking at it I think we should use the product form instead of the formula actually used, so that we don't have to use the specific way the normalised units are formatted.

@lfoppiano
Copy link
Owner

Normally, the fix in 7b95705 should solve the issue of fetching the unit definition with the normalised unit having the superscripts numbers

@lfoppiano
Copy link
Owner

I'm closing this issue, feel free to reopen if you want me to look more into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants