Distinguish extraction problem from post-processing problem #80

kensei-te · 2022-04-15T08:06:27Z

In order to obtain neat/ready-to-use dataset for machine-learning, from text data mining, there would be two steps.

First, the item of interest has to be properly extracted.
Second, it has to be properly post-processed.

During the curation process, I want to clearly distinguish extraction problem from post-processing problem. Even now every "status" or "error-type" will fall into either, but I want to clarify it.

Luca is already kindly performing several post-processing for extracted items. But the data are still not fully ready to use. I also want to discuss about, which part will be taken care by Luca, and which part might be our task.

I mean, every curated items will be divided into 3

will be solved by improving extraction
will be solved by post-processing method by Luca (therefore this may be provided in open-version of supercon2)
will be solved by post-processing method by user (this might be Takano-Gr original)
It would be great if we can distinguish them during the curation. I hope we can discuss this in coming meeting.

lfoppiano · 2022-04-18T01:11:41Z

Good point! This is one of the goals of the guidelines.

For case 1) we can define these cases as "invalid boxes", when the box miss some information or contains too many information.

Here some examples:

Input: "In the doped La Fe we noticed that..."
Example 1: the extracted material is "La Fe" missing "doping"
Example 2: the extracted material is "doped La Fe we noticed"

For case 2) it's a special case of 1). For example

Input: "In the doped La Fe we noticed that..." we assume that the material is correctly extracted doped La Fe.

Example 1: the post processed formula is La Fe, and this is correct
Example 2: the post processed formula is La or anything else which is not correct.

For case 3) we will have to sort the post-processing by picking up information scattered in the paper. Example already discussed

1 and 3 are clear I think. 2 could be tricky because it requires the curator to know which type of post-processing are performed.

kensei-te assigned lfoppiano Apr 15, 2022

lfoppiano added documentation Improvements or additions to documentation question Further information is requested labels Apr 15, 2022

lfoppiano added the need further clarification label Sep 30, 2022

lfoppiano added curation Discussions related to the curation and removed documentation Improvements or additions to documentation labels Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish extraction problem from post-processing problem #80

Distinguish extraction problem from post-processing problem #80

kensei-te commented Apr 15, 2022

lfoppiano commented Apr 18, 2022

Distinguish extraction problem from post-processing problem #80

Distinguish extraction problem from post-processing problem #80

Comments

kensei-te commented Apr 15, 2022

lfoppiano commented Apr 18, 2022