Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish extraction problem from post-processing problem #80

Open
kensei-te opened this issue Apr 15, 2022 · 1 comment
Open

Distinguish extraction problem from post-processing problem #80

kensei-te opened this issue Apr 15, 2022 · 1 comment
Assignees
Labels
curation Discussions related to the curation need further clarification question Further information is requested

Comments

@kensei-te
Copy link
Collaborator

In order to obtain neat/ready-to-use dataset for machine-learning, from text data mining, there would be two steps.

First, the item of interest has to be properly extracted.
Second, it has to be properly post-processed.

During the curation process, I want to clearly distinguish extraction problem from post-processing problem. Even now every "status" or "error-type" will fall into either, but I want to clarify it.

Luca is already kindly performing several post-processing for extracted items. But the data are still not fully ready to use. I also want to discuss about, which part will be taken care by Luca, and which part might be our task.

I mean, every curated items will be divided into 3

  1. will be solved by improving extraction
  2. will be solved by post-processing method by Luca (therefore this may be provided in open-version of supercon2)
  3. will be solved by post-processing method by user (this might be Takano-Gr original)
    It would be great if we can distinguish them during the curation. I hope we can discuss this in coming meeting.
@lfoppiano lfoppiano added documentation Improvements or additions to documentation question Further information is requested labels Apr 15, 2022
@lfoppiano
Copy link
Owner

Good point! This is one of the goals of the guidelines.

For case 1) we can define these cases as "invalid boxes", when the box miss some information or contains too many information.

Here some examples:

Input: "In the doped La Fe we noticed that..."
Example 1: the extracted material is "La Fe" missing "doping"
Example 2: the extracted material is "doped La Fe we noticed"

For case 2) it's a special case of 1). For example

Input: "In the doped La Fe we noticed that..." we assume that the material is correctly extracted doped La Fe.

Example 1: the post processed formula is La Fe, and this is correct
Example 2: the post processed formula is La or anything else which is not correct.

For case 3) we will have to sort the post-processing by picking up information scattered in the paper. Example already discussed

1 and 3 are clear I think. 2 could be tricky because it requires the curator to know which type of post-processing are performed.

@lfoppiano lfoppiano added curation Discussions related to the curation and removed documentation Improvements or additions to documentation labels Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
curation Discussions related to the curation need further clarification question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants