Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

best way to use gnfinder to find names in tabulated data and get results tabulated as in origin #120

Open
abubelinha opened this issue Apr 4, 2022 · 7 comments

Comments

@abubelinha
Copy link

abubelinha commented Apr 4, 2022

Hello

I am planning to use gnfinder to process a column from a table with about 2500 rows.

  • In first column I have an identifier from a museum specimen ID.
  • In second column I have the whole unprocessed old specimen label, which contains one or several species names, locality, collector and maybe some comments.
ID LABEL
1 Blah blah blah Scientificname_A blah blah Scientificname_B blah blah
2 Scientificname_C bleh blah blah Scientificname_A
... ...
2500 Blah blah blih blah Scientificname_X bluh blah blah Scientificname_F blah blah

So, in fact, what I need to pass in to gnfinder is each cell of the second column, to extract names from it and return matches against some preferred name sources. But of course, I need to keep the returned info associated to each specimen ID (1st column in my table).

  • Is it possible to somehow pass the 2nd column to gnfinder in just one call, so gnfinder returns me an array of 2500 responses for each row of my original table?
  • Or do I need to make 2500 separate gnfinder calls?

I was planning to use the API but I suppose I could try to use the CLI if it is more suitable to this purpose.

Thanks a lot

EDIT: not sure if this has relation to #56 but I am not using R dataframes. Just processing a CSV file in Python.

@dimus
Copy link
Member

dimus commented Apr 4, 2022

Hi @abubelinha, one way you can do it locally is to set a pipe in python to talk to command liine gnfinder on you computer. It would be similar to https://github.com/gnames/gnparser#pipes

2500 separate calls to API also does not sound too strenuous for the service.

@abubelinha
Copy link
Author

abubelinha commented Apr 4, 2022

Thanks @dimus
But I guess even using pipes, this would imply 2500 local gnfinder pipe calls, wouldn't it? (which again means 2500 online requests when verification is turned on, correct?)
I would prefer to use one call, just in case I end up using this technic for something much bigger in the future.

Anyway, I had not realized that gnfinder returns start/end position of each name found in the long text string. That could be so useful for my use case.
Perhaps creating a couple of new calculated columns in my table, label_length, plus cummulative_labels_length, and then concatenating all labels' cells and passing them to gnfinder as a single long string ... I might be able to match found names against the correct rows by comparing returned start & end values of each name against these two columns' values

@dimus
Copy link
Member

dimus commented Apr 4, 2022

If you do not mind to use the start/end positions, all should work in one go. However, take in account #38. If your file is tab-separated, all will work, if it is comma-separated, you would probably need to preprocess the file and add a space after commas.

@abubelinha
Copy link
Author

abubelinha commented Apr 4, 2022

Good point!
As I am generating the original CSV I can control its format and make it tab-separated.
Anyway, what I am passing to gnfinder is only one column (see LABEL column in table above), with all rows concatenated, like this (so no column separators affecting here):

"Blah blah blah Scientificname_A blah blah Scientificname_B blah blah|Scientificname_C bleh blah blah Scientificname_A|Blah blah blih blah Scientificname_X bluh blah blah Scientificname_F blah blah"

I use | symbols here to show you the limits between original colum cells (from up to down). But if I concatenate them, those symbols are not present in the text passed to gnfinder ... or should I better use them? Which character would you use (if any) to separate content from contiguous cells, before feeding gnfinder?

I try to figure out what will happen if the taxon name is just at the end or beginning of the cell (if no separator is added, then both names will be concatenated).

Perhaps a space before and after separator would be better? (so 3 characters instead of just one)

@dimus
Copy link
Member

dimus commented Apr 4, 2022

originally gnfinder was made to detect names in BHL, so it uses a space of any kind as a separator between words. The | characters should not affect anything, as long as there is a space after them.

@dimus
Copy link
Member

dimus commented Apr 4, 2022

several spaces are ok

@dimus
Copy link
Member

dimus commented May 10, 2022

CSV and TSV files should work fine, because they are going to be normalized to a plain text with spaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants