Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate ami regex tools #32

Open
remkop opened this issue Apr 18, 2020 · 0 comments
Open

Consolidate ami regex tools #32

remkop opened this issue Apr 18, 2020 · 0 comments

Comments

@remkop
Copy link
Collaborator

remkop commented Apr 18, 2020

This is a follow-up item from discussion on #15:

@petermr I noticed there are two classes named AMIRegexTool and they both extend AbstractAMISearchTool. Which one should be the subcommand for ami? Or do you want both?

I picked org.contentmine.ami.tools.AMIRegexTool but it looks like that was the wrong one...

Note that org.contentmine.ami.plugins.regex.RegexPlugin is still available as a top-level command with a separate ami-regex launcher script. I can make that one the subcommand for ami if you want, but then what to do with org.contentmine.ami.tools.AMIRegexTool? (That one did not have a launcher script so perhaps you don't care too much about that one...)

Peter's reply:

What happened was a primitive pre-picocli command line which
supported something I called Plugins (they weren't actually Plugins as the
links were hardcoded but they were designed to be if and when I worked out
how! I even looked at OSGI at one stage).
AMIRegex is currently "broken" - i.e. it isn't linked in, but it should be.
It would be great to have the following:

  • AMIRegex which is very useful for lots of things, such as identifiers.
    (There's a separate AMIIdentifier which is just a specialisation of regex.
  • AMISpecies which uses style (...) to detect a possible species and
    then regex ([A-Z][a-z]+\s+[a-z]*) to pick up Tyrannosaurus rex . (The
    regex is more complex but ..). There's also more lookup logic - not just
    lexical.
  • Gene, Sequence (biological), are also useful.
    and
    word frequencies also suffer from this mess.

If you look at org.contentmine.ami.plugins you can see these all had
pre-picocli commands (you can see why Picocli saved the project!!)

I have forgotten exactly how the commands linked in, but It should be
relatively easy to reconstruct a prototype. I started doing this but stuck
about halfway through (I think when I broke the leg). Somewhere in there I
have used a Bloom filter for rapid searching (I think it's still linked in).

But it may be that for general word searching and frequency it's better to
use Lucene/Solr and write results back into the tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant