Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy/paste of queries at the bottom of the query_SDGX.xml pages into Scopus fails #6

Open
erikkemperman opened this issue Mar 16, 2021 · 4 comments

Comments

@erikkemperman
Copy link
Contributor

Describe the bug
Copy/pasting the entire query at the bottom of the pages into Scopus advanced search gives syntax errors
To Reproduce
Steps to reproduce the behavior:

  1. Go to https://aurora-network-global.github.io/sdg-queries/query_SDG1.xml
  2. Copy the query at the bottom of the page
  3. Paste it into Scopus advanced search
  4. See error

Expected behavior
These queries should work, or the "usage" text should be adapted?

Desktop (please complete the following information):

  • OS: Linux
  • Browser Chromium
  • Version 89.0.4389.82

Additional context
The problem seems to be that the concatenation of subqueries with "\nOR\n" is disregarded by the scopus query editor, and it tries to execute queries containing "ORTITLE-ABS-KEY".

@mosart
Copy link
Contributor

mosart commented May 18, 2021

Hi Erik,

The orgin for rendering the XML to HTML can be found in the XSL
https://github.com/Aurora-Network-Global/sdg-queries/blob/master/queries.xsl

I tried to fix that by adding a non-breaking space after the OR statement in the XSL template.
f160f07#diff-34b02004b91728a359a521c831ffef74d95aebdd3b9c03788288b8e9aaa4bcb6

However, then I tried this, the rendering of the HTML breaks completely. (I could not test it on my laptop. Somehow a modern version of a web browser does not render xml to html when opened on localhost but only when accessed via https.)
So I changed it back.

If you have a -tested- solution for the forced space after the OR statement in the XSL. please help me out.

Warm regards,
Maurice

@erikkemperman
Copy link
Contributor Author

Hi Maurice,

Thanks for taking a look at the issue -- I am not sure how to remedy it I'm afraid, and have changed my approach since I reported this. I am now just parsing the raw XML files and compositing the scopus queries in a Python script. This suits me better anyway, since my goal is to transform the queries to work on Postgres.

Regards,
Erik

@mosart
Copy link
Contributor

mosart commented May 18, 2021

Nice! that is why we put it in xml, for automation, and human readability.
If you want you can let me know more about your project, and perhaps also share the transamination script. ( like IDfuse did for Elastic search DSL.
You are working at Erasmus University right?

@erikkemperman
Copy link
Contributor Author

erikkemperman commented May 20, 2021

Yes, I am an RSEC at Erasmus!

Agreed that XML is a nice format for this kind of thing -- although to facilitate translation of the queries to other languages, it might be worthwhile to consider making things a bit finer-grained, and perhaps slightly less Scopus-centric (although I understand those are the origins).

Just as an example,

<aqd:query-line field="TITLE-ABS-KEY">
  ("poverty line*") OR ("poverty indicator*")
</aqd:query-line>

To write a script to translate this to other query languages, I need to parse first the XML and then the Scopus query (for which, to my knowledge, no explicit grammar is publicly available so I've had to cobble something together myself using Antlr).

Suppose, instead, the XML looked something like this:

<aqd:query-line field="TITLE-ABS-KEY">
  <aqg:query-or>
    <aqd:query-parens>
      "poverty line*"
    </aqd:query-parens>
    <aqd:query-parens>
      "poverty indicator*"
    </aqd:query-parens>
  </aqg:query-or>
</aqd:query-line>

That way the tree structure of the query is reflected explicitly in XML, and it would be much easier to transform to other query languages. Of course the XSLT to render the Scopus queries would become a bit more complicated. Now that I have a Antlr grammar that appears to correctly parse the Scopus trees, I suppose it would be pretty easy to use that to automatically transform the former to the latter XML, so that wouldn't have to be done manually.

As an aside, I'm beginning to regret the choice (not mine) for Postgres. The argument at the time was that it supports something like Scopus' W/N proximity operator. But playing around with this, and reading up on Postgres' <N> operator, it's actually subtly different.

For one thing, the Scopus proximity operator is not directional, i.e. A W/3 B matches the same documents as B W/3 A. This is not true in Postgres, so to get an equivalent query I have to emit extra clauses, e.g. (A <3> B) || (B <3> A). (*)

Another complication is that Scopus' <W/3> means "within 3 or fewer words/lexemes" but the Postgres operator is exact. So actually, the equivalent of A W/3 B would be something like (A <1> B) || (B <1> A) || (A <2> B) || (B <2> A) || (A <3> B) || (B <3> A).

Of course, these problems compound very quickly if multiple proximity operators occur in a single query: if I am given A W/3 B W/3 C I will have to emit clauses for each permutation of A, B, and C (6 of them) as well as the cartesian product of the two ranges 1, 2, 3 (9 of them) for a total of 54 (!) clauses. And this is a trivial example, you can imagine I am ending up with some gigantic queries for the real thing!

Finally, I end up not using the more advanced features of Postgres text search, and in fact I have to force it to "simple" mode in order to make the Scopus wildcards work. Postgres would like to help me with this, stemming words in the documents and queries for me, ignoring stop words, and leveraging a built-in thesaurus for synonyms.

But the way the Scopus queries are given here defeats this, for example eradicat* occurs in the Scopus queries but since that isn't a known word, Postgres doesn't know how to stem it -- and so unless I force it to simple mode, a document with the word eradicate or eradication will not match this query, because it will have stemmed the valid word in the document but not the term in the query...

I can imagine, although it will be a lot of work, enriching the Aurora XML with a few valid expansions of the wild-carded terms. That way I can use those in my Postgres queries and have it do its magic.

Anyway, I have to get on with the next phase and unfortunately can't linger on these issues. Just thought I'd mention these observations while they are fresh on my mind. If I have a bit more time, I might revisit this if you are interested and try to come up with some more constructive / concrete proposals.

(*) Incidentally, Scopus does also have a directed variant, PRE/N and I wonder if some of the Aurora queries would be more precisely expressed that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants