Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unification error after harvesting LDES feed #66

Open
cedricdcc opened this issue Jul 10, 2024 · 5 comments
Open

Unification error after harvesting LDES feed #66

cedricdcc opened this issue Jul 10, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@cedricdcc
Copy link
Contributor

A Docker container running a unification task encountered an error while processing a task identified by the URL http://redpencil.data.gift/id/task/668BE59D8B5E665623EBAC90, specifically for the operation http://mu.semte.ch/vocabularies/ext/ContentUnificationJob. The following error was logged:

/usr/local/lib/python3.8/site-packages/SPARQLWrapper/Wrapper.py:794: RuntimeWarning: Sending Accept header '*/*' because unexpected returned format 'json' in a 'CONSTRUCT' SPARQL query form
  warnings.warn(
Traceback (most recent call last):
  File "/app/task.py", line 240, in run_tasks
    generated = runner_func(used)
  File "/usr/src/app/ext/app/web.py", line 192, in <lambda>
    lambda sources: [run_vocab_unification(sources[0])],
  File "/usr/src/app/ext/app/web.py", line 120, in run_vocab_unification
    batch_res = query_sudo(get_batch_qs)
  File "/app/sudo_query.py", line 24, in query_sudo
    return sparqlQuery.query().convert()
  File "/usr/local/lib/python3.8/site-packages/SPARQLWrapper/Wrapper.py", line 960, in query
    return QueryResult(self._query())
  File "/usr/local/lib/python3.8/site-packages/SPARQLWrapper/Wrapper.py", line 926, in _query
    response = urlopener(request)
  File "/usr/local/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/local/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/local/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python3.8/urllib/request.py", line 1383, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/local/lib/python3.8/urllib/request.py", line 1358, in do_open
    r = h.getresponse()
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without response"
http.client.RemoteDisconnected: Remote end closed connection without response

Context

To reproduce this issue, the LDES feed located at https://vocab.nerc.ac.uk/ldes/P02/ was manually harvested (not via ttl config) at a rate of 60 requests per minute without dereferencing the members. Additionally I hadn't filled out the mapping for the LDES feed prior to the initial unification.

from LDES feed spawned container

2024-07-10T07:20:00.169Z [EventStream] info: done
finished processing stream
Finished processing https://vocab.nerc.ac.uk/ldes/P02/

Error analysis

The key error message is:

http.client.RemoteDisconnected: Remote end closed connection without response

This indicates that the remote server closed the connection unexpectedly while the Docker container was attempting to process a SPARQL query.

Potential issues

  • Netwerk Issues: Intermittent network issues might have caused the connection to drop.
  • SPARQL Query Format: The warning about the unexpected returned format 'json' in a 'CONSTRUCT' query form might indicate a mismatch in expected query results.
@cedricdcc cedricdcc added the bug Something isn't working label Aug 2, 2024
@MikiDi
Copy link
Contributor

MikiDi commented Sep 12, 2024

Hi Cedric,

Could you please clarify if you experience this issue with all LDES streams you commonly use, or just with this specific one? We know that the pipeline isn't as efficient as it could be with regard to the loading of data, which might impact really large datasets such as P01. I don't expect any of this to happen for smaller datasets like P02 though. I'm going to try and reproduce this issue locally with a fresh setup using current master ( e576025 ) sources.
To me this stack-trace looks like the sparql server responded something the client didn't expect. The most likely causes for this are a malformed query or some kind of timeout.

@cedricdcc
Copy link
Contributor Author

I can confirm that I get the same problem for marineregions. Ive used this config

@cedricdcc
Copy link
Contributor Author

cedricdcc commented Sep 18, 2024

output_logs_search.txt
@MikiDi if you initiate a vocab with the following config https://github.com/vlizBE/vocabserver-deploy/blob/main/vocab-config/bodc_p02.vocab-config.ttl you will get an error in the ldes feed saying that the 2012 fragment cannot be parsed due to an error in the feed. This triggers a cascade of events that causes the mu-search to take up all the processing power of the machine. Below is a snapshot of the docker stats
afbeelding

I've also added the logs of the mu-search container.
output_logs_search.txt
at about 09:18:40 I introduced the P02 ldes feed after which I kept getting

INFO [#1] UPDATE HANDLER -- Persisting update queue to disk (length: 0)

@MikiDi
Copy link
Contributor

MikiDi commented Sep 20, 2024

Hi Cedric,
It indeed seems like the ttl provided by bodc is malformed. As the consumer indeed logs, it goes wrong at line 74 of http://vocab.nerc.ac.uk/ldes/P02/2012_01_01_00_00_00_2012_12_31_23_59_59.ttl. No string escaping was done on the quote in 'fluff'. You can read more about ttl string escaping here https://www.w3.org/TR/turtle/#turtle-literals .
I could reproduce this invalid-problem locally by directly plugging that specific ldes page as the ldes stream when creating a new vocabulary in vocabsearch.
The excessive cpu/memory usage from search however I couldn't reproduce.

@MikiDi
Copy link
Contributor

MikiDi commented Oct 9, 2024

We presume the excessive memory usage (which snowballed into other problems) will be resolved with cccb4ef

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants