Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grobid does not return anything #1134

Open
naarkhoo opened this issue Jun 25, 2024 · 10 comments
Open

grobid does not return anything #1134

naarkhoo opened this issue Jun 25, 2024 · 10 comments

Comments

@naarkhoo
Copy link

I am using grobid through langchain and have observed a weird behavior
I hope you have priviliage to access the following papers pubmed.ncbi.nlm.nih.gov/8440333 pubmed.ncbi.nlm.nih.gov/18628819 for some reason if I use

loader = GenericLoader.from_filesystem(
        path = '/Users/alka/Devel/LiteGrave/data/all/8440333/',
        suffixes=[".pdf"],
        glob="**/[!.]*",
        parser=GrobidParser(segment_sentences=True),
        show_progress=True,
    )

documents = loader.load()

does not return anything but if It works through pypdfparser

from langchain.document_loaders.parsers.pdf import PyPDFParser

loader = GenericLoader.from_filesystem(
    path = '/Users/alka/Devel/LiteGrave/data/all/8440333/',
    glob="**/*.pdf",
parser=PyPDFParser()
)

I wonder what could be the underlying reason ?

@lfoppiano
Copy link
Collaborator

Hi @naarkhoo,
the default parameters of the langchain parser assumes that you're running Grobid in local at localhost:8070. See: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.parsers.grobid.GrobidParser.html

If this is the case, then to better investigate we would need to see the Grobid logs.
If it's not the case you should follow the instruction at https://python.langchain.com/v0.2/docs/integrations/document_loaders/grobid/

The best approach is to install Grobid via docker, see https://grobid.readthedocs.io/en/latest/Grobid-docker/.

(Note: additional instructions can be found [here](https://python.langchain.com/v0.2/docs/integrations/providers/grobid/).)

Once grobid is up-and-running you can interact as described below.

@naarkhoo
Copy link
Author

naarkhoo commented Jun 25, 2024 via email

@lfoppiano
Copy link
Collaborator

Hi @naarkhoo,
I cannot access the document, so for the moment, could you please share the log here?

@naarkhoo
Copy link
Author

18628819.pdf
sure; thanks for asking

here is the log for 8440333

ERROR [2024-06-26 07:18:39,587] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! org.grobid.core.exceptions.GrobidException: [NO_BLOCKS] PDF parsing resulted in empty content
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:417)
! at org.grobid.core.engines.Segmentation.processing(Segmentation.java:95)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:150)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:119)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:587)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:577)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:290)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:291)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:240)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
! at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
! at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
! at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1665)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:36)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:46)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:40)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:313)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:267)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1382)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1304)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at io.dropwizard.metrics.jetty11.InstrumentedHandler.handle(InstrumentedHandler.java:307)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:822)
! at io.dropwizard.jetty.ZipExceptionHandlingGzipHandler.handle(ZipExceptionHandlingGzipHandler.java:26)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:173)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.Server.handle(Server.java:563)
! at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505)
! at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:497)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
! at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:416)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:385)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:272)
! at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.lambda$new$0(AdaptiveExecutionStrategy.java:140)
! at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:936)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1080)
! at java.base/java.lang.Thread.run(Thread.java:833)

I also attached the PDF file

18628819.pdf
8440333.pdf

@lfoppiano
Copy link
Collaborator

Thanks. I checked them and:

  • 18629919.pdf works fine.
  • 8440333.pdf is an image so it's normal that there is no output, but the error message is saying that: [NO_BLOCKS] PDF parsing resulted in empty content 😄 Maybe the langchain parser need to handle these cases.

@naarkhoo
Copy link
Author

naarkhoo commented Jun 26, 2024

Thank you for looking into them.

so you mean Grobid doesn't have OCR engine and is only a layout parse ?!

interesting, that you say 18629919 works - I am running through langchain and it does not return any output; there must be some issue within the langchain then.

loader = GenericLoader.from_filesystem(
        path = '/data/all/18628819/',
        suffixes=[".pdf"],
        glob="**/[!.]*",
        parser=GrobidParser(segment_sentences=True),
        show_progress=True,
    )

documents = loader.load()

I can make an issue on their repo and refer to this conversation.

@lfoppiano
Copy link
Collaborator

@naarkhoo One option may be that you hit the timeout, could you please confirm that you are not getting any error message from langchain?

Something like: GROBID server timed out. Return None.?

@naarkhoo
Copy link
Author

naarkhoo commented Jul 4, 2024

not really - I increased the timeout to 180 in their loader.

def lazy_parse(self, blob: Blob) -> Iterator[Document]:
        file_path = blob.source
        if file_path is None:
            raise ValueError("blob.source cannot be None.")
        pdf = open(file_path, "rb")
        files = {"input": (file_path, pdf, "application/pdf", {"Expires": "0"})}
        try:
            data: Dict[str, Union[str, List[str]]] = {}
            for param in ["generateIDs", "consolidateHeader", "segmentSentences"]:
                data[param] = "1"
            data["teiCoordinates"] = ["head", "s"]
            files = files or {}
            r = requests.request(
                "POST",
                self.grobid_server,
                headers=None,
                params=None,
                files=files,
                data=data,
                timeout=180,
            )
            xml_data = r.text
        except requests.exceptions.ReadTimeout:
            logger.error("GROBID server timed out. Return None.")
            xml_data = None

        if xml_data is None:
            return iter([])
        else:
            return self.process_xml(file_path, xml_data, self.segment_sentences)

I tried to run the grobid python without langchain using

from grobid_client.grobid_client import GrobidClient

host = 'http://localhost:8070/api/processFulltextDocument/'
port = '8070'

client = GrobidClient(host, port)
client.process("processFulltextDocument", 
               '/Users/alka/Devel/LiteGrave/data/all/16369034/16369034.pdf', 
               output="/Users/alka/Devel/",
               consolidate_citations=True, 
               tei_coordinates=True, 
               force=True)

but didn't succeed it complains GROBID server does not appear up and running 400 - this is another issue perhaps for another ticket.

@lfoppiano
Copy link
Collaborator

lfoppiano commented Jul 13, 2024

I finally found time to check this.

Two comments:

  1. I'm not sure which client you are using and which version. The GrobidClient does not get host and port as parameter, but a different parameter called grobid_server, e.g. grobid_server=http://localhost:8070.

  2. the langchain parser ignore all sections without <head>, it's a bit of a strong assumption here, and the document 18628819.pdf is the example, so everything is basically ignored. (line 58 of grobid.py)

@naarkhoo
Copy link
Author

Thanks so much for looking into this. I start to see if I can fix it myself and/or put it as an issue on langchain ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants