-
Notifications
You must be signed in to change notification settings - Fork 722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for image URLs #526
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
We should probably ensure that the current Tesseract version is indeed capable of using URLs: This requires a minimum version of Tesseract and the binary being linked to libcurl (which can be disabled at build time). |
I added checks on Tesseract version and presence of libcurl. Not sure why the PR checks are still failing though. Sources: |
URL support is a compile-time feature as previously mentioned: https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/CMakeLists.txt#L101 Ubuntu < 23.04 just does not link against libcurl during build-time: https://packages.ubuntu.com/jammy/tesseract-ocr https://packages.ubuntu.com/lunar/tesseract-ocr (See |
Ah I see what you mean. The logic I added should correctly check if Tesseract was built with libcurl, so I can add it to the testcase as well. Granted that would make it never run until GitHub adds Ubuntu 23.04 as host for actions and the action config is updated, but at least it will allow people who compiled Tesseract themselves use the functionality. |
We should probably use the image from GitHub (for example by uploading it inside a comment here) to not rely on external services. For testing: I am not sure whether we already want to test this here or add a conditional skip. Compiling Tesseract on GitHub actions would work, but probably mean quite some overhead for each build. |
f55732e
to
f524a35
Compare
@stefan6419846 Please let me know your thoughts based on my previous comment |
pytesseract/pytesseract.py
Outdated
@@ -209,7 +218,14 @@ def save(image): | |||
try: | |||
with NamedTemporaryFile(prefix='tess_', delete=False) as f: | |||
if isinstance(image, str): | |||
yield f.name, realpath(normpath(normcase(image))) | |||
if image.startswith('http:') or image.startswith('https:'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this be shorter using image.startswith(prefix=('http:', 'https:'))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, fixed @stefan6419846
for more information, see https://pre-commit.ci
8ff250d
to
b0ae13f
Compare
7fc814d
to
6e4f31c
Compare
@marosstruk thank you very much for your contribution. I really like the VCR.py approach, but we use the external tesseract executable itself for the request. PS: Oh, and almost forgot that we need a documentation entry in the Quickstart guide at least, so people know about that function. |
Fixes #415
In the spirit of the repo being a wrapper, I left URL validation to the tesseract-ocr process. If you think it is something that should be handled here, let me know.