Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLDR-749 fast auto textual layer detection #481

Merged
merged 9 commits into from
Aug 9, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions dedoc/api/api_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ class QueryParameters:
# pdf handling
pdf_with_text_layer: str = Form("auto_tabby", enum=["true", "false", "auto", "auto_tabby", "tabby"],
description="Extract text from a text layer of PDF or using OCR methods for image-like documents")
fast_auto: str = Form("false", enum=["true", "false"], description="Use non-ML solution to detect textual layer if selected auto or"
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved
" auto_tabby in pdf_with_text_layer option. Much faster but less accurate.")
language: str = Form("rus+eng", description="Recognition language ('rus+eng', 'rus', 'eng', 'fra', 'spa')")
pages: str = Form(":", description='Page numbers range for reading PDF or images, "left:right" means read pages from left to right')
is_one_column_document: str = Form("auto", enum=["auto", "true", "false"],
Expand Down
4 changes: 4 additions & 0 deletions dedoc/api/web/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,10 @@ <h4>PDF handling</h4>
</label>
</p>

<p>
<label><input name="fast_auto" type="checkbox" value="true"> fast_auto</label>
</p>

<p>
<label> language
<input name="language" list="language" size="8" placeholder="rus+eng">
Expand Down
8 changes: 6 additions & 2 deletions dedoc/readers/pdf_reader/pdf_auto_reader/txtlayer_detector.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,12 @@ def detect_txtlayer(self, path: str, parameters: dict) -> PdfTxtlayerParameters:
"""
try:
lines = self.__get_lines_for_predict(path=path, parameters=parameters)
is_correct = self.txtlayer_classifier.predict(lines)
first_page_correct = self.__is_first_page_correct(lines=lines, is_txt_layer_correct=is_correct)
if parameters["fast_auto"] == "true":
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved
is_correct = any(line._line.strip() for line in lines)
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved
first_page_correct = True
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved
else:
is_correct = self.txtlayer_classifier.predict(lines)
first_page_correct = self.__is_first_page_correct(lines=lines, is_txt_layer_correct=is_correct)
return PdfTxtlayerParameters(is_correct_text_layer=is_correct, is_first_page_correct=first_page_correct)

except Exception as e:
Expand Down
9 changes: 9 additions & 0 deletions docs/source/dedoc_api_usage/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,15 @@ Api parameters description
If the document doesn't have a textual layer (it is an image, scanned document), PDF document parsing works like with ``need_pdf_table_analysis=false``.
It is highly recommended to use this option value for any PDF document parsing.

* - fast_auto
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved
- true, false
- false
- Enable fast textual layer detection. Works only when **auto** or **auto_tabby** is selected at **pdf_with_text_layer**.

* **true** -- if any text is detected in a PDF file, Dedoc assumpts that textual layer is detected and it is correct. Much faster but less accurate.
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved
* **false** -- use :class:`dedoc.readers.TxtlayerClassifier` to detect textual layer and prove its correctness.
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved


* - language
- rus, eng, rus+eng, fra, spa
- rus+eng
Expand Down
11 changes: 10 additions & 1 deletion docs/source/parameters/pdf_handling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,22 @@ PDF and images handling
If the document has a textual layer (is copyable), :class:`dedoc.readers.PdfTxtlayerReader` will be used for parsing.
If the document doesn't have a textual layer (it is an image, scanned document), :class:`dedoc.readers.PdfImageReader` will be used.


* **auto_tabby** -- automatic detection of textual layer presence in the PDF document.
This option is used to choose :class:`dedoc.readers.PdfAutoReader` for parsing.
If the document has a textual layer (is copyable), :class:`dedoc.readers.PdfTabbyReader` will be used for parsing.
If the document doesn't have a textual layer (it is an image, scanned document), :class:`dedoc.readers.PdfImageReader` will be used.
It is highly recommended to use this option value for any PDF document parsing.

* - fast_auto
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved
- true, false
- false
- * :meth:`dedoc.readers.PdfAutoReader.read`
* :meth:`dedoc.readers.PdfAutoReader.can_read`
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved
- Enable fast textual layer detection. Works only when **auto** or **auto_tabby** is selected at **pdf_with_text_layer**.

* **true** -- if any text is detected in a PDF file, Dedoc assumpts that textual layer is detected and it is correct. Much faster but less accurate.
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved
* **false** -- use :class:`dedoc.readers.TxtlayerClassifier` to detect textual layer and prove its correctness.
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved

* - language
- rus, eng, rus+eng, fra, spa
- rus+eng
Expand Down
Loading