Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLDR-749 fast auto textual layer detection #481

Merged
merged 9 commits into from
Aug 9, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dedoc/api/api_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ class QueryParameters:
description='Set cells orientation in table headers, "90" means 90 degrees counterclockwise cells rotation')

# pdf handling
pdf_with_text_layer: str = Form("auto_tabby", enum=["true", "false", "auto", "auto_tabby", "tabby"],
pdf_with_text_layer: str = Form("auto_tabby", enum=["true", "false", "auto", "fast_auto", "auto_tabby", "tabby"],
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved
description="Extract text from a text layer of PDF or using OCR methods for image-like documents")
language: str = Form("rus+eng", description="Recognition language ('rus+eng', 'rus', 'eng', 'fra', 'spa')")
pages: str = Form(":", description='Page numbers range for reading PDF or images, "left:right" means read pages from left to right')
Expand Down
3 changes: 2 additions & 1 deletion dedoc/api/web/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -122,12 +122,13 @@ <h4>PDF handling</h4>
<option value="true">true</option>
<option value="false">false</option>
<option value="auto">auto</option>
<option value="fast_auto">fast_auto</option>
<option value="auto_tabby" selected>auto_tabby</option>
<option value="tabby">tabby</option>
</select> pdf_with_text_layer
</label>
</p>

<p>
<label> language
<input name="language" list="language" size="8" placeholder="rus+eng">
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ def can_read(self, file_path: Optional[str] = None, mime: Optional[str] = None,
You can look to :ref:`pdf_handling_parameters` to get more information about `parameters` dictionary possible arguments.
"""
from dedoc.utils.parameter_utils import get_param_pdf_with_txt_layer
return super().can_read(file_path=file_path, mime=mime, extension=extension) and get_param_pdf_with_txt_layer(parameters) in ("auto", "auto_tabby")
return super().can_read(file_path=file_path, mime=mime, extension=extension) and get_param_pdf_with_txt_layer(parameters) in ("auto", "fast_auto", "auto_tabby")

def read(self, file_path: str, parameters: Optional[dict] = None) -> UnstructuredDocument:
"""
Expand Down
8 changes: 6 additions & 2 deletions dedoc/readers/pdf_reader/pdf_auto_reader/txtlayer_detector.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,12 @@ def detect_txtlayer(self, path: str, parameters: dict) -> PdfTxtlayerParameters:
"""
try:
lines = self.__get_lines_for_predict(path=path, parameters=parameters)
is_correct = self.txtlayer_classifier.predict(lines)
first_page_correct = self.__is_first_page_correct(lines=lines, is_txt_layer_correct=is_correct)
if parameters["pdf_with_text_layer"] == "fast_auto":
is_correct = any(line._line.strip() for line in lines)
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved
first_page_correct = True
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved
else:
is_correct = self.txtlayer_classifier.predict(lines)
first_page_correct = self.__is_first_page_correct(lines=lines, is_txt_layer_correct=is_correct)
return PdfTxtlayerParameters(is_correct_text_layer=is_correct, is_first_page_correct=first_page_correct)

except Exception as e:
Expand Down
2 changes: 2 additions & 0 deletions docs/source/parameters/pdf_handling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ PDF and images handling
If the document has a textual layer (is copyable), :class:`dedoc.readers.PdfTxtlayerReader` will be used for parsing.
If the document doesn't have a textual layer (it is an image, scanned document), :class:`dedoc.readers.PdfImageReader` will be used.

* **fast_auto** -- the pipeline is the same as **auto** except thr detection of textual layer. It is much faster but less accurate
NastyBoget marked this conversation as resolved.
Show resolved Hide resolved
because of no-ML solution.

* **auto_tabby** -- automatic detection of textual layer presence in the PDF document.
This option is used to choose :class:`dedoc.readers.PdfAutoReader` for parsing.
Expand Down
Loading