Skip to content

dedoc_lib_eng

IlyaKozlov edited this page Feb 18, 2021 · 4 revisions

!!! Code not in the master yet !!!

Dedoc as library

Let's say we are writing our project in python and we want to parse some documents, but we don't want to send files over http. In such a situation, we can use Dedok as a library.

Installation

Prepare your system environment

The easiest way is to look at Dockerfile and install packages in your OS

(In particular, you need Libreoffice, it converts documents)

Install dedoc

 virtualenv -p python3 .
 source bin/activate
 pip install -e git+https://github.com/ispras/dedoc@cf479c6ed2497d4fc7b088cbddcf0a9b0db47e82#egg=dedoc

Verify the installation

python -c "from dedoc.utils import get_unique_name; print(get_unique_name('some.txt'))"

It should work without error and print something like that 1613578571_895.txt

Read the file content

If we don't want to do a full analysis, but just want to read a specific file, we can use one of the readers.

The list of readers can be viewed here

Например если мы хотим прочитать файл в формате docx мы можем использовать docx reader:

For example if we want to read a file in docx format we can use docx reader:

reader = DocxReader()
document, _ = reader.read("/home/padre/ТЗ_медики.docx")
document.lines   # document text
document.tables  # document tables

The complete file processing

If we want to use the full functionality of Dedok, we may need a manager.

Get configs

There are two: config and manager_config

In config, various parameters of work are set, in manager_config, the tools with which the manager will process documents are set. Both configs are python dicts, you can use your own configs instead (try not to remove the keys from the config).

Standard configs can be simply imported:

from dedoc.config import get_config
from dedoc.manager_config import get_manager_config

config = get_config()
manager_config = get_manager_config(config)
print(sorted(config.keys()))  # Print is Optional (check that import succeeded)
print(sorted(manager_config.keys()))

Получим что то вроде

['api_port', 'import_path_init_api_args',...

['attachments_extractor', 'converter', ...

Create manager

manager = DedocManager.from_config(config=config, manager_config=manager_config, version="1")

Parsing the document using manager:

parsed_document = manager.parse_file(
    file_path="path to docx",
    parameters={"with_attachments": "True"},
)

The list of parameters can be found in the online documentation.

The structure of the parsed document

ParsedDocument

It is returned by manager. parse_file

  1. metadata: DocumentMetadata metadata of the document (depending on the type. For example, the size of the document)
  2. content: Optional[Document Content] (The content of the document, we will write more about it. For nested documents, it can be None)
  3. version: Optional[str] (the version we passed to manager)
  4. warnings: List[str] (list of problems encountered during the operation)
  5. attachments: Optional[List[ParsedDocument]] (list of documents attached to this document).)

DocumentContent

The structure of the parsed document is equivalent to the structure of the document described in the online documentation.

Clone this wiki locally