Skip to content

dedoc_lib_eng

IlyaKozlov edited this page Feb 18, 2021 · 4 revisions

Dedoc as a library

Suppose we are writing our project in python and we want to parse some documents, but we don't want to send files via http. In this situation, we can use Dedoc as a library.

Installation

Prepare your system environment

The easiest way is to look at Dockerfile and install packages in your OS (In particular, you need LibreOffice to convert documents). These packages were installed for Ubuntu OS; on other systems, the sequence of actions may differ.

Install dedoc

 virtualenv -p python3 .
 source bin/activate
 pip install -e git+https://github.com/ispras/dedoc@cf479c6ed2497d4fc7b088cbddcf0a9b0db47e82#egg=dedoc

Verify the installation

python -c "from dedoc.utils import get_unique_name; print(get_unique_name('some.txt'))"

It should work without error and print something like that 1613578571_895.txt

Read the file content

If we needn't a full document analysis, but only reading a specific file, we can use one of the readers.

The list of readers can be viewed here

For example if we want to read a file in docx format we can use docx reader:

reader = DocxReader()
document, _ = reader.read("/home/padre/ТЗ_медики.docx")
document.lines   # document text
document.tables  # document tables

The complete file processing

If we want to use the full functionality of Dedoc, we need a manager.

Get configs

There are two configs: config and manager_config

Config is used for setting various parameters of work, manager_config is used by the manager for the documents processing. Both configs are python dicts, you can use your own configs instead (try not to remove the keys from the config).

Standard configs can be simply imported:

from dedoc.config import get_config
from dedoc.manager_config import get_manager_config

config = get_config()
manager_config = get_manager_config(config)
print(sorted(config.keys()))  # Print is Optional (check that import succeeded)
print(sorted(manager_config.keys()))

We should get something like that:

['api_port', 'import_path_init_api_args',...

['attachments_extractor', 'converter', ...

Create manager

manager = DedocManager.from_config(config=config, manager_config=manager_config, version="1")

Parsing the document using manager:

parsed_document = manager.parse_file(
    file_path="path to docx",
    parameters={"with_attachments": "True"},
)

The list of parameters can be found in the online documentation.

The structure of the parsed document

ParsedDocument

It is returned by manager.parse_file method

  1. metadata: DocumentMetadata (metadata of the document depending on the type, for example, the size of the document)
  2. content: Optional[Document Content] (The content of the document, we will write more about it. It can be None for nested documents)
  3. version: Optional[str] (the version passed to manager)
  4. warnings: List[str] (list of problems encountered during the operation)
  5. attachments: Optional[List[ParsedDocument]] (list of documents attached to this document)

DocumentContent

The structure of the parsed document is equivalent to the structure of the document described in the online documentation.

Clone this wiki locally