-
Notifications
You must be signed in to change notification settings - Fork 22
dedoc_lib_eng
!!! Code is not in the master yet !!!
Suppose we are writing our project in python and we want to parse some documents, but we don't want to send files via http. In this situation, we can use Dedoc as a library.
The easiest way is to look at Dockerfile and install packages in your OS (In particular, you need LibreOffice to convert documents). These packages were installed for Ubuntu OS; on other systems, the sequence of actions may differ.
virtualenv -p python3 .
source bin/activate
pip install -e git+https://github.com/ispras/dedoc@cf479c6ed2497d4fc7b088cbddcf0a9b0db47e82#egg=dedoc
Verify the installation
python -c "from dedoc.utils import get_unique_name; print(get_unique_name('some.txt'))"
It should work without error and print something like that 1613578571_895.txt
If we needn't a full document analysis, but only reading a specific file, we can use one of the readers.
The list of readers can be viewed here
For example if we want to read a file in docx format we can use docx reader:
reader = DocxReader()
document, _ = reader.read("/home/padre/ТЗ_медики.docx")
document.lines # document text
document.tables # document tables
If we want to use the full functionality of Dedoc, we need a manager.
There are two configs: config and manager_config
Config is used for setting various parameters of work, manager_config is used by the manager for the documents processing. Both configs are python dicts, you can use your own configs instead (try not to remove the keys from the config).
Standard configs can be simply imported:
from dedoc.config import get_config
from dedoc.manager_config import get_manager_config
config = get_config()
manager_config = get_manager_config(config)
print(sorted(config.keys())) # Print is Optional (check that import succeeded)
print(sorted(manager_config.keys()))
We should get something like that:
['api_port', 'import_path_init_api_args',...
['attachments_extractor', 'converter', ...
manager = DedocManager.from_config(config=config, manager_config=manager_config, version="1")
Parsing the document using manager:
parsed_document = manager.parse_file(
file_path="path to docx",
parameters={"with_attachments": "True"},
)
The list of parameters can be found in the online documentation.
It is returned by manager.parse_file method
- metadata: DocumentMetadata (metadata of the document depending on the type, for example, the size of the document)
- content: Optional[Document Content] (The content of the document, we will write more about it. It can be None for nested documents)
- version: Optional[str] (the version passed to manager)
- warnings: List[str] (list of problems encountered during the operation)
- attachments: Optional[List[ParsedDocument]] (list of documents attached to this document)
The structure of the parsed document is equivalent to the structure of the document described in the online documentation.