-
Notifications
You must be signed in to change notification settings - Fork 22
dedoc_lib_eng
!!! Code not in the master yet !!!
Let's say we are writing our project in python and we want to parse some documents, but we don't want to send files over http. In such a situation, we can use Dedok as a library.
The easiest way is to look at Dockerfile and install packages in your OS
(In particular, you need Libreoffice, it converts documents)
virtualenv -p python3 .
source bin/activate
pip install -e git+https://github.com/ispras/dedoc@cf479c6ed2497d4fc7b088cbddcf0a9b0db47e82#egg=dedoc
Verify the installation
python -c "from dedoc.utils import get_unique_name; print(get_unique_name('some.txt'))"
It should work without error and print something like that 1613578571_895.txt
If we don't want to do a full analysis, but just want to read a specific file, we can use one of the readers.
The list of readers can be viewed here
Например если мы хотим прочитать файл в формате docx мы можем использовать docx reader:
For example if we want to read a file in docx format we can use docx reader:
reader = DocxReader()
document, _ = reader.read("/home/padre/ТЗ_медики.docx")
document.lines # document text
document.tables # document tables
If we want to use the full functionality of Dedok, we may need a manager.
There are two: config and manager_config
In config, various parameters of work are set, in manager_config, the tools with which the manager will process documents are set. Both configs are python dicts, you can use your own configs instead (try not to remove the keys from the config).
Standard configs can be simply imported:
from dedoc.config import get_config
from dedoc.manager_config import get_manager_config
config = get_config()
manager_config = get_manager_config(config)
print(sorted(config.keys())) # Print is Optional (check that import succeeded)
print(sorted(manager_config.keys()))
Получим что то вроде
['api_port', 'import_path_init_api_args',...
['attachments_extractor', 'converter', ...
manager = DedocManager.from_config(config=config, manager_config=manager_config, version="1")
Parsing the document using manager:
parsed_document = manager.parse_file(
file_path="path to docx",
parameters={"with_attachments": "True"},
)
The list of parameters can be found in the online documentation.
It is returned by manager. parse_file
- metadata: DocumentMetadata metadata of the document (depending on the type. For example, the size of the document)
- content: Optional[Document Content] (The content of the document, we will write more about it. For nested documents, it can be None)
- version: Optional[str] (the version we passed to manager)
- warnings: List[str] (list of problems encountered during the operation)
- attachments: Optional[List[ParsedDocument]] (list of documents attached to this document).)
The structure of the parsed document is equivalent to the structure of the document described in the online documentation.