Releases · microsoft/markitdown

New Features and Formats

Renamed mlm_client and mlm_model arguments to llm_client and llm_model, and added appropriate deprecation warnings.

See:

Remove invalid classifiers by @simonw in #10
Add installation instructions from haesleinhuepf:patch-1 by @gagb in #27
Update README.md by @gagb in #28
Improve the readme with contributing guidelines by @gagb in #7
Add installation instructions by @haesleinhuepf in #24
Update README.md by @pawarbi in #26
Update README.md by @gagb in #29
CLI usage instructions by @simonw in #11
Fix character decoding issues with text-like files by @brc-dd in #19
Catching pydub's warning of ffmpeg or avconv missing by @SH4DOW4RE in #39
Exclude test files from language statistics using linguist-vendored by @Y-Kim-64 in #44
Support specifying YouTube transcript language by @narumiruna in #50
Add passing style_map kwarg to Mammoth when converting docx to allow keeping comments by @VillePuuska in #38
Fix: pass the kwargs to _convert method when converting an url file by @Soulter in #48
Added Dockerfile by @madduci in #60
fix issue #65 by @DIMAX99 in #67
Cybernobie/main by @gagb in #75
Ensure hatch is installed before running tests by @cybernobie in #63
Kevinclb/main by @gagb in #77
feature: add argument parsing for cli tool capability by @kevinclb in #46
Added llm tests to the local test set. by @afourney in #100

Full Changelog: v0.0.1a2...v0.0.1a3

Initial Release of markitdown

The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)

It presently supports:

PDF (.pdf)

PowerPoint (.pptx)

Word (.docx)

Excel (.xlsx)

Images (EXIF metadata, and OCR)

Audio (EXIF metadata, and speech transcription)

HTML (special handling of Wikipedia, etc.)

Various other text-based formats (csv, json, xml, etc.)

The API is simple:

from markitdown import MarkItDown

markitdown = MarkItDown()
result = markitdown.convert("test.xlsx")
print(result.text_content)