Changelog

All notable changes to this project will be documented in this file. The format is based on Keep a Changelog.

[0.5.29] - [unreleased]

Development Changes

Add CONTRIBUTING.md (#428)

[0.5.28] — 2021-05-08

Added

Add --laparams flag to CLI. (#407)

Changed

Change .convert_csv(...) to order objects first by page number, rather than object type. (#407)
Change .convert_csv(...), .convert_json(...), and CLI so that, by default, they returning all available object types, rather than those in a predefined default list. (#407)

Fixed

Fix .extract_text(...) so that it can accept generator objects as its main parameter. (#385) [h/t @alexreg]
Fix page-parsing so that LTAnno objects (which have no bounding-box coordinates) are not extracted. (Was only an issue when setting laparams.) (#388)
Fix Page.extract_table(...) so that it honors text tolerance settings (#415) [h/t @trifling]

[0.5.27] — 2021-02-28

Fixed

Fix regression (introduced in 0.5.26/b1849f4) in closing files opened by PDF.open
Reinstate access to higher-level layout objects (such as textboxhorizontal) when laparams is passed to pdfplumber.open(...). Had been removed in 0.5.24 via 1f87898. (#359 + #364)

Development Changes

Add a python setup.py build sdist test to main GitHub action. (#365)

[0.5.26] — 2021-02-10

Added

Add Page.close/__enter__/__exit__ methods, by generalizing that behavior through the Container class (b1849f4)

Changed

Change TableFinder to return tables in order of topmost-and-then-leftmost, rather than leftmost-and-then-topmost (#336)
Change Page.to_image()'s handling of alpha layer, to remove aliasing artifacts (#340) [h/t @arlyon]

Development Changes

Enforce psf/black and flake8 on tests/ (#327

[0.5.25] — 2020-12-09

Added

Add new boolean argument strict_metadata (default False) to pdfplumber.open(...) method for handling metadata resolution failures (f2c510d)

Fixed

Fix metadata extraction to handle integer/floating-point values (cb32478) (#297)
Fix metadata extraction to handle nested metadata values (2d9415) (#316)
Explicitly load text as utf-8 in setup.py (7854328) (#304)
Fix pdfplumber.open(...) so that it does not close file objects passed to it (408605f) (#312)

[0.5.24] — 2020-10-20

Added

Added extra_attrs=[...] parameter to .extract_text(...) (c8b200e) (#28)
Added utils/page.dedupe_chars(...) (04fd56a + b132d45) (#71)

Changed

Change character attribute upright from int to bool (per original pdfminer.six representation) (1f87898)
Remove access and reference to Container.figures, given that they are not fundamental objects (8e74cb9)

Fixed

Decimalize "simple" explicit_horizontal_lines/explicit_vertical_lines descs passed to TableFinder methods (bc40779) (#290)

Development Changes

Refactor/simplify Page.process_objects (1f87898), utils.extract_words (c8b200e), and convert.serialize (a74d3bc)
Remove test_issues.py:test_pr_77 (917467a) and narrow test_ca_warn_report:test_objects (6233bbd) to speed up tests

[0.5.23] — 2020-08-15

Added

Add utils.resolve (non-recursive .resolve_all) (7a90630)
Add page.annots and page.hyperlinks, replacing non-functional page.annos, and mirroring pdfminer's language ("annot" vs. "anno"). (aa03961)
Add page/pdf.to_json and page/pdf.to_csv (cbc91c6)
Add relative=True/False parameter to .crop and .within_bbox; those methods also now raise exceptions for invalid and out-of-page bounding boxes. (047ad34) [h/t @samkit-jain]

Changed

Remove pdfminer.from_path and pdfminer.load as deprecated; now pdfminer.open is the canonical way to load a PDF. (00e789b)
Simplify the logic in "text" table-finding strategies; in edge cases, may result in changes to results. (d224202)
Drop support for Python 3.5 (baf1033)

Fixed

Fix .extract_words, which had been returning incorrect results when horizontal_ltr = False (d16aa13)
Fix utils.resize_object, which had been failing in various permutations (d16aa13)
Fix lines_strict table-finding strategy, which a typo had prevented from being usable (f0c9b85)
Fix utils.resolve_all to guard against two known sources of infinite recursion (cbc91c6)

Development Changes

Rename default branch to "stable," to clarify its purpose
Reformat code with psf/black (1258e09)
Add code linting via psf/black and flake8 (1258e09)
Switch from nosetests to pytest (1ac16dd)
Switch from pipenv to standard requirements.txt + python -m venv (48eaa51)
Add GitHub action for tests + codecov (b148fd1)
Add Makefile for building development virtual environment and running tests (4c69c58)
Add badges to README.md (9e42dc3)
Add Trove classifiers for Python versions to setup.py (6946e8d)
Add MANIFEST.in (eafc15c)
Add GitHub issue templates (c4156d6)
Remove pandas from dev requirements and tests (a5e7d7f)

[0.5.22] — 2020-07-18

Changed

Upgraded pdfminer.six requirement to ==20200517 (cddbff7) [h/t @youngquan]

Added

Add support for non_stroking_color attribute on char objects (0254da3) [h/t @idan-david]

[0.5.21] — 2020-05-27

Fixed

Fix Page.extract_table(...) to return None instead of crashing when no table is found (d64afa8) [h/t @stucka]

[0.5.20] — 2020-04-29

Fixed

Fix .get_page_image to prefer paths over streams, when possible (ab957de) [h/t @ubmarco]
Local-fix pdfminer.six's .resolve_all to handle tuples and simplify (85f422d)

Changed

Remove support for Python 2 and Python <3.3

[0.5.19] — 2020-04-16

Changed

Add utils.decimalize performance improvement (830d117) [h/t @ubmarco]

Fixed

Fix un-referenced method when using "text" table-finding strategy (2a0c4a2)
Add missing object type rect_edge to obj_to_edges() (0edc6bf)

[0.5.18] — 2020-04-01

Changed

Allow rect and curve objects also to be passed to "explicit_..._lines" setting when table-finding. (And disallow other types of dicts to be passed.)

Fixed

Fix utils.extract_text bug introduced in prior version

[0.5.17] — 2020-04-01

Fixed

Fix and simplify obj-in-bbox logic (see commit 25672961)
Improve/fix the way utils.extract_text handles vertical text (see commit 8a5d858b) [h/t @dwalton76]
Have Page.to_image use bytes stream instead of file path (Issue #124 / PR #179) [h/t @cheungpat]
Fix issue #176, in which Page.extract_tables did not pass kwargs to Table.extract [h/t @jsfenfen]

[0.5.16] — 2020-01-12

Fixed

Prevent custom LAParams from raising exception (Issue #168 / PR #169) [h/t @frascuchon]
Add six as explicit dependency (for now)

[0.5.15] — 2020-01-05

Changed

Upgrade pdfminer.six requirement to ==20200104
Upgrade pillow requirement >=7.0.0
Remove Python 2.7 and 3.4 from tox tests

[0.5.14] — 2019-10-06

Fixed

Fix sorting bug in page.extract_table()
Fix support for password-protected PDFs (PR #138)

[0.5.13] — 2019-08-29

Fixed

Fixed PDF object resolution for rotation (PR #136)

[0.5.12] — 2019-04-14

Added

cdecimal support for Python 2
Support for password-protected PDFs

[0.5.11] — 2018-11-13

Added

Caching for .decimalize() method

Changed

Upgrade to pdfminer.six==20181108
Make whitespace checking more robust (PR #88)

Fixed

Fix issue #75 (.to_image() custom arguments)
Fix issue raised in PR #77 (PDFObjRef resolution), and general class of problems
Fix issue #90, and general class of problems, by explicitly typecasting each kind of PDF Object

[0.5.10] — 2018-08-03

Fixed

Fix bug in which, when calling get_page_image(...), the alpha channel could make the whole page black out.

[0.5.9] — 2018-07-10

Fixed

Fix issue #67, in which bool-type metadata were handled incorrectly

[0.5.8] — 2018-03-06

Fixed

Fix issue #53, in which non-decimalize-able (non_)stroking_color properties were raising errors.

[0.5.7] — 2018-01-20

Added

.travis.yml, but failing on .to_image()

Changed

Move from defunct pycrypto to pycryptodome
Update pdfminer.six to 20170720

[0.5.6] — 2017-11-21

Fixed

Fix issue #41, in which PDF-object-referenced cropboxes/mediaboxes weren't being fully resolved.

[0.5.5] — 2017-05-10

Added

Access to __version__ from main namespace

Fixed

Fix issue #33, by checking decode_text's argument type

[0.5.4] — 2017-04-27

Fixed

Pin pdfminer.six to version 20151013 (for now), fixing incompatibility

[0.5.3] — 2017-02-27

Fixed

Allow import pdfplumber even if ImageMagick not installed.

[0.5.2] — 2017-02-27

Added

Access to curve points. (E.g., page.curves[0]["points"].)
Ability for .draw_line to draw curve points.

Changed

Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold".
Internally, made utils.decimalize a bit more robust; now throws errors on non-decimalizable items.
Now explicitly ignoring some (obscure) pdfminer object attributes.
Raw input for .draw_line from a bounding box to ((x, y), (x, y)), for consistency with curve["points"] and with Pillow's underlying method.

Fixed

Fixed typo bug when .rect_edges is called before .edges

[0.5.1] — 2017-02-26

Added

Quick-draw PageImage methods: .draw_vline, .draw_vlines, .draw_hline, and .draw_hlines.
Boolean parameter keep_blank_chars for .extract_words(...) and TableFinder settings.

Changed

Increased default text_tolerance and intersection_tolerance TableFinder values from 1 to 3.

Fixed

Properly handle conversion of PDFs with transparency to pillow images.
Properly handle pandas DataFrames as inputs to multi-draw commands (e.g., PageImage.draw_rects(...)).

[0.5.0] - 2017-02-25

Added

Visual debugging features, via Page.to_image(...) and PageImage. (Introduces wand and pillow as package requirements.)
More powerful options for extracting data from tables. See changes below.

Changed

Entirely overhaul the table-extraction methods. Now based on Anssi Nurminen's master's thesis.
Disentangle .crop from .intersects_bbox and .within_bbox.
Change default x_tolerance and y_tolerance for word extraction from 5 to 3

Fixed

Fix bug stemming from non-decimalized page heights. [h/t @jsfenfen]

[0.4.6] - 2017-01-26

Added

Provide access to Page.page_number

Changed

Use .page_number instead of .page_id as primary identifier. [h/t @jsfenfen]
Change default x_tolerance and y_tolerance for word extraction from 0 to 5

Fixed

Provide proper support for rotated pages

[0.4.5] - 2016-12-09

Fixed

Fix bug stemming from when metadata includes a PostScript literal. [h/t @boblannon]

[0.4.4] - Mistakenly skipped

Whoops.

[0.4.3] - 2016-04-12

Changed

When extracting table cells, use chars' midpoints instead of top-points.

Fixed

Fix find_gutters — should ignore " " chars

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

[0.5.29] - [unreleased]

Development Changes

[0.5.28] — 2021-05-08

Added

Changed

Fixed

[0.5.27] — 2021-02-28

Fixed

Development Changes

[0.5.26] — 2021-02-10

Added

Changed

Development Changes

[0.5.25] — 2020-12-09

Added

Fixed

[0.5.24] — 2020-10-20

Added

Changed

Fixed

Development Changes

[0.5.23] — 2020-08-15

Added

Changed

Fixed

Development Changes

[0.5.22] — 2020-07-18

Changed

Added

[0.5.21] — 2020-05-27

Fixed

[0.5.20] — 2020-04-29

Fixed

Changed

[0.5.19] — 2020-04-16

Changed

Fixed

[0.5.18] — 2020-04-01

Changed

Fixed

[0.5.17] — 2020-04-01

Fixed

[0.5.16] — 2020-01-12

Fixed

[0.5.15] — 2020-01-05

Changed

[0.5.14] — 2019-10-06

Fixed

[0.5.13] — 2019-08-29

Fixed

[0.5.12] — 2019-04-14

Added

[0.5.11] — 2018-11-13

Added

Changed

Fixed

[0.5.10] — 2018-08-03

Fixed

[0.5.9] — 2018-07-10

Fixed

[0.5.8] — 2018-03-06

Fixed

[0.5.7] — 2018-01-20

Added

Changed

[0.5.6] — 2017-11-21

Fixed

[0.5.5] — 2017-05-10

Added

Fixed

[0.5.4] — 2017-04-27

Fixed

[0.5.3] — 2017-02-27