Extracting (text) color as rgb #675

petermr · 2022-06-29T08:07:52Z

petermr
Jun 29, 2022

I am extracting text characters using pdfplumber and wish to have the color as rgb (as it's supported by many tools including CSS). Currently I get a 4-tuple for non-stroking-color which I can't find documented ~~and I guess is CMKY (sic) . CMKY is, I think, CMYK with K and Y reversed.~~

is there a direct way of getting RGB instead?
if not, is there a Python library for converting to RGB. (Note this isn't trivial as there seem to be complex discussions about different sorts of CMYK, printer inks, etc.

(When I use PDFBox (Java) for the same file I can extract RGB).

My code:

        with pdfplumber.open(PMC1421) as pdf:
            first_page = pdf.pages[0]

            assert first_page.chars[0] == {'matrix': (9, 0, 0, 9, 319.74, 797.4203),
                                           'fontname': 'KAAHHD+Calibri,Italic', 'adv': 0.319,
                                           'upright': True, 'x0': 319.74, 'y0': 795.1703, 'x1': 322.611, 'y1': 804.1703,
                                           'width': 2.870999999999981, 'height': 9.0, 'size': 9.0,
                                           'object_type': 'char', 'page_number': 1,
                                           'text': 'J', 'stroking_color': None,
                                           'non_stroking_color': (0.86667, 0.26667, 1, 0.15294),
                                           'top': 37.8297, 'bottom': 46.8297, 'doctop': 37.8297}

The non-stroking-color appears to be a light green/gray (character "J" leading the topline)
The file (PMC1421) is available at https://github.com/petermr/pyami/blob/pmr4/py4ami/resources/projects/liion4/PMC4391421/fulltext.pdf

More generally, why is this color system being used? Is it defined by the author/software or converted by PDFMiner/pdfplumber?

UPDATE:
With a different document I can extract colour =(1, 0, 0) which corresponds to bright red on the screen, i.e. I guess that RGB (scaled to 1) is probably being used. So the question morphs to:
"How can I find out what colour model is used for characters? And if not RGB, how can I convert it?"

Thanks

jsvine · 2022-06-29T21:27:21Z

jsvine
Jun 29, 2022
Maintainer

Great set of questions and observations on a tricky topic, @petermr! This has led me down an enjoyable rabbit hole, though I'm not 100% confident in my findings. My current understanding:

The PDF specification allows documents to use a wide range of color spaces/systems, as described in section 4.5 of the official reference.
Those color spaces are defined explicitly in the document, and it appears that pdfminer.six attempts to pick up that information. (See here for the parsing, and here for the internal interface.)
pdfminer.six exposes the non-stroking color space as the .ncs attribute of LTChar objects (which are ultimately what we convert in pdfplumber to "char" objects). Currently, pdfplumber does not pull in that information, but doing so would be relatively trivial.

However... when I examine the .ncs attributes of the characters in your PDF (as well as with another colorful PDF I have on hand) all color spaces were listed as DeviceGray, which would be incorrect. I ran a similar test with pdfminer.six directly (to try to rule out pdfplumber bugs) across the files in that repos samples/ directory (to get a broader view). There, too, most of the listed color spaces were DeviceGray ... but not all. At first, I wasn't quite sure what was happening. But after digging through the pdfminer.six code and reading the PDF spec more closely, I have a hunch:

This seems to be a key paragraph in the spec: "Color values are interpreted according to the current color space, another parameter of the graphics state. A PDF content stream first selects a color space by invoking the CS operator (for the stroking color) or the cs operator (for the non-stroking color). It then selects color values within that color space with the SC operator (stroking) or the sc operator (nonstroking). There are also convenience operators—G, g, RG, rg, K, and k—that select both a color space and a color value within it in a single step." [Emphasis added.]
As far as I can tell, pdfminer.six handles the CS operators correctly. But when it sees those "convenience operators", it only changes the color value not the color space: https://github.com/pdfminer/pdfminer.six/blob/43c8fc8557528463c99598049b7005ae96ab8084/pdfminer/pdfinterp.py#L652-L690

I've filed an issue in pdfminer.six with my hunch: pdfminer/pdfminer.six#779

If this ends up getting fixed there, I'll aim to incorporate the color space information into pdfplumber.

In the meantime, with many PDFs you can probably assume that an individual integer or float represents a monochrome value (0=black -> 1=white), a 3-value color is RGB, and a 4-value color is CMYK. Not foolproof, but usually a decent first guess.

1 reply

petermr Jul 2, 2022
Author

Thanks!

Great set of questions and observations on a tricky topic, @petermr! This has led me down an enjoyable rabbit hole, though I'm not 100% confident in my findings.

Yes, it can be fun (as long as it's not under time pressure!). My main experience has been with (Java) PDFBox and I believe that that code manages color-spaces. It protects the end-consumer from the low-level PDF codes that you mention.

At the moment I am happy to "guess" the color space. I can get the images by cropping the page display and compare so that gives a check if necessary. PDF is fiendishly complex. I also understand the challenge of dealing with a cascade of not-very-well-documented legacy code.

I want to extract images automatically and am making some progress, but will file a separate discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting (text) color as rgb #675

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Extracting (text) color as rgb #675

petermr Jun 29, 2022

Replies: 1 comment · 1 reply

jsvine Jun 29, 2022 Maintainer

petermr Jul 2, 2022 Author

petermr
Jun 29, 2022

Replies: 1 comment 1 reply

jsvine
Jun 29, 2022
Maintainer

petermr Jul 2, 2022
Author