Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow defaults to be folded into YAML metadata #5870

Open
jgm opened this issue Nov 3, 2019 · 51 comments
Open

Allow defaults to be folded into YAML metadata #5870

jgm opened this issue Nov 3, 2019 · 51 comments

Comments

@jgm
Copy link
Owner

jgm commented Nov 3, 2019

The proposal is to provide a way to do pandoc --defaults source.md or pandoc -d source.md, and have pandoc read its default options from a particular part of the markdown document, then process the document accordingly. This would handle #4627 .

One approach would be to put the defaults in a defaults_: field in YAML metadata at the start of the file. Pandoc does not try to parse fields ending in _, so this would ensure that the content was only used for defaults.

This would look like:

---
defaults_:
  toc: true
  standalone: true
  variables:
    documentclass: book
  output-file: doc.pdf
title: My document
author: Me, etc.
---

Another possible approach would be to use a YAML anchor to mark out the defaults. This would look like:

---
&defaults
standalone: true
columns: 78
...
---
title: My document
author: Me, etc.
...

On this approach the defaults could be flush left, but you'd need a separate YAML block for them, with the keyword &defaults.

On both approaches it would be desirable to set the document as input file, unless input-files is specifically specified. This feature could be triggered by the presence of whichever feature we used (above) -- either the special tag or the defaults_ key. So, the logic for --defaults would be:

  1. see if the file starts with a YAML block with defaults_ key (or with a tagged block).
  2. if so, parse this part as defaults, and set input file to the file unless it's specified otherwise.
  3. if not, parse the whole thing as defaults, as we do now.
@mb21
Copy link
Collaborator

mb21 commented Nov 6, 2019

So you would have to do pandoc -d source.md source.md ? or just pandoc source.md would be enough?

Also copying my comment from #4627:

I don't think a lot of people know about YAML anchors and it will unnecessarily confuse them. But I like having the options as a subfields (e.g. under defaults_:). Maybe defaults_ is not the most descriptive name though, what about something like options_ or output_?

@jgm
Copy link
Owner Author

jgm commented Nov 6, 2019

With this proposal you'd have to do pandoc -d source.md.
If input-files isn't set in source.md and there is content beyond the YAML document at the beginning, pandoc could be trainde to treat source.md itself as the input file. This might be a bit too fancy, but it would handle a common request.

@iandol
Copy link
Contributor

iandol commented Nov 6, 2019

I also think defaults_ is preferable to the anchor (it is consistent with pandocomatic that uses pandocomatic_ for parsing).

On issue #5790 there was also quite a bit of discussion about a better name than defaults. John suggested using and with for the command-line equivalents, other options were project, action, recipe. One strong benefit of defaults is that -d is still available, which makes everything [--defaults | -d | defaults_] all consistent. I think -u for using_ and -w for with_, and -O for options_ are also all available...

@brainchild0
Copy link

brainchild0 commented Nov 7, 2019

I feel that a principle benefit of #5790 was to enforce a clearer separation of concerns, the lack of which had previously created a variety of issues, including those discussed in #4627. I am hesitant to support moving in any direction that appears to reverse these gains.

Ideally, an input file contains only textual content and related metadata.

@iandol
Copy link
Contributor

iandol commented Nov 7, 2019

A unified file which specifies how its content is to be parsed without hierarchies of accessory configuration files is useful IMO. As an example, the wonderful writing App Scrivener supports Pandoc output, but compiles its metadata and contents into a single .md file. Many [most] users who are not programmers or command-line specialists find it really challenging to conceptualise technical solutions that "elegantly" separate semantic parts[1]. They just want to be able to take their work and have an elegant output.

Nothing is stopping more technical purists from separating default settings, metadata and documents in this proposal, it just enables a workflow that would benefit causal users.


[1] I provide a premade Pandoc compiling workflow for Scrivener, using pandocomatic to deal with the configuration (using a separate pandocomatic.yaml config, something like defaults, but where different templates are collected in one file), and this still really confuses lots of people that could nevertheless benefit from all that Pandoc can do. Several users I've tried to help use Pandoc in this way just gave up, and many more got put off before they even started. Having a defaults blended into their compiled file would really simplify the workflow, and benefit this class of user.

@brainchild0
Copy link

brainchild0 commented Nov 7, 2019

Only having heard of Scrivener is passing, I am correct that a core design feature is separating the physical layout features of a document from the text, to help authors focus on the development of the text, and if so, would such a model not more directly accord with the "semantic separation"?

@iandol
Copy link
Contributor

iandol commented Nov 7, 2019

Yes, Scrivener is a GUI-based app that encourages separation of content and presentation. But we have had decades of people using things like Word or Powerpoint where they just press buttons to make stuff look how they want while they create. The styles system in Word is really well developed by now, and still all Word docs I have to deal with almost exclusively ignore semantic styles and use inline styling. Many users are drawn to the organisational features of Scrivener but they have a really hard time trying to conceptualise and adjust to the idea of compiling content into output formats. Scrivener can hide a lot of this, but nevertheless requires a lot of adjustment from users. This does makes them more likely to then take a step towards something like Pandoc to really enhance the transformation of their work. But overall most users are only willing to step so far…

YAML is structured using whitespace to identify that structure; I think this is intuitive or at least intelligible for users. Utilising the structured nature of YAML directly was why I preferred using a named defaults_ section. I think your objection is that you would prefer to have an entirely separate keyed metadata block, to visually reinforce that these are separate domains ? My personal feeling is that this will not enlighten someone who will not understand (or even care) that reference-links and link-citations are different classes things any more than having them in separated sections of the metadata block. &keys are not something most users will use in the metadata otherwise, and it is somewhat more visually "noisy" having
seperated metadata blocks.

Pandoc is a swiss-army knife, and that sometimes means using the screwdriver to open jars 😃

@brainchild0
Copy link

brainchild0 commented Nov 7, 2019

Largely I think your, @iandol, premises and reasoning are sound, but I have some difficulty identifying precisely where I might diverge with your conclusions because of confusion over which particular group of users are or should be considered in each of a variety of cases your outline. For example, that group that has a really hard time trying to conceptualise and adjust to the idea of compiling content into output formats would seem to be disjoint with that group who uses Pandoc. I suspect you would agree, but I may not fully follow the train of ideas you have given.

I think it is good to consider the various ways that Pandoc might be integrated with different tools, invoked using different strategies, and understood by different groups of users. But for each case one needs to consider the actual combinations along these axis that have representation.

@iandol
Copy link
Contributor

iandol commented Nov 7, 2019

confusion over which particular group of users are or should be considered in each of a variety of cases

Yes, this is always a conundrum that makes the clear waters of any design turbid. But I suppose precisely for this I prefer pluralism; and think the flexibility of being able to use one file or several will appeal to a broader range of use-cases than strictly enforcing the separation of pandoc settings from metadata, variables and content. John and others probably know the numbers better than I, but the fact there are numerous issues and forum posts asking for the ability to put settings into some form of YAML section suggest that there are pure Pandoc users that would prefer this way of option of dealing with their workflow. Scrivener users were one example class that would benefit, but previous requests for something like thing point to there being different users who this would appeal to.

To invert the question, apart from conceptual "cleanliness"[1], what are the problems with allowing documents that contain a defaults_ or even ---&defaults... section?


[1] which is relative, because as I mentioned above, it is not always clear what the distinction is between settings and at least some metadata options. A real "clean" solution would separate settings, metadata, variables and content into their own accessory and main documents.

@brainchild0
Copy link

brainchild0 commented Nov 8, 2019

My preliminary ranking of four possibilities in order of ascending preference, looks as shown below. I am listing the negative features of each idea. I am not rejecting that they may have merits, but I’ll leave the task of listing them to enthusiasts. I little doubt they will appear, as I rarely pick the popular side.

  1. Use a defaults_ key.

    Introduces unnecessary problems, right down to shifting the indent level of pasted blocks, and promotes a slack attitude toward modular design and conceptual separation. Complicates code development and structure by processing the same YAML stream, and extracting separate components from it, in two distinctive stages of processing pipeline. Carries greater possibility that trivial future code changes will complicate subsequent maintenance.

  2. Use a &defaults anchor.

    May confuse many users, even if familiar with other Pandoc metaphors and processes, because of unfamiliarity with anchor syntax. Also, may be slightly a misuse of the anchor construct, which could lead to practical issues, but YAML experts would have to offer a more certain appraisal.

  3. Use a plain sequence of streams, the first indicating, using fields in the table, that the source comprises the remainder of the same file.

    Like all of above, unnecessarily integrates what data with how data. Presents obstacles against exchange of the source file with other applications and users. Prevents thinking in a way the promotes full benefit of the tools and methods.

  4. Do not add this feature.

    My preference for now. To be clear, I am more apprehensive about adding it before the surrounding questions of the defaults file in general have assessed, than I am closed to it in the future.

@brainchild0
Copy link

On issue #5790 there was also quite a bit of discussion about a better name than defaults. John suggested using and with for the command-line equivalents, other options were project, action, recipe. One strong benefit of defaults is that -d is still available, which makes everything [--defaults | -d | defaults_] all consistent. I think -u for using_ and -w for with_, and -O for options_ are also all available...

Was options considered yet? It seems conspicuously straightforward.

@brainchild0
Copy link

Possibly I've missed an important point, but trying to think through the scenarios and mutations, it seems that we could have, at least in principle, some very roundabout flow of data the way I understand that the feature is currently described.

Taking what seems to be currently the most popular variation, and considering the following header:

---
defaults_:
  variables:
    title: This gets clobbered
    subtitle: The real subtitle
title: The real title
---

Then the result is a document with title The real title, and subtitle The real subtitle, correct?

@jgm
Copy link
Owner Author

jgm commented Nov 8, 2019

Was options considered yet? It seems conspicuously straightforward.

It's less fitting than defaults, I think. For one thing, the file won't exhaust the options, as additional options may be specified on the command line, and these may even override options specified in the defaults file. For another thing, some of the things in the defaults file correspond to command-line arguments, not options (namely, input files).

@brainchild0
Copy link

brainchild0 commented Nov 8, 2019

Was options considered yet? It seems conspicuously straightforward.

It's less fitting than defaults, I think. For one thing, the file won't exhaust the options, as additional options may be specified on the command line, and these may even override options specified in the defaults file. For another thing, some of the things in the defaults file correspond to command-line arguments, not options (namely, input files).

So you consider "options" without qualification to indicate command-line options? I suppose some might find it misleading. For me it's not a problem. I don't think of options in file as either more, less, or equally comprehensive necessarily than options from the command line. I think of options as the top-level concept, of which those from the command line, or any other source, are just special cases.

@iandol
Copy link
Contributor

iandol commented Nov 9, 2019

Out of curiosity, when you write that the differentiation among the sets of data fields is "not always obvious", do you refer principally to the ambiguity a developer might face deciding where a field should be placed, or to the challenge a user might faced deciding where a field has been placed?

My perspective is that of a general user.

Then the result is a document with title The real title, and subtitle The real subtitle, correct?

I personally hope so, but I think John is working on standardising how all these potential sources 'cascade' together in what order...

@brainchild0
Copy link

Then the result is a document with title The real title, and subtitle The real subtitle, correct?

I personally hope so, but I think John is working on standardising how all these potential sources 'cascade' together in what order...

Are you referring to the logic that allowed handling of multiple defaults files on the same command line, or something different I might not know about?

@iandol
Copy link
Contributor

iandol commented Nov 9, 2019

Yes.

@brainchild0
Copy link

brainchild0 commented Nov 9, 2019

That logic would be separate from how metadata from multiple sources is merged. I think this process has always been a flat, key-wise update. Or do I misunderstand?

@iandol
Copy link
Contributor

iandol commented Nov 9, 2019

John is better placed to answer how all of this will fit together than I 😊 …

@jgm
Copy link
Owner Author

jgm commented Nov 11, 2019

Then the result is a document with title The real title, and subtitle The real subtitle, correct?

No, because setting anything in variables directly sets the relevant template variable, which will clobber any default value the variable acquires from metadata.

@brainchild0
Copy link

brainchild0 commented Apr 9, 2020

As discussed, the idea on offer would seem in a large sense to defeat the earlier objective of separating the two kinds of data.

However, one context in which combining them might be uniquely useful is programmatic invocation of the application, with a dynamically-generated source, as might be desired for example by a MarkDown editor that supports publishing and printing through Pandoc functionality.

For such a purpose, however, I would be far less inclined to advocate folding the defaults fields into the metadata header than adding support directly in the defaults file that could instruct the application to acquire the document source from the data in the same stream that follows the YAML block:

---
reader: markdown
writer: html
output-file: document.html
input-method: inline
---

# Chapter 1

@the-solipsist
Copy link
Contributor

One other option to consider is to be able to indicate a defaults files' locations in the in-file YAML:

---
defaults-files: 
  - project.yaml # document-specific defaults
  - pdf.yaml # output format defaults
title: Document Title
bibliography: refs.json
...

There are text editors that support exporting via Pandoc. However, currently they can't really use defaults files, since defaults files have to be indicated outside of the document. Being able to specify defaults file paths would greatly help a user of a text editor like Zettlr to be able to use defaults files using its in-built Pandoc-based export functionality.

Does this make any sense as an alternative/additional approach? What would be the drawbacks?

@tarleb
Copy link
Collaborator

tarleb commented Oct 14, 2020

I'm worried about the security implications of a feature like this. It is currently very safe to run pandoc on any arbitrary input file -- the worst possible attack right now is denial of service by crafting input such that pandoc takes a long time to parse. DoS like that is comparatively easy to defend against by setting resource limits for the process. However, allowing methods to overwrite command line parameters could give a document author a lot of control over input files, output location, filters, pdf engine, etc. YAML can be hidden anywhere in a long Markdown document, so it can become difficult to find when skimming the document.

If we do this, it might be wise to define an explicit allow list for command line options which are deemed safe.

@brainchild0
Copy link

brainchild0 commented Oct 15, 2020

If we do this, it might be wise to define an explicit allow list for command line options which are deemed safe.

Another option is to disable the feature except if some option is given on the command line. (Following this design, the application probably should fail if the option is not given and the input seems to expect it.)

Partitioning the options into two categories, according to which are safe for use in input, may create challenges for maintenance. It seems to present the threat of features creeping across the boundary, due to carelessness or poor decisions in code revisions. The potential for unexpected interaction among options presents an even more serious challenge to maintaining a list of safe features.

Ultimately, the difference between input files with defaults values and defaults files with input content is in many ways a small distinction, but may have significance in resolving questions such as these security concerns. A file cannot change a project for the worse if that file is itself the entire project.

@tarleb
Copy link
Collaborator

tarleb commented Oct 16, 2020

Another option is to disable the feature except if some option is given on the command line.

I think this is difficult to pull off while balancing usability and security, and naming matters a lot. It would presumably not be too difficult to get a user to run pandoc --allow-defaults https://example.com/my-file.md while a command like pandoc --allow-document-to-run-arbitrary-code https://example.com/my-file.md should at least raise some eyebrows.

@iandol
Copy link
Contributor

iandol commented Oct 16, 2020

@the-solipsist — yes this would be well received by many of us, it is what tools like pandocomatic or @mb21's panrun do, by specifying the options and accessory files as a "recipe" in a manner similar (but potentially more powerful than) defaults files. The whole purpose is to enable users who detest, fear, or remain ignorant of the CLI, or just want to automate Pandoc from another program to use Pandoc flexibly. Zettlr / Scrivener and many other editors would benefit from being able to modify the defaults used within the document being edited. No fiddling with a CLI or config files. Write + compile.

@tarleb — the distinction between defaults called from the CLI or from the document in a GUI seems minor[1]. I assume filters are the biggest concern? If security concerns block defaults settings only for within-document invocation, that would really hinder most benefits of being able to specify how a document is processed from within the document itself. Pushing on the security concern further, filters themselves are "opaque" unless you understand the source, and obviously pose a security concern. If the security issue is a filter, whether it is called from the CLI or a defaults set yields the same end result. What if the filter looks safe but calls another library that is compromised, or what if it exploits a CPU flaw in otherwise innocuous looking code? I realise that security is a gradual continuum. One possible option would be if Pandoc was to always process defaults "verbosely", detailing every filter and file used to STDOUT by default, that would at least help with somewhat with "transparency"?

Forcing a CLI option to enable in-document processing defeats the purpose of having in-document settings.


[1] For GUI users the document metadata is visible and a non-expert user would be more aware of a defaults file being present as part of the document itself rather than anything buried in a terminal invocation.

@tarleb
Copy link
Collaborator

tarleb commented Oct 16, 2020

Some possible attacks are described here: #5999 (comment)

The point about filters is well taken. Maybe we should link to the manual's "Security" section more often to highlight that danger? It currently contains this:

  1. Although pandoc itself will not create or modify any files other
    than those you explicitly ask it create (with the exception
    of temporary files used in producing PDFs), a filter or custom
    writer could in principle do anything on your file system. Please
    audit filters and custom writers very carefully before using them.

@brainchild0
Copy link

I think this is difficult to pull off while balancing usability and security, and naming matters a lot.

Yes, I agree that the idea has problems, including acceptability to users.

I worry that adequate security would be infeasible simply through a whitelist of allowed options, because of the difficulty of maintaining that list such that the application is never vulnerable. It would be difficult enough, perhaps not even strictly possible, to construct an original list that would guarantee security, but even once the list is created, any modification to the application may in principle remove the guarantee.

@the-solipsist
Copy link
Contributor

I must admit that I don't fully understand the security concern. I'm not sure if they are very obvious and I'm missing still them somehow, or whether they are subtle (in which case, I don't blame myself :-)

The concerns with allowing arbitrary URLs as described in #5999 (comment) seem well-placed and justified. As I understand that comment, the risk arises mainly because a user can be deceived into executing a defaults file that (a) she has not read and vetted, or (b) has changed since she last vetted it. This is indeed a problem. And there are easy workarounds for this when dealing with scripted shell commands. Hence, there is no real loss of functionality in Pandoc not providing URLs for defaults, while there is an enlarged security risk.

However, with local defaults files that are specified in the input file, I don't see quite how those same risks play out. I don't think the threat model should include having an attacker providing a user an input file and a defaults file (neither of which the user reads), and then getting them to run Pandoc with a flag (which if the user is unfamiliar with Pandoc & its defaults files, would mean getting the user to install Pandoc). The input markdown files and the defaults files aren't complex code and ought to be easy enough to read. (I feel I might be missing something obvious.) And if one is using a GUI text editor (Zettlr/Atom/Scrivener) that use Pandoc, there is no clear way to customize conversion on a per-document basis. So there is a loss of functionality in not having this feature, unlike in the case of disallowing URLs for defaults files. (I personally use pandoc directly for conversions, so I won't personally be affected by this even if the developers decide not to go ahead with this.)

And as I see it, the problem comes not from defaults files per se, but from --filters and from allowing options like --shell-escape for pdflatex (especially when Pandoc already provides ways to do citations using biblatex, biber, natbib, etc., code syntax highlighting, and to do through filters many of the other things that \write18 would be used for). So rather than an allowlist, a blocklist of cli options that either won't be permitted at all, or won't be permitted in defaults files seems more sensible to me. And filters remain a problem even without defaults files, since it is tougher for lay persons to audit code than for them to read markdown files / defaults files both of which are rather simple.

I think there are all kinds of unexpected results that can occur by allowing defaults in the input file unless this is thought through carefully. For instance, if a user is converting multiple files, each of which refers to different defaults options that are in conflict with one another, which would take precedence and how will the user be warned about this? But I don't really see that as a security issue. I feel I'm clearly missing something, but am unsure what.

@alerque
Copy link
Contributor

alerque commented Oct 16, 2020

It is currently very safe to run pandoc on any arbitrary input file

I don't believe this is true currently. It depends very much on the exact combination of input and output formats. In particular nothing about running LaTeX is safe. A document can very easily be rigged to execute any arbitrary code when converting to PDF. Filters and custom writters have already been mentioned. I suspect it would be possible to at least rig conversion to some other targets such that the final output would not be safe to open even if the conversion process itself didn't trigger anything.

If "very safe to run an any arbitrary file" is the goal then there are a number of things that need to be changed. In it's current condition I think it would be better to warn that untrusted input could potentially do bad things, not pretend that it's currently very safe on arbitrary input/output pairs.

@jgm
Copy link
Owner Author

jgm commented Oct 16, 2020

See also #5045 for an approach that would provide a high degree of security.

@brainchild0
Copy link

And if one is using a GUI text editor (Zettlr/Atom/Scrivener) that use Pandoc, there is no clear way to customize conversion on a per-document basis.

Yes, but superior interaction between Pandoc and interactive editors may be best achieved through features entirely distinct from the one currently proposed.

@iandol
Copy link
Contributor

iandol commented Oct 17, 2020

Yes, but superior interaction between Pandoc and interactive editors may be best achieved through features entirely distinct from the one currently proposed.

Being able to write a document and specify how it is processed within that document is the "superior" option, it doesn't require any extra tooling or complexity. It is intuitive to use:

---
author: Jane Doe
title: Test
use-defaults: latex-letter
...

My content here.

Single document + identical command, no fussing. Pandoc already allows users to download and use templates, filters and LaTeX that all already generate potential security issues. Specifying defaults adds one additional abstraction, but for non-technical users it makes no difference (Python filters, LateX templates or YAML settings), they cannot security audit these because they don't speak these language. This is like the walled garden of Apple, where individual liberty is removed to satisfy potential threat mitigations (many programs have crippled functionality in the App Store, limiting utility). Do we really want Pandoc to turn into a walled garden that limits its use? Shall we remove templates and filters, surely keeping a Pandoc user safer but limiting the scope of utility this utility provides? IMO the current model, warn the user of the potential but do not restrict them, should apply.

@brainchild0
Copy link

brainchild0 commented Oct 17, 2020

Yes, but superior interaction between Pandoc and interactive editors may be best achieved through features entirely distinct from the one currently proposed.

Being able to write a document and specify how it is processed within that document is the "superior" option, it doesn't require any extra tooling or complexity. It is intuitive to use:

The two statements are not at odds with one another, I believe, in the way as they may appear so to you.

Yours targets the broader discussion, mine the narrow concern of interaction between Pandoc and an interactive editor, which was one of the subjects of the comment to which mine was a reply.

My comment was prompted by the observation that whatever might be an agreeable user interface, for direct invocation of the application by the user, is not necessarily, nor likely, the most optimal interface between the application and a tool such an interactive editor, which would have its own user interface, and would invoke the application programmatically in response to requests provided to it directly by the user. In plain terms, how a user talks directly to Pandoc, versus how another application talks to Pandoc, are separate questions, which should be considered separately, before reaching any conclusions about how their resolutions might be alike.

The value of such a distinction is not derived from some assumption that use of Pandoc should require extra tooling, only that in some environments, other tools would be implicated for sound reasons.

Again, the comment does not negate the broader premise of the discussion, but only addresses one specific remark.

@brainchild0
Copy link

brainchild0 commented Oct 17, 2020

Specifying defaults adds one additional abstraction, but for non-technical users it makes no difference... they cannot security audit these because they don't speak these language.

I'm not sure I see how a security audit being possible for some particular user demographic relates the theme of the security concerns. Perhaps I'm just missing it.

This is like the walled garden of Apple, where individual liberty is removed to satisfy potential threat mitigations...

The concern is serious, but might be a caricature of the core issue. Separation of procedural logic and pure information has been a central topic since the earliest days of computing. If some design assumes that some inputs contain only the latter, and if such assumption is wrong, then the results are unpredictable, even detrimental. Meanwhile, opening all inputs to the possibility of procedural logic produces the liability of no safe location remaining for pure information that is unable to cause harm. This constraint then produces an environment that is fully open, making it convenient for users in the case that all individuals are behaving soundly, and dangerous in the case that even only one is behaving maliciously or even just carelessly.

Tendencies that appear paranoid to the end user have a legitimate place in security appraisals, because end users are not the ones that are impelled day and night to calculate ingenious methods of causing misery for others.

@iandol
Copy link
Contributor

iandol commented Oct 19, 2020

I'm not sure I see how a security audit being possible for some particular user demographic relates the theme of the security concerns. Perhaps I'm just missing it.

Well, as I understood it the original point was that by containing Pandoc settings in defaults metadata, we have "potentially" added one layer of abstraction, and that if this was invoked from the CLI it would somehow be more transparent for a user to understand/audit and therefore safer? But this supposes the user can understand what the CLI and its options represent (it certainly isn't intuitive), that the CLI itself isn't abstracted by a calling program (script, editor etc.), and that the abstractions that templates / filters represent are the major threat irrespective of being called from a CLI or being present in the document itself. So why should CLI invocation merit a "user responsibility to security audit themselves" but using metadata merits a "lets hard-limit Pandoc functionality"?

@brainchild0
Copy link

brainchild0 commented Oct 19, 2020

@iandol: Again, I may be missing a critical piece. Nevertheless, I'm not following a train of connections leading to your recent comments from the earliest comments referring to security. Those comments pointed to the threat from an attack model in which the input document is altered to change or augment the set of files involved in input or output.

The following is quoted from above:

However, allowing methods to overwrite command line parameters could give a document author a lot of control over input files, output location, filters, pdf engine, etc. YAML can be hidden anywhere in a long Markdown document, so it can become difficult to find when skimming the document.

The emphasis is not on the demand for the user to perform an audit, but rather the danger to the user owing to the user being unlikely to consider meticulously the security implications of each transaction.

So why should CLI invocation merit a "user responsibility to security audit themselves" but using metadata merits a "lets hard-limit Pandoc functionality"?

Being the target of a security audit is not essentially the relevant distinction between input files and arguments given directly on the CLI. The distinction is rather who provides the data. Input files are passed between users (or to a user from an attacker posing as a user). Arguments, however, are supplied by the same user whom they affect.

@iandol
Copy link
Contributor

iandol commented Oct 19, 2020

Sorry if I was not clear, we may be speaking from very different perspectives. My point is that a "user being unlikely to consider meticulously the security implications of each transaction", also applies to (1) someone who does not really understand the CLI or (2) any language a filter or template is written in, not only for the topic of this issue (3) processing instructions in a document or being allowed to reference default.yaml files. A potential safety restriction has been proposed only for the latter (3), not former (1,2) cases. I think it is valid to question why case (3) should be treated differently.

Arguments, however, are supplied by the same user whom they affect.

People copy and paste commands from online tutorial or stack overflow into the CLI all the time[1]. At what level is a bad actor (who has weaponised the myscript.lua) asking a naive user to copy-and-paste the command into their terminal:

> pandoc -L myscript mydocument.md

Or reference a file in metadata:

---
&defaults
  lua-filter: myscript
...
> pandoc mydocument.md

…different? This assumes somehow the user will understand the CLI and -L as potentially dangerous but not the metadata in their document. A malicious actor can convince a naive user to run commands in a CLI as easily as download a YAML metadata file and reference it in their markdown.

Benefits vs. Costs

By specifying the options to build our documents within the document metadata we have a practical solution to build documents that would work for users today. Abstract discussions about logic vs. information purity are interesting, but do not satisfy any practical or working result in the near future[2]. And the security implications/costs are not isolated to the metadata scenario alone.


[1] I help quite a few users with their Pandoc workflows and tell them what terminal commands to use, they are trustingly following a recipe not fully understanding their actions.
[2] i.e. what is your practical alternative for working with Scrivener or Zettlr as an example, given that these apps may or may not want to redesign for Pandoc specifically.

@brainchild0
Copy link

brainchild0 commented Oct 19, 2020

Abstract discussions about logic vs. information purity are interesting, but do not satisfy any practical or working result in the near future.

Framing the distinction abstractly may appear boring or obtuse to the end user, but the practical effects to this group of such a discussion occurring in a design context are not insubstantial. Might you agree that processing a file framed as a text document carries a rather different intuitive feel to many users, even if not all, compared to copying commands into the console, in terms of the sense of risk and uncertainty?

At what level is a bad actor (who has weaponised the myscript.lua) asking a naive user to copy-and-paste the command into their terminal... different?

First, a small point. The central target of the immediate concern is not a LUA file, but a Markdown one.

Imagine finding a file called tolstoy_warandpeace.md. Many would imagine that the contents of the file are some textual representation of Leo Tolstoy's iconic 1869 novel. Within this group, how many would imagine, or even consider, that the file also contained instructions that would surreptitiously alter the data on the user's system?

Consider an analogy. As many may recall, in the years that straddled the change of the millennium, the Windows world was haunted by viruses taking the form of Office macros distributed in files intended to appear benign. Eventually, the design of Office (see reference) was revised to restrict the operation of macros to contexts that the user might consider safe. These revisions forced attackers to adopt either of two strategies. The first was to distribute documents that appeared to unsuspecting users as regular documents, but were in fact executable programs. The other was to utilize an "exploit" vulnerability in the software's protective layer. Over the years, many such exploits have surfaced, and have caused damage before being resolved (see reference). Both strategies had some success, but much more limited than that which was possible when the software design was fully open to attack. Even today remains the legacy of these oversights, putting aside the preventable damage that occurred during the period of complete vulnerability.

Plainly, implementing the protections outright would have been a wiser choice.

Numerous differences are obvious between the Office and Pandoc cases, which make perhaps the outlook in the latter much more sanguine. Yet losing the broader context in haste toward an immediate payoff remains the serious cautionary consideration.

Sorry if I was not clear, we may be speaking from very different perspectives.

Yes, I agree, but the difference is less a flaw in the dialog or failure of communication than a natural consequence of the tension between security and convenience. Putting my cards on the table, I am not categorically opposed to the feature on security grounds. (My original comments on the feature were related entirely to other considerations, and even those comments expressed no strict opposition.) Yet I would be reluctant to dismiss summarily the security considerations recently raised, because they do point to a serious dilemma.

@tarleb
Copy link
Collaborator

tarleb commented Oct 19, 2020

The discussions, while interesting, seem to swerve into very fundamental issues of security, user education, and interface design. I think this would best be moved into another issue, to the mailing list, or maybe even a place like security.SE. Thanks.

xaviervalarino added a commit to xaviervalarino/portfolio that referenced this issue Oct 31, 2020
Pandoc doesn't have a way to  specify which template file should be used
within the markdown front-matter, so I'm using `sed` to see if a
template key exists, and if it does, using that as the argument for the
`--template` option.

Hopefully this will get added in the future. Following Pandoc
Issue #5870 to see what happens.

jgm/pandoc#5870
@hoclun-rigsep
Copy link

I have been lurking in this and related threads for a long while. Here's one
request for clarification on the security issues and one perspective on the
merits of the proposal.

@brainchild0 writes:

First, a small point. The central target of the immediate concern is not a
LUA file, but a Markdown one.

Imagine finding a file called tolstoy_warandpeace.md. Many would imagine
that the contents of the file are some textual representation of Leo
Tolstoy's iconic 1869 novel. Within this group, how many would imagine, or
even consider, that the file also contained instructions that would
surreptitiously alter the data on the user's system?

I've read this and the related issues several times but still fear I may have
missed this: How is this not dealt with by requiring a command line option to
enable explicitly the feature under discussion? Upthread @brainchild0 writes:

Ideally, an input file contains only textual content and related metadata.

I found this compelling the first several times I read it. But the user
could use this cli option to declare, "I, in this invocation anyway, disclaim
reliance on that ideal in favor of the advantages of the one-file project
structure." If the option is not given, tolstoy_warandpeace.md goes back to
being just textual content and related metadata (including some mere
suggestions about how to build an output target).

I agree with @iandol and @mb21 and others about the motivations at play
here and believe it's worth quoting @mb21 in #4627:

The motivation is really that for one-off documents, I want to save the
necessary pandoc options right in the file. (Just like rmarkdown users can
simply open the file and hit that 'convert' button.) I don’t want to remember
which document-class/style/theme I had decided to convert this document with.
I don’t want to litter my filesystem with runpandoc.sh or template.html files
for each one-off document. Finally, I didn’t want to “parse” YAML with sed,
or use a complex tool that only works for certain options.

If there's one thing these lengthy discussions have established, it's that
this is a common motivation. Though confident that I understand the principles
of separation of concern, separating content from presentation, etc, I
have felt this motivation myself. I've also observed in this
and other contexts that separating concerns can be more of a concession to the
inhuman UI than a reflection of reality: there are many situations in the
messy real world where we-as-users find things more tightly coupled than
we-as-devotees-of-abstraction would like them to be, and it may even be that
these situations are particularly common in typesetting. I believe I could
provide pithy examples.

For some time before writing today's comment I thought that I should reject
the one-file approach entirely and embrace the directory as the working unit
for every document "no-matter-what." @iandol's discussion of other users and my
own experience turned me around on that. While with or without the feature
under discussion I will continue to use project directories and makefiles
where I thought they do the most good, I do consider the one-file goal
legitimate, and prefer that the implementation lie within pandoc proper.

@brainchild0
Copy link

brainchild0 commented Dec 12, 2020

Not sure whether to engage the new comment, given the earlier request by @tarleb, so I will try to be brief then hopefully let go.

How is this not dealt with by requiring a command line option to enable explicitly the feature under discussion?

It is addressed, I believe at least adequately, by such an option. Hence the suggestion.

Ideally, an input file contains only textual content and related metadata.

I found this compelling the first several times I read it.

I do submit that such is the ideal, without the intention to demand a completely inflexible adherence to it, as long as the essential concerns are balanced. Use cases and usability certainly capture a range of valid concerns, which I have not intended to dismiss.

I do consider the one-file goal
legitimate, and prefer that the implementation lie within pandoc proper.

Also note that the earlier suggestion of putting the text content inside the defaults file, rather than putting the operational instructions inside a text document, accomplishes the same objective, and though such an approach may seem counterintuitive to those familiar with the historic design of Pandoc, it may also be a more all-around sound approach moving forward.

@jgm
Copy link
Owner Author

jgm commented Dec 12, 2020

putting the text content inside the defaults file, rather than putting the operational instructions inside a text document

I may have missed that comment above. Is the idea something like this?

---
from: markdown
to: html
standalone: true
...
Everything after the end of the YAML document gets used as the
input, unless an input file has been specified.

@brainchild0
Copy link

I may have missed that comment above. Is the idea something like this?

Yes, this one, but expanding the objective to human usability, in light of the sensibilities revealed by ongoing discussion, rather than keeping it constrained to programmatic invocation, as suggested originally.

xaviervalarino added a commit to xaviervalarino/portfolio that referenced this issue Jan 8, 2021
Pandoc doesn't have a way to  specify which template file should be used
within the markdown front-matter, so I'm using `sed` to see if a
template key exists, and if it does, using that as the argument for the
`--template` option.

Hopefully this will get added in the future. Following Pandoc
Issue #5870 to see what happens.

jgm/pandoc#5870
@mb21
Copy link
Collaborator

mb21 commented Jun 5, 2021

This discussion has now spanned several different aspects. But I'm particularly interested in achieving this goal (yes, also for PanWriter):

The whole purpose is to enable users who detest, fear, or remain ignorant of the CLI, or just want to automate Pandoc from another program to use Pandoc flexibly. Zettlr / Scrivener and many other editors would benefit from being able to modify the defaults used within the document being edited. No fiddling with a CLI or config files. Write + compile.

Do we think this is basically somehow achievable or we think the security risks are fundamentally too great?

I can think of two different kinds of security risks when converting an untrusted markdown file:

  1. Side-effects that modify the user's computer (e.g. file system).
  2. Side-effects that read out sensitive parts of the user's computer (e.g. file system) and send that information to an attacker-controlled web-server.

Regarding threat 1, --output and --extract-media options could simply be limited to the same directory as the input markdown file resides in. Or when reading in myFileName.md, perhaps even forcing the output to go to myFileName.html, and myFileName/* for extract-media.

But securing filters and pdf-engines (like pdflatex) seems like an impossible task (Although, we convert to pdf with latex in a tempdir, so accessing anything outside usually runs into the openout_any = p limitation). But that's also the case when opening an untrusted .tex file and compiling it... so we can either:

  • say "when a user has TeX installed, they probably know what they're doing" (especially when they're running pandoc from the command-line), or
  • have a one-time confirmation dialog informing the users of the risks, or
  • just not support filters and pdf generation in this mode (probably what Add a "sandboxed mode" that limits IO #5045 would do).

I wonder what the RStudio, Zettlr etc. devs think about this use-case though...

Regarding threat 2, while it's easy to craft a command that leaks information into the output file, which the attacker would have to trick the user into sending back to him (e.g. echo '![](id_rsa)' | pandoc --self-contained --resource-path ~/.ssh or echo '\input{~/.ssh/id_rsa}' | pandoc -o output.pdf), I'm not sure there's a way to trick pandoc into making an http request containung such information (which is possible when opening untrusted csv files in spreadsheet software) – well, except using --pdf-engine=lualatex, context, or wkhtmltopdf, since you then can make network requests.

@brainchild0
Copy link

brainchild0 commented Jun 5, 2021

To my thinking, this discussion now has evolved into a false dilemma between either 1) combining text and instructions into files that look like text files versus 2) completely forbidding such a combination in any single file.

Before it was typical for desktop systems to have ZIP-archive software installed, archives were commonly distributed in a self-extracting form. Archives targeted at Windows users would end in the .exe extension.

Suppose a debate had occurred for the choice between giving all ZIP archives the extension .exe, or not allowing the creation of self-extracting archives.

The current consensus overwhelmingly affirms the value of some method to build a recipe for generating a target document through Pandoc, represented as a self-contained package containing both document text and processing options.

The relevant question, to my mind, is a choice between a generalization of text documents to include processing options, or a generalization of defaults files to include text. It strikes me as clear at this stage that the latter is preferable. Let's keep text documents as documents that contain text, and let's develop a definition for a self-contained package of instructions for Pandoc to process conveniently. Such a package may be equally suited for a file storage, IPC operations, or network requests, and would help achieve all of the objectives already identified in this thread without any significant security threat or other serious side effects.

@allefeld
Copy link
Contributor

allefeld commented Feb 3, 2022

Sorry for hijacking this issue, but I believe people who commented here might be interested in this project of mine:
Pandoc/Defaults
I would definitely be interested in your comments on it, via Issues or Discussions.

It does not propose to change the way Pandoc operates, but rather a minimalistic extension of the YAML header, to be interpreted by an external processor which calls Pandoc. The repository includes a description and motivation of the proposal, a Python implementation and a VS Code extension.

@camoz
Copy link

camoz commented Feb 14, 2024

I have read this and a few related issues in search of a way to do something like this:

---
author: Jane Doe
title: Test
use-defaults: latex-letter
...

My content here.

A few thoughts regarding what I have read:

  • Two main features are discussed here: (1) Being able to specify defaults directly in the input file (i.e. the current issue description), and (2) linking to a defaults file from within the input file (like in the code snippet above). Regarding the second feature: It would be nice to either reopen Allow defaults file to be specified in YAML header #7977, or to modify the issue description of this issue to also include this feature (2), so that it does not get lost, and so that people searching for this (much-requested) feature can easily find the relevant issue for it.
  • I'm perhaps lacking some details, but, while I like the idea of a defaults_: field for specifying defaults directly in the input file, the man page states in the part Extension: yaml_metadata_block that "Fields with names ending in an underscore will be ignored by pandoc. (They may be given a role by external processors.)" So I imagine that quite a few external processors and user scripts currently use defaults_:, since it's a descriptive name for the functionality. Could this lead to unexpected behavior? Why not choose a field name that does not end in an underscore, to keep that "ends-with-underscore" namespace completely unused?
  • I find it more intuitive to think of adding/processing defaults: and defaults-file: keys in the input files (possibly only if a global --read-defaults is specified), than to use defaults files which happen to also contain input data that is used. Here, the first half of this issue's description differs from the second half IMO. It is not clear to me what is meant.

@mboyea
Copy link

mboyea commented Jun 25, 2024

Given the simple bash script solution (using sed) to interpret template_: name-of-file here: #1958 (comment)
I'm sure it would be easy to write a script to interpret defaults_:\n- name-of-file until a proper solution is implemented.

For now, I'll stick to declaring my own template_ in the YAML header. In my template, I strip parts that allow YAML header styling options (like geometry) altogether anyways. I would rather create custom LaTeX (or HTML/CSS) templates than implement pandoc-specific YAML header options. I need to create templates for my use case anyways.

Edit: I implemented this solution. Now I find that declaring a default_ file is extremely useful. Keeping the --template as default allows my scripts to support new features as they're added to Pandoc.

Repo link: https://github.com/mboyea/pandoc-scripts

@allefeld
Copy link
Contributor

Coming back to this because of notifications: I have used my own solution Pandoc/Defaults mentioned above for a while, and was quite happy with it. However, in the meantime I discovered Quarto, which is mainly a system for processing computational documents using Pandoc in the background (like R Markdown, but extending it to languages other than R). However, it has extensive support for setting options, including Pandoc's, in the YAML header (or an external YAML file), also for non-computational documents. And it comes with nice HTML templates and a set of themes. For this reason I rarely use Pandoc directly anymore. Pandoc is the amazingly powerful underlying engine, but it is imho easiest to use through Quarto.

@iandol
Copy link
Contributor

iandol commented Jun 26, 2024

Quarto is a great project, with a useful focus and extended feature set. I do appreciate you can perform configuration in the YAML header metadata. But it is worth noting that the Quarto project explicitly requires the project folder as the basic unit, not the file. They don't enable something like the pandoc data directory which stores compilation related templates/filters in a shared space. You must install e.g. filters for each project folder, then if a filter changes, reinstall for each project folder etc. While there are other advantages to Quarto (beautiful HTML outputs, great cross-referencing, mermaid / etc. and easy install of TinyteX), Pandoc is still a more flexible tool for flexible ad-hoc writing, whether it is one quick file or some major project of multiple files... (not that this is an either/or, you can easily use both as needed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

12 participants