-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow defaults to be folded into YAML metadata #5870
Comments
So you would have to do Also copying my comment from #4627: I don't think a lot of people know about YAML anchors and it will unnecessarily confuse them. But I like having the options as a subfields (e.g. under |
With this proposal you'd have to do |
I also think On issue #5790 there was also quite a bit of discussion about a better name than |
I feel that a principle benefit of #5790 was to enforce a clearer separation of concerns, the lack of which had previously created a variety of issues, including those discussed in #4627. I am hesitant to support moving in any direction that appears to reverse these gains. Ideally, an input file contains only textual content and related metadata. |
A unified file which specifies how its content is to be parsed without hierarchies of accessory configuration files is useful IMO. As an example, the wonderful writing App Scrivener supports Pandoc output, but compiles its metadata and contents into a single .md file. Many [most] users who are not programmers or command-line specialists find it really challenging to conceptualise technical solutions that "elegantly" separate semantic parts[1]. They just want to be able to take their work and have an elegant output. Nothing is stopping more technical purists from separating default settings, metadata and documents in this proposal, it just enables a workflow that would benefit causal users. [1] I provide a premade Pandoc compiling workflow for Scrivener, using pandocomatic to deal with the configuration (using a separate |
Only having heard of Scrivener is passing, I am correct that a core design feature is separating the physical layout features of a document from the text, to help authors focus on the development of the text, and if so, would such a model not more directly accord with the "semantic separation"? |
Yes, Scrivener is a GUI-based app that encourages separation of content and presentation. But we have had decades of people using things like Word or Powerpoint where they just press buttons to make stuff look how they want while they create. The styles system in Word is really well developed by now, and still all Word docs I have to deal with almost exclusively ignore semantic styles and use inline styling. Many users are drawn to the organisational features of Scrivener but they have a really hard time trying to conceptualise and adjust to the idea of compiling content into output formats. Scrivener can hide a lot of this, but nevertheless requires a lot of adjustment from users. This does makes them more likely to then take a step towards something like Pandoc to really enhance the transformation of their work. But overall most users are only willing to step so far… YAML is structured using whitespace to identify that structure; I think this is intuitive or at least intelligible for users. Utilising the structured nature of YAML directly was why I preferred using a named Pandoc is a swiss-army knife, and that sometimes means using the screwdriver to open jars 😃 |
Largely I think your, @iandol, premises and reasoning are sound, but I have some difficulty identifying precisely where I might diverge with your conclusions because of confusion over which particular group of users are or should be considered in each of a variety of cases your outline. For example, that group that has a really hard time trying to conceptualise and adjust to the idea of compiling content into output formats would seem to be disjoint with that group who uses Pandoc. I suspect you would agree, but I may not fully follow the train of ideas you have given. I think it is good to consider the various ways that Pandoc might be integrated with different tools, invoked using different strategies, and understood by different groups of users. But for each case one needs to consider the actual combinations along these axis that have representation. |
Yes, this is always a conundrum that makes the clear waters of any design turbid. But I suppose precisely for this I prefer pluralism; and think the flexibility of being able to use one file or several will appeal to a broader range of use-cases than strictly enforcing the separation of pandoc settings from metadata, variables and content. John and others probably know the numbers better than I, but the fact there are numerous issues and forum posts asking for the ability to put settings into some form of YAML section suggest that there are pure Pandoc users that would prefer this way of option of dealing with their workflow. Scrivener users were one example class that would benefit, but previous requests for something like thing point to there being different users who this would appeal to. To invert the question, apart from conceptual "cleanliness"[1], what are the problems with allowing documents that contain a [1] which is relative, because as I mentioned above, it is not always clear what the distinction is between settings and at least some metadata options. A real "clean" solution would separate settings, metadata, variables and content into their own accessory and main documents. |
My preliminary ranking of four possibilities in order of ascending preference, looks as shown below. I am listing the negative features of each idea. I am not rejecting that they may have merits, but I’ll leave the task of listing them to enthusiasts. I little doubt they will appear, as I rarely pick the popular side.
|
Was |
Possibly I've missed an important point, but trying to think through the scenarios and mutations, it seems that we could have, at least in principle, some very roundabout flow of data the way I understand that the feature is currently described. Taking what seems to be currently the most popular variation, and considering the following header:
Then the result is a document with title The real title, and subtitle The real subtitle, correct? |
It's less fitting than |
So you consider "options" without qualification to indicate command-line options? I suppose some might find it misleading. For me it's not a problem. I don't think of options in file as either more, less, or equally comprehensive necessarily than options from the command line. I think of options as the top-level concept, of which those from the command line, or any other source, are just special cases. |
My perspective is that of a general user.
I personally hope so, but I think John is working on standardising how all these potential sources 'cascade' together in what order... |
Are you referring to the logic that allowed handling of multiple defaults files on the same command line, or something different I might not know about? |
Yes. |
That logic would be separate from how metadata from multiple sources is merged. I think this process has always been a flat, key-wise update. Or do I misunderstand? |
John is better placed to answer how all of this will fit together than I 😊 … |
No, because setting anything in |
As discussed, the idea on offer would seem in a large sense to defeat the earlier objective of separating the two kinds of data. However, one context in which combining them might be uniquely useful is programmatic invocation of the application, with a dynamically-generated source, as might be desired for example by a MarkDown editor that supports publishing and printing through Pandoc functionality. For such a purpose, however, I would be far less inclined to advocate folding the defaults fields into the metadata header than adding support directly in the defaults file that could instruct the application to acquire the document source from the data in the same stream that follows the YAML block:
|
One other option to consider is to be able to indicate a defaults files' locations in the in-file YAML:
There are text editors that support exporting via Pandoc. However, currently they can't really use defaults files, since defaults files have to be indicated outside of the document. Being able to specify defaults file paths would greatly help a user of a text editor like Zettlr to be able to use defaults files using its in-built Pandoc-based export functionality. Does this make any sense as an alternative/additional approach? What would be the drawbacks? |
I'm worried about the security implications of a feature like this. It is currently very safe to run pandoc on any arbitrary input file -- the worst possible attack right now is denial of service by crafting input such that pandoc takes a long time to parse. DoS like that is comparatively easy to defend against by setting resource limits for the process. However, allowing methods to overwrite command line parameters could give a document author a lot of control over input files, output location, filters, pdf engine, etc. YAML can be hidden anywhere in a long Markdown document, so it can become difficult to find when skimming the document. If we do this, it might be wise to define an explicit allow list for command line options which are deemed safe. |
Another option is to disable the feature except if some option is given on the command line. (Following this design, the application probably should fail if the option is not given and the input seems to expect it.) Partitioning the options into two categories, according to which are safe for use in input, may create challenges for maintenance. It seems to present the threat of features creeping across the boundary, due to carelessness or poor decisions in code revisions. The potential for unexpected interaction among options presents an even more serious challenge to maintaining a list of safe features. Ultimately, the difference between input files with defaults values and defaults files with input content is in many ways a small distinction, but may have significance in resolving questions such as these security concerns. A file cannot change a project for the worse if that file is itself the entire project. |
I think this is difficult to pull off while balancing usability and security, and naming matters a lot. It would presumably not be too difficult to get a user to run |
@the-solipsist — yes this would be well received by many of us, it is what tools like @tarleb — the distinction between defaults called from the CLI or from the document in a GUI seems minor[1]. I assume filters are the biggest concern? If security concerns block defaults settings only for within-document invocation, that would really hinder most benefits of being able to specify how a document is processed from within the document itself. Pushing on the security concern further, filters themselves are "opaque" unless you understand the source, and obviously pose a security concern. If the security issue is a filter, whether it is called from the CLI or a defaults set yields the same end result. What if the filter looks safe but calls another library that is compromised, or what if it exploits a CPU flaw in otherwise innocuous looking code? I realise that security is a gradual continuum. One possible option would be if Pandoc was to always process defaults "verbosely", detailing every filter and file used to STDOUT by default, that would at least help with somewhat with "transparency"? Forcing a CLI option to enable in-document processing defeats the purpose of having in-document settings. [1] For GUI users the document metadata is visible and a non-expert user would be more aware of a defaults file being present as part of the document itself rather than anything buried in a terminal invocation. |
Some possible attacks are described here: #5999 (comment) The point about filters is well taken. Maybe we should link to the manual's "Security" section more often to highlight that danger? It currently contains this:
|
Yes, I agree that the idea has problems, including acceptability to users. I worry that adequate security would be infeasible simply through a whitelist of allowed options, because of the difficulty of maintaining that list such that the application is never vulnerable. It would be difficult enough, perhaps not even strictly possible, to construct an original list that would guarantee security, but even once the list is created, any modification to the application may in principle remove the guarantee. |
I must admit that I don't fully understand the security concern. I'm not sure if they are very obvious and I'm missing still them somehow, or whether they are subtle (in which case, I don't blame myself :-) The concerns with allowing arbitrary URLs as described in #5999 (comment) seem well-placed and justified. As I understand that comment, the risk arises mainly because a user can be deceived into executing a defaults file that (a) she has not read and vetted, or (b) has changed since she last vetted it. This is indeed a problem. And there are easy workarounds for this when dealing with scripted shell commands. Hence, there is no real loss of functionality in Pandoc not providing URLs for defaults, while there is an enlarged security risk. However, with local defaults files that are specified in the input file, I don't see quite how those same risks play out. I don't think the threat model should include having an attacker providing a user an input file and a defaults file (neither of which the user reads), and then getting them to run Pandoc with a flag (which if the user is unfamiliar with Pandoc & its defaults files, would mean getting the user to install Pandoc). The input markdown files and the defaults files aren't complex code and ought to be easy enough to read. (I feel I might be missing something obvious.) And if one is using a GUI text editor (Zettlr/Atom/Scrivener) that use Pandoc, there is no clear way to customize conversion on a per-document basis. So there is a loss of functionality in not having this feature, unlike in the case of disallowing URLs for defaults files. (I personally use pandoc directly for conversions, so I won't personally be affected by this even if the developers decide not to go ahead with this.) And as I see it, the problem comes not from defaults files per se, but from I think there are all kinds of unexpected results that can occur by allowing defaults in the input file unless this is thought through carefully. For instance, if a user is converting multiple files, each of which refers to different defaults options that are in conflict with one another, which would take precedence and how will the user be warned about this? But I don't really see that as a security issue. I feel I'm clearly missing something, but am unsure what. |
I don't believe this is true currently. It depends very much on the exact combination of input and output formats. In particular nothing about running LaTeX is safe. A document can very easily be rigged to execute any arbitrary code when converting to PDF. Filters and custom writters have already been mentioned. I suspect it would be possible to at least rig conversion to some other targets such that the final output would not be safe to open even if the conversion process itself didn't trigger anything. If "very safe to run an any arbitrary file" is the goal then there are a number of things that need to be changed. In it's current condition I think it would be better to warn that untrusted input could potentially do bad things, not pretend that it's currently very safe on arbitrary input/output pairs. |
See also #5045 for an approach that would provide a high degree of security. |
Yes, but superior interaction between Pandoc and interactive editors may be best achieved through features entirely distinct from the one currently proposed. |
Being able to write a document and specify how it is processed within that document is the "superior" option, it doesn't require any extra tooling or complexity. It is intuitive to use: ---
author: Jane Doe
title: Test
use-defaults: latex-letter
...
My content here. Single document + identical command, no fussing. Pandoc already allows users to download and use templates, filters and LaTeX that all already generate potential security issues. Specifying defaults adds one additional abstraction, but for non-technical users it makes no difference (Python filters, LateX templates or YAML settings), they cannot security audit these because they don't speak these language. This is like the walled garden of Apple, where individual liberty is removed to satisfy potential threat mitigations (many programs have crippled functionality in the App Store, limiting utility). Do we really want Pandoc to turn into a walled garden that limits its use? Shall we remove templates and filters, surely keeping a Pandoc user safer but limiting the scope of utility this utility provides? IMO the current model, warn the user of the potential but do not restrict them, should apply. |
The two statements are not at odds with one another, I believe, in the way as they may appear so to you. Yours targets the broader discussion, mine the narrow concern of interaction between Pandoc and an interactive editor, which was one of the subjects of the comment to which mine was a reply. My comment was prompted by the observation that whatever might be an agreeable user interface, for direct invocation of the application by the user, is not necessarily, nor likely, the most optimal interface between the application and a tool such an interactive editor, which would have its own user interface, and would invoke the application programmatically in response to requests provided to it directly by the user. In plain terms, how a user talks directly to Pandoc, versus how another application talks to Pandoc, are separate questions, which should be considered separately, before reaching any conclusions about how their resolutions might be alike. The value of such a distinction is not derived from some assumption that use of Pandoc should require extra tooling, only that in some environments, other tools would be implicated for sound reasons. Again, the comment does not negate the broader premise of the discussion, but only addresses one specific remark. |
I'm not sure I see how a security audit being possible for some particular user demographic relates the theme of the security concerns. Perhaps I'm just missing it.
The concern is serious, but might be a caricature of the core issue. Separation of procedural logic and pure information has been a central topic since the earliest days of computing. If some design assumes that some inputs contain only the latter, and if such assumption is wrong, then the results are unpredictable, even detrimental. Meanwhile, opening all inputs to the possibility of procedural logic produces the liability of no safe location remaining for pure information that is unable to cause harm. This constraint then produces an environment that is fully open, making it convenient for users in the case that all individuals are behaving soundly, and dangerous in the case that even only one is behaving maliciously or even just carelessly. Tendencies that appear paranoid to the end user have a legitimate place in security appraisals, because end users are not the ones that are impelled day and night to calculate ingenious methods of causing misery for others. |
Well, as I understood it the original point was that by containing Pandoc settings in defaults metadata, we have "potentially" added one layer of abstraction, and that if this was invoked from the CLI it would somehow be more transparent for a user to understand/audit and therefore safer? But this supposes the user can understand what the CLI and its options represent (it certainly isn't intuitive), that the CLI itself isn't abstracted by a calling program (script, editor etc.), and that the abstractions that templates / filters represent are the major threat irrespective of being called from a CLI or being present in the document itself. So why should CLI invocation merit a "user responsibility to security audit themselves" but using metadata merits a "lets hard-limit Pandoc functionality"? |
@iandol: Again, I may be missing a critical piece. Nevertheless, I'm not following a train of connections leading to your recent comments from the earliest comments referring to security. Those comments pointed to the threat from an attack model in which the input document is altered to change or augment the set of files involved in input or output. The following is quoted from above:
The emphasis is not on the demand for the user to perform an audit, but rather the danger to the user owing to the user being unlikely to consider meticulously the security implications of each transaction.
Being the target of a security audit is not essentially the relevant distinction between input files and arguments given directly on the CLI. The distinction is rather who provides the data. Input files are passed between users (or to a user from an attacker posing as a user). Arguments, however, are supplied by the same user whom they affect. |
Sorry if I was not clear, we may be speaking from very different perspectives. My point is that a "user being unlikely to consider meticulously the security implications of each transaction", also applies to (1) someone who does not really understand the CLI or (2) any language a filter or template is written in, not only for the topic of this issue (3) processing instructions in a document or being allowed to reference default.yaml files. A potential safety restriction has been proposed only for the latter (3), not former (1,2) cases. I think it is valid to question why case (3) should be treated differently.
People copy and paste commands from online tutorial or stack overflow into the CLI all the time[1]. At what level is a bad actor (who has weaponised the
Or reference a file in metadata: ---
&defaults
lua-filter: myscript
...
…different? This assumes somehow the user will understand the CLI and Benefits vs. CostsBy specifying the options to build our documents within the document metadata we have a practical solution to build documents that would work for users today. Abstract discussions about logic vs. information purity are interesting, but do not satisfy any practical or working result in the near future[2]. And the security implications/costs are not isolated to the metadata scenario alone. [1] I help quite a few users with their Pandoc workflows and tell them what terminal commands to use, they are trustingly following a recipe not fully understanding their actions. |
Framing the distinction abstractly may appear boring or obtuse to the end user, but the practical effects to this group of such a discussion occurring in a design context are not insubstantial. Might you agree that processing a file framed as a text document carries a rather different intuitive feel to many users, even if not all, compared to copying commands into the console, in terms of the sense of risk and uncertainty?
First, a small point. The central target of the immediate concern is not a LUA file, but a Markdown one. Imagine finding a file called Consider an analogy. As many may recall, in the years that straddled the change of the millennium, the Windows world was haunted by viruses taking the form of Office macros distributed in files intended to appear benign. Eventually, the design of Office (see reference) was revised to restrict the operation of macros to contexts that the user might consider safe. These revisions forced attackers to adopt either of two strategies. The first was to distribute documents that appeared to unsuspecting users as regular documents, but were in fact executable programs. The other was to utilize an "exploit" vulnerability in the software's protective layer. Over the years, many such exploits have surfaced, and have caused damage before being resolved (see reference). Both strategies had some success, but much more limited than that which was possible when the software design was fully open to attack. Even today remains the legacy of these oversights, putting aside the preventable damage that occurred during the period of complete vulnerability. Plainly, implementing the protections outright would have been a wiser choice. Numerous differences are obvious between the Office and Pandoc cases, which make perhaps the outlook in the latter much more sanguine. Yet losing the broader context in haste toward an immediate payoff remains the serious cautionary consideration.
Yes, I agree, but the difference is less a flaw in the dialog or failure of communication than a natural consequence of the tension between security and convenience. Putting my cards on the table, I am not categorically opposed to the feature on security grounds. (My original comments on the feature were related entirely to other considerations, and even those comments expressed no strict opposition.) Yet I would be reluctant to dismiss summarily the security considerations recently raised, because they do point to a serious dilemma. |
The discussions, while interesting, seem to swerve into very fundamental issues of security, user education, and interface design. I think this would best be moved into another issue, to the mailing list, or maybe even a place like security.SE. Thanks. |
Pandoc doesn't have a way to specify which template file should be used within the markdown front-matter, so I'm using `sed` to see if a template key exists, and if it does, using that as the argument for the `--template` option. Hopefully this will get added in the future. Following Pandoc Issue #5870 to see what happens. jgm/pandoc#5870
I have been lurking in this and related threads for a long while. Here's one @brainchild0 writes:
I've read this and the related issues several times but still fear I may have
I found this compelling the first several times I read it. But the user I agree with @iandol and @mb21 and others about the motivations at play
If there's one thing these lengthy discussions have established, it's that For some time before writing today's comment I thought that I should reject |
Not sure whether to engage the new comment, given the earlier request by @tarleb, so I will try to be brief then hopefully let go.
It is addressed, I believe at least adequately, by such an option. Hence the suggestion.
I do submit that such is the ideal, without the intention to demand a completely inflexible adherence to it, as long as the essential concerns are balanced. Use cases and usability certainly capture a range of valid concerns, which I have not intended to dismiss.
Also note that the earlier suggestion of putting the text content inside the defaults file, rather than putting the operational instructions inside a text document, accomplishes the same objective, and though such an approach may seem counterintuitive to those familiar with the historic design of Pandoc, it may also be a more all-around sound approach moving forward. |
I may have missed that comment above. Is the idea something like this?
|
Yes, this one, but expanding the objective to human usability, in light of the sensibilities revealed by ongoing discussion, rather than keeping it constrained to programmatic invocation, as suggested originally. |
Pandoc doesn't have a way to specify which template file should be used within the markdown front-matter, so I'm using `sed` to see if a template key exists, and if it does, using that as the argument for the `--template` option. Hopefully this will get added in the future. Following Pandoc Issue #5870 to see what happens. jgm/pandoc#5870
This discussion has now spanned several different aspects. But I'm particularly interested in achieving this goal (yes, also for PanWriter):
Do we think this is basically somehow achievable or we think the security risks are fundamentally too great? I can think of two different kinds of security risks when converting an untrusted markdown file:
Regarding threat 1, But securing filters and pdf-engines (like pdflatex) seems like an impossible task (Although, we convert to pdf with latex in a tempdir, so accessing anything outside usually runs into the
I wonder what the RStudio, Zettlr etc. devs think about this use-case though... Regarding threat 2, while it's easy to craft a command that leaks information into the output file, which the attacker would have to trick the user into sending back to him (e.g. |
To my thinking, this discussion now has evolved into a false dilemma between either 1) combining text and instructions into files that look like text files versus 2) completely forbidding such a combination in any single file. Before it was typical for desktop systems to have ZIP-archive software installed, archives were commonly distributed in a self-extracting form. Archives targeted at Windows users would end in the Suppose a debate had occurred for the choice between giving all ZIP archives the extension The current consensus overwhelmingly affirms the value of some method to build a recipe for generating a target document through Pandoc, represented as a self-contained package containing both document text and processing options. The relevant question, to my mind, is a choice between a generalization of text documents to include processing options, or a generalization of defaults files to include text. It strikes me as clear at this stage that the latter is preferable. Let's keep text documents as documents that contain text, and let's develop a definition for a self-contained package of instructions for Pandoc to process conveniently. Such a package may be equally suited for a file storage, IPC operations, or network requests, and would help achieve all of the objectives already identified in this thread without any significant security threat or other serious side effects. |
Sorry for hijacking this issue, but I believe people who commented here might be interested in this project of mine: It does not propose to change the way Pandoc operates, but rather a minimalistic extension of the YAML header, to be interpreted by an external processor which calls Pandoc. The repository includes a description and motivation of the proposal, a Python implementation and a VS Code extension. |
I have read this and a few related issues in search of a way to do something like this: ---
author: Jane Doe
title: Test
use-defaults: latex-letter
...
My content here. A few thoughts regarding what I have read:
|
Given the simple bash script solution (using
Edit: I implemented this solution. Now I find that declaring a Repo link: https://github.com/mboyea/pandoc-scripts |
Coming back to this because of notifications: I have used my own solution Pandoc/Defaults mentioned above for a while, and was quite happy with it. However, in the meantime I discovered Quarto, which is mainly a system for processing computational documents using Pandoc in the background (like R Markdown, but extending it to languages other than R). However, it has extensive support for setting options, including Pandoc's, in the YAML header (or an external YAML file), also for non-computational documents. And it comes with nice HTML templates and a set of themes. For this reason I rarely use Pandoc directly anymore. Pandoc is the amazingly powerful underlying engine, but it is imho easiest to use through Quarto. |
Quarto is a great project, with a useful focus and extended feature set. I do appreciate you can perform configuration in the YAML header metadata. But it is worth noting that the Quarto project explicitly requires the project folder as the basic unit, not the file. They don't enable something like the pandoc data directory which stores compilation related templates/filters in a shared space. You must install e.g. filters for each project folder, then if a filter changes, reinstall for each project folder etc. While there are other advantages to Quarto (beautiful HTML outputs, great cross-referencing, mermaid / etc. and easy install of TinyteX), Pandoc is still a more flexible tool for flexible ad-hoc writing, whether it is one quick file or some major project of multiple files... (not that this is an either/or, you can easily use both as needed) |
The proposal is to provide a way to do
pandoc --defaults source.md
orpandoc -d source.md
, and have pandoc read its default options from a particular part of the markdown document, then process the document accordingly. This would handle #4627 .One approach would be to put the defaults in a
defaults_:
field in YAML metadata at the start of the file. Pandoc does not try to parse fields ending in_
, so this would ensure that the content was only used for defaults.This would look like:
Another possible approach would be to use a YAML anchor to mark out the defaults. This would look like:
On this approach the defaults could be flush left, but you'd need a separate YAML block for them, with the keyword
&defaults
.On both approaches it would be desirable to set the document as input file, unless
input-files
is specifically specified. This feature could be triggered by the presence of whichever feature we used (above) -- either the special tag or thedefaults_
key. So, the logic for--defaults
would be:defaults_
key (or with a tagged block).The text was updated successfully, but these errors were encountered: