-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discuss defaults files #5790
Comments
Advantages of using Makefiles:
Disadvantages of makefiles:
I'd prefer to leave to 'make' the things it does so well, but there may be good reasons to allow users to define "bundles" of pandoc options (template, output format, other command-line options) that can be put either in the working directory or in the user data directory. I think this is close to what you're describing, though perhaps not exactly the same. A simple implementation would involve a file like
Invoke with
where It would be easy to implement this and it wouldn't require YAML. A variant, using YAML, would be something like
but I don't really see too much advantage to this. Disadvantages: another format to learn, more rules to document, and a more complex implementation. (Note however that we currently have ToJSON/FromJSON instances for the Opt structure used to store the options + input and output filenames. So we could already read a JSON file in this format for free, and I believe we could also read the equivalent YAML. But then in addition to documenting the command line options, we'd have to document the format of this structure, and I just don't see a lot to be gained.) |
By the way, here's the JSON representation of the default value of the Opt data structure. We {
"optTabStop": 4,
"optPreserveTabs": false,
"optStandalone": false,
"optReader": null,
"optWriter": null,
"optTableOfContents": false,
"optShiftHeadingLevel": 0,
"optBaseHeaderLevel": 1,
"optTemplate": null,
"optVariables": [],
"optMetadata": [],
...
"optStripComments": false
} |
I've changed the JSON encoding to use e.g. {-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE OverloadedStrings #-}
import Text.Pandoc.App
import Data.Aeson
import qualified Data.ByteString.Lazy as B
import qualified Data.YAML.Aeson as Y
main = do
inp <- B.getContents
let defaults = Y.encode1 defaultOpts
case Y.decode1 (defaults <> inp) of
Right (foo :: Opt) -> print foo
Left err -> print err |
One difficulty is that some of the fields of this option structure expect strings where the corresponding pandoc options take filenames, e.g. highlighting-style or include-in-header. |
A great many wonderful benefits and conveniences could be realized through a comprehensive internal solution for managing projects. I understand that such a solution is not currently a priority. Yet I disagree that Pandoc users, even ones who never use Windows, extract a comparable benefit currently from make files as they might from the proposed. I could write exhaustively on this discrepancy, but as I would sooner develop the ideas we share than argue over the ones we do not, let it suffice for now to observe the following:
Looking forward from the present, I feel it would be of great benefit if Pandoc could process a file of this form:
In this case, some of the fields appearing under "options" are familiar from how metadata blocks have been used. Others are adapted, similar to your JSON structure, from the command-line arguments. The relaxed layout of a file opens possibilities such as separate fields for items such as a list of extensions, in comparison to the Currently the list of fields that affect a transformation is large and growing. Mostly, it is infeasible to support all of them on the CLI, as I think you have said yourself. And indeed, some at the moment are only supported in metadata fields. If a feature like the above is available, then it can be a preferred way to set these options. I see little need to maintain or to document an equally expansive list of CLI options, as no one, practically speaking, writes a command line with 20 options and then repeats exactly the same the next day. Complicated, repeated operations should be described in files. CLI options can be limited to a shorter list of common ones, such as column count for wrapping plain text, that would likely be used for one-time operations. For more specific cases, a single CLI option whose value is itself a field and value can be used, much as Supporting interpolation of the file's base name would be helpful, as would a variety of other features, including interpolation by metadata fields such as author or title. Such enhancements can be added progressively later without disrupting imminent design choices. In fact, the earlier-suggested concept of project files might encompass a wide variety of conveniences that could be considered one at a time once the basic functionality is available. You commented about file names and strings being different types, which I fail to understand, because I think they are the same. But more generally, YAML and JSON have richer semantics than the CLI, so, separate from the concern of reusing existing functionality as much as possible, I would not find any systemic issue in using files to represent what currently is command line. |
Makefiles have no such limitation. I've found it easy to create elegant Makefiles for constructing documents using pandoc. Perhaps if you've been struggling with this, you could post what you're trying to do and someone could help. Yes, Makefile is a syntax that must be learned, but the same would be true of "project files."
Nobody types them out manually like this, no. But commands with 20 options are often used in the context of shell scripts, Makefiles, etc. It's quite handy to be able to do this without providing an external file with the instructions. And that's how nearly all standard unix tools work. I suspect it would be an unpopular suggestion to remove the current configurability of pandoc via command line options. But if this is kept and project files added, and the project files have an ad hoc syntax that is only sort of overlapping with the CLI options, then...it's one more thing to learn, one more thing to get confused about, one more thing to document, one more thing to maintain. This kind of complexity has a cost (especially for maintainers!). |
As I said, I'm not against this proposal. I just think the most sensible way to implement it would be as a way to set default values for command-line options. This would probably only be slightly different from what you outlined above. Note that all variables and metadata fields can already be set from the command line. |
The following changes to
|
Gmake defaults to running a make file found in the current working directory. Options such as Instead, the operation of a make file, in the base case, is tied to its location, in this case, separate files for each of four locations, even if everything is the same but for a few file names. This assumption can be overridden, but not cleanly, by boilerplate logic inside the file. In the case of Pandoc, since some options are given as metadata fields, they must either be referenced in separate metadata files or as a sequence of If you have some examples that you believe are elegant and easy, then maybe it would be more natural and efficient for you to share those examples, instead of my posting the example that I consider problematic.
Yes and no. Surely it would be absurd if the A project file that is application specific should not be considered contrary to Unix design patterns.
I agree with your conclusions, but only given your premise, which is different from mine, about what is being proposed. I do not propose by any measure that "the project files have an ad hoc syntax that is only sort of overlapping with the CLI options". The original analysis intended to demonstrate that currently, whatever the actual history or design, a general Pandoc operation is determined by what appear to be two collections of partially-overlapping, ad-hoc option sets, namely metadata fields and command-line options. The proposed idea would rather be a well-planned, extensible, flexible option set that is expressed in YAML (or JSON) in some way that is clear to understand and to document. Such a solution largely obviates the need for an expansive set of options either in metadata or the command line, though some support might be maintained for legacy, and simple operations should of course always be possible directly from the command line. If you consider the proposal as moving the number of things for users to learn and developers to maintain from two to three, then it seems like a bad proposal without any doubt. But the actual idea is to move it from two to one, the single solution being chosen to meet all the needs not currently met either by each individually nor both together, and to meet even more needs besides.
Are you referring to the idea that the YAML structure is defined so that each key in the table is simply one among the CLI options (without the Having said as much, I don't have any problem with CLI options, when used, overriding or augmenting the values in the new file type. Source and target files for example might be easier to keep outside the file in certain cases. |
FWIW, I still dream of having a pandoc equivalent to texlua, i.e. to have pandoc act as a Lua interpreter when called as pandoclua. That would be portable, scriptable, and not too difficult to use. But I won't have time to work on that anytime soon. |
Yes, that's my idea, and I do understand you had in mind something more comprehensive. However, it's an important goal not to break existing workflows unnecessarily, so I favor a smaller and less breaking change. When I've got it worked out, I think you will see that it achieves most or all of your stated goals quite nicely (including the goal of not putting formatting related things in metadata). By the way, this approach could also solve the underlying problem in #3732 , giving us a way to specify structured template variable values, which is currently impossible on the command line and leads to people using metadata for things that aren't metadata. Also relevant: #4627 |
Just to be clear, I never suggested the imminent suspension of support for any command-line option, so as to break backward compatibility. Only I thought that because the new file type, unlike the CLI options, is of course not currently in active use, the namespace it encapsulates might be filled according to a design as we see best suited at this time. |
HTMLMathMethod, CiteMethod, ObfuscationMethod, TrackChanges, WrapOption, TopLevelDivision, ReferenceLocation, HTMLSlideVariant. In each case we use lowercase (or hyphenated lowercase) for constructors to line up more closely with command-line option values. This is a breaking change for those who manually decode or encode JSON for these data types (e.g. for ReaderOptions or WriterOptions). See #5790.
And yes, I agree about #3732 and #4627. The latter was mentioned in the original post. A wrapping application such as a CMS would be able control an entire pandoc invocation by piping in a single dynamically-generated data structure encoded in a YAML stream. This interface not only supports larger and more structured data, but also saves the developer from the tiresome work of interpolating the values into command-line arguments. Hyphen being given as the filename to represent standard input would offer a platform-independent way to support such use. |
After reflecting on the concerns you raised earlier about maintainability and transparency of distinct sets of options, which I may have dismissed too hastily, I propose a suggestion that intends to represent much of the benefits of my earlier thinking while avoiding bloat to the design or implementation. It is slightly more complex than your approach, but once understood, I hope might be considered to carry an appropriate balance of elegance and manageability. Consider the following steps:
Thus, four sets are defined:
The first, at least for a while, will overwhelmingly be the larger set. Thus it represents most of the maintenance costs, which will be the same as before this set of enhancements was considered. The final two have the same function, in different contexts. This overlap adds overhead, though the amount should be acceptable, as both sets are small. The rationale for the separation is that some CLI options have functions that are incoherent as top-level fields, and meanwhile the top-level fields can be designed from scratch as determined appropriate from with the context of a YAML structure. When a new formatting option is added, it can be designated as simple or structured. Simple options are available on the CLI and in the YAML table, whereas structured ones are available only in the YAML table. The main utility of the structured options is to facilitate more sophisticated functionality that does not express well in primitive values. However, the overall idea is still viable even if they are not included. In that case, there are only three sets. I would suggest that it is feasible to document all four sets, and to explain their use, as well as to maintain them in the codebase. As no regression would occur no immediate update to documentation is needed. The sixth post in this series shows an example how of the YAML block would appear with this design. |
I've made some progress. This now works.
where input-files:
- body.md
- header.yaml
reader: markdown
output-file: result.pdf
writer: latex
number-sections: true
standalone: true
variables:
documentclass: book
classoptions:
- oneside
fontsize: 12pt
fontfamily: times
geometry: 'margin=1in'
headerincludes: |
\usepackage{indentfirst} |
I like the suggestion of adding unknown fields at the top level to |
It looks nice, and is a big step forward.
Not sure what "unknown fields" means here. The suggestion I intended to convey was putting fields such as Originally I thought that |
A few fine details to consider:
|
|
As I think I said, I agreed that project wouldn't be an accurate word for what is happening now, but I also think that defaults is also not the best choice. It sounds like a set of user preferences, rather than an item associated a particular document or group of documents.
Might want to change CWD to directory containing file, if it a real file.
Agree with idea for search path. But also an extension helps simplify the command line:
You wouldn't have to require the extension, just know what to do when you see it. You can still keep the verbose syntax for arbitrary names.
And |
I wouldn't want to do this by default, but we could consider adding a variable to allow you to specify paths relative to that directory. But note, there can be multiple input files, and they may live in different directories. All in all, I think this kind of thing is better left to scripts and tools like make.
I see. So the suggestion is that the default treatment of a Currently the search procedure is as follows:
|
I'd also be interested in suggestions about the option name.
But that's exactly what it is: a set of user preferences for the values you can normally specify on the command line. These preferences can include an input and output file, in which case you have "an item associated with a particular document," but they could also just be Other possibilities:
|
Note also that
|
I was referring to the directory of the YAML file, not any document source file. Setting the CWD to this location ensures consistent behavior, creating more convenience for experienced users and fewer surprises for inexperienced ones. My idea is to try to approach incrementally toward a model of "it just works". I agree that adding some kind of template substitution for a variety of variables would eventually be useful. I just don't think that relying on the author of the YAML file to use the prefix consistently is the best way to give users easy and unobstructed access to reproducible behavior. If the prefix is used in certain file references but not all, and the CWD is not changed, a user might test the file not realizing that the observed success was simply due to the luck of having a particular CWD. Meanwhile, if the CWD is not changed, even the diligent user is burdened with needing to test that the file works equally well in any CWD. Ideally the objective is to look for ways that users can test once, run anywhere.
My particular thinking right now leads me to suggest the following procedure for resolving the file:
Adding to the above would be a preemptive check in the input file names for one with the particular extension. There are two options:
Also, using a particular extension designates its type strictly from a directory listing, and facilitates integration with shells and editors. If a desktop environment adopts a wrapper, then users could run the file simply by clicking in their shell, and editors can scan the local directory for files to run.
I suppose it can be used as such, but it generalizes well to many other contexts. Most options in a transformation engine are not determined by what a user prefers as a matter of personal taste universal to all contexts, but rather by what is appropriate for a document, in many cases determined in part by other members of a group. The choice of key bindings in an editor, or background color on a desktop, is quite personal, and users want complete consistency across sessions usually without exception. Choices such as how many columns to use, what output format, which font size, and so on, might be choices that certain users repeat in some cases according to tastes, but are largely determined by the broader circumstances surrounding the document itself. A user who thinks that books and letters should have identical formatting is one whose judgment is questionable. Moreover, once a transformation is defined to behave in a consistent way, sharing the definition becomes a compelling possibility. Naturally, the same file can be used repeatedly for different documents of the same type. This case as you indicate is handled by giving the file names on the command line rather than within the file. There are three general cases, which might see varied levels of use:
To me, the label defaults does not capture the full breadth of generality and potential. As I said before, I don't currently have a name that wildly excites me, but action seems to capture the concept to some degree. Another possibility is recipe, but again, I have yet to discover a term I really like. Maybe with time nicer ideas will emerge, especially in a broader discussion. |
By the way, I tried using this feature, and was generally quite happy with how much easier it was for me to manage documents At one point, commit ff1df24, I noticed that |
@brainchild0 The |
Yes they are used for the CLI arguments, which often uses informal semantics. I think in the context of the file, nouns (and verbs) are nicer than prepositions. As discussed it is infeasible and unnecessary to maintain two separate sets of names for all the formatting options, but for the few top-level fields it creates a nice effect to choose names that are appropriately descriptive. Remember there is no CLI argument for input files, whereas But this reminds me of an earlier comment, representing a less petty issue, about extensions perhaps being expressed separate from reader using the structured features of YAML: extensions:
reader:
add:
- empty-paragraphs
drop:
- raw_html Could also consider the following: reader:
name: markdown
add-ext:
- empty-paragraphs
drop-ext:
- raw_html Then the shorthand could be: reader: markddown+emptyparagraphs-raw_html But the former is easier to edit in a file and to build programatically. Also possible, in case of not wanting any to be implicitly added: reader:
name: markdown
use-only-ext:
- empty-paragraphs Or: reader:
name: markdown
drop-def-ext: yes
add-ext:
- empty-paragraphs |
I see. That's fine. |
In most cases probably. But: let's say you have a defaults file |
Are you saying that the YAML file might represent some input file that has the same base name for all invocations but resides in some project-specific directory, so its path needs to be resolved per invocation? This case would be an advanced one, eventually good to consider, but in my view only after resolving the fundamentals. |
Yes, exactly. In my user data directory I have a defaults file
In my project directory I have a file letter.md |
Oh, and we don't necessarily need the same base name. (Actually, I think this should probably work already. I'll need to test later...) |
This is already implemented.
I'm open to concrete suggestions, but I haven't heard one yet that seems better to me than 'defaults'.
Does the (existing) ability to set
Personally I think this kind of thing is a job for scripts or Makefiles. I'd rather give this |
@denismaier - sorry, I deleted your comment by accident. |
Oh. No worries. My comment was just that I am fine with a simple version of the defaults option. |
Ok, good, it seems the current implementation mostly works correctly. I find a few fringe cases with
I don't feel that |
Maybe you could give an example where it's not enough? |
Perhaps I should first see that I understand what use you propose. Suppose a user has files What is the best way to use |
OK, |
I've made a small change. Now you can specify With options like |
Another change: you can specify multiple files: |
Awesome. How can I test this? |
Compile from source, or get a nightly after tonight (see under the Actions tab). |
I'd be curious to understand how close to this scenario
Right. I'm not sure what to say that hasn't already been said. All I can think is that it seems in this example we have two possibilities:
Since file locations are clearly important, the latter seems to be a huge benefit to the user even if at the cost of a small escalation of code size in the application. Edit: Separate from feeling that the functionality proposed is more appropriate outside of Pandoc, what reasons, if any, do you identify for preferring using the working directory for relative paths? Is the main issue implementation and maintenance of resolving the path relative to the file location? |
Commas can appear in file names, so I would avoid these nonstandard semantics. |
Well, that's true; maybe I'll revert this for now.
For one thing, the defaults file might come from a central location (like the user data directory); we certainly wouldn't want to interpret paths relative to it in this case. Even when it's not coming from the user data directory, someone might put the default files in a common place. |
I'm not sure that it would be so terrible even in the case of the file being stashed in some data directory. For example, if some options set for generating a company letter referred to an image file representing a corporate logo, then naturally the image would be near the options file. But in whatever case that the options files are kept separate from the document source, then isn't the normal use that the source files would provided outside those files? It seems that the issue reduces to the question of what is the case that the paths ought to be followed relative to the current working directory. |
Although there are some additional changes we may want to consider further at a later date, I'm content with the current state of this feature for 2.8, so I'm going to close this issue. |
Introduction
I have used Pandoc for several months, and have read many users’ successes and frustrations in the issue tracker, the discussion group, external articles, and documentation provided in external templates and tools. In this experience, I identify a recurring problem.
What is the best way for a user to create a work flow of the following form?:
Indeed, topics such as Issues #4627 and #5584, the emergence of Panzer and Pandocomatic, reports of improvised solutions based on shell scripts and make files, and a myriad of circulating questions and suggestions, all provide evidence that much of the community seeks a level of automation, reproducibility, and separation of concerns greater than what Pandoc natively provides in its current form.
While (1), above, has long been the central function of Pandoc, the need among users, and the lack of compelling options, for (2), (3), and especially (4), beg for a review of the current status and for proposals for a solution.
Background
Pandoc currently provides moderate automation and reproducibility through the use of metadata blocks, which can occur in multiple input files listed in a single command, and which can include fields that affect the transformation of input text into output documents. Using multiple input files, the user is able to enforce a separation of concerns, formatting versus text, by placing formatting-related metadata in auxiliary files, without affecting files containing textual source. The user is also free to merge both concerns into the same source files. In either case, the fields in the metadata blocks will be processed with the same effect for recurring invocations as long as no changes are made to those fields, to the list of source files, or to any command-line options that would interfere.
Automation and reproducibility for multiple output documents from a common textual source finds more limited support within Pandoc compared to the base case of a single target document. Metadata fields relating to different output format types can occur in the same block, or group of blocks, in some sequence of input files, and any particular output format type can be selected per invocation through command-line options. One weakness in this basic approach is that any metadata field that affects multiple output format types cannot have a different value applied to output for each type. A more glaring vulnerability is the impossibility of applying different field values to distinct output targets of the same output format type. For cases involving either problem, it becomes necessary to create a distinct file, with a metadata block, for each output target, and to specify the desired one through the command line per invocation, along with the corresponding options for output format type. Such a solution is viable in many cases, but not optimally convenient or transparent.
Analysis
Many options affecting a Pandoc transformation are possible on the command line but cannot currently be provided in metadata fields. The inadequacy of this restriction with respect to automation and reproducibility has been noticed. Issue #4627 proposes that options currently limited to the command line simply be made available as metadata fields. But as discussion within reveals, because many of the command-line options relate to which files are read and how they are processed, the effect of those options within an input file would be unclear, and would certainly not be equivalent to their effect as currently defined for the command line. In principle, many options currently limited to the command line also could be processed in metadata fields, but metadata alone cannot entirely determine a Pandoc transformation of some textual source into an output document.
Any attempt to offer ever increasing control of a Pandoc transformation through metadata fields eventually would become stymied by an insoluble category error. The current design appears to encapsulate a tension, because of a poorly defined boundary, between that which controls processing of metadata blocks and that which are controlled by them. As a complete description of a Pandoc transformation includes options that affect processing of source files, such a description can never possibly be limited merely to the contents of source files. And yet, in the current design, aside from source files, the command line provides the only source of data that determine an operation. In the general case, a Pandoc invocation must include some options on the command line beyond only the source files, output file, and format types. Inevitably, an automatic and reproducible process is only possible through an external wrapper, such as a shell script or make file, that can provide command-line options to Pandoc.
Reliance on metadata fields to describe a transformation places strain on a clean separation of concerns, and is impossible to realize to the full exclusion of options given through other means. Meanwhile, reliance on command-line options, except in certain very simple cases, depends on an external wrapper to achieve adequate automation and reproducibility.
Alternatives
For a recurring operation, a static shell script may be sufficiently simple for many users and many cases, but lacking flexibility, portability, and transparency, ultimately is difficult for the user to manage cleanly.
Some external projects have emerged providing wrappers that augment functionality from Pandoc with respect to automation and reproducibility. These projects include Panzer, which has ceased development, and Pandocomatic. Yet neither offers the breadth of flexibility or depth of capabilities that seem demanded. Indeed, more users appear to be using Pandoc in conjunction with make files than to be using either of the above utilities.
Make files are designed to automate and to optimize operations involving multiple interdependent tasks, particularly the build processes of software projects. The main benefit generally is to automate the discovery of steps in an operation that can be skipped in a particular invocation, as when some intermediary results were already written to files more recently than when last their dependencies were changed. While features of make files can be understood without full knowledge of software development, effective use is largely limited to engineers and to other software specialists. Meanwhile, as the structure of make files is generally best suited to the patterns in building software projects, make files supporting Pandoc operations will generally have either high complexity or limited reusability. Even in the best case, the maintainer of a make file for a Pandoc project must construct a command for each output by improvising a sequence of command-line options and metadata source files that unite to achieve a desired result.
In a very general assessment, make files might appear well suited to wrap one or more invocations of Pandoc into a general operation, since a make file represents an automatic and reproducible process that uses a set of source files to generate a set of output files. Assessment below the surface, unfortunately, reveals that such suitability is superficial if not illusory.
Broadly, the benefit currently realized by Pandoc users as a whole from make files or other wrapper tools, at least among solutions that have appeared to date in discussions, is quite limited. In contrast, a design tightly coupled to the features of Pandoc could support basic and advanced use cases with high flexibility and transparency.
Concept
Support for project files in Pandoc is proposed to resolve the difficulties under current review.
Succinctly, a project file would be construed as a plan of activity, represented in a file, for an invocation of Pandoc, able to include in one place a variety of different kinds of information currently given on the command line or in metadata fields. The object is to automate, by fully expressing in a clean structure, the complete set of actions important to the user with respect to one or more files comprising a source document. Such a set of actions would be able to include multiple output documents to be generated by a single invocation of Pandoc. A source document, meanwhile, rather than including details related to formatting or output type, would be needed only for its textual content, and possible document metadata, such as title, author, and citations.
A minimal set of features project files include might be as follows:
To support the description of various actions, or output targets, each including a particular combination of options, source files, filters, templates, output format type, and output file name.
To allow all such actions to be performed together by a trivial invocation:
To allow a subset of such actions to be performed together, by only a slightly less trivial invocation, one naming the actions to be performed.
To support various cases, in which source files are given:
To use files structured in a meta-format familiar to Pandoc users, such as JSON, or, as many may prefer, YAML. Specifying a particular file extension, such as
.pd
, not simply using one for the meta-format, such as.yaml
, would simplify invocation on the command line, and integration with other environments, such as desktops and editors.Ultimately, the most compelling case may be one that looks similar to the following:
In the above examples, each invocation refers to two files. One is the textual source, which varies for each invocation, and the other a project file, encapsulating the summation of all choices, except the textual source, relating to the larger ecosystem of options that determine Pandoc’s behavior. The user is saying, as though to a human assistant, “And, oh, Miss Pandoc, take a letter, and prepare it in the usual way”.
Remarks
Project files solve the problem of where to place the data Pandoc uses to determine how to produce output in a particular context. In the past, discussion focused on where among the available locations to place existing and new options, without addressing the broader question of whether additional locations might be made available.
The current tension over the division and overlap in the options given through metadata fields or the command line is elegantly resolved by introducing project files, where any option fits naturally and freely. Use of a project file allows the command line to be clear but for a few parameters the user finds useful in controlling each invocation, and allows the source files to be clear but for the textual content entering into processing.
With the project file nimbly carrying the weight from concerns that previously added burdens elsewhere, the user experience becomes portable, predictable, clean, direct, dynamic, and customized.
Whether the appetite exists for developing a feature of this type, I regard it useful to consider. Even if realization is infeasible or undesirable given current goals and resources, the ideas and distinctions exposed during this conversation hopefully lead to the discovery of useful solutions inside of current limits.
The text was updated successfully, but these errors were encountered: