Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discuss defaults files #5790

Closed
brainchild0 opened this issue Oct 5, 2019 · 66 comments
Closed

discuss defaults files #5790

brainchild0 opened this issue Oct 5, 2019 · 66 comments

Comments

@brainchild0
Copy link

brainchild0 commented Oct 5, 2019

Introduction

I have used Pandoc for several months, and have read many users’ successes and frustrations in the issue tracker, the discussion group, external articles, and documentation provided in external templates and tools. In this experience, I identify a recurring problem.

What is the best way for a user to create a work flow of the following form?:

  1. To take input text and options, and then to create one or more output documents of some types, then
  2. To make changes to the original input text and options, then
  3. To repeat (1) exactly as before except for the changes from (2), then
  4. To continue indefinite iterations of (2) and (3).

Indeed, topics such as Issues #4627 and #5584, the emergence of Panzer and Pandocomatic, reports of improvised solutions based on shell scripts and make files, and a myriad of circulating questions and suggestions, all provide evidence that much of the community seeks a level of automation, reproducibility, and separation of concerns greater than what Pandoc natively provides in its current form.

While (1), above, has long been the central function of Pandoc, the need among users, and the lack of compelling options, for (2), (3), and especially (4), beg for a review of the current status and for proposals for a solution.

Background

Pandoc currently provides moderate automation and reproducibility through the use of metadata blocks, which can occur in multiple input files listed in a single command, and which can include fields that affect the transformation of input text into output documents. Using multiple input files, the user is able to enforce a separation of concerns, formatting versus text, by placing formatting-related metadata in auxiliary files, without affecting files containing textual source. The user is also free to merge both concerns into the same source files. In either case, the fields in the metadata blocks will be processed with the same effect for recurring invocations as long as no changes are made to those fields, to the list of source files, or to any command-line options that would interfere.

Automation and reproducibility for multiple output documents from a common textual source finds more limited support within Pandoc compared to the base case of a single target document. Metadata fields relating to different output format types can occur in the same block, or group of blocks, in some sequence of input files, and any particular output format type can be selected per invocation through command-line options. One weakness in this basic approach is that any metadata field that affects multiple output format types cannot have a different value applied to output for each type. A more glaring vulnerability is the impossibility of applying different field values to distinct output targets of the same output format type. For cases involving either problem, it becomes necessary to create a distinct file, with a metadata block, for each output target, and to specify the desired one through the command line per invocation, along with the corresponding options for output format type. Such a solution is viable in many cases, but not optimally convenient or transparent.

Analysis

Many options affecting a Pandoc transformation are possible on the command line but cannot currently be provided in metadata fields. The inadequacy of this restriction with respect to automation and reproducibility has been noticed. Issue #4627 proposes that options currently limited to the command line simply be made available as metadata fields. But as discussion within reveals, because many of the command-line options relate to which files are read and how they are processed, the effect of those options within an input file would be unclear, and would certainly not be equivalent to their effect as currently defined for the command line. In principle, many options currently limited to the command line also could be processed in metadata fields, but metadata alone cannot entirely determine a Pandoc transformation of some textual source into an output document.

Any attempt to offer ever increasing control of a Pandoc transformation through metadata fields eventually would become stymied by an insoluble category error. The current design appears to encapsulate a tension, because of a poorly defined boundary, between that which controls processing of metadata blocks and that which are controlled by them. As a complete description of a Pandoc transformation includes options that affect processing of source files, such a description can never possibly be limited merely to the contents of source files. And yet, in the current design, aside from source files, the command line provides the only source of data that determine an operation. In the general case, a Pandoc invocation must include some options on the command line beyond only the source files, output file, and format types. Inevitably, an automatic and reproducible process is only possible through an external wrapper, such as a shell script or make file, that can provide command-line options to Pandoc.

Reliance on metadata fields to describe a transformation places strain on a clean separation of concerns, and is impossible to realize to the full exclusion of options given through other means. Meanwhile, reliance on command-line options, except in certain very simple cases, depends on an external wrapper to achieve adequate automation and reproducibility.

Alternatives

For a recurring operation, a static shell script may be sufficiently simple for many users and many cases, but lacking flexibility, portability, and transparency, ultimately is difficult for the user to manage cleanly.

Some external projects have emerged providing wrappers that augment functionality from Pandoc with respect to automation and reproducibility. These projects include Panzer, which has ceased development, and Pandocomatic. Yet neither offers the breadth of flexibility or depth of capabilities that seem demanded. Indeed, more users appear to be using Pandoc in conjunction with make files than to be using either of the above utilities.

Make files are designed to automate and to optimize operations involving multiple interdependent tasks, particularly the build processes of software projects. The main benefit generally is to automate the discovery of steps in an operation that can be skipped in a particular invocation, as when some intermediary results were already written to files more recently than when last their dependencies were changed. While features of make files can be understood without full knowledge of software development, effective use is largely limited to engineers and to other software specialists. Meanwhile, as the structure of make files is generally best suited to the patterns in building software projects, make files supporting Pandoc operations will generally have either high complexity or limited reusability. Even in the best case, the maintainer of a make file for a Pandoc project must construct a command for each output by improvising a sequence of command-line options and metadata source files that unite to achieve a desired result.

In a very general assessment, make files might appear well suited to wrap one or more invocations of Pandoc into a general operation, since a make file represents an automatic and reproducible process that uses a set of source files to generate a set of output files. Assessment below the surface, unfortunately, reveals that such suitability is superficial if not illusory.

Broadly, the benefit currently realized by Pandoc users as a whole from make files or other wrapper tools, at least among solutions that have appeared to date in discussions, is quite limited. In contrast, a design tightly coupled to the features of Pandoc could support basic and advanced use cases with high flexibility and transparency.

Concept

Support for project files in Pandoc is proposed to resolve the difficulties under current review.

Succinctly, a project file would be construed as a plan of activity, represented in a file, for an invocation of Pandoc, able to include in one place a variety of different kinds of information currently given on the command line or in metadata fields. The object is to automate, by fully expressing in a clean structure, the complete set of actions important to the user with respect to one or more files comprising a source document. Such a set of actions would be able to include multiple output documents to be generated by a single invocation of Pandoc. A source document, meanwhile, rather than including details related to formatting or output type, would be needed only for its textual content, and possible document metadata, such as title, author, and citations.

A minimal set of features project files include might be as follows:

  • To support the description of various actions, or output targets, each including a particular combination of options, source files, filters, templates, output format type, and output file name.

  • To allow all such actions to be performed together by a trivial invocation:

    pandoc <project_file>
    
  • To allow a subset of such actions to be performed together, by only a slightly less trivial invocation, one naming the actions to be performed.

  • To support various cases, in which source files are given:

    • within the project file.
    • on the command line.
    • in another project file.
  • To use files structured in a meta-format familiar to Pandoc users, such as JSON, or, as many may prefer, YAML. Specifying a particular file extension, such as .pd, not simply using one for the meta-format, such as .yaml, would simplify invocation on the command line, and integration with other environments, such as desktops and editors.

Ultimately, the most compelling case may be one that looks similar to the following:

pandoc letter.pd text=letter1.md
pandoc letter.pd text=letter2.md

In the above examples, each invocation refers to two files. One is the textual source, which varies for each invocation, and the other a project file, encapsulating the summation of all choices, except the textual source, relating to the larger ecosystem of options that determine Pandoc’s behavior. The user is saying, as though to a human assistant, “And, oh, Miss Pandoc, take a letter, and prepare it in the usual way”.

Remarks

Project files solve the problem of where to place the data Pandoc uses to determine how to produce output in a particular context. In the past, discussion focused on where among the available locations to place existing and new options, without addressing the broader question of whether additional locations might be made available.

The current tension over the division and overlap in the options given through metadata fields or the command line is elegantly resolved by introducing project files, where any option fits naturally and freely. Use of a project file allows the command line to be clear but for a few parameters the user finds useful in controlling each invocation, and allows the source files to be clear but for the textual content entering into processing.

With the project file nimbly carrying the weight from concerns that previously added burdens elsewhere, the user experience becomes portable, predictable, clean, direct, dynamic, and customized.

Whether the appetite exists for developing a feature of this type, I regard it useful to consider. Even if realization is infeasible or undesirable given current goals and resources, the ideas and distinctions exposed during this conversation hopefully lead to the discovery of useful solutions inside of current limits.

@jgm jgm added the enhancement label Oct 5, 2019
@jgm
Copy link
Owner

jgm commented Oct 6, 2019

Advantages of using Makefiles:

  1. If you have several build targets in the same directory, you can specify different options for each, and you don't have to remember which project file to use for which
  2. Make can tell what needs to be rebuilt and won't do unnecessary work
  3. Make can do complex multi-step builds, including components that are built with other programs than pandoc (graphviz, ...).
  4. We don't need to invent another ad hoc syntax

Disadvantages of makefiles:

  1. Not available by default on Windows.

I'd prefer to leave to 'make' the things it does so well, but there may be good reasons to allow users to define "bundles" of pandoc options (template, output format, other command-line options) that can be put either in the working directory or in the user data directory. I think this is close to what you're describing, though perhaps not exactly the same.

A simple implementation would involve a file like letter.pandoc:

--template letter -Vfontsize=11pt -f markdown+emoji

Invoke with pandoc --defaults letter myletter.md -o myletter.pdf
Perhaps it would be worth allowing something like

-o %.pdf

where % is calculated from the first input filename, so you don't have to specify the output file explicitly and could do pandoc --defaults letter myletter.md.

It would be easy to implement this and it wouldn't require YAML. A variant, using YAML, would be something like

template: letter
variables:
  fontsize: 11pt
to: markdown+emoji
output: %.pdf

but I don't really see too much advantage to this. Disadvantages: another format to learn, more rules to document, and a more complex implementation. (Note however that we currently have ToJSON/FromJSON instances for the Opt structure used to store the options + input and output filenames. So we could already read a JSON file in this format for free, and I believe we could also read the equivalent YAML. But then in addition to documenting the command line options, we'd have to document the format of this structure, and I just don't see a lot to be gained.)

@jgm
Copy link
Owner

jgm commented Oct 6, 2019

By the way, here's the JSON representation of the default value of the Opt data structure. We
could change this fairly easily to remove the opt and use snake style rather than camel case.

{
  "optTabStop": 4,
  "optPreserveTabs": false,
  "optStandalone": false,
  "optReader": null,
  "optWriter": null,
  "optTableOfContents": false,
  "optShiftHeadingLevel": 0,
  "optBaseHeaderLevel": 1,
  "optTemplate": null,
  "optVariables": [],
  "optMetadata": [],
  ...
  "optStripComments": false
}

@jgm
Copy link
Owner

jgm commented Oct 7, 2019

I've changed the JSON encoding to use e.g. strip-comments instead of optStripComments.
As a note to self, here's a little program that reads YAML and emits pandoc options (with the defaults):

{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE OverloadedStrings #-}
import Text.Pandoc.App
import Data.Aeson
import qualified Data.ByteString.Lazy as B
import qualified Data.YAML.Aeson as Y

main = do
  inp <- B.getContents
  let defaults = Y.encode1 defaultOpts
  case Y.decode1 (defaults <> inp) of
    Right (foo :: Opt) -> print foo
    Left err -> print err

@jgm
Copy link
Owner

jgm commented Oct 7, 2019

One difficulty is that some of the fields of this option structure expect strings where the corresponding pandoc options take filenames, e.g. highlighting-style or include-in-header.

@brainchild0
Copy link
Author

brainchild0 commented Oct 7, 2019

A great many wonderful benefits and conveniences could be realized through a comprehensive internal solution for managing projects. I understand that such a solution is not currently a priority.

Yet I disagree that Pandoc users, even ones who never use Windows, extract a comparable benefit currently from make files as they might from the proposed.

I could write exhaustively on this discrepancy, but as I would sooner develop the ideas we share than argue over the ones we do not, let it suffice for now to observe the following:

  1. Although I have used make files for 20 years, accomplishing my objective of creating a group of files that build a set of specific targets from an arbitrary MarkDown source took considerable effort, will be cumbersome to maintain moving forward, and does not perform all of the tasks that I would want ideally. (I would say that one of the files I created is a make file, but such is not strictly accurate, because it is rather a shell script that invokes a make file embedded in the form of a "here" file, to circumvent the limitation in make that the command is usually run from the same directory that contains the make file.)
  2. Of those who write book and articles, and who might benefit from the functionality of Pandoc, most have not learned to use a make tool and never will learn.

Looking forward from the present, I feel it would be of great benefit if Pandoc could process a file of this form:

infiles:
  - header.yaml
  - body.md
intype: markdown
outfile: result.pdf
outtype: latex
options:
  number-sections: yes
  standalone: yes
  documentclass: book
  classoptions:
    - oneside
  headerincludes: |
    \usepackage{indentfirst}

In this case, some of the fields appearing under "options" are familiar from how metadata blocks have been used. Others are adapted, similar to your JSON structure, from the command-line arguments. The relaxed layout of a file opens possibilities such as separate fields for items such as a list of extensions, in comparison to the markdown+emoji semantics used in your above example.

Currently the list of fields that affect a transformation is large and growing. Mostly, it is infeasible to support all of them on the CLI, as I think you have said yourself. And indeed, some at the moment are only supported in metadata fields. If a feature like the above is available, then it can be a preferred way to set these options. I see little need to maintain or to document an equally expansive list of CLI options, as no one, practically speaking, writes a command line with 20 options and then repeats exactly the same the next day. Complicated, repeated operations should be described in files. CLI options can be limited to a shorter list of common ones, such as column count for wrapping plain text, that would likely be used for one-time operations. For more specific cases, a single CLI option whose value is itself a field and value can be used, much as --metadata is currently used.

Supporting interpolation of the file's base name would be helpful, as would a variety of other features, including interpolation by metadata fields such as author or title. Such enhancements can be added progressively later without disrupting imminent design choices. In fact, the earlier-suggested concept of project files might encompass a wide variety of conveniences that could be considered one at a time once the basic functionality is available.

You commented about file names and strings being different types, which I fail to understand, because I think they are the same. But more generally, YAML and JSON have richer semantics than the CLI, so, separate from the concern of reusing existing functionality as much as possible, I would not find any systemic issue in using files to represent what currently is command line.

@jgm
Copy link
Owner

jgm commented Oct 7, 2019

to circumvent the limitation in make that the command is usually run from the same directory that contains the make file

Makefiles have no such limitation. I've found it easy to create elegant Makefiles for constructing documents using pandoc. Perhaps if you've been struggling with this, you could post what you're trying to do and someone could help. Yes, Makefile is a syntax that must be learned, but the same would be true of "project files."

I see little need to maintain or to document an equally expansive list of CLI options, as no one, practically speaking, writes a command line with 20 options and then repeats exactly the same the next day.

Nobody types them out manually like this, no. But commands with 20 options are often used in the context of shell scripts, Makefiles, etc. It's quite handy to be able to do this without providing an external file with the instructions. And that's how nearly all standard unix tools work. I suspect it would be an unpopular suggestion to remove the current configurability of pandoc via command line options. But if this is kept and project files added, and the project files have an ad hoc syntax that is only sort of overlapping with the CLI options, then...it's one more thing to learn, one more thing to get confused about, one more thing to document, one more thing to maintain. This kind of complexity has a cost (especially for maintainers!).

@jgm
Copy link
Owner

jgm commented Oct 7, 2019

As I said, I'm not against this proposal. I just think the most sensible way to implement it would be as a way to set default values for command-line options.

This would probably only be slightly different from what you outlined above. Note that all variables and metadata fields can already be set from the command line.

@jgm
Copy link
Owner

jgm commented Oct 7, 2019

The following changes to Opt would be good to do first, if we provided a way to read in defaults for options in YAML or JSON form:

  • optHighlightingStyle - make this a plain string instead of a Style, and load the Style after option parsing.
  • manual FromJSON instances for LineEnding and the other enumerated types used for values in Opt, so that people can write eol: crlf rather than eol: CRLF and top-level-division: section rather than top-level-division: TopLevelSection.
    • LineEnding
    • TopLevelDivision
    • HTMLMathMethod
    • Verbosity
    • ReferenceLocation
    • WrapOption
    • ObfuscationMethod
    • CiteMethod
    • make optVariables and writerVariables a Context Text, so they can take structured values.
    • make optMetadata a Meta or MetaMap, so it can take structured values.

@brainchild0
Copy link
Author

brainchild0 commented Oct 7, 2019

Makefiles have no such limitation. I've found it easy to create elegant Makefiles for constructing documents using pandoc. Perhaps if you've been struggling with this, you could post what you're trying to do and someone could help. Yes, Makefile is a syntax that must be learned, but the same would be true of "project files."

Gmake defaults to running a make file found in the current working directory. Options such as -C and -f override this behavior, in part. Operations are still launched from the directory containing the make file. Make files generally do not "elegantly" handle the idiom of "take a bunch of inputs in location A and write outputs also in location A, and then do the same for inputs in locations B, C, and D".

Instead, the operation of a make file, in the base case, is tied to its location, in this case, separate files for each of four locations, even if everything is the same but for a few file names. This assumption can be overridden, but not cleanly, by boilerplate logic inside the file.

In the case of Pandoc, since some options are given as metadata fields, they must either be referenced in separate metadata files or as a sequence of --metadata options. One of the larger challenges when using with a make file, conceptually simple though mechanically cumbersome, is weaving together exactly the right combination of all options for each output.

If you have some examples that you believe are elegant and easy, then maybe it would be more natural and efficient for you to share those examples, instead of my posting the example that I consider problematic.

Nobody types them out manually like this, no. But commands with 20 options are often used in the context of shell scripts, Makefiles, etc. It's quite handy to be able to do this without providing an external file with the instructions. And that's how nearly all standard unix tools work.

Yes and no. Surely it would be absurd if the cp command required the user to create a file with a source and target path in order to use. But placing the definitions of complex, repeated tasks in external files is neither novel nor disparaged in a Unix environment. Shell scripts or make files that encapsulate long argument lists have limitations, especially in transparency and clarity of use, that is best resolved by external files read by an application directly. Historically, these files would often be configuration files in a system directory or static user directory. Such locations make sense for operations tied to the configuration of a particular site, such as administrative tasks, or when, like in the old days, the ratio of computer users to systems was in the double digits or higher. Considering a problem like document transformations in the modern environment, it becomes much more relevant to be able to create a portable file that describes an operation.

A project file that is application specific should not be considered contrary to Unix design patterns.

I suspect it would be an unpopular suggestion to remove the current configurability of pandoc via command line options. But if this is kept and project files added, and the project files have an ad hoc syntax that is only sort of overlapping with the CLI options, then...it's one more thing to learn, one more thing to get confused about, one more thing to document, one more thing to maintain. This kind of complexity has a cost (especially for maintainers!).

I agree with your conclusions, but only given your premise, which is different from mine, about what is being proposed. I do not propose by any measure that "the project files have an ad hoc syntax that is only sort of overlapping with the CLI options". The original analysis intended to demonstrate that currently, whatever the actual history or design, a general Pandoc operation is determined by what appear to be two collections of partially-overlapping, ad-hoc option sets, namely metadata fields and command-line options. The proposed idea would rather be a well-planned, extensible, flexible option set that is expressed in YAML (or JSON) in some way that is clear to understand and to document. Such a solution largely obviates the need for an expansive set of options either in metadata or the command line, though some support might be maintained for legacy, and simple operations should of course always be possible directly from the command line.

If you consider the proposal as moving the number of things for users to learn and developers to maintain from two to three, then it seems like a bad proposal without any doubt. But the actual idea is to move it from two to one, the single solution being chosen to meet all the needs not currently met either by each individually nor both together, and to meet even more needs besides.

As I said, I'm not against this proposal. I just think the most sensible way to implement it would be as a way to set default values for command-line options.

Are you referring to the idea that the YAML structure is defined so that each key in the table is simply one among the CLI options (without the -- prefix)? I think I would prefer a solution somewhat more comprehensive . The structure of a YAML file opens much greater possibilities than every value being a primitive, as on the CLI. Meanwhile, the problem remains of metadata fields currently doing the job of fields that should be in the new file type. Ideally, the allowed contents of the new file type should be a union of all the formatting options in either CLI and metadata, combined cleanly into the same namespace, and utilizing the structured capabilities of the meta-format. Among the objectives should be that metadata blocks are only needed to provide actual metadata, like title and author, not formatting options, like LaTeX headers. These ideas, and their rationale, were central to the original submission, and I have not seen any comments since that pushes we away from them.

Having said as much, I don't have any problem with CLI options, when used, overriding or augmenting the values in the new file type. Source and target files for example might be easier to keep outside the file in certain cases.

@tarleb
Copy link
Collaborator

tarleb commented Oct 7, 2019

FWIW, I still dream of having a pandoc equivalent to texlua, i.e. to have pandoc act as a Lua interpreter when called as pandoclua. That would be portable, scriptable, and not too difficult to use. But I won't have time to work on that anytime soon.

@jgm
Copy link
Owner

jgm commented Oct 7, 2019

Are you referring to the idea that the YAML structure is defined so that each key in the table is simply one among the CLI options (without the -- prefix)? I think I would prefer a solution somewhat more comprehensive .

Yes, that's my idea, and I do understand you had in mind something more comprehensive. However, it's an important goal not to break existing workflows unnecessarily, so I favor a smaller and less breaking change. When I've got it worked out, I think you will see that it achieves most or all of your stated goals quite nicely (including the goal of not putting formatting related things in metadata).

By the way, this approach could also solve the underlying problem in #3732 , giving us a way to specify structured template variable values, which is currently impossible on the command line and leads to people using metadata for things that aren't metadata.

Also relevant: #4627

@brainchild0
Copy link
Author

brainchild0 commented Oct 7, 2019

it's an important goal not to break existing workflows unnecessarily,

Just to be clear, I never suggested the imminent suspension of support for any command-line option, so as to break backward compatibility.

Only I thought that because the new file type, unlike the CLI options, is of course not currently in active use, the namespace it encapsulates might be filled according to a design as we see best suited at this time.

jgm added a commit that referenced this issue Oct 8, 2019
HTMLMathMethod, CiteMethod, ObfuscationMethod, TrackChanges, WrapOption,
TopLevelDivision, ReferenceLocation, HTMLSlideVariant.

In each case we use lowercase (or hyphenated lowercase) for
constructors to line up more closely with command-line option
values.

This is a breaking change for those who manually decode or encode
JSON for these data types (e.g. for ReaderOptions or WriterOptions).

See #5790.
@brainchild0
Copy link
Author

brainchild0 commented Oct 8, 2019

And yes, I agree about #3732 and #4627. The latter was mentioned in the original post. A wrapping application such as a CMS would be able control an entire pandoc invocation by piping in a single dynamically-generated data structure encoded in a YAML stream. This interface not only supports larger and more structured data, but also saves the developer from the tiresome work of interpolating the values into command-line arguments. Hyphen being given as the filename to represent standard input would offer a platform-independent way to support such use.

@brainchild0
Copy link
Author

brainchild0 commented Oct 8, 2019

After reflecting on the concerns you raised earlier about maintainability and transparency of distinct sets of options, which I may have dismissed too hastily, I propose a suggestion that intends to represent much of the benefits of my earlier thinking while avoiding bloat to the design or implementation.

It is slightly more complex than your approach, but once understood, I hope might be considered to carry an appropriate balance of elegance and manageability.

Consider the following steps:

  1. From the whole set of current CLI options, identify those relating to source and result file names and file types, filters, metadata fields, template variables, and administrative operations, such as printing version information or showing supported formats. Let this subset be called administrative CLI options.
  2. Let the the administrative CLI options subtracted from the whole set of current CLI options be called simple formatting options.
  3. Define a new set called structured formatting options, which can be empty for the time being.
  4. For the YAML table, define a set of fields called top-level fields. It will be constituted as such:
    1. Various fields will represent high-level I/O considerations such as source and result file names and types, filters, and so on.
    2. A further field will be called options. The value will be a table, for which each item is processed as such:
      1. If the key is a member of structured formatting options, process it according to its function.
      2. Otherwise, if the key is a member of simple formatting options, process it according to its function, as though it appeared on the command line, unless it is overridden by the arguments on the actual command line.
      3. Otherwise, add it to table of template variables.

Thus, four sets are defined:

  • Simple formatting options.
  • Structured formatting options.
  • Top-level fields.
  • Administrative CLI options.

The first, at least for a while, will overwhelmingly be the larger set. Thus it represents most of the maintenance costs, which will be the same as before this set of enhancements was considered.

The final two have the same function, in different contexts. This overlap adds overhead, though the amount should be acceptable, as both sets are small. The rationale for the separation is that some CLI options have functions that are incoherent as top-level fields, and meanwhile the top-level fields can be designed from scratch as determined appropriate from with the context of a YAML structure.

When a new formatting option is added, it can be designated as simple or structured. Simple options are available on the CLI and in the YAML table, whereas structured ones are available only in the YAML table.

The main utility of the structured options is to facilitate more sophisticated functionality that does not express well in primitive values. However, the overall idea is still viable even if they are not included. In that case, there are only three sets.

I would suggest that it is feasible to document all four sets, and to explain their use, as well as to maintain them in the codebase. As no regression would occur no immediate update to documentation is needed.

The sixth post in this series shows an example how of the YAML block would appear with this design.

@jgm jgm added this to the 2.8 milestone Oct 9, 2019
@jgm
Copy link
Owner

jgm commented Oct 10, 2019

I've made some progress. This now works.

pandoc --defaults project.yaml

where project.yaml looks like this:

input-files:
  - body.md
  - header.yaml
reader: markdown
output-file: result.pdf
writer: latex
number-sections: true
standalone: true
variables:
  documentclass: book
  classoptions:
    - oneside
  fontsize: 12pt
  fontfamily: times
  geometry: 'margin=1in'
  headerincludes: |
    \usepackage{indentfirst}

@jgm
Copy link
Owner

jgm commented Oct 10, 2019

I like the suggestion of adding unknown fields at the top level to variables; we can easily add that.

@brainchild0
Copy link
Author

It looks nice, and is a big step forward.

I like the suggestion of adding unknown fields at the top level to variables; we can easily add that.

Not sure what "unknown fields" means here. The suggestion I intended to convey was putting fields such as number-sections down under the same top-level field as the variables. In this case, variables might instead be called options, to express something more general. Then the question is whether the variables are kept in their own structure, moved one level down, or kept at the current level and mixed with the formatting options.

Originally I thought that standalone would also occupy the lower level, but now I am thinking that it has a very particular function and is best suited for the top level.

@brainchild0
Copy link
Author

A few fine details to consider:

  • I am less enthusiastic about the term defaults than I am about many other possibilities. It carries the sense of some static source of system or user preferences. Project would be a good term in the case of comprehensive support for multiple targets. As this particular solution is more specific, I am thinking of some kind of word closer to the meaning of action.
  • Is standard input supported specifically?
  • Are paths resolved relative to the file, regardless of working directory? Is behavior uniform regardless of the original working directory?
  • It would be good to consider a specific extension, for example, .pd or .pda, to facilitate exchange or automation. Also, I know it breaks the model that unqualified files are read as input, but I believe a special case should be made for files that have the extension. It is tremendously convenient to be able to use the file with no other arguments.

@jgm
Copy link
Owner

jgm commented Oct 10, 2019

  • "Defaults" makes sense given what this option currently does. It simply sets default values which can be overridden and extended by further command-line options. So, if you want to use this feature to define every detail for a project, as in the example above, you can. But you can also use it, e.g., to set a few default values that you frequently use, with the plan of specifying the rest on the command line, as in pandoc --defaults letter.yaml myletter.md -o myletter.pdf -Vsign. I'm open to suggestions, but "project" doesn't seem suitable for all the uses I'm envisioning.

  • stdin is supported; you specify - as input file. Similarly for stdout / output file. Note that if you simply leave out input-files it will default to stdin, and similarly for output-files.

  • The paths are interpreted relative to the working directory, as is standard with pandoc.

  • As for extensions, I'm open to suggestions, but currently I'm not requiring any particular extension. It would be nice to have a default extension so you can just do --defaults=letter. Also, I'd like eventually to allow the defaults to be put in the user data directory so you can reuse them in multiple projects. (That wouldn't make sense for those that specify in and output files, of course, but it would make sense for the kind of uses I envision.)

@brainchild0
Copy link
Author

brainchild0 commented Oct 10, 2019

  • "Defaults" makes sense given what this option currently does. It simply sets default values which can be overridden and extended by further command-line options. So, if you want to use this feature to define every detail for a project, as in the example above, you can. But you can also use it, e.g., to set a few default values that you frequently use, with the plan of specifying the rest on the command line, as in pandoc --defaults letter.yaml myletter.md -o myletter.pdf -Vsign. I'm open to suggestions, but "project" doesn't seem suitable for all the uses I'm envisioning.

As I think I said, I agreed that project wouldn't be an accurate word for what is happening now, but I also think that defaults is also not the best choice. It sounds like a set of user preferences, rather than an item associated a particular document or group of documents.

  • The paths are interpreted relative to the working directory, as is standard with pandoc.

Might want to change CWD to directory containing file, if it a real file.

  • As for extensions, I'm open to suggestions, but currently I'm not requiring any particular extension. It would be nice to have a default extension so you can just do --defaults=letter. Also, I'd like eventually to allow the defaults to be put in the user data directory so you can reuse them in multiple projects. (That wouldn't make sense for those that specify in and output files, of course, but it would make sense for the kind of uses I envision.)

Agree with idea for search path.

But also an extension helps simplify the command line:

$ pandoc document.pd

You wouldn't have to require the extension, just know what to do when you see it. You can still keep the verbose syntax for arbitrary names.

$ pandoc --defaults document.yaml
$ pandoc --defaults document.someverystrangename

And --defaults=letter could resolve to a file called letter.pd if found on the search path. This reduces the chance that a local file will disrupt the normal searching.

@jgm
Copy link
Owner

jgm commented Oct 10, 2019

Might want to change CWD to directory containing file, if it a real file.

I wouldn't want to do this by default, but we could consider adding a variable to allow you to specify paths relative to that directory. But note, there can be multiple input files, and they may live in different directories. All in all, I think this kind of thing is better left to scripts and tools like make.

But also an extension helps simplify the command line.

I see. So the suggestion is that the default treatment of a .pd file (or whatever) is to treat it as --defaults. Not sure what I think, I'd be curious what others have to say.

Currently the search procedure is as follows:

  • add .yaml extension if the filename lacks an extension
  • first look for it relative to working directory
  • if not found there, look in the defaults subdirectory of the user data directory (e.g. ~/.local/share/pandoc or ~/.pandoc.

@jgm
Copy link
Owner

jgm commented Oct 10, 2019

I'd also be interested in suggestions about the option name. --defaults still seems accurate and descriptive to me. Your objection is

It sounds like a set of user preferences, rather than an item associated a particular document or group of documents.

But that's exactly what it is: a set of user preferences for the values you can normally specify on the command line. These preferences can include an input and output file, in which case you have "an item associated with a particular document," but they could also just be standalone: true if you always want standalone documents (to give an example).

Other possibilities:

pandoc --using letter
pandoc --with letter

@jgm
Copy link
Owner

jgm commented Oct 10, 2019

Note also that -d is currently unused and could be a one-letter abbreviation

pandoc -d letter

@brainchild0
Copy link
Author

brainchild0 commented Oct 11, 2019

Might want to change CWD to directory containing file, if it a real file.

I wouldn't want to do this by default, but we could consider adding a variable to allow you to specify paths relative to that directory. But note, there can be multiple input files, and they may live in different directories. All in all, I think this kind of thing is better left to scripts and tools like make.

I was referring to the directory of the YAML file, not any document source file. Setting the CWD to this location ensures consistent behavior, creating more convenience for experienced users and fewer surprises for inexperienced ones. My idea is to try to approach incrementally toward a model of "it just works".

I agree that adding some kind of template substitution for a variety of variables would eventually be useful. I just don't think that relying on the author of the YAML file to use the prefix consistently is the best way to give users easy and unobstructed access to reproducible behavior. If the prefix is used in certain file references but not all, and the CWD is not changed, a user might test the file not realizing that the observed success was simply due to the luck of having a particular CWD. Meanwhile, if the CWD is not changed, even the diligent user is burdened with needing to test that the file works equally well in any CWD.

Ideally the objective is to look for ways that users can test once, run anywhere.

I see. So the suggestion is that the default treatment of a .pd file (or whatever) is to treat it as --defaults. Not sure what I think, I'd be curious what others have to say.

My particular thinking right now leads me to suggest the following procedure for resolving the file:

  1. If the argument is given, and the value refers to a file that exists, use that file.
  2. Otherwise, if the argument is given, and the value contains only a singe path element (e.g. no leading directory parts), then:
    1. Append the extension to the value, unless it already ends in the extension.
    2. Look for the file in some search path, (which for now can be hard-coded to a single location).
  3. Otherwise, complain and exit.

Adding to the above would be a preemptive check in the input file names for one with the particular extension. There are two options:

  1. Whenever a file on the input file list has the extension, remove it from the input file list and process it as though it were provided with the argument.
  2. Perform the special handling only if a file having the extension is the only value on the command line not preceded by an option name.

Also, using a particular extension designates its type strictly from a directory listing, and facilitates integration with shells and editors. If a desktop environment adopts a wrapper, then users could run the file simply by clicking in their shell, and editors can scan the local directory for files to run.

I'd also be interested in suggestions about the option name. --defaults still seems accurate and descriptive to me. Your objection is

It sounds like a set of user preferences, rather than an item associated a particular document or group of documents.

But that's exactly what it is: a set of user preferences for the values you can normally specify on the command line. These preferences can include an input and output file, in which case you have "an item associated with a particular document," but they could also just be standalone: true if you always want standalone documents (to give an example).

I suppose it can be used as such, but it generalizes well to many other contexts. Most options in a transformation engine are not determined by what a user prefers as a matter of personal taste universal to all contexts, but rather by what is appropriate for a document, in many cases determined in part by other members of a group.

The choice of key bindings in an editor, or background color on a desktop, is quite personal, and users want complete consistency across sessions usually without exception. Choices such as how many columns to use, what output format, which font size, and so on, might be choices that certain users repeat in some cases according to tastes, but are largely determined by the broader circumstances surrounding the document itself. A user who thinks that books and letters should have identical formatting is one whose judgment is questionable.

Moreover, once a transformation is defined to behave in a consistent way, sharing the definition becomes a compelling possibility.

Naturally, the same file can be used repeatedly for different documents of the same type. This case as you indicate is handled by giving the file names on the command line rather than within the file.

There are three general cases, which might see varied levels of use:

  1. User wants same options for every call to application, unless overridden by command-line arguments.
  2. User (or group) wants consistent results for set of documents, with different names.
  3. User (or group) wants consistent results for a single document, with some name.

To me, the label defaults does not capture the full breadth of generality and potential.

As I said before, I don't currently have a name that wildly excites me, but action seems to capture the concept to some degree. Another possibility is recipe, but again, I have yet to discover a term I really like.

Maybe with time nicer ideas will emerge, especially in a broader discussion.

@brainchild0
Copy link
Author

By the way, I tried using this feature, and was generally quite happy with how much easier it was for me to manage documents

At one point, commit ff1df24, I noticed that reader and writer changed to from and to. FWIW, I think the former are slightly nicer.

@alerque
Copy link
Contributor

alerque commented Oct 11, 2019

@brainchild0 The from and to terminology is eminently more recognizable both from other programs and from Pandoc's CLI usage.

@brainchild0
Copy link
Author

brainchild0 commented Oct 11, 2019

@brainchild0 The from and to terminology is eminently more recognizable both from other programs and from Pandoc's CLI usage.

Yes they are used for the CLI arguments, which often uses informal semantics. I think in the context of the file, nouns (and verbs) are nicer than prepositions. As discussed it is infeasible and unnecessary to maintain two separate sets of names for all the formatting options, but for the few top-level fields it creates a nice effect to choose names that are appropriately descriptive. Remember there is no CLI argument for input files, whereas from, used to indicate source type, does not express cleanly the distinction between file name and format.

But this reminds me of an earlier comment, representing a less petty issue, about extensions perhaps being expressed separate from reader using the structured features of YAML:

extensions:
  reader:
    add:
      - empty-paragraphs
    drop:
      - raw_html

Could also consider the following:

reader:
  name: markdown
  add-ext: 
    - empty-paragraphs
  drop-ext:
    - raw_html

Then the shorthand could be:

reader: markddown+emptyparagraphs-raw_html

But the former is easier to edit in a file and to build programatically.

Also possible, in case of not wanting any to be implicitly added:

reader:
  name: markdown
  use-only-ext: 
    - empty-paragraphs

Or:

reader:
  name: markdown
  drop-def-ext: yes
  add-ext: 
    - empty-paragraphs

@denismaier
Copy link
Contributor

I agree, but this input file is not referenced in the YAML file, which is the case I consider with respect to resolution of relative paths. Of course files given on the CLI should be resolved relative the original CWD

I see. That's fine.

@denismaier
Copy link
Contributor

Exactly, and wouldn't you want the same behavior regardless of where you were when you built the project?

In most cases probably. But: let's say you have a defaults file letter.yaml stored in your datafirectory and you specify an input file %.md so that pandoc converts a markdowns file in your given working directory. I agree that your suggestion is reasonable in many, perhaps most, cases. But there might be others so a way to disable would be great

@brainchild0
Copy link
Author

Are you saying that the YAML file might represent some input file that has the same base name for all invocations but resides in some project-specific directory, so its path needs to be resolved per invocation?

This case would be an advanced one, eventually good to consider, but in my view only after resolving the fundamentals.

@denismaier
Copy link
Contributor

Yes, exactly. In my user data directory I have a defaults file letter.yaml. There I specify

input-files:
  - *.md

In my project directory I have a file letter.md
Invoking pandoc -d letter in the project directory should run pandoc on letter.md with the options specified in letter.yaml.

@denismaier
Copy link
Contributor

Oh, and we don't necessarily need the same base name. (Actually, I think this should probably work already. I'll need to test later...)

@jgm
Copy link
Owner

jgm commented Oct 31, 2019

Validate the items in the file, at least the names of top-level entries, and fail gracefully with a useful message if an item name is not recognized.

This is already implemented.

Rather than the current term defaults, use one that conveys the meaning of a set of options and actions chosen for a certain document, or group of documents, and that can be distributed with the document source

I'm open to concrete suggestions, but I haven't heard one yet that seems better to me than 'defaults'.

the core issue is whether files referenced by a file should be resolved relative to that file or to the original CWD.

Does the (existing) ability to set resource-path in the defaults file help with this?

Invoking pandoc -d letter in the project directory should run pandoc on letter.md with the options specified in letter.yaml.

Personally I think this kind of thing is a job for scripts or Makefiles. I'd rather give this --defaults feature the simple job of determining a set of default values for options.

Repository owner deleted a comment from denismaier Oct 31, 2019
Repository owner deleted a comment from bdo206 Oct 31, 2019
@jgm
Copy link
Owner

jgm commented Oct 31, 2019

@denismaier - sorry, I deleted your comment by accident.

@denismaier
Copy link
Contributor

Oh. No worries. My comment was just that I am fine with a simple version of the defaults option.

@brainchild0
Copy link
Author

brainchild0 commented Oct 31, 2019

This [validation] is already implemented.

Ok, good, it seems the current implementation mostly works correctly. I find a few fringe cases with to and writer appearing together, which should produce an error. Instead t looks as though the latter is silently ignored. I haven't tested for 'from' and 'reader', but the same concerns would apply.

Does the (existing) ability to set resource-path in the defaults file help with this?

I don't feel that resource-path serves a function comparable to that of the proposed change.

@jgm
Copy link
Owner

jgm commented Nov 1, 2019

I don't feel that resource-path serves a function comparable to that of the proposed change.

Maybe you could give an example where it's not enough?

@brainchild0
Copy link
Author

brainchild0 commented Nov 1, 2019

I don't feel that resource-path serves a function comparable to that of the proposed change.

Maybe you could give an example where it's not enough?

Perhaps I should first see that I understand what use you propose.

Suppose a user has files foo.md and foo.yaml in a directory. The goal is to be able to run a command giving the path, either absolute, or relative to the current, of the latter file, such that former will be processed as to create a third file called foo.html in the same location as the other two. The input and out files are to be included in the YAML file such that the YAML file is only one named in the command.

What is the best way to use resource-path to achieve this effect?

@jgm
Copy link
Owner

jgm commented Nov 1, 2019

OK, resource-path won't give you a default location for an output file, if that's what you're after. But I'm still not convinced this sort of thing is not better handled with scripts, Makefiles, and the like. No need to repeat that discussion (see above).

@jgm
Copy link
Owner

jgm commented Nov 1, 2019

I've made a small change. Now you can specify --defaults multiple times. Also, --defaults no longer clobbers all option settings that appear before it on the command line (though of course it might clobber particular option settings if it specifies them).

With options like include-in-header, we should presumably behave as we do when these are specified on the command line, so that repeated instances add content to a list, rather than replacing all the previous content. That is not how it works at the moment.

@jgm
Copy link
Owner

jgm commented Nov 1, 2019

Another change: you can specify multiple files: --defaults=one.yaml,two.yaml.
You can also leave off .yaml: --defaults=one,two. You can also do --defaults=one --defaults=two, which has exactly the same effect.

@denismaier
Copy link
Contributor

Awesome. How can I test this?

@jgm
Copy link
Owner

jgm commented Nov 1, 2019

Compile from source, or get a nightly after tonight (see under the Actions tab).

@brainchild0
Copy link
Author

brainchild0 commented Nov 1, 2019

OK, resource-path won't give you a default location for an output file, if that's what you're after.

I'd be curious to understand how close to this scenario resource-path can take us. If I were willing to compromise on output location, can you show me how I might use resource path?

But I'm still not convinced this sort of thing is not better handled with scripts, Makefiles, and the like. No need to repeat that discussion (see above).

Right. I'm not sure what to say that hasn't already been said. All I can think is that it seems in this example we have two possibilities:

  • The user creates two files, defaults and make, the latter being used specifically for managing the locations of the input and output.
  • The user creates only a defaults file, which handles locations natively.

Since file locations are clearly important, the latter seems to be a huge benefit to the user even if at the cost of a small escalation of code size in the application.

Edit: Separate from feeling that the functionality proposed is more appropriate outside of Pandoc, what reasons, if any, do you identify for preferring using the working directory for relative paths? Is the main issue implementation and maintenance of resolving the path relative to the file location?

@brainchild0
Copy link
Author

Another change: you can specify multiple files: --defaults=one.yaml,two.yaml

Commas can appear in file names, so I would avoid these nonstandard semantics.

@jgm
Copy link
Owner

jgm commented Nov 2, 2019

Commas can appear in file names, so I would avoid these nonstandard semantics.

Well, that's true; maybe I'll revert this for now.

what reasons, if any, do you identify for preferring using the working directory for relative paths? Is the main issue implementation and maintenance of resolving the path relative to the file location?

For one thing, the defaults file might come from a central location (like the user data directory); we certainly wouldn't want to interpret paths relative to it in this case. Even when it's not coming from the user data directory, someone might put the default files in a common place.

@brainchild0
Copy link
Author

brainchild0 commented Nov 2, 2019

For one thing, the defaults file might come from a central location (like the user data directory); we certainly wouldn't want to interpret paths relative to it in this case. Even when it's not coming from the user data directory, someone might put the default files in a common place.

I'm not sure that it would be so terrible even in the case of the file being stashed in some data directory. For example, if some options set for generating a company letter referred to an image file representing a corporate logo, then naturally the image would be near the options file.

But in whatever case that the options files are kept separate from the document source, then isn't the normal use that the source files would provided outside those files?

It seems that the issue reduces to the question of what is the case that the paths ought to be followed relative to the current working directory.

@jgm
Copy link
Owner

jgm commented Nov 3, 2019

Although there are some additional changes we may want to consider further at a later date, I'm content with the current state of this feature for 2.8, so I'm going to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants