Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package needs to be adapted to handle --subsample -n X switches #4

Open
vsenderov opened this issue May 20, 2024 · 1 comment
Open

Comments

@vsenderov
Copy link

Right now only the default (1) sample is supported, and also for that you need a specialized function. I have some code but not all to do this.

@vsenderov
Copy link
Author

After some analysis and private discussions, here are some thoughts:

Currently tpplc does sampling by two switches --subsample and -n X:

  --subsample                        Whether to subsample the posterior
                                              distribution. Use in conjuction with -m
                                              smc-apf or smc-bpf and without
                                              --no-print-samples
  -n <subsample size>          The number of subsamples to draw if
                                             --subsample is selected. Default: 1.

The Python package has a generic way of processing tpplc arguments, but it cannot process -n X as it only has one minus sign in front. Also if you subsample on the tpplc level, there is no point in subsampling on the Python level, and we need a function like the following in the Python package, which gets the whole sample that arrives from tpplc:

    def getsample(self):
        idx=range(len(self.nweights))
        return [self.samples[i] for i in idx]

The getsample function should be added always, I think, but for the number of subsamples, we could do the following:

a. modify the base in the Python package to support the -n X functionality (with a single minus) with a custom argument
b. modify the CorePPL compiler to have a different syntax for subsampling, i.e. --subsample X. Then however, just --subsample will be illegal, or we could try to default it to 1, but it will be slightly more involved way of parsing the command line arguments, and I am not sure it is straightforward. Then the semantics will be: if you don't specify --subsample, the whole sample is printed, if you specify --subsample X, X samples are printed. If we don't go for the default of 1, then the change to cppl will be fairly minor (I think).
c. instead try to move the subsampling to the binary arguments

so that

./out x.json NPARTS NSWEEPS X

where out is the compiled program.

This is quite involved, as it requires several changes (I am ignoring for now to use named arguments as in ./out --input x.json --samples 1000 --subsamples 10, I will go back to this later):

  1. The part of the tpplc compiler that generates code for reading from the input, IIRC (but double-checking needed) lib-compile.mc. This is probably not that hard, see below for example.
  2. But critically after the generated binary reads the inputs, it needs to pass them on to the Infer framework, which does not understand them. The entails a change in the tpplc compiler around line-numbers 319-330), and also the Infer framework needs to be changed to understand the number of subsamples. Right now the last of these three items (changing the Infer framework) feels like too much for me to do as I don't have any free time, and I would need to learn the Infer framework.

Now let's get to whether we can change the executable arguments to named arguments. Right now they are handled in lib-compile.mc:

let particles = if leqi (length argv) 2 then 10 else string2int (get argv 2)
let sweeps    = if leqi (length argv) 3 then 1 else string2int (get argv 3)
let input: JsonValue =
  if leqi (length argv) 1 then error "You must provide a data file!"
  else jsonParseExn (readFile (get argv 1))

The executable doesn't use the argument parsing thing that David wrote, it is fairly simple as can be seen. Perhaps, this could be made to do named argument parsing, but what would be the reason, if the users already use Python and the command-line is there only for debugging? This also applies to defaults on the command line such as ./out # Default input is stdin, default samples 1000, default subsamples is equal to the number of samples.

Could you comment on this a little @kudlicka and what path we should take in your opinion?

PS: Speaking of debugging, we need to also discuss #5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant