Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Monte Carlo Workspace Algorithm #38341

Open
wants to merge 32 commits into
base: release-next
Choose a base branch
from

Conversation

Despiix
Copy link
Collaborator

@Despiix Despiix commented Oct 31, 2024

Description of work

Summary of work

To benchmark the Mantid fitting engine, we need a method to generate randomly distributed data based on specific fit functions or workspaces. This will allow us to compare the generated data with known true values, providing insights into the performance of the fitting engine. I implemented a Mantid algorithm that uses a Cumulative Distribution Function (CDF) approach. This algorithm:
  1. Takes a Workspace as Input: The input workspace defines the distribution’s characteristics.
  2. Generates a Random Distribution: Using the CDF of the input data, the algorithm samples random points, producing a distribution that mirrors the input.
  3. Ensures Reproducibility: The algorithm seeds the random number generator, ensuring reproducible results for consistent benchmarking.

Fixes #38196.

Report to: @thomashampson

Further detail of work

**Expected Outcome:**

This algorithm allows us to create random distributions based on known values, making it easier to assess the fitting engine’s accuracy and performance across different scenarios.

To test:

  1. Validate my tests
  2. Run the tests
  3. Verify algorithm does what it's expected to do.
  4. Verify that documentation is correct + Location of image is correct

Reviewer

Please comment on the points listed below (full description).
Your comments will be used as part of the gatekeeper process, so please comment clearly on what you have checked during your review. If changes are made to the PR during the review process then your final comment will be the most important for gatekeepers. In this comment you should make it clear why any earlier review is still valid, or confirm that all requested changes have been addressed.

Code Review

  • Is the code of an acceptable quality?
  • Does the code conform to the coding standards?
  • Are the unit tests small and test the class in isolation?
  • If there is GUI work does it follow the GUI standards?
  • If there are changes in the release notes then do they describe the changes appropriately?
  • Do the release notes conform to the release notes guide?

Functional Tests

  • Do changes function as described? Add comments below that describe the tests performed?
  • Do the changes handle unexpected situations, e.g. bad input?
  • Has the relevant (user and developer) documentation been added/updated?

Does everything look good? Mark the review as Approve. A member of @mantidproject/gatekeepers will take care of it.

Gatekeeper

If you need to request changes to a PR then please add a comment and set the review status to "Request changes". This will stop the PR from showing up in the list for other gatekeepers.

@Despiix Despiix added the ISIS Team: Core Issue and pull requests managed by the Core subteam at ISIS label Oct 31, 2024
@Despiix Despiix added this to the Release 6.12 milestone Oct 31, 2024
@Despiix Despiix self-assigned this Oct 31, 2024
@Despiix Despiix force-pushed the Create_Monte_Carlo_Ws branch from 95efe61 to 190679a Compare November 6, 2024 15:44
@Despiix Despiix closed this Nov 12, 2024
@Despiix Despiix force-pushed the Create_Monte_Carlo_Ws branch from 4e478ed to 005e651 Compare November 12, 2024 17:02
@Despiix Despiix reopened this Nov 12, 2024
@Despiix Despiix marked this pull request as ready for review November 12, 2024 17:29
@Despiix Despiix force-pushed the Create_Monte_Carlo_Ws branch from ffee360 to 685746e Compare November 13, 2024 13:34
Copy link
Contributor

@RichardWaiteSTFC RichardWaiteSTFC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for sticking my oar in here - I found this very interesting (I do a similar thing to generate random data in 3D Q for single-crystal peak integration testing).

I'm aware it may be a work in progress/unfinished, so please forgive me, but I had a couple of questions:

(1) What is the advantage of this over using numpy random number generation like so (assuming the bin-width is small enough you don't need to worry about variation in the underlying pdf over the width of a bin) ?

params =  np.array([2.5e4, 0.06,  0.015, 30000, 30,  50])

peak_func = FunctionFactory.Instance().createPeakFunction("BackToBackExponential")
[peak_func.setParameter(ipar, params[ipar]) for ipar in range(peak_func.nParams())]
comp_func = FunctionWrapper(peak_func) + FlatBackground(A0=params[-1])

# simuate data
np.random.seed(1)
x = np.linspace(ptrue[3]-350, ptrue[3]+500, 61)
y = np.random.poisson(comp_func(x))
e = np.sqrt(y)
ws = CreateWorkspace(DataX=x, DataY=y, DataE=e)

image

(2) This seems to only support single histogram workspaces, but it doesn't check the number of histograms in the input workspace (unless I missed it?) - would it just make sense to generalise it and add a (parallelised) loop over histograms?


int CreateMonteCarloWorkspace::computeNumberOfIterations(const Mantid::HistogramData::HistogramY &yData) {
double total_counts = std::accumulate(yData.begin(), yData.end(), 0.0);
return static_cast<int>(std::round(total_counts));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this rounded to 0? I suppose you'd get 0s in the data, this might be confusing? Perhaps at least throw a warning?
Would it make more sense to add the number of MC events as an input parameter? Say for example you want to see how counting times/stats affect optimiser performance, I think to do this you would need to scale the input workspace which seems a bit clunky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the parameter to specify number of MC events is good, and the default could be the integral of the input.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commited the suggestion above, please have a look and let me know. The scaling function is not perfect, if the number of MC events is too little, the new workspace appears to be a flat line when overplotted onto the original workspace. I am open to any suggestions on ways to improve it.


MatrixWorkspace_sptr outputWs = WorkspaceFactory::Instance().create(instWs);
progress.report();
std::this_thread::sleep_for(std::chrono::milliseconds(100));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a stupid question, sorry, but why do you need these sleep statements?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a stupid question, I found it was the only way to get the progress bar to actually work. If I remove the sleep statements it doesn't work anymore. I believe it's because the code execution might complete so quickly that the progress bar doesn't have enough time to refresh or visually update its state.

If you happen to know a better way to implement it please feel free to let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might've have the progress bar a little wrong (although I'm no expert).

If your algorithm, for example, iterated over the spectra of an input workspace and performed some kind of operation with each, you would have the progress bar update after each spectra operation. I think that's the idea anyway, it's showing the progress of the main bulk of the computation. I think for you this would be in fillHistogramWithRandomData.

I think you can also set it to display text (e.g. "writing output"). Maybe you can have a look at some other algorithms and see how it's used there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code for the progress bar, although personally I prefer the previous look, the new version does not need the sleep statements.

@Despiix
Copy link
Collaborator Author

Despiix commented Nov 18, 2024

Sorry for sticking my oar in here - I found this very interesting (I do a similar thing to generate random data in 3D Q for single-crystal peak integration testing).

I'm aware it may be a work in progress/unfinished, so please forgive me, but I had a couple of questions:

(1) What is the advantage of this over using numpy random number generation like so (assuming the bin-width is small enough you don't need to worry about variation in the underlying pdf over the width of a bin) ?

params =  np.array([2.5e4, 0.06,  0.015, 30000, 30,  50])

peak_func = FunctionFactory.Instance().createPeakFunction("BackToBackExponential")
[peak_func.setParameter(ipar, params[ipar]) for ipar in range(peak_func.nParams())]
comp_func = FunctionWrapper(peak_func) + FlatBackground(A0=params[-1])

# simuate data
np.random.seed(1)
x = np.linspace(ptrue[3]-350, ptrue[3]+500, 61)
y = np.random.poisson(comp_func(x))
e = np.sqrt(y)
ws = CreateWorkspace(DataX=x, DataY=y, DataE=e)

Thank you for taking the time to review my algorithm. To answer to your first question based on my understanding:

The main advantage of my approach is that it doesn't assume any specific underlying functional form for the data. By constructing a cumulative distribution function directly from the input data and sampling from it, we can capture the actual empirical distribution, including any irregularities or complexities that may not be well-represented by a predefined function. This is done to deal with data where the underlying distribution is unknown, complex, or doesn't fit standard models.

Please correct the above if I am wrong, to be honest I hadn't considered using numpy random number generation. Let me know if you still think it would be a better alternative.

docs/source/algorithms/CreateMonteCarloWorkspace-v1.rst Outdated Show resolved Hide resolved

MatrixWorkspace_sptr outputWs = WorkspaceFactory::Instance().create(instWs);
progress.report();
std::this_thread::sleep_for(std::chrono::milliseconds(100));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might've have the progress bar a little wrong (although I'm no expert).

If your algorithm, for example, iterated over the spectra of an input workspace and performed some kind of operation with each, you would have the progress bar update after each spectra operation. I think that's the idea anyway, it's showing the progress of the main bulk of the computation. I think for you this would be in fillHistogramWithRandomData.

I think you can also set it to display text (e.g. "writing output"). Maybe you can have a look at some other algorithms and see how it's used there.

Framework/Algorithms/src/CreateMonteCarloWorkspace.cpp Outdated Show resolved Hide resolved
Comment on lines 63 to 65
// The total counts should be less than numIterations because some random numbers are not counted
auto sumCounts = std::accumulate(outputY.begin(), outputY.end(), 0.0);
TS_ASSERT_LESS_THAN(sumCounts, numIterations);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know that this will always happen, is there a chance these two value could be equal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are equal by design in this algorithm, this is in contrast to the method of randomly generating data I posted, where the total counts itself would be Poisson distributed (I believe, I should check...) - I've asked @Despiix to check the distribution of counts in each bin from multiple simulations are Poisson distributed (or close enough).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the parameter to specify number of MC events is good, and the default could be the integral of the input.

@thomashampson - what about making the number of MC events equal to a random number drawn from Poisson distribution with mean of integral of the input? This might help the Poisson-ness of the stats in each bin (if that makes sense...)

@Despiix Despiix force-pushed the Create_Monte_Carlo_Ws branch 2 times, most recently from c446801 to 491d7ce Compare November 21, 2024 13:50
@Despiix Despiix force-pushed the Create_Monte_Carlo_Ws branch from d876e9e to 5450e90 Compare November 29, 2024 14:06

//----------------------------------------------------------------------------------------------

Mantid::HistogramData::HistogramY CreateMonteCarloWorkspace::fillHistogramWithRandomData(const std::vector<double> &cdf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether one could optionally output an event workspace (which is essentially what you are simulating), or produce an event workspace and then rebin to match the input workspace?

@sf1919
Copy link
Contributor

sf1919 commented Jan 14, 2025

Does this need to move to v6.13?

@Despiix
Copy link
Collaborator Author

Despiix commented Jan 14, 2025

Does this need to move to v6.13?

No, I am waiting for someone to review it. All the additional features will be added through a seperate pr.

Copy link
Contributor

@RichardWaiteSTFC RichardWaiteSTFC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing some of the points, in particular setting the errors.

This works well and the tests are good (checking reproducibility with the seed, respects number of MC events etc.).

I checked with the test workspace from the code I posted earlier,
image
I also checked that the variance of the simulated counts agrees with the average over ~5000 simulations so I think it does seem reasonable to set e = sqrt(y), at least for large counts. I haven't tested the validity of the Poisson distribution etc.
image

There are some small changes requested: perhaps some unnecessary normalisation done if the monte-carlo events are user specified?

I think there are some additional features I would like (support for workspaces with more than 1 spectrum, perhaps event workspaces as well) but these can be addressed in a separate PR as and when it becomes useful.

@@ -0,0 +1,44 @@
// Mantid Repository : https://github.com/mantidproject/mantid
//
// Copyright &copy; 2024 ISIS Rutherford Appleton Laboratory UKRI,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably 2025 now!

src/PolarizationCorrections/PolarizationEfficienciesWildes.cpp
src/PolarizationCorrections/PolarizerEfficiency.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look unrelated to the PR, do you need to rebase?

Framework/Algorithms/src/CreateMonteCarloWorkspace.cpp Outdated Show resolved Hide resolved
Framework/Algorithms/test/CreateMonteCarloWorkspaceTest.h Outdated Show resolved Hide resolved
docs/source/algorithms/CreateMonteCarloWorkspace-v1.rst Outdated Show resolved Hide resolved
Despiix and others added 8 commits January 16, 2025 10:56
Co-authored-by: Jonathan Haigh <35813666+jhaigh0@users.noreply.github.com>
…errors are calculated as the sqrt of the counts.
Does not work if MC events are much less than data points in original workspace
@Despiix Despiix force-pushed the Create_Monte_Carlo_Ws branch from 5ef163a to f30f378 Compare January 16, 2025 14:56
Copy link
Contributor

@RichardWaiteSTFC RichardWaiteSTFC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, in particular renaming the .png and removing the unnecessary additional normalisation when the user supplies the number of events.

I know this PR has been going a while, I'm happy to approve now as I think this algorithm works well and is important to continue with fitbenchmarking work. There are a few nit-picking comments, but as there are a few other features I'd like added separately after this release I think these can be addressed later! What do you think @jclarkeSTFC, @thomashampson, @jhaigh0 and @sf1919 ?

Framework/Algorithms/src/CreateMonteCarloWorkspace.cpp Outdated Show resolved Hide resolved
* Determine how many iterations to use for MC sampling.
* If userMCEvents > 0, use that directly; otherwise use the integral of the input data.
*/
int CreateMonteCarloWorkspace::computeNumberOfIterations(const Mantid::HistogramData::HistogramY &yData,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically I would prefer to have this if (userMCEvents > 0) in the exec and then this method would be called integrateYData or similar. But I'm happy to merge as is!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be fixed now

@RichardWaiteSTFC
Copy link
Contributor

RichardWaiteSTFC commented Jan 16, 2025

Oh just noticed the copyright is still 2024, is that correct or should it be 2025? Again v minor...

jhaigh0
jhaigh0 previously approved these changes Jan 16, 2025
@jclarkeSTFC jclarkeSTFC changed the base branch from main to release-next January 16, 2025 15:49
@jclarkeSTFC jclarkeSTFC dismissed stale reviews from jhaigh0 and RichardWaiteSTFC January 16, 2025 15:49

The base branch was changed.

Copy link
Contributor

@thomashampson thomashampson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would benefit from a better image for the documentation to showcase the algorithm.

@RichardWaiteSTFC
Copy link
Contributor

RichardWaiteSTFC commented Jan 16, 2025

I think we would benefit from a better image for the documentation to showcase the algorithm.

Feel free to use this if you want
image
or this if you prefer smooth input data
image

The script to generate it is here

# import mantid algorithms, numpy and matplotlib
from mantid.simpleapi import *
import matplotlib.pyplot as plt
import numpy as np
from mantid.api import FunctionFactory

func = FunctionWrapper(FunctionFactory.createInitialized("name=BackToBackExponential,I=25000,A=0.06,B=0.015,X0=30000,S=30;name=FlatBackground,A0=50"))
# create input workspace
x = np.linspace(29650.0, 30500.0, 201)
y = func(x)
e = np.sqrt(y)
ws = CreateWorkspace(DataX=x, DataY=y, DataE=e, UnitX="TOF")
# call algorithm
ws_mc = CreateMonteCarloWorkspace(InputWorkspace=ws, Seed=0)

fig, axes = plt.subplots(subplot_kw={'projection': 'mantid'})
axes.plot(ws, label='input', wkspIndex=0)
axes.plot(ws_mc, label='CreateMonteCarloWorkspace output', wkspIndex=0, alpha=0.75)
legend = axes.legend(fontsize=8.0).set_draggable(True).legend
fig.show()

Despiix and others added 3 commits January 17, 2025 09:34
func = FunctionWrapper(FunctionFactory.createInitialized("name=BackToBackExponential,I=25000,A=0.06,B=0.015,X0=30000,S=30;name=FlatBackground,A0=50"))
# create input workspace
x = np.linspace(29650.0, 30500.0, 201)
y = func(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this script reproduces the other image I sent ( the smooth blue curve). If you want the noisy data then you need
y = np.random.poisson(func(x))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought I added the other image. I will fix it now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fixed now

Copy link
Contributor

@RichardWaiteSTFC RichardWaiteSTFC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, I've tested and looks good!
Still got copyright 2024 here

// Copyright &copy; 2024 ISIS Rutherford Appleton Laboratory UKRI,

But test file has 2025 in the .cpp file - I don't know which one is right but I think they should be the same!
Other than that happy to approve!

@Despiix
Copy link
Collaborator Author

Despiix commented Jan 20, 2025

Thanks for the changes, I've tested and looks good! Still got copyright 2024 here

// Copyright &copy; 2024 ISIS Rutherford Appleton Laboratory UKRI,

But test file has 2025 in the .cpp file - I don't know which one is right but I think they should be the same!
Other than that happy to approve!

Apologies for that I must have missed it! It should be fixed now.

Copy link
Contributor

@RichardWaiteSTFC RichardWaiteSTFC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work well done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ISIS Team: Core Issue and pull requests managed by the Core subteam at ISIS
Projects
Status: Awaiting Merge
Development

Successfully merging this pull request may close these issues.

New algorithm for creating Monte Carlo data
6 participants