-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hashing functions/objects defined elsewhere #717
Comments
This shouldn't be tied to type annotations, apart from files, but actual runtime types. Hashing the cloudpickle as a fallback for types we don't know how to hash seems reasonable. |
Looking at the old function: def hash_function(obj):
"""Generate hash of object."""
return sha256(str(obj).encode()).hexdigest() I see how that was so successful. :-) I do think it would be good to use something more like cloudpickle than |
Oh, and finally, the fastest way to enable hashing on a currently-unhashable object would be: import cloudpickle as cp
from pydra.utils.hash import register_serializer, Cache
@register_serializer
def bytes_repr_Pipeline(obj: Pipeline, cache: Cache):
yield cp.dumps(obj) |
thanks @effigies for the register code. that helps. in this particular instance
but a general pipeline could indeed use the cloudpickle. should i leave this open for adding a fallback option? ps. also we still haven't solved function hashing in general across putatively similar environments (say same versions of libraries installed in different operating system environments) - but that's a different challenge. |
actually the register code by itself doesn't work. as the object is in a dict and perhaps there is no recursion there: {'permute': True, 'model': (Pipeline(steps=[('std', StandardScaler()),
('MLPClassifier', MLPClassifier(alpha=1, max_iter=1000))]), [0, 2, 3], [1, 10, 12]),
'gen_feature_importance': False,
'_func': b'\x80\x05\x95-\x00\x00\x00\x00\x00\x00\x00\x8c\x0epydra_ml.tasks\x94\x8c\x16get_feature_importance\x94\x93\x94.'} also i just injected that register in pydra_ml rather than pydra, which i think is the right thing to do. |
There should be recursion. We built it that way. |
hmmm. this whole thing goes to (Pdb) u
> /Users/satra/software/nipype/pydra/pydra/utils/hash.py(63)hash_function()
-> return hash_object(obj).hex()
(Pdb) obj
{'permute': True, 'model': (Pipeline(steps=[('std', StandardScaler()),
('MLPClassifier', MLPClassifier(alpha=1, max_iter=1000))]), [0, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 16, 18, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 33, 34, 35, 36, 38, 39, 40, 41, 42, 43, 44, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 65, 67, 68, 69, 70, 72, 73, 74, 76, 77, 79, 80, 81, 82, 83, 84, 86, 87, 88, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 104, 105, 106, 109, 110, 111, 112, 113, 114, 115, 116, 117, 119, 120, 121, 122, 123, 124, 125, 126, 128, 129, 130, 131, 133, 135, 136, 137, 138, 139, 141, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 158, 160, 161, 162, 163, 164, 166, 167, 168, 169, 171, 173, 174, 176, 177, 178, 180, 181, 182, 183, 184, 185, 186, 187, 189, 190, 191, 192, 193, 195, 197, 198, 199, 200, 201, 202, 203, 204, 206, 207, 208, 209, 212, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 225, 226, 227, 228, 229, 230, 232, 234, 236, 237, 238, 240, 241, 242, 243, 244, 245, 246, 248, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 265, 266, 267, 269, 270, 271, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 299, 300, 302, 303, 304, 305, 306, 307, 309, 311, 312, 313, 314, 315, 316, 317, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 331, 332, 333, 334, 335, 336, 338, 339, 341, 342, 343, 344, 346, 347, 349, 351, 352, 355, 357, 359, 360, 361, 362, 363, 365, 366, 367, 368, 369, 370, 371, 373, 374, 375, 376, 377, 378, 379, 380, 381, 383, 384, 386, 387, 388, 390, 392, 393, 394, 395, 396, 397, 398, 399, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 415, 418, 419, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 433, 435, 436, 437, 438, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 459, 460, 461, 462, 464, 467, 469, 470, 472, 474, 475, 476, 477, 478, 479, 480, 481, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 501, 502, 503, 505, 506, 507, 508, 509, 510, 511, 513, 514, 517, 520, 521, 522, 523, 524, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 539, 540, 541, 542, 543, 544, 545, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 561, 563, 565, 568], [1, 10, 12, 14, 15, 17, 21, 31, 37, 45, 46, 64, 66, 71, 75, 78, 85, 89, 90, 102, 103, 107, 108, 118, 127, 132, 134, 140, 142, 157, 159, 165, 170, 172, 175, 179, 188, 194, 196, 205, 210, 211, 213, 224, 231, 233, 235, 239, 247, 249, 250, 263, 264, 268, 272, 283, 298, 301, 308, 310, 318, 319, 330, 337, 340, 345, 348, 350, 353, 354, 356, 358, 364, 372, 382, 385, 389, 391, 400, 401, 412, 413, 414, 416, 417, 420, 421, 432, 434, 439, 457, 458, 463, 465, 466, 468, 471, 473, 482, 500, 504, 512, 515, 516, 518, 519, 525, 538, 546, 560, 562, 564, 566, 567]), 'gen_feature_importance': False, '_func': b'\x80\x05\x95-\x00\x00\x00\x00\x00\x00\x00\x8c\x0epydra_ml.tasks\x94\x8c\x16get_feature_importance\x94\x93\x94.'} |
this branch has the changes: nipype/pydra-ml#59 and i'm just running this test to work through the changes:
|
actually nevermind. the env still seemed to have pydra 0.22. checking with 0.23.alpha now. |
the issue persists. |
I think we probably need to provide a better debugging experience when hashing fails. Has the registration worked properly or is there an error inside the registered function, @satra? |
at least in pdb when i try to run the node it says since the Pipeline object is an input in other places, i do think the registration is working. i.e. i can make things not work by inserting something in the registration function. |
The code is: Lines 66 to 78 in 0e3d33c
With the |
@satra I can't reproduce. I'm getting an entirely different error on your PR: $ pydra_ml/tests/test_classifier.py F
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> captured stderr >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
100%|██████████| 114/114 [00:00<00:00, 273.40it/s]
100%|██████████| 114/114 [00:00<00:00, 282.79it/s]
100%|██████████| 114/114 [00:00<00:00, 271.81it/s]
100%|██████████| 114/114 [00:00<00:00, 268.36it/s]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
tmpdir = local('/tmp/pytest-of-chris/pytest-10/test_classifier0')
def test_classifier(tmpdir):
clfs = [
("sklearn.neural_network", "MLPClassifier", {"alpha": 1, "max_iter": 1000}),
[
["sklearn.impute", "SimpleImputer"],
["sklearn.preprocessing", "StandardScaler"],
["sklearn.naive_bayes", "GaussianNB", {}],
],
]
csv_file = os.path.join(os.path.dirname(__file__), "data", "breast_cancer.csv")
inputs = {
"filename": csv_file,
"x_indices": range(10),
"target_vars": ("target",),
"group_var": None,
"n_splits": 2,
"test_size": 0.2,
"clf_info": clfs,
"permute": [True, False],
"gen_feature_importance": False,
"gen_permutation_importance": False,
"permutation_importance_n_repeats": 5,
"permutation_importance_scoring": "accuracy",
"gen_shap": True,
"nsamples": 15,
"l1_reg": "aic",
"plot_top_n_shap": 16,
"metrics": ["roc_auc_score", "accuracy_score"],
}
wf = gen_workflow(inputs, cache_dir=tmpdir)
> results = run_workflow(wf, "cf", {"n_procs": 1})
pydra_ml/tests/test_classifier.py:38:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pydra_ml/classifier.py:175: in run_workflow
sub(runnable=wf)
../pydra-tmp/pydra/engine/submitter.py:42: in __call__
self.loop.run_until_complete(self.submit_from_call(runnable, rerun))
../../../mambaforge/envs/pydra-dev/lib/python3.11/asyncio/base_events.py:653: in run_until_complete
return future.result()
../pydra-tmp/pydra/engine/submitter.py:71: in submit_from_call
await self.expand_runnable(runnable, wait=True, rerun=rerun)
../pydra-tmp/pydra/engine/submitter.py:128: in expand_runnable
await asyncio.gather(*futures)
../pydra-tmp/pydra/engine/helpers.py:586: in load_and_run_async
await task._run(submitter=submitter, rerun=rerun, **kwargs)
../pydra-tmp/pydra/engine/core.py:1237: in _run
result.output = self._collect_outputs()
../pydra-tmp/pydra/engine/core.py:1365: in _collect_outputs
val_out = val.get_value(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = LazyOutField(name='feature_importance', field='feature_importance', type=pydra.engine.specs.StateArray[typing.List[typing.Any]], splits=frozenset({(('ml_wf.clf_info',), ('ml_wf.permute',))}), cast_from=None)
wf = <pydra.engine.core.Workflow object at 0x7f4873b5efd0>, state_index = None
def get_value(
self, wf: "pydra.Workflow", state_index: ty.Optional[int] = None
) -> ty.Any:
"""Return the value of a lazy field.
Parameters
----------
wf : Workflow
the workflow the lazy field references
state_index : int, optional
the state index of the field to access
Returns
-------
value : Any
the resolved value of the lazy-field
"""
from ..utils.typing import TypeParser # pylint: disable=import-outside-toplevel
node = getattr(wf, self.name)
result = node.result(state_index=state_index)
if result is None:
> raise RuntimeError(
f"Could not find results of '{node.name}' node in a sub-directory "
f"named '{node.checksum}' in any of the cache locations:\n"
+ "\n".join(str(p) for p in set(node.cache_locations))
)
E RuntimeError: Could not find results of 'feature_importance' node in a sub-directory named 'FunctionTask_553735ecdd5564cc0b0913c68a4fa342' in any of the cache locations:
E /tmp/pytest-of-chris/pytest-10/test_classifier0
../pydra-tmp/pydra/engine/specs.py:1012: RuntimeError Can move this comment to nipype/pydra-ml#59 if you'd prefer. |
that's the error, and if you pdb it, move up one slot in the stack and try to run the node variable |
I really can't reproduce this:
Attempting to look at
|
thanks @effigies for trying. i also don't know why we are seeing different outcomes. i'll dig in some more tomorrow morning, but you are getting the error that it couldn't find the results right? i get that too and then when i try to run the node, it crashes. i also don't know why that error surfaces only with 0.23. if you have a 0.22 + 0.6 that error doesn't surface. however, a lot has changed between 0.22 and 0.23. |
Correct, but running |
@satra - I can try to reproduce and debug as well, but can you write your python version. and you're running this locally on your osx, right? |
i have tried with python 3.10 and 3.11 (same errors) on osx 14 (m1 chip). |
@satra - do you expect |
i don't think so. it's based on the clf_info (a split) and the split indices (a nested split). |
so it has a random component? |
I would say it has dynamic/generative components, more so that random. But
there are random generators.
…On Thu, Oct 26, 2023, 9:00 AM Dorota Jarecka ***@***.***> wrote:
so it has a random component?
—
Reply to this email directly, view it on GitHub
<#717 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABM573JOPX7OULUYFVU4M3YBJNFFAVCNFSM6AAAAAA6K3V4QSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBRGA3TKOBUGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
yes, of course, sorry, didn't think long enough about the workflow and your answer... for moment I got confused why my checksums keep changing... |
Would it help debugging to add a random seed that could be fixed? |
I think it's not using random seed and yet still is changing |
@tclose - I've created this branch for testing: https://github.com/djarecka/pydra-ml/tree/newpydra_test I removed some tasks that are not needed to get the error and also simplified the spliters part. I kept the commented part just so it's easy to see what has been removed from the original code. The error I'm getting is:
I removed the connections to the following nodes, but I kept You can see that that this task changes hash not only between the time it is run and the results have to be collected, but it is different every time the workflow is run. I understand that since |
it's perhaps a function of how it gets checksummed. a |
yes, I'm testing in new locations. but just to be clear, I still keep other things in the pipeline... At the beginning I thought that the problem is just with I'm wondering what should be the best solution for hashing the pipeline? |
Just picking this up now, what do these Pipeline objects look like @satra? When you say that they are supplied by the user, are they completely arbitrary? Where do the pipelines in the test come from? If they are arbitrary and have stochastic elements then there isn't much that we can do is there? We need the user to supply a hasher that can pick out the stable attributes and make a hash from them, don't we? |
@tclose - the original hash ( hence the question of how to associate that hashing function with the |
more generally, debugging improvements could help here. |
I have implemented debugging improvements on this issue in #698, which detect that the hash actually changes during the execution of the task. I dug into it with the debugger and can confirm that your Pipeline I have narrowed things down a bit and it is a bit strange. It turns out that the hash of a Pipeline with a SimpleImputer step isn't stable between deepcopies of the Pipeline object, i.e. When a task is run a deepcopy of the inputs is stored away to guard against inner states of objects being updated while the task runs, however, this assumes that the hash is stable across deepcopies. What would cause a deepcopy to be different from the original?... (will keep digging) |
Actually, the deepcopy inequality is just because the Imputer classes don't implement an To work with these "poorly behaved" types in general, perhaps what we should do is replace the inputs with the deepcopied version for the task run, and then replace them with the original afterwards. We could also perhaps check to see if the deepcopied version doesn't equal the original and raise a warning. |
In that branch I have changed the way that See Lines 271 to 314 in 1720ba6
|
Actually, couldn't help myself and had another look and if I'm reading it right I think a straight copy should be sufficient instead of a deepcopy. So I have pushed those changes to my branch and the test seems to get past that point and error somewhere else |
@tclose - thanks so much for digging into this. setting the deepcopy issue aside, if you use the original hasher to the question of deepcopies inside pydra, we did not want the functions wrapped by tasks changing inputs, which many functions are prone to doing, especially for mutable things like dictionaries and lists. and since split/combine also operate on the inputs and could get arbitrary inputs and functions, deepcopies seemed safest to prevent the mutation. i.e. functions can do whatever they want, but we send into a function cannot be mutated in pydra. |
Hi @satra, I have done that and created a PR onto your PR nipype/pydra-ml#61. Both tests pass now. Note that the tests also pass using the cloudpickle bytes_repr if you use my #698, Pydra branch once you add the other change in that PR. |
thanks @tclose for helping debug this. |
regarding the deepcopy changes, i'll leave it to you and other developers to review and add. |
thanks @tclose for finding and fixing the issue! the |
with current master after hashing changes were incorporated.
in pydra_ml almost everything was marked as type Any, with the old hashing function figuring out things. i saw this as something a general scripter would do. i can go and do stricter type annotation (i have tried this too - see below), but i wouldn't expect a general user to do that when importing a function from an arbitrary library.
what's the minimal thing one can do to fix this on the user side, so that arbitrary functions could be imported in pydra? i suspect this may actually require changes to the hashing. an approach may be to pickle the object and generate the byte stream, which seems like a sensible fallback in a local setting instead of
UnhashableError
.in this particular case, the scikit-learn
Pipeline
object is being used, which is an arbitrarily nested object of objects. it's the input to this function. an equivalent consideration would be someone decided to a pass a pydra workflow as an input to a function.https://github.com/nipype/pydra-ml/blob/b58ad3d488857716df74c5917d5ad11729c25258/pydra_ml/tasks.py#L129
The text was updated successfully, but these errors were encountered: