Make credentials a resolver #4320
Replies: 16 comments 2 replies
-
I like the idea in principle, but what would be a way for us to prevent hardcoding of credentials into the catalog directly? |
Beta Was this translation helpful? Give feedback.
-
Not sure of what you mean : for me the default credentials provided by kedro would do the exact same things as it does now (map a string in the catalog to an entry in credentials.yml) so I don't think there is a problem here. A user who likes playing with kedro's limits could eventually write his own resolver and eventually takes raw inputs directly in the catalog, but I don't think it's kedro's responsibility to try to circumvent all weird things users can do (and funnily, they already can write a custom resolver and use it in the catalog - my proposal would simply enable to make their dangerous logic the default, but they are clearly doing it on purpose). Do you have an example of a situation this proposal would enable which is not already possible and that we want to avoid by all means? |
Beta Was this translation helpful? Give feedback.
-
Hard agree. Our YAML-first approach to credentials is weird, given that there are many other methods that are arguably more prevalent, like environment variables, secrets vaults, JSON token files... Plus we have evidence that users struggle to understand the current approach #1646 (comment) |
Beta Was this translation helpful? Give feedback.
-
I think this is really elegant, I think there is probably a way where this is non-breaking and just an extension too |
Beta Was this translation helpful? Give feedback.
-
@Galileo-Galilei The way
and this:
Currently, the latter will not be taken by Kedro and it won't work, because we do the resolving after loading the config. This way we are preventing users from accidentally checking in secrets. If we go the resolver path (which I like personally), we need to make sure that my second example will error out and your example will only work. |
Beta Was this translation helpful? Give feedback.
-
This is a totally valid concern, I did not think about this edge case. It seems it should be possible to check if the credentials subkey is interpolated with [ |
Beta Was this translation helpful? Give feedback.
-
Originally posted by @MatthiasRoels in #1936 (comment) |
Beta Was this translation helpful? Give feedback.
-
O also think a resolver based way of doing creds would be amazing, I had to hack something together to make things work for my SQLFrame experiement. |
Beta Was this translation helpful? Give feedback.
-
Hi Everyone, As always, a lot of great thoughts & comments. 👍 If I may quickly had some poorly articulated concerns / suggestions that connects this issue to #2829: AWS Secrets bills on READ Our problem is compounded by the fact that, in the Looking very forward to reading your thoughts on this. |
Beta Was this translation helpful? Give feedback.
-
The suggestion make sense. We agree this is something that is useful, at the moment we are doing some re-design for DataCatalog, we will start looking at this once we finish the refactoring. |
Beta Was this translation helpful? Give feedback.
-
More evidence that our users are just using environment variables for credentials https://kedro.hall.community/support-lY6wDVhxGXNY/passing-credentials-to-a-node-in-kedro-DrfL3JQdGpPb (Related to #3979 cc @lordsoffallen) |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Credential management in Polars is changing 👀 see
lf = pl.scan_parquet(
"s3://...",
credential_provider=pl.CredentialProviderAWS(profile_name="..."),
) |
Beta Was this translation helpful? Give feedback.
-
Moving this to Discussions to continue the conversation there. |
Beta Was this translation helpful? Give feedback.
-
Hi, This would be very convenient, especially with dynamic credentials generated from Hooks, so I'll be able to pass a default value if the credential doesn't exists. Today I've found a way by adding a "get_credential_if_available" resolver in an "after_context_created" hook: class GetCredentialIfAvailableResolver:
@hook_impl
def after_context_created(self, context) -> None:
# This method is called after the Kedro context is created, before catalog creation
def custom_resolver(context, credential_name):
# return the credential if it exists, else unset the credential by returning None
try:
return context.config_loader["credentials"][credential_name]
except KeyError:
return None
# registering the resolver
OmegaConf.register_new_resolver("get_credential_if_available", lambda x: custom_resolver(context, x)) That I then uses in it my data catalog, like this: master_table:
type: pandas.ExcelDataset
filepath: "${runtime_params:mastertable-file-blob, /data/master_table.xlsx}"
#save_args:
credentials: ${get_credential_if_available:mastertable-file-blob_credentials} This is useful as my "mastertable-file-blob_credentials" is generated only if the passed parameter is an abfs path So now I don't get any error if the credential doesn't exists (so when I run the pipeline locally for developing and don't pass an abfs:// path with the kedro commandline parameter, that is just saving to local file to the path which is the default value). The hook for generating the dynamic credential if it helps understanding: class AzureBlobDefaultAzureCredentialsBlobPathsParametersCredentialsGenerator:
# This class automatically detects STRING values parameters which contains a blob url and annd credentials for them, with the same name than the parameter but appended with _credentials
# it uses Azure Default Credentials, wrapped to async for fsspec
@hook_impl
def after_context_created(self, context) -> None:
class AsyncDefaultAzureCredentialWrapper(AsyncTokenCredential):
"""
This class wraps the synchronous DefaultAzureCredential to provide an asynchronous interface.
It allows the use of DefaultAzureCredential with fsspec, which requires asynchronous credentials.
The close method is overridden to be a no-op to prevent issues when fsspec tries to close the credential.
Note from Francis Viallet: I posted the answer to the fsspec's git issue related to this: https://github.com/fsspec/adlfs/issues/431
"""
def __init__(self, **kwargs):
# Initialize the synchronous DefaultAzureCredential
self._credential = DefaultAzureCredential(**kwargs)
async def get_token(self, *scopes, **kwargs):
#Asynchronously get a token for the specified scopes: this method runs the synchronous get_token method in an executor to avoid blocking.
loop = asyncio.get_event_loop()
token = await loop.run_in_executor(None, self._credential.get_token, *scopes, **kwargs)
return AccessToken(token.token, token.expires_on)
async def close(self):
#No-op close method to prevent issues when fsspec tries to close the credential.
pass
# Create the async credential from the sync one, with the no-op close
credential = AsyncDefaultAzureCredentialWrapper(exclude_interactive_browser_credential=False)
# This method is called after the Kedro context is created, before catalog creation (which we need as we will use the credentials in the catalogs): https://docs.kedro.org/en/latest/hooks/introduction.html
# Filter parametetrs to get ones with azure datalake/blob fsspec's compliant path values only (pandas datasets uses fsspec derrière)
blob_paths_parameters = {
k: v for k, v in context.params.items()
if isinstance(v, str)
and (v.startswith('abfs://') or v.startswith('az://') or v.startswith('adl://'))
and ('.blob.core.windows.net/' in v or '.dfs.core.windows.net/' in v)
}
if len(blob_paths_parameters) > 0:
# Initialize the DefaultAzureCredential if needed (so if parameters containing blob path were found)
# We disable the exclude_interactive_browser_credential so when this class will ask user to authenticate in last resort if none of the previous methods works: useful when running the code on developer's Computer
#from azure.identity import DefaultAzureCredential # import it there because fsspec needs to uses the aio (async) version where others uses the regular one (https://github.com/fsspec/adlfs/issues/431)
#credential = DefaultAzureCredential(exclude_interactive_browser_credential=False)
# Iterate found parameters
for key, blob_path in blob_paths_parameters.items():
# extract account name from blob path
if '@' in blob_path:
account_name = blob_path.split('@')[-1].split('.')[0]
else:
url_parsed = urlparse(blob_path)
account_name = url_parsed.netloc.split('.')[0]
# Adding the credentials to the Kedro context configuration (will be added to setting.py to made them available to all the catalog/pipelines : from vmx_preprocessing.hooks import AzureSQLDefaultAzureCredentials, AzureSharepointDefaultAzureCredentials ; HOOKS = (AzureSQLDefaultAzureCredentials(), AzureSharepointDefaultAzureCredentials(),))
context.config_loader["credentials"] = {
**context.config_loader["credentials"],# Keep existing credentials in the context to not lose them when overwriting the credentials object
f'{key}_credentials' : {
'account_name': account_name,
'credential': credential,
'anon': False
}
} |
Beta Was this translation helpful? Give feedback.
-
Today I found something interesting:
kedro/kedro/io/catalog_config_resolver.py Line 18 in f6ecca5 ...same as ...but this is not transparent for users! I just helped a user that had a custom dataset and entries like these: # catalog.yml
ds:
type: my.custom.Dataset
url: db_url
# credentials.yml
db_url: https://... They expected this to work, but they were getting overwhelming errors essentially boiling down to the string They also had the perception that it was not okay to use environment variables. So in the end I saw two options: either they changed their catalog entries and custom dataset source code to abide by the (After which they asked, "why can't I do the same with keys stored in |
Beta Was this translation helpful? Give feedback.
-
Description
Make "credentials" a
OmegaConf
resolver, so injecting credentials in the catalog would look like :Context
This may look like a pure technical refactoring, but actually I am expecting this to be an actual improvement for users, e.g. :
settings.py
Overall, this would likely be a general improvement with some overlaps with what is described here #1646 and has never been properly adressed, despite a bunch of discussions around this topic.
Possible Implementation
Move the logic from the catalog and the
_get_credentials
and_resolve_credentials
:kedro/kedro/io/data_catalog.py
Lines 35 to 58 in 7237dc7
kedro/kedro/io/data_catalog.py
Lines 61 to 83 in 7237dc7
to a resolver.
Possible Alternatives
Do nothing and keep the situation as is.
Beta Was this translation helpful? Give feedback.
All reactions