-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: "Exception in ASGI application" with usernames longer than 56 characters #39
Comments
Hello, yeah that's an issue, we can't even consider 64 characters as the ceiling. The user id is used to generate a table name with the format 'Vector_<userid>' btw. Chroma db is throwing the error that you see. |
Using a hash seems like a valid solution for a unique table name, but as you say, backward compatibility would be an issue. The clean solution would be a migration to hashes, but that makes downgrading impossible, of course... |
After looking at the code I get the impression that using a hash would have been the easier solution from the start. It removes the need for all the checks, which aren't feasible anyway, since nobody will format their usernames after the requirements of an addon: --- a/context_chat_backend/vectordb/__init__.py
+++ b/context_chat_backend/vectordb/__init__.py
@@ -1,5 +1,6 @@
import re
from importlib import import_module
+from hashlib import sha224
from .base import BaseVectorDB, MetadataFilter
@@ -25,25 +26,7 @@ def get_collection_name(user_id: str) -> str:
if user_id in user_id_cache:
return user_id_cache[user_id]
- if not re_user_id.match(user_id):
- raise AssertionError('Error: invalid user_id format, should consist of alphanumeric characters, hyphen, underscore, dot, and space only. Length should not exceed 56 characters.') # noqa: E501
-
- # should not end in a special character
- if user_id[-1] in '_.-@ ':
- raise AssertionError('Error: user_id should not end in a special character.')
-
- # replace space with double underscore
- user_id = user_id.replace(' ', '__')
- # replace consecutive dots with .n. (n is the number of dots)
- user_id = re.sub(r'\.{2,}', lambda m: f'.{len(m.group())}.', user_id)
- # replace @ with .at.
- user_id = user_id.replace('@', '.at.')
-
- # recheck length constraints
- if len(user_id) > 56:
- raise AssertionError(f'Error: length of cleaned up user_id should not exceed 56 characters, processed username: {user_id}.') # noqa: E501
-
- collection_name = f'Vector_{user_id}'
+ collection_name = f'Vector_{sha224(user_id).hexdigest()}'
user_id_cache[user_id] = collection_name
return collection_name Of course this affects get_user_id_from_collection(), but on first glance I can't see that it is used anywhere apart from testing. So to me it looks like this would be the better way to do it. And regarding backward compatibility, it is still better to break it now than later. The only remaining issue is the possibility of collisions, but that's only a theoretical problem and will be vastly more rare than the current issue. |
Hello, I discussed this issue with senior developers, they mention the same issues you do like the collision and maintenance of reverse lookup table for user ids. We think it would be nice to use this opportunity to change the schema to support ACLs (and common documents) which would prevent duplication of document chunks and embeddings for shared documents. We plan on making two collections. One for vectors and the other for ACLs. This is due to the fact that Chroma DB does not support list type of metadata and metadata filters don't match on partial strings afaik. We could drop Chroma and just use Weaviate but the latter does not support schema change and we would like to have that for now, given the dynamic nature of this software at this stage. We're looking into the finer details but that is the general direction we think would solve both the issues and be future proof for a while.
It is used to get all the users to delete a metadata filtered set of documents (https://github.com/nextcloud/context_chat_backend/blob/master/context_chat_backend/vectordb/base.py#L212, https://github.com/nextcloud/context_chat_backend/blob/master/context_chat_backend/vectordb/chroma.py#L41). (necessary for custom data provider support)
yup |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Is there any update on the progress? I'd really love to do some testing ;). |
Hi, yes, we'll be working on this in the coming or so week. We'll keep you informed. |
Describe the bug
I'm using Keycloak for authentication to nextcloud, making the usernames a 64 character hex string. When I try to use the context chat, it fails with the error message
AssertionError: Error: invalid user_id format, should consist of alphanumeric characters, hyphen, underscore, dot, and space only. Length should not exceed 56 characters.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The bot should give an answer to the query
Server logs (if applicable)
Context Chat Backend logs (if applicable, from the docker container)
Screenshots
If applicable, add screenshots to help explain your problem.
Setup Details (please complete the following information):
Additional context
I'm not sure if changing the user IDs would help, since even email addresses can get longer than 56 characters.
The text was updated successfully, but these errors were encountered: