Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: "Exception in ASGI application" with usernames longer than 56 characters #39

Open
telepath opened this issue Apr 14, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@telepath
Copy link

Describe the bug
I'm using Keycloak for authentication to nextcloud, making the usernames a 64 character hex string. When I try to use the context chat, it fails with the error message
AssertionError: Error: invalid user_id format, should consist of alphanumeric characters, hyphen, underscore, dot, and space only. Length should not exceed 56 characters.

To Reproduce
Steps to reproduce the behavior:

  1. have context chat and context chat backend installed
  2. log into nextcloud with a user that has a username longer than 56 characterrs
  3. use the context chat
  4. check backend log for error

Expected behavior
The bot should give an answer to the query

Server logs (if applicable)

``` ```

Context Chat Backend logs (if applicable, from the docker container)

ERROR:    Exception in ASGI application                                                                                                                                                                                                                                        
Traceback (most recent call last):                                                                                                                                                                                                                                                     
  File "/usr/local/lib/python3.11/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi                                                                                                                                                                             
    result = await app(  # type: ignore[func-returns-value]                                                                                                                                                                                                                            
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                            
  File "/usr/local/lib/python3.11/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__                                                                                                                                                                             
    return await self.app(scope, receive, send)                                                                                                                                                                                                                                        
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                        
  File "/usr/local/lib/python3.11/dist-packages/uvicorn/middleware/message_logger.py", line 84, in __call__                                                                                                                                                                            
    raise exc from None                                                                                                                    
  File "/usr/local/lib/python3.11/dist-packages/uvicorn/middleware/message_logger.py", line 80, in __call__                                                                                                                                                                            
    await self.app(scope, inner_receive, inner_send)                                                                                       
  File "/usr/local/lib/python3.11/dist-packages/fastapi/applications.py", line 1115, in __call__                                                                                                                                                                                       
    await super().__call__(scope, receive, send)                                                                                           
  File "/usr/local/lib/python3.11/dist-packages/starlette/applications.py", line 122, in __call__                                                                                                                                                                                      
    await self.middleware_stack(scope, receive, send)                                                                                      
  File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/errors.py", line 184, in __call__                                                                                                                                                                                 
    raise exc                                                                                                                              
  File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/errors.py", line 162, in __call__                                                                                                                                                                                 
    await self.app(scope, receive, _send)                                                                                                  
  File "/app/context_chat_backend/ocs_utils.py", line 75, in __call__                                                                      
    await self.app(scope, receive, send)                                                                                                   
  File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__                                                                                                                                                                              
    raise exc                                                                                                                              
  File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__                                                                                                                                                                              
    await self.app(scope, receive, sender)                                                                                                 
  File "/usr/local/lib/python3.11/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__                                                                                                                                                                            
    raise e                                                                                                                                
  File "/usr/local/lib/python3.11/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__                                                                                                                                                                            
    await self.app(scope, receive, send)                                                                                                   
  File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 718, in __call__                                                                                                                                                                                           
    await route.handle(scope, receive, send)                                                                                               
  File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 276, in handle                                                                                                                                                                                             
    await self.app(scope, receive, send)                                                                                                                                                                                                                                               
  File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 66, in app                                                                                      
    response = await func(request)                                                                                                                                          
               ^^^^^^^^^^^^^^^^^^^                                                                                                                                          
  File "/usr/local/lib/python3.11/dist-packages/fastapi/routing.py", line 274, in app                                                                                       
    raw_response = await run_endpoint_function(                                                                                                                             
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                             
  File "/usr/local/lib/python3.11/dist-packages/fastapi/routing.py", line 193, in run_endpoint_function                                                                     
    return await run_in_threadpool(dependant.call, **values)                                                                                                                
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                
  File "/usr/local/lib/python3.11/dist-packages/starlette/concurrency.py", line 41, in run_in_threadpool                                                                    
    return await anyio.to_thread.run_sync(func, *args)                                                                                                                      
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                      
  File "/usr/local/lib/python3.11/dist-packages/anyio/to_thread.py", line 33, in run_sync                                                                                   
    return await get_asynclib().run_sync_in_worker_thread(                                                                                                                                                 
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                 
  File "/usr/local/lib/python3.11/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread                                                                                       
    return await future                                                                              
           ^^^^^^^^^^^^                                                                                                                                                                                    
  File "/usr/local/lib/python3.11/dist-packages/anyio/_backends/_asyncio.py", line 807, in run                                                                                                                                                                                         
    result = context.run(func, *args)                                                                                                                                                                                                                   
             ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                      
  File "/app/context_chat_backend/utils.py", line 73, in wrapper                                                            
    return func(*args, **kwargs)                                                                                                           
           ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                       
  File "/app/context_chat_backend/controller.py", line 200, in _                                                                           
    result = embed_sources(db, sources)                                                                                                                                                                                                                                                
             ^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                    
  File "/app/context_chat_backend/chain/ingest/injest.py", line 167, in embed_sources                                                                                                                                                                                                  
    return _process_sources(vectordb, sources_filtered)                                                                                    
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                
  File "/app/context_chat_backend/chain/ingest/injest.py", line 120, in _process_sources                                                                                                                                                                                               
    filtered_docs = _filter_documents(user_id, vectordb, documents)                                                                                                                                                                                                                    
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                        
  File "/app/context_chat_backend/chain/ingest/injest.py", line 36, in _filter_documents                                                                                                                                                                                               
    existing_objects = vectordb.get_objects_from_metadata(                                                                                 
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                 
  File "/app/context_chat_backend/vectordb/chroma.py", line 92, in get_objects_from_metadata                                                                                                                                                                                           
    self.setup_schema(user_id)                                       
  File "/app/context_chat_backend/vectordb/chroma.py", line 47, in setup_schema                                                            
    self.client.get_or_create_collection(get_collection_name(user_id))                                                                     
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                      
  File "/app/context_chat_backend/vectordb/__init__.py", line 29, in get_collection_name                                                                                                                                                                                               
    raise AssertionError('Error: invalid user_id format, should consist of alphanumeric characters, hyphen, underscore, dot, and space only. Length should not exceed 56 characters.')  # noqa: E501
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Screenshots
If applicable, add screenshots to help explain your problem.

Setup Details (please complete the following information):

  • Nextcloud Version: 28.0.4
  • AppAPI Version: 2.4.0
  • Context Chat PHP Version: 2.0.2
  • Context Chat Backend Version 2.0.1
  • Nextcloud deployment method: All-in-one
  • Context Chat Backend deployment method: occ command

Additional context
I'm not sure if changing the user IDs would help, since even email addresses can get longer than 56 characters.

@telepath telepath added the bug Something isn't working label Apr 14, 2024
@kyteinsky
Copy link
Contributor

Hello, yeah that's an issue, we can't even consider 64 characters as the ceiling.
The most crude solution that comes to mind could be to hash the user id with something like md5 to get a 32 char unique derivation of the user id. Maybe we can come up with a better solution since this would definitely not be ideal. If we try to maintain backward compatibility, we would be hashing only usernames that cross the length limit, messing the db up further.

The user id is used to generate a table name with the format 'Vector_<userid>' btw. Chroma db is throwing the error that you see.

@telepath
Copy link
Author

Using a hash seems like a valid solution for a unique table name, but as you say, backward compatibility would be an issue. The clean solution would be a migration to hashes, but that makes downgrading impossible, of course...
Without a hash you could have generic table names and a lookup table.
In any case, the current method seems to be only valid for specific cases, so some other solution needs to be found, backward compatible or not...

@telepath
Copy link
Author

telepath commented Apr 18, 2024

After looking at the code I get the impression that using a hash would have been the easier solution from the start. It removes the need for all the checks, which aren't feasible anyway, since nobody will format their usernames after the requirements of an addon:

--- a/context_chat_backend/vectordb/__init__.py
+++ b/context_chat_backend/vectordb/__init__.py
@@ -1,5 +1,6 @@
 import re
 from importlib import import_module
+from hashlib import sha224
 
 from .base import BaseVectorDB, MetadataFilter
 
@@ -25,25 +26,7 @@ def get_collection_name(user_id: str) -> str:
        if user_id in user_id_cache:
                return user_id_cache[user_id]
 
-       if not re_user_id.match(user_id):
-               raise AssertionError('Error: invalid user_id format, should consist of alphanumeric characters, hyphen, underscore, dot, and space only. Length should not exceed 56 characters.')  # noqa: E501
-
-       # should not end in a special character
-       if user_id[-1] in '_.-@ ':
-               raise AssertionError('Error: user_id should not end in a special character.')
-
-       # replace space with double underscore
-       user_id = user_id.replace(' ', '__')
-       # replace consecutive dots with .n. (n is the number of dots)
-       user_id = re.sub(r'\.{2,}', lambda m: f'.{len(m.group())}.', user_id)
-       # replace @ with .at.
-       user_id = user_id.replace('@', '.at.')
-
-       # recheck length constraints
-       if len(user_id) > 56:
-               raise AssertionError(f'Error: length of cleaned up user_id should not exceed 56 characters, processed username: {user_id}.')  # noqa: E501
-
-       collection_name = f'Vector_{user_id}'
+       collection_name = f'Vector_{sha224(user_id).hexdigest()}'
        user_id_cache[user_id] = collection_name
 
        return collection_name

Of course this affects get_user_id_from_collection(), but on first glance I can't see that it is used anywhere apart from testing.

So to me it looks like this would be the better way to do it. And regarding backward compatibility, it is still better to break it now than later.

The only remaining issue is the possibility of collisions, but that's only a theoretical problem and will be vastly more rare than the current issue.

@kyteinsky
Copy link
Contributor

Hello, I discussed this issue with senior developers, they mention the same issues you do like the collision and maintenance of reverse lookup table for user ids.

We think it would be nice to use this opportunity to change the schema to support ACLs (and common documents) which would prevent duplication of document chunks and embeddings for shared documents. We plan on making two collections. One for vectors and the other for ACLs. This is due to the fact that Chroma DB does not support list type of metadata and metadata filters don't match on partial strings afaik. We could drop Chroma and just use Weaviate but the latter does not support schema change and we would like to have that for now, given the dynamic nature of this software at this stage. We're looking into the finer details but that is the general direction we think would solve both the issues and be future proof for a while.

Of course this affects get_user_id_from_collection(), but on first glance I can't see that it is used anywhere apart from testing.

It is used to get all the users to delete a metadata filtered set of documents (https://github.com/nextcloud/context_chat_backend/blob/master/context_chat_backend/vectordb/base.py#L212, https://github.com/nextcloud/context_chat_backend/blob/master/context_chat_backend/vectordb/chroma.py#L41). (necessary for custom data provider support)

And regarding backward compatibility, it is still better to break it now than later.

yup

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label May 25, 2024
@kyteinsky kyteinsky removed the stale label May 27, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Jun 27, 2024
@kyteinsky kyteinsky removed the stale label Jun 27, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Jul 28, 2024
@kyteinsky kyteinsky removed the stale label Jul 29, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Aug 29, 2024
@kyteinsky kyteinsky removed the stale label Aug 29, 2024
@telepath
Copy link
Author

Is there any update on the progress? I'd really love to do some testing ;).

@kyteinsky
Copy link
Contributor

Hi, yes, we'll be working on this in the coming or so week. We'll keep you informed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants