Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Langchain_demo,为什么输出response是list而不是str,导致报错 #71

Open
zhangzhili1112 opened this issue Sep 11, 2024 · 1 comment

Comments

@zhangzhili1112
Copy link

[引用]
[[citation:1]]

# translate to vectors
    batch_size = args.batch_size
    for i in tqdm(range(0, len(chunks), batch_size), desc="向量化"):
        try:
            vector_store.add_documents(chunks[i:i + batch_size])
        except Exception as e:
            print(f"文件向量化失败,{e}")

    # save embedded vectors
    output_path = args.output_path
    os.makedirs(output_path, exist_ok=True)
    vector_store.save_local(output_path)
    print(f"文件向量化完成,已保存至{output_path}")

[[citation:2]]

![](../resources/logo.jpeg)

[English](README.md) | [中文](README_zh.md)

## RAG功能

CodeGeeX4支持RAG检索增强,并兼容Langchain框架,实现项目级检索问答。

## 使用教程

### 1. 安装依赖项

```bash
cd langchain_demo
pip install -r requirements.txt

2. 配置Embedding API Key

本项目使用智谱开放平台的Embedding API实现向量化功能,请先注册并获取API Key。

并在models/embedding.py中配置API Key。

详情可参考 https://open.bigmodel.cn/dev/api#text_embedding

3. 生成向量数据

python vectorize.py --workspace . --output_path vectors

>>> 文件向量化完成,已保存至vectors

[[citation:3]]
```markdown
def vectorize(files: list[str], args):
    # split file into chunks
    chunks = []
    for file in tqdm(files, desc="文件切分"):
        chunks.extend(split_into_chunks(file, args.chunk_size, args.overlap_size))

    # initialize the vector store
    vector_store = FAISS(
        embedding_function=embed_model,
        index=dependable_faiss_import().IndexFlatL2(embed_model.embedding_size),
        docstore=InMemoryDocstore(),
        index_to_docstore_id={},
    )

[[citation:4]]

if __name__ == '__main__':
    args = parse_arguments()
    files = traverse(args.workspace)
    vectorize(files, args)

问:这个项目如何实现文件向量化
为什么输出response是list而不是str,结果如下
{'name': '这个项目通过以下步骤实现文件向量化:', 'content': '\n1. 首先,项目会根据给定的参数(如batch_sizechunk_size)将文件切分成多个块(chunks)。这是通过split_into_chunks函数实现的,该函数会根据给定的块大小和重叠大小将文件切分成多个块[[citation:3]]。\n\n2. 然后,项目会初始化一个向量存储(vector store)。这个向量存储使用FAISS库,这是一个用于高效相似度搜索和聚类的大型N维向量索引库。向量存储的初始化包括指定嵌入函数(embedding function)、索引(index)和文档存储(docstore)[[citation:3]]。\n\n3. 接下来,项目会使用vector_store.add_documents方法将切分好的文件块添加到向量存储中。这个方法会调用嵌入函数将每个文件块转换成向量,然后将这些向量添加到向量存储中[[citation:1]]。\n\n4. 最后,项目会将向量存储保存到本地文件系统中。这是通过vector_store.save_local方法实现的,该方法会将向量存储保存到指定的输出路径中[[citation:1]]。\n\n总的来说,这个项目通过将文件切分成块,然后将每个块转换成向量,并将这些向量存储到向量存储中,从而实现了文件向量化。'}

@zhangzhili1112 zhangzhili1112 changed the title Langchain_demo Langchain_demo,为什么输出response是list而不是str,导致报错 Sep 11, 2024
@zhangzhili1112
Copy link
Author

zhangzhili1112 commented Sep 11, 2024

`response中有“\n”,总是会将内容分割`   
 def process_response(self, output, history):
        content = ""
        history = deepcopy(history)
        for response in output.split("<|assistant|>"):
            if "\n" in response:
                metadata, content = response.split("\n", maxsplit=1)
            else:
                metadata, content = "", response
            if not metadata.strip():
                content = content.strip()
                history.append({"role": "assistant", "metadata": metadata, "content": content})
                content = content.replace("[[训练时间]]", "2023年")
            else:
                history.append({"role": "assistant", "metadata": metadata, "content": content})
                if history[0]["role"] == "system" and "tools" in history[0]:
                    parameters = json.loads(content)
                    content = {"name": metadata.strip(), "parameters": parameters}
                else:
                    content = {"name": metadata.strip(), "content": content}
        return content, history

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant