Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: split text keep separator #7930

Merged
merged 1 commit into from
Sep 4, 2024
Merged

Conversation

Sumkor
Copy link
Contributor

@Sumkor Sumkor commented Sep 3, 2024

Checklist:

Important

Please review the checklist below before submitting your pull request.

  • Please open an issue before creating a PR or link to an existing issue
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

Description

Describe the big picture of your changes here to communicate to the maintainers why we should accept this pull request. If it fixes a bug or resolves a feature request, be sure to link to that issue. Close issue syntax: Fixes #<issue number>, see documentation for more details.

Fixes #7929

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update, included: Dify Document
  • Improvement, including but not limited to code refactoring, performance optimization, and UI/UX improvement
  • Dependency upgrade

Testing Instructions

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

from core.rag.splitter.fixed_text_splitter import EnhanceRecursiveCharacterTextSplitter

character_splitter = EnhanceRecursiveCharacterTextSplitter.from_encoder(
    chunk_size=512,
    chunk_overlap=50,
    separators=None,
    embedding_model_instance=None
)

text = '''## 程序员和其他职业的区别

如果从纯粹外行的角度去简单的分辨,程序员最大的特征就是薪资很高。  
  
比起其他行业,动辄月薪一两万,年薪二三十万的程序员确实太香了。在薪资较高的那些行业中,比起对身份和背景有要求的金融,对学历和资历有要求医学,程序员既不需要你有不俗的家世背景去拉资源,也不需要你付出动辄十几年的努力去苦苦熬经验,几乎是普通家庭的孩子想要追高薪资的最佳选择。  
  
从某种意义上说,程序员的“前途”,是一种“钱途“。  
  
很多人会有这样的疑惑:

凭什么程序员这么有钱途?

如果从入行的角度简单的观察一下,程序员和其他非技术部门的工作日常,你就会发现,程序员的工作难度是最高的。  
  
突发的 bug 需要立即修复,难以用现有技术和知识去解决的复杂的业务场景,永远在学习,永远在更迭最新的工具和技能树是程序员的常态。  
  
比起文职类的简单工作,或是流水线工人的按部就班,作为程序员,花上一天时间去解决一个莫名其妙的报错,你此前学过的所有可能都无法派上用场,面对这些情况也许是家常便饭。  
  
修改不起作用的挫败,业务必须要完成的压力,不停 battle 强人所难的需求,处理前人留下的”屎山“代码,所有的这些对身心的双重折磨,都造就了程序员为什么薪资这么高的原因。  
  
本质上就是,难度越高的,赏金就越多。  
  
如果你打算要成为程序员,以上这些你都要做好准备。'''


list = character_splitter._split_text(text, ["\n\n", "。", ". ", " ", ""])

for i in enumerate(list):
    print(i)

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. 🐞 bug Something isn't working labels Sep 3, 2024
@crazywoola crazywoola requested a review from JohnJyong September 3, 2024 10:38
Copy link
Member

@crazywoola crazywoola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 4, 2024
@crazywoola crazywoola merged commit 571415d into langgenius:main Sep 4, 2024
6 checks passed
mehrajagdish pushed a commit to Sbazar-GmbH/dify that referenced this pull request Sep 6, 2024
cuiks pushed a commit to cuiks/dify that referenced this pull request Sep 26, 2024
lau-td pushed a commit to heydevs-io/dify that referenced this pull request Oct 23, 2024
idonotknow pushed a commit to AceDataCloud/Dify that referenced this pull request Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working lgtm This PR has been approved by a maintainer size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Chinese chunk begins with a full-stop
2 participants