Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add mecab keywords handler for japanese #12311

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

KMerdan
Copy link
Contributor

@KMerdan KMerdan commented Jan 2, 2025

Summary

Added Japanese text keyword extraction support using MeCab morphological analyzer. This enhancement allows Dify to better handle Japanese text in RAG applications by providing accurate keyword extraction, compound word detection, and proper noun recognition.

Key features:

  • Japanese text keyword extraction using MeCab
  • Configurable part-of-speech weighting
  • Compound word detection (e.g., "自然言語処理", "機械学習")
  • Support for custom dictionaries
  • Comprehensive Japanese stopwords list
  • Mixed Japanese-English text handling
  • Reading normalization for different word forms

This implementation follows the same pattern as the existing Jieba implementation for Chinese, making it easy to integrate and maintain.

Dependencies:

  • MeCab and Python bindings (mecab-python3)
  • System dictionaries (mecab-ipadic-utf8 or similar)

Resolves #12204

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues
  • I've added tests for the MeCab implementation
  • I've updated the documentation
  • I ran dev/reformat and fixed all linting issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Japanese Text keyword extractor usch as MeCab/Janome
1 participant