add mecab keywords handler for japanese #12311

KMerdan · 2025-01-02T10:35:07Z

Summary

Added Japanese text keyword extraction support using MeCab morphological analyzer. This enhancement allows Dify to better handle Japanese text in RAG applications by providing accurate keyword extraction, compound word detection, and proper noun recognition.

Key features:

Japanese text keyword extraction using MeCab
Configurable part-of-speech weighting
Compound word detection (e.g., "自然言語処理", "機械学習")
Support for custom dictionaries
Comprehensive Japanese stopwords list
Mixed Japanese-English text handling
Reading normalization for different word forms

This implementation follows the same pattern as the existing Jieba implementation for Chinese, making it easy to integrate and maintain.

Dependencies:

MeCab and Python bindings (mecab-python3)
System dictionaries (mecab-ipadic-utf8 or similar)

Resolves #12204

Checklist

This change requires a documentation update, included: Dify Document
I understand that this PR may be closed in case there was no previous discussion or issues
I've added tests for the MeCab implementation
I've updated the documentation
I ran dev/reformat and fixed all linting issues

This reverts commit 7bdedbe.

KMerdan added 9 commits January 2, 2025 18:50

add mecab keywords handler for japanese

81c5953

linting

77030d7

fix lint

610d069

improve the consistancy

75dd867

[WIP] before final test

4f5a4e7

add type annotation

2f6bfe8

[WIP]for test perpose only

7bdedbe

Revert "[WIP]for test perpose only"

6d7eb67

This reverts commit 7bdedbe.

[WIP] add type ignore

b94dca5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add mecab keywords handler for japanese #12311

add mecab keywords handler for japanese #12311

KMerdan commented Jan 2, 2025 •

edited

Loading

add mecab keywords handler for japanese #12311

Are you sure you want to change the base?

add mecab keywords handler for japanese #12311

Conversation

KMerdan commented Jan 2, 2025 • edited Loading

Summary

Checklist

KMerdan commented Jan 2, 2025 •

edited

Loading