Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(components): integrate Spacy NLP toolset into Langflow component… #4733

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

raphaelchristi
Copy link

SpaCy Components Integration

This PR integrates SpaCy's powerful NLP capabilities into Langflow through a comprehensive set of components, enabling advanced text processing and analysis workflows.

🎯 Core Components

Language Model Management

  • SpacyModel
    • Base component for SpaCy language models
    • Supports 20+ languages including English, German, French, Spanish, etc.
    • Automatic model download and initialization
    • Multiple model sizes (sm, md, lg) per language
    • Configurable entity merging
    • Pipeline component management

Entity Processing

  • EntityRecognizer

    • Named Entity Recognition (NER)
    • Built-in entity types (PERSON, ORG, DATE, etc.)
    • Entity context extraction
    • Sentence-level entity tracking
    • Confidence scoring
    • Detailed entity metadata
  • EntityRuler

    • Pattern-based entity recognition
    • Custom rule definition
    • Regex pattern support
    • Phrase pattern matching
    • Entity pattern priorities
    • Rule-based entity labeling

Text Analysis

  • DependencyMatcher

    • Syntactic pattern matching
    • Relationship extraction
    • Subject-Verb-Object detection
    • Custom dependency rules
    • Active/Passive voice identification
    • Complex pattern definitions
  • TextCategorizer

    • Single-label classification (textcat)
    • Multi-label classification (textcat_multilabel)
    • Configurable threshold settings
    • Confidence scoring
    • Custom category management
    • Binary and multi-class support

Text Processing

  • Lemmatizer

    • Rule-based and lookup lemmatization
    • Custom abbreviation handling
    • Multiple lemmatization modes
    • Whitespace preservation
    • Part-of-speech aware lemmatization
    • Custom dictionary support
  • Sentencizer

    • Advanced sentence segmentation
    • RAG-optimized chunking
    • Automatic abbreviation detection
    • Custom punctuation rules
    • Quote-aware segmentation
    • Multi-language support
  • Tagger

    • Part-of-speech tagging (POS)
    • Fine-grained tags (TAG)
    • Dependency parsing (DEP)
    • Morphological analysis
    • Custom tag sets
    • Detailed token attributes

🔍 Example Flows

Lemmatizer Flow

Lemmatizer Flow
Test text:

The researchers were running multiple groundbreaking studies while the automated 
systems continuously processed the incoming data. Children's toys scattered 
across the floor were quickly gathered by the cleaning robots, which had been 
programmed to recognize various objects.

Download Lemmatizer Flow JSON

Dependency Matcher Flow

Dependency Matcher Flow
Pattern Example:

[
    {
        "RIGHT_ID": "verb",
        "RIGHT_ATTRS": {"POS": "VERB"}
    },
    {
        "LEFT_ID": "verb",
        "REL_OP": ">",
        "RIGHT_ID": "subject",
        "RIGHT_ATTRS": {"DEP": "nsubj"}
    }
]

Download Dependency Matcher Flow JSON

Sentencizer Flow

Sentencizer Flow
Features:

Text Categorizer Flow

Text Categorizer Flow
Supports:

Tagger Flow

Tagger Flow
Tag types:

Entity Ruler Flow

Entity Ruler Flow
Pattern types:

Entity Recognizer Flow

Entity Recognizer Flow
Entity types:

🛠️ Technical Details

Implementation Features

  • Full integration with Langflow's component architecture
  • Comprehensive error handling and validation
  • Efficient batch processing capabilities
  • Dynamic configuration options
  • Extensive type checking
  • Memory-efficient processing

📊 Sample Data

🔗 Related Resources

👥 Contributors

📃 License

  • MIT License (same as Langflow)

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Nov 20, 2024
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 11 changed files in this pull request and generated no suggestions.

Files not reviewed (6)
  • src/backend/base/langflow/components/spacy/lemmatizer.py: Evaluated as low risk
  • src/frontend/src/utils/styleUtils.ts: Evaluated as low risk
  • src/frontend/src/constants/constants.ts: Evaluated as low risk
  • src/backend/base/langflow/components/spacy/init.py: Evaluated as low risk
  • src/backend/base/langflow/components/spacy/entity_ruler.py: Evaluated as low risk
  • src/backend/base/langflow/components/spacy/spacy_model.py: Evaluated as low risk
Comments skipped due to low confidence (3)

src/backend/base/langflow/components/spacy/entity_recognizer.py:42

  • Add a None check before the type check to avoid potential AttributeError.
if not isinstance(self.spacy_model, Language):

src/backend/base/langflow/components/spacy/tagger.py:94

  • The error message 'Invalid SpaCy model. Please connect a valid SpaCy Model component.' could be more specific by including the expected type or value.
raise ValueError("Invalid SpaCy model. Please connect a valid SpaCy Model component.")

src/backend/base/langflow/components/spacy/dependency_matcher.py:10

  • The class name SpacyPatternMatcher should be consistent with the naming convention used in other components, i.e., SpaCyPatternMatcher.
class SpacyPatternMatcher(Component):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant