Content Extractor

📰 A powerful desktop application for extracting and analyzing content from web URLs

Introduction

A powerful desktop application for extracting and analyzing content from web URLs. Built with Python and Tkinter, this tool provides a user-friendly interface for processing multiple URLs simultaneously, extracting key information, and saving results locally.

Buy Me A Coffee

If you find this tool useful, you can consider supporting its development by buying me a coffee. This will help me continue to improve and maintain the tool. Your support is greatly appreciated!

✨ Features

🔗 URL Processing: Process multiple URLs simultaneously with a queue-based system
🤖 Content Analysis: Extract titles, keywords, summaries, and generate relevant hashtags
📊 Progress Tracking: Real-time status updates and progress monitoring for each task
💾 Auto-save: Automatically save processed content to local files
⚙️ Task Management: Pause, restart, or review completed tasks
🎛️ Configurable Settings: Customize save directory and auto-save preferences

Screenshots

Prerequisites

Python 3.7 or higher
tkinter (usually comes with Python)
Ollama for running the LLaMA model

Installing Ollama

Install Ollama based on your operating system:

Linux

curl https://ollama.ai/install.sh | sh

macOS

brew install ollama

Windows

Download and run the installer from Ollama's official website

Start the Ollama service:

ollama serve

Pull the LLaMA model:

ollama pull llama2

Verify the installation:

ollama list

You should see llama2 in the list of available models.

Configure the application to use Ollama:
- The application is pre-configured to use "llama3.2" as the model name
- Update the model name in ContentExtractor initialization if using a different model version

Installation

Clone the repository:

git clone <repository-url>
cd content-extractor

Install required dependencies:

pip install -r requirements.txt

Ensure Ollama is running:
- Start Ollama service if not already running
- Verify the LLaMA model is available

Usage

Start the application:

python content_extractor_gui.py

Configure Settings:
- Click "Browse" to set your preferred project directory
- Toggle auto-save option as needed
Process URLs:
- Enter a URL in the input field
- Click "Add URL" to start processing
- Monitor progress in the tasks list
- View results by clicking on completed tasks
Managing Tasks:
- Click on any task to view its details
- Use the restart button (↻) to reprocess failed or completed tasks
- Save results manually using the "Save Result" button if auto-save is disabled

Task States

Queued: Task is waiting to be processed
Processing: Currently extracting content
Completed: Successfully processed
Error: Failed to process (can be restarted)

Output Format

Results are saved as JSON files with the following structure:

{
  "title": "Article Title",
  "keywords": [
    "keyword1",
    "keyword2",
    ...
  ],
  "content_summary": "Brief summary of the content",
  "hashtags": [
    "#hashtag1",
    "#hashtag2",
    ...
  ],
  "full_article": "Complete article text"
}

Configuration

The application stores its configuration in content_extractor_config.json:

{
  "project_dir": "/path/to/save/directory",
  "auto_save": true
}

Technical Details

Key Components

URLTask: Manages individual URL processing tasks
- Tracks status, progress, and results
- Handles timing and error states
TaskPanel: UI component for displaying task information
- Real-time status updates
- Progress bar
- Duration tracking
- Save path display
ContentExtractorGUI: Main application interface
- Manages task queue and threading
- Handles file I/O and configuration
- Provides user interface controls

Threading Model

Uses a queue-based system for task management
Processes multiple URLs concurrently
Maintains UI responsiveness with proper thread management
Limits concurrent processing to prevent resource exhaustion

Error Handling

The application includes comprehensive error handling for:

Invalid URLs
Network issues
Processing failures
File system operations
Configuration management

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

MIT

Support

For issues and feature requests, please:

Check existing issues in the repository
Create a new issue with detailed information
Include steps to reproduce any bugs

Acknowledgments

Built using Python and Tkinter
Uses LLaMA model for content analysis through Ollama

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs		docs
.gitignore		.gitignore
BUILD.md		BUILD.md
LICENSE		LICENSE
README.md		README.md
app-icon.ico		app-icon.ico
app-icon.png		app-icon.png
build_app.py		build_app.py
content_extractor.py		content_extractor.py
content_extractor_gui.py		content_extractor_gui.py
crawler.py		crawler.py
requirments.txt		requirments.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Content Extractor

Introduction

Buy Me A Coffee

✨ Features

Screenshots

Prerequisites

Installing Ollama

Linux

macOS

Windows

Installation

Usage

Task States

Output Format

Configuration

Technical Details

Key Components

Threading Model

Error Handling

Contributing

License

Support

Acknowledgments

About

Releases

Packages

Languages

License

ceylonai/apps-article-reader

Folders and files

Latest commit

History

Repository files navigation

Content Extractor

Introduction

Buy Me A Coffee

✨ Features

Screenshots

Prerequisites

Installing Ollama

Linux

macOS

Windows

Installation

Usage

Task States

Output Format

Configuration

Technical Details

Key Components

Threading Model

Error Handling

Contributing

License

Support

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages