DataScribe is an intelligent AI agent designed to streamline data retrieval, extraction, and structuring. By harnessing the power of Large Language Models (LLMs) and automated web search capabilities, it enables users to extract actionable insights from datasets with minimal effort. Designed for efficiency, scalability, and user-friendliness, DataScribe is ideal for professionals handling large datasets or requiring quick access to structured information.
-
File Upload & Integration
- Upload datasets directly from CSV files.
- Google Sheets Integration: Seamlessly connect and interact with Google Sheets.
-
Custom Query Definition
- Define intuitive query templates for extracting data.
- Advanced Query Templates: Extract multiple fields simultaneously, e.g., "Find the email and address for {company}."
-
Automated Information Retrieval
- LLM-Powered Extraction: Uses ChatGroq for LLM processing and Serper API for web searches.
- Retry Mechanism: Handles failed queries with robust retries for accurate results.
-
Interactive Results Dashboard
- View extracted data in a clean, dynamic, and filterable table view.
-
Export & Update Options
- Download results as CSV or directly update Google Sheets.
Component | Technologies |
---|---|
Dashboard/UI | Streamlit |
Data Handling | pandas, Google Sheets API (Auth0, gspread) |
Search API | Serper API, ScraperAPI |
LLM API | Groq API |
Backend | Python |
Agents | LangChain |
DataScribe/
├── app.py # Main application entry point
├── funcs/ # Core functionalities
│ ├── googlesheet.py # Google Sheets integration
│ ├── llm.py # LLM-based extraction and search
├── views/ # UI components and layout
│ ├── home.py # Home page and navigation
│ ├── upload_data.py # File upload and data preprocessing
│ ├── define_query.py # Query definition logic
│ ├── extract_information.py # Information extraction workflows
│ ├── view_and_download.py # Result viewing and export functionalities
├── requirements.txt # Dependency list
├── .env.sample # Environment variable template
├── credentialsample.json # Google API credentials template
├── README.md # Documentation
├── LICENSE # License information
- Python 3.9 or higher.
- Google API credentials for Sheets integration.
-
Clone the Repository
git clone https://github.com/sam22ridhi/DataScribe.git cd DataScribe
-
Install Dependencies
pip install -r requirements.txt
-
Set Up Environment Variables
- Copy the
.env.sample
file to.env
:cp .env.sample .env
- Add the required API keys to the
.env
file:GOOGLE_API_KEY=<your_google_api_key> SERPER_API_KEY=<your_serper_api_key>
- Copy the
-
Prepare Google API Credentials
- Replace the content in
credentialsample.json
with your Google API credentials and save it ascredentials.json
.
- Replace the content in
-
Run the Application
streamlit run app.py
-
Access the Application
Open http://localhost:8501 in your browser.
-
Upload Data
Navigate to the Upload Data tab to import a CSV file or connect to Google Sheets. -
Define Query
Use the Define Query tab to specify search templates. Select the column containing the entities and define fields to extract. -
Extract Information
Execute automated searches in the Extract Information tab to fetch structured data. -
View & Download
Review the results in the View & Download tab, then export as CSV or update Google Sheets directly.
Watch the 2-minute walkthrough showcasing:
- Overview of DataScribe's purpose and features.
- Key workflows, including upload, extraction, and export.
- Code features
Try out on huggin face link
Special thanks to Breakout AI and Kapil Mittal for their opportunity to demonstrate my skills through this project/assessment.
This project is licensed under the Apache License 2.0.
We welcome contributions!
- Fork the repository.
- Create a feature branch.
- Submit a pull request with a detailed description of changes.
For feedback or support: