Skip to content

Latest commit

 

History

History
244 lines (193 loc) · 5.57 KB

readme.md

File metadata and controls

244 lines (193 loc) · 5.57 KB

Chat with Your Data using Spark and SlashML Text-to-SQL

This project demonstrates how to integrate Apache Spark with SlashML's Text-to-SQL model to enable natural language queries on your data. The demo includes a sample retail database with sales, products, and customers data that you can query using plain English.

Enter your question > What are the total sales by category?

Generated SQL Query:
SELECT 
    p.category,
    SUM(s.total_amount) as total_sales
FROM sales s
JOIN products p ON s.product_id = p.product_id
GROUP BY p.category
ORDER BY total_sales DESC;

Query Result:
+----------+------------------+
|  category|       total_sales|
+----------+------------------+
|Electronics|         567890.45|
| Furniture|         234567.89|
|Appliances|         123456.78|
+----------+------------------+

🌟 Features

  • Natural language to SQL query conversion using SlashML's Text-to-SQL model
  • Interactive command-line interface for querying data
  • Built-in sample retail database with:
    • Sales transactions
    • Product catalog
    • Customer information
  • Real-time query generation and execution with Spark
  • Data preview capabilities
  • Schema visualization
  • Comprehensive error handling

🚀 Getting Started

Setting Up SlashML Text-to-SQL Model

  1. Visit dashboard.slashml.com
  2. Create an account or log in
  3. Navigate to Models section
  4. Deploy a Text-to-SQL model
  5. Copy the API endpoint URL

Prerequisites

Python 3.7+
Apache Spark
pip install pyspark requests

Dataset Structure

The demo includes a sample retail database with three tables:

  1. Sales Table
CREATE TABLE sales (
    transaction_id INTEGER,
    product_id INTEGER,
    customer_id INTEGER,
    sale_date DATE,
    quantity INTEGER,
    unit_price DOUBLE,
    total_amount DOUBLE
);
  1. Products Table
CREATE TABLE products (
    product_id INTEGER,
    product_name VARCHAR,
    category VARCHAR,
    supplier_id INTEGER
);
  1. Customers Table
CREATE TABLE customers (
    customer_id INTEGER,
    customer_name VARCHAR,
    country VARCHAR,
    join_date DATE
);

Running the Demo

  1. Clone this repository:
git clone https://github.com/yourusername/spark-text-to-sql-demo.git
cd spark-text-to-sql-demo
  1. Run the demo with your SlashML API endpoint:
python spark_sql_demo.py --api-endpoint "YOUR_SLASHML_API_ENDPOINT"

💡 Usage

Available Commands

help    : Display help information and example questions
schema  : Show the database schema
preview : Show sample data preview
exit    : Exit the program

Example Questions

The demo understands complex analytical questions such as:

- What are the total sales for each product category?
- Who are the top 5 customers by total purchase amount?
- What is the average order value by country?
- How many products were sold in each month of 2023?
- Which products have never been sold?

Interactive Session Example

$ python spark_sql_demo.py
Initializing Spark session...
Creating sample data...

Available tables: sales, products, customers

Sample data preview:
Products:
+---------+------------+----------+-----------+
|product_id|product_name| category|supplier_id|
+---------+------------+----------+-----------+
|        1|     Laptop|Electronics|       101|
|        2| Smartphone|Electronics|       102|
|        3| Desk Chair| Furniture|       103|
+---------+------------+----------+-----------+

Enter your question > What are the total sales by category?

Generated SQL Query:
SELECT 
    p.category,
    SUM(s.total_amount) as total_sales
FROM sales s
JOIN products p ON s.product_id = p.product_id
GROUP BY p.category
ORDER BY total_sales DESC;

Query Result:
+----------+------------------+
|  category|       total_sales|
+----------+------------------+
|Electronics|         567890.45|
| Furniture|         234567.89|
|Appliances|         123456.78|
+----------+------------------+

🔧 Customizing the Demo

Using Your Own Data

To use your own data, modify the create_sample_data method in the SparkSQLDemo class:

  1. Define your schemas:
your_schema = StructType([
    StructField("column_name", DataType(), nullable=False),
    # Add more fields...
])
  1. Create your data:
your_data = [
    (value1, value2, ...),
    # Add more records...
]
  1. Create DataFrame and register view:
your_df = self.spark.createDataFrame(your_data, your_schema)
your_df.createOrReplaceTempView("your_table")
  1. Update the schema string in __init__:
self.schema_str = """
CREATE TABLE your_table (
    column_name DATA_TYPE,
    # Add more columns...
);
"""

🔍 Troubleshooting

  1. API Connection Issues:
Error: Cannot connect to the SQL generation API
Solution: Verify your API endpoint and internet connection
  1. Spark Installation:
Error: Spark not found
Solution: Ensure Spark is properly installed and SPARK_HOME is set
  1. Query Generation:
Error: Failed to generate SQL query
Solution: Verify question is related to available schema/tables

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

❓ Support