Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logic to parse docx files #26

Merged
merged 46 commits into from
Dec 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
9a9190c
create an improved chatbox item with widgets
nwaughachukwuma Dec 5, 2024
6b2767a
cleanup examples and example_card for use in chatbox_and_widget
nwaughachukwuma Dec 5, 2024
d94cc60
hide chatbox on the home page
nwaughachukwuma Dec 5, 2024
ac29351
simplify the landing page with chat_box_and_widget
nwaughachukwuma Dec 6, 2024
613fb04
enhance ChatBoxAndWidget with search validation and additional text
nwaughachukwuma Dec 6, 2024
26b8637
allow overriding tools slot in chat_box_and_widget
nwaughachukwuma Dec 6, 2024
5441345
add logic to upload and preview file attachments
nwaughachukwuma Dec 6, 2024
8e6b0aa
improve chat_box attachment and attachment preview components
nwaughachukwuma Dec 6, 2024
88fc35f
Merge remote-tracking branch 'origin/main' into allow-uploading-attac…
nwaughachukwuma Dec 6, 2024
15c5d0b
add endpoint to store file uploads
nwaughachukwuma Dec 6, 2024
ef63b94
use the correct response_model type
nwaughachukwuma Dec 6, 2024
460a380
refactor file upload handling to check for existing blobs and return …
nwaughachukwuma Dec 6, 2024
da9fdf1
refactor /store-file-upload endpoint to only accept single file
nwaughachukwuma Dec 6, 2024
1fb3834
refactor ChatBoxAttachment to manage uploaded files with a writable s…
nwaughachukwuma Dec 6, 2024
b9c690e
refactor ChatBoxAttachment and ChatBoxAndWidget to use attachments co…
nwaughachukwuma Dec 6, 2024
bbb41e9
render loading state and file_icon
nwaughachukwuma Dec 6, 2024
80ac879
add logic and enndpoint to summarize custom sources
nwaughachukwuma Dec 6, 2024
3c94868
cleanup
nwaughachukwuma Dec 6, 2024
07a7b4b
add caching for attachments summary in chat endpoint and update syste…
nwaughachukwuma Dec 6, 2024
15e9faf
refactor attachment handling to use session-specific context and impr…
nwaughachukwuma Dec 6, 2024
26449de
fix bug and cleanup
nwaughachukwuma Dec 6, 2024
d10c2a5
keep local audio_sources in sync with db
nwaughachukwuma Dec 6, 2024
85087e7
save attachments in background tasks as link custom sources
nwaughachukwuma Dec 6, 2024
6336800
refactor custom source management to utilize FieldFilter for querying…
nwaughachukwuma Dec 6, 2024
e87e417
add GCS URL resolution and blob name extraction in storage manager
nwaughachukwuma Dec 7, 2024
e69c488
refactor audio source management and cleanup session context handling
nwaughachukwuma Dec 7, 2024
938750b
add retry decorator, create decorators dir
nwaughachukwuma Dec 7, 2024
a574075
use a backoff in get_signed_url_endpoint
nwaughachukwuma Dec 7, 2024
b4d37e6
use the new retry_decorator in get_signed_url_endpoint
nwaughachukwuma Dec 7, 2024
596ea23
default to openai for tts on dev env
nwaughachukwuma Dec 7, 2024
dbcc7e8
store file uploads as plain text when preserve is not required
nwaughachukwuma Dec 7, 2024
cc1141a
override loading message when generating first response with attachments
nwaughachukwuma Dec 7, 2024
1270122
add auto_resize logic to textarea
nwaughachukwuma Dec 7, 2024
97ab1d5
move attachment preview to the top
nwaughachukwuma Dec 7, 2024
8065da0
add loading state and spinner to ChatBoxAndWidget; update ChatContain…
nwaughachukwuma Dec 7, 2024
8028a22
move chatbox and widget into own file
nwaughachukwuma Dec 7, 2024
c85bff8
Merge remote-tracking branch 'origin/main' into use-chat-box-with-wid…
nwaughachukwuma Dec 7, 2024
8d0ee4a
Add disabled state to ChatBoxAttachment and adjust spinner size in Ch…
nwaughachukwuma Dec 7, 2024
9d51455
Refactor audiocast source generation and update related endpoints and…
nwaughachukwuma Dec 7, 2024
9cd8339
Update get_source_content function to handle unsupported content type…
nwaughachukwuma Dec 8, 2024
f9c8064
add python-docx and logic to read docx files
nwaughachukwuma Dec 8, 2024
f68a669
add a new spinner component
nwaughachukwuma Dec 8, 2024
80eed47
allow uploading .docx in custom_source_manager
nwaughachukwuma Dec 8, 2024
231adb1
Update supported file types in AddCustomSourceForm to include DOCX
nwaughachukwuma Dec 8, 2024
c4b0cd0
Add support for DOCX file uploads and content extraction
nwaughachukwuma Dec 8, 2024
d874610
Merge remote-tracking branch origin/main into add-logic-to-parse-docx…
nwaughachukwuma Dec 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion api/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,5 @@ setuptools

lxml
beautifulsoup4
async-web-search
async-web-search
python-docx
7 changes: 7 additions & 0 deletions api/src/utils/custom_sources/read_content.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from io import BytesIO

from docx import Document
from fastapi import UploadFile
from pypdf import PdfReader

Expand All @@ -21,6 +22,10 @@ def _read_pdf(self, content: bytes) -> tuple[str, PdfReader]:
def _read_txt(self, content: bytes) -> str:
return content.decode()

def _read_docx(self, content: bytes) -> str:
doc = Document(BytesIO(content))
return "\n\n".join([p.text for p in doc.paragraphs])

async def _read_file(self, file: UploadFile, preserve: bool):
file_bytes = await file.read()

Expand All @@ -31,6 +36,8 @@ async def _read_file(self, file: UploadFile, preserve: bool):
text_content, _ = self._read_pdf(file_bytes)
elif file.content_type == "text/plain":
text_content = self._read_txt(file_bytes)
elif file.content_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
text_content = self._read_docx(file_bytes)
else:
return BytesIO(file_bytes)

Expand Down
12 changes: 8 additions & 4 deletions api/src/utils/custom_sources/save_uploaded_sources.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,9 @@
TEN_MB = 10 * 1024 * 1024


class UploadedFiles:
class UploadedFiles(ReadContent):
def __init__(self, session_id: str):
self.session_id = session_id
self.content_reader = ReadContent()

async def _extract_content(self, file: UploadFile):
file_bytes = await file.read()
Expand All @@ -20,15 +19,20 @@ async def _extract_content(self, file: UploadFile):
return None

if file.content_type == "application/pdf":
text_content, pdf_reader = self.content_reader._read_pdf(file_bytes)
text_content, pdf_reader = self._read_pdf(file_bytes)

metadata = {**(pdf_reader.metadata or {}), "pages": pdf_reader.get_num_pages()}
content_type = "application/pdf"
elif file.content_type == "text/plain":
text_content = self.content_reader._read_txt(file_bytes)
text_content = self._read_txt(file_bytes)

metadata = {}
content_type = "text/plain"
elif file.content_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
text_content = self._read_docx(file_bytes)

metadata = {}
content_type = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
else:
return None

Expand Down
7 changes: 5 additions & 2 deletions api/src/utils/summarize_custom_sources.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def summarize_custom_sources_prompt(combined_content: str) -> str:
"""


async def get_source_content(source_url: str) -> str:
async def get_source_content(source_url: str):
"""
Get the content of a source URL.
"""
Expand All @@ -45,8 +45,11 @@ async def get_source_content(source_url: str) -> str:
text_content, _ = content_reader._read_pdf(content_byte)
elif blob.content_type == "text/plain":
text_content = content_reader._read_txt(content_byte)
elif blob.content_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
text_content = content_reader._read_docx(content_byte)
else:
raise ValueError(f"Unsupported content type: {blob.content_type}")
print(f"Unsupported content type: {blob.content_type}")
return None

return text_content

Expand Down
2 changes: 1 addition & 1 deletion app/src/lib/components/ChatBoxAndWidget.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import ChatBoxAttachment from './ChatBoxAttachment.svelte';
import ChatBoxAttachmentPreview from './ChatBoxAttachmentPreview.svelte';
import { getAttachmentsContext } from '@/stores/attachmentsContext.svelte';
import Spinner from './Spinner.svelte';
import Spinner from './Spinner2.svelte';

export let searchTerm = '';
export let loading = false;
Expand Down
4 changes: 2 additions & 2 deletions app/src/lib/components/ChatBoxAttachment.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@
const fileId = `${sessionId}_${slug(file.name)}`;

const formData = new FormData();
// const newFile = new File([file], fileId, { type: file.type });
formData.append('file', file);
formData.append('filename', fileId);

Expand Down Expand Up @@ -84,10 +85,9 @@
<PaperclipIcon class="h-5 w-5" />
</Button>

<!-- TODO: add support for .docx -->
<input
type="file"
accept=".pdf,.txt"
accept=".pdf,.txt,.docx"
multiple
class="hidden"
bind:this={fileInput}
Expand Down
4 changes: 2 additions & 2 deletions app/src/lib/components/ChatBoxAttachmentPreview.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import { FileIcon, XIcon } from 'lucide-svelte';
import { Button } from './ui/button';
import { getAttachmentsContext } from '@/stores/attachmentsContext.svelte';
import Spinner from './Spinner.svelte';
import Spinner from './Spinner2.svelte';

const { removeUploadItem, sessionUploadItems$ } = getAttachmentsContext();

Expand Down Expand Up @@ -32,7 +32,7 @@
</script>

<div class="p-2 flex flex-wrap gap-2 bg-zinc-800/30">
{#each validItems as { file, id, loading } (id)}
{#each validItems as { file, id, loading }, ix (id + ix)}
<div class="flex items-center w-56 gap-2 justify-between bg-zinc-700/30 rounded p-2">
<div class="p-1">
{#if loading}
Expand Down
33 changes: 33 additions & 0 deletions app/src/lib/components/Spinner2.svelte
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
<script lang="ts" context="module">
export type SpinnerSize = 'small' | 'medium' | 'large';
const sizeClasses: Record<SpinnerSize, string> = {
small: 'w-4 h-4',
medium: 'w-6 h-6',
large: 'w-10 h-10'
};
</script>

<script lang="ts">
export let size: SpinnerSize = 'medium';
export let color = 'text-gray-600';
</script>

<div class="flex w-full h-full justify-center items-center">
<div class={`${sizeClasses[size]} ${color}`}>
<svg class="animate-spin" viewBox="0 0 24 24">
<path
fill="none"
stroke="currentColor"
stroke-width="2"
stroke-linecap="round"
d="M12 2C6.477 2 2 6.477 2 12c0 1.821.487 3.53 1.338 5"
/>
</svg>
</div>
</div>

<style>
.animate-spin {
animation-duration: 0.5s;
}
</style>
23 changes: 19 additions & 4 deletions app/src/lib/components/custom-source/AddCustomSourceForm.svelte
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
<script context="module">
<script lang="ts" context="module">
const TEN_MB = 10 * 1024 * 1024;

function validFileType(file: File) {
// Supported file types and extensions
const supportedTypes = [
'application/pdf',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
];
const supportedExtensions = ['.pdf', '.docx', '.txt'];

// Check MIME type
if (supportedTypes.includes(file.type)) return true;

// Fallback: Check file extension (case insensitive)
return supportedExtensions.some((ext) => file.name.toLowerCase().endsWith(ext));
}
</script>

<script lang="ts">
Expand Down Expand Up @@ -65,7 +80,7 @@
continue;
}

if (file.type !== 'application/pdf' && !file.name.endsWith('.txt')) {
if (!validFileType(file)) {
toast.info(`Unsupported file type for ${file.name}. Skipping...`);
continue;
}
Expand Down Expand Up @@ -113,14 +128,14 @@
type="file"
class="hidden"
multiple
accept=".pdf,.txt"
accept=".pdf,.txt,.docx"
bind:this={fileuploadEl}
on:change={(e) => handleFiles(e.currentTarget.files)}
/>
</p>
</div>
<p class="text-sm text-gray-500">
Supported file types: PDF, .txt, Markdown. (Max size: 10MB)
Supported file types: PDF, TXT, DOCX, Markdown. (Max size: 10MB)
</p>
</div>
</div>
Expand Down
6 changes: 5 additions & 1 deletion app/src/lib/db/db.customSources.ts
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,11 @@ export type UploadSource = {

export type Sources = (LinkSource | CopyPasteSource | UploadSource) & {
id: string;
content_type: 'text/plain' | 'text/html' | 'application/pdf';
content_type:
| 'text/plain'
| 'text/html'
| 'application/pdf'
| 'application/vnd.openxmlformats-officedocument.wordprocessingml.document';
content: string;
title?: string;
created_at?: string;
Expand Down
Loading