Fix handling of PurePosixPath in PDFReader #16612
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Solution to bug - #16602 - S3Reader: Failed to load files other than .txt using S3Reader
ISSUE:
Failed to load file test-llamaindex/file1.docx with error: Attempt to open non key-like path: test-llamaindex\file1.docx. Skipping...
Failed to load file test-llamaindex/file2.pdf with error: RetryError[<Future at 0x1122b0b42b0 state=finished raised ValueError>]. Skipping...
Version
llama_index Version: 0.11.18, llama-index-readers-s3 Version: 0.2.0
Steps to Reproduce
This is the code used
`from llama_index.readers.s3 import S3Reader
from llama_index.readers.file import PDFReader, DocxReader
reader = S3Reader(
bucket="test-llamaindex",
aws_access_id=AWS_ACCESS_KEY_ID,
aws_access_secret=AWS_SECRET_ACCESS_KEY,
aws_session_token=AWS_SESSION_TOKEN,
recursive=True,
required_exts=[".pdf", ".docx", ".txt"],
file_extractor={".pdf": PDFReader(), ".docx": DocxReader()}
)
documents = reader.load_data()
documents`
DEBUGGED:
The issue is in the class PDFReader, in the below condition-
"""Parse file."""
if not isinstance(file, Path):
file = Path(file)
Since my file was of type <class 'pathlib.PurePosixPath'>, the IF condition passed and the Path(file) changes the "/" in the file path to "" which is not supported by s3fs and gives this error -
Attempt to open non key-like path: test-llamaindex\ShortStartStopMove.pdf
in line with fs.open(str(file), "rb") as fp:
SOLUTION:
Converting the file into posix path.
"""Parse file."""
if not isinstance(file, Path) and not isinstance(file, PurePosixPath):
file = PurePosixPath(file)
Fixes # (issue)
#16602
New Package?
PurePosixPath
Did I fill in the
tool.llamahub
section in thepyproject.toml
and provide a detailed README.md for my new integration or package?Version Bump?
No
Did I bump the version in the
pyproject.toml
file of the package I am updating? (Except for thellama-index-core
package)Type of Change
Please delete options that are not relevant.
How Has This Been Tested?
Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.
Suggested Checklist:
make format; make lint
to appease the lint gods