Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MS Word document resulted from RTFEmbeddedObject.getData() byte array cannot be opened #118

Open
FabioRNT opened this issue Sep 3, 2019 · 5 comments

Comments

@FabioRNT
Copy link

FabioRNT commented Sep 3, 2019

Hello, I'm trying to extract an MS Word file embedded in an RTF file by using RTFEmbeddedObject.getEmbeddedObjects(String file). The method returns a list with four instances, which is expected. When I check the resulting data array with Apache Tika, it returns the application/x-tika-msoffice mime type, which seems correct.

However, when I try to open the resulting file, it doesn't show the expected result on MS Word. I will attach both files on this issue.

here's the code that I'm using:

`
List<List> rtfl = RTFEmbeddedObject.getEmbeddedObjects(readLineByLine(file));

    for(List<RTFEmbeddedObject> l : rtfl){

        FileUtils.writeByteArrayToFile(new File
                ("test.doc"),
                l.get(1).getData());

        Tika t = new Tika();

        String s = t.detect(l.get(1).getData());

        System.out.println("Mimetype: " + s);

    }

`

Attachments at:
rtfword.zip

Thanks in advance!

@joniles
Copy link
Owner

joniles commented Sep 3, 2019

Just to confirm, is the test.doc included in the zip file the original file which was embedded in the RTF, or one you have extracted yourself?

@joniles
Copy link
Owner

joniles commented Sep 3, 2019

Also... if possible could you include the MPP file that the RTF came from?

@FabioRNT
Copy link
Author

FabioRNT commented Sep 3, 2019

Hello, the test.doc file is the one that I extracted using the library. About the RTF, it wasn't from an MPP file. It was from a database that exported OLE objects for me, and I've been able to convert them to RTF and access them as embedded objects.

@joniles
Copy link
Owner

joniles commented Sep 4, 2019

Thanks for the update. Do you have a way to get the original OLE object out of the database without going through the RTF export exercise your describe? I'm looking at starting with a "known good" file which MS Word can open, then comparing that to what we're able to extract from the RTF.

@FabioRNT
Copy link
Author

FabioRNT commented Sep 4, 2019

I'll upload an original OLE file, but it isn't openable by MS Word. In order to be able to open it, I have to add a header and convert it to RTF.
ole.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants