This repository has been archived by the owner on Jun 15, 2023. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 114
Title detection heurisitcs #53
Comments
I just looked at the code, and the reference detection logic is also heuristical in nature: Lines 14 to 22 in 9e6864c
|
There's a haskell-based tool that's based around renaming PDFs: https://github.com/2mol/pboy but it has no title-heuristics, just metadata. |
I actually found a library that implements the heuristics that I talked about: https://github.com/metebalci/pdftitle It works pretty well. So I've modified my original script to use this instead of KDE's metadump. Closing this, because I don't think it makes any sense to re-implement this functionality in pdfx. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Currently, PDF title is retrieved directly from the metadata info, but most PDFs (like Arxiv) don't actually have that metadata. We could have custom logic, if we detect that it is an "Arxiv" PDF which is what #52 is about, or else we could add heuristic based "guessing" of title (say from the text with largest font on the first page.) This will obviously not work everywhere. But, it doesn't have to!
I've past experience with KDE's KFileMetaData which used a similar heuristic, and it used to give good results. This was later removed though (commit), because KDE as a distro has to make a lot of people happy.
If you're okay with a heuristic based approach, I could take a stab at implementing this!
Usecase: I would really like to have a script that auto-renames my PDFs with proper titles. I actually had a script that was based on KFileMetaData, but I've since moved onto Windows. https://github.com/dufferzafar/.scripts/blob/master/pdf-titles
The text was updated successfully, but these errors were encountered: