Title detection heurisitcs #53

dufferzafar · 2021-11-16T20:58:37Z

Currently, PDF title is retrieved directly from the metadata info, but most PDFs (like Arxiv) don't actually have that metadata. We could have custom logic, if we detect that it is an "Arxiv" PDF which is what #52 is about, or else we could add heuristic based "guessing" of title (say from the text with largest font on the first page.) This will obviously not work everywhere. But, it doesn't have to!

I've past experience with KDE's KFileMetaData which used a similar heuristic, and it used to give good results. This was later removed though (commit), because KDE as a distro has to make a lot of people happy.

If you're okay with a heuristic based approach, I could take a stab at implementing this!

Usecase: I would really like to have a script that auto-renames my PDFs with proper titles. I actually had a script that was based on KFileMetaData, but I've since moved onto Windows. https://github.com/dufferzafar/.scripts/blob/master/pdf-titles

dufferzafar · 2021-11-16T21:04:01Z

I just looked at the code, and the reference detection logic is also heuristical in nature:

pdfx/pdfx/extractor.py

Lines 14 to 22 in 9e6864c

    
           # arXiv.org 
        
           ARXIV_REGEX = r"""arxiv:\s?([^\s,]+)""" 
        
           ARXIV_REGEX2 = r"""arxiv.org/abs/([^\s,]+)""" 
        
           # DOI 
        
           DOI_REGEX = r"""DOI:\s?([^\s,]+)""" 
        
           # URL 
        
           URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""  # noqa: E501

dufferzafar · 2021-11-16T21:05:59Z

There's a haskell-based tool that's based around renaming PDFs: https://github.com/2mol/pboy but it has no title-heuristics, just metadata.

dufferzafar · 2021-11-18T20:59:50Z

I actually found a library that implements the heuristics that I talked about: https://github.com/metebalci/pdftitle

It works pretty well. So I've modified my original script to use this instead of KDE's metadump.

Closing this, because I don't think it makes any sense to re-implement this functionality in pdfx.

dufferzafar closed this as completed Nov 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Title detection heurisitcs #53

Title detection heurisitcs #53

dufferzafar commented Nov 16, 2021 •

edited

Loading

dufferzafar commented Nov 16, 2021

dufferzafar commented Nov 16, 2021

dufferzafar commented Nov 18, 2021

Title detection heurisitcs #53

Title detection heurisitcs #53

Comments

dufferzafar commented Nov 16, 2021 • edited Loading

dufferzafar commented Nov 16, 2021

dufferzafar commented Nov 16, 2021

dufferzafar commented Nov 18, 2021

dufferzafar commented Nov 16, 2021 •

edited

Loading