Handling duplicated entries gracefully #3

jrjhealey · 2018-08-13T12:00:55Z

At present, if a bib entry is repeated, the whole process will exit with no attempt to continue/skip over/ remedy the situation meaning one has to then go hunting manually.

This can be replicated with:

python3 fixbibtex.py examples/duplicates.bib   # or mixedduplicates.bib which contains more than 1 simple repetition

(Examples are in my fork of the repo and PR to come).

Invoking the script gives the user the following traceback:

Traceback (most recent call last):
  File "fixbibtex.py", line 250, in <module>
    cli()
  File "fixbibtex.py", line 246, in cli
    main(args.path)
  File "fixbibtex.py", line 229, in main
    newbib = loop.run_until_complete(fix_bibtex(path))
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/asyncio/base_events.py", line 468, in run_until_complete
    return future.result()
  File "fixbibtex.py", line 99, in fix_bibtex
    bib = parse_bibfile(path)
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/database/__init__.py", line 852, in parse_file
    return parser.parse_file(file)
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/database/input/__init__.py", line 51, in parse_file
    self.parse_stream(f)
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/database/input/bibtex.py", line 385, in parse_stream
    return self.parse_string(text)
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/database/input/bibtex.py", line 380, in parse_string
    self.process_entry(entry_type, *entry[1])
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/database/input/bibtex.py", line 347, in process_entry
    self.data.add_entry(key, entry)
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/database/__init__.py", line 150, in add_entry
    report_error(BibliographyDataError('repeated bibliograhpy entry: %s' % key))
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/errors.py", line 77, in report_error
    raise exception
pybtex.database.BibliographyDataError: repeated bibliograhpy entry: Hitchcock1986

Since duplicated entries aren't strictly an issue for use with TeX (a warning is usually given, but I believe compilation will continue without error, and duplicates are skipped). It would be good if the script could gracefully handle this scenario, even if the solution is just to skip the entry, and warn, as TeX does.

The text was updated successfully, but these errors were encountered:

…nable it. Addresses issue #3.

jaimergp · 2018-08-13T13:06:23Z

I have added a new commit addressing duplicate keys. Let me know if this is enough.

Detecting duplicated entries (I mean, same entries with different keys, even if not perfectly equal) could be also interesting to address, but that problem is not that trivial...

jrjhealey · 2018-08-17T19:39:02Z

Tried it on the toy example I made and it seems to handle it quite nicely! I also threw it at my ~400 citation PhD bib file, and it warned about a dozen duplicates. Seemed to go on to complete correctly thereafter, so I'd call this a success - cheers Jaime!

One other observation (not sure if this is intentional or not but), it puts out the repeated entry warning twice, right at the start, and just before the patched file info.

I think the duplicate entries/different keys enhancement would be well worthwhile - but agreed, not easy. For my own thesis I grepped and string replaced through my .tex files to find incorrect multiple keys to correct them at the document compile time (but it would have been easier to have resolved it upstream in the bib file).

jaimergp · 2019-01-02T11:11:29Z

@jaimergp:

Detecting duplicated entries (I mean, same entries with different keys, even if not perfectly equal) could be also interesting to address, but that problem is not that trivial...

Maybe we can use minhash or other LSH techniques to address this. These techniques could also be applied to find similarity scores with search results. I have to look into that!

jrjhealey mentioned this issue Aug 13, 2018

added examples of problematic bibs #4

Closed

jaimergp added a commit that referenced this issue Aug 13, 2018

pybtex strict mode is False by default. Add --strict flag to CLI to e…

5827893

…nable it. Addresses issue #3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling duplicated entries gracefully #3

Handling duplicated entries gracefully #3

jrjhealey commented Aug 13, 2018

jaimergp commented Aug 13, 2018

jrjhealey commented Aug 17, 2018

jaimergp commented Jan 2, 2019

Handling duplicated entries gracefully #3

Handling duplicated entries gracefully #3

Comments

jrjhealey commented Aug 13, 2018

jaimergp commented Aug 13, 2018

jrjhealey commented Aug 17, 2018

jaimergp commented Jan 2, 2019