Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling duplicated entries gracefully #3

Open
jrjhealey opened this issue Aug 13, 2018 · 3 comments
Open

Handling duplicated entries gracefully #3

jrjhealey opened this issue Aug 13, 2018 · 3 comments

Comments

@jrjhealey
Copy link

At present, if a bib entry is repeated, the whole process will exit with no attempt to continue/skip over/ remedy the situation meaning one has to then go hunting manually.

This can be replicated with:

python3 fixbibtex.py examples/duplicates.bib   # or mixedduplicates.bib which contains more than 1 simple repetition

(Examples are in my fork of the repo and PR to come).

Invoking the script gives the user the following traceback:

Traceback (most recent call last):
  File "fixbibtex.py", line 250, in <module>
    cli()
  File "fixbibtex.py", line 246, in cli
    main(args.path)
  File "fixbibtex.py", line 229, in main
    newbib = loop.run_until_complete(fix_bibtex(path))
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/asyncio/base_events.py", line 468, in run_until_complete
    return future.result()
  File "fixbibtex.py", line 99, in fix_bibtex
    bib = parse_bibfile(path)
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/database/__init__.py", line 852, in parse_file
    return parser.parse_file(file)
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/database/input/__init__.py", line 51, in parse_file
    self.parse_stream(f)
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/database/input/bibtex.py", line 385, in parse_stream
    return self.parse_string(text)
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/database/input/bibtex.py", line 380, in parse_string
    self.process_entry(entry_type, *entry[1])
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/database/input/bibtex.py", line 347, in process_entry
    self.data.add_entry(key, entry)
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/database/__init__.py", line 150, in add_entry
    report_error(BibliographyDataError('repeated bibliograhpy entry: %s' % key))
  File "/Users/joehealey/Applications/miniconda3/lib/python3.6/site-packages/pybtex/errors.py", line 77, in report_error
    raise exception
pybtex.database.BibliographyDataError: repeated bibliograhpy entry: Hitchcock1986

Since duplicated entries aren't strictly an issue for use with TeX (a warning is usually given, but I believe compilation will continue without error, and duplicates are skipped). It would be good if the script could gracefully handle this scenario, even if the solution is just to skip the entry, and warn, as TeX does.

@jaimergp
Copy link
Owner

I have added a new commit addressing duplicate keys. Let me know if this is enough.

Detecting duplicated entries (I mean, same entries with different keys, even if not perfectly equal) could be also interesting to address, but that problem is not that trivial...

@jrjhealey
Copy link
Author

Tried it on the toy example I made and it seems to handle it quite nicely! I also threw it at my ~400 citation PhD bib file, and it warned about a dozen duplicates. Seemed to go on to complete correctly thereafter, so I'd call this a success - cheers Jaime!

One other observation (not sure if this is intentional or not but), it puts out the repeated entry warning twice, right at the start, and just before the patched file info.

I think the duplicate entries/different keys enhancement would be well worthwhile - but agreed, not easy. For my own thesis I grepped and string replaced through my .tex files to find incorrect multiple keys to correct them at the document compile time (but it would have been easier to have resolved it upstream in the bib file).

@jaimergp
Copy link
Owner

jaimergp commented Jan 2, 2019

@jaimergp:

Detecting duplicated entries (I mean, same entries with different keys, even if not perfectly equal) could be also interesting to address, but that problem is not that trivial...

Maybe we can use minhash or other LSH techniques to address this. These techniques could also be applied to find similarity scores with search results. I have to look into that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants