-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why not use the raw_csv hexdigest to detect duplicates? #134
Comments
Hi! First of all, thanks to everyone who's contributed to this project! It's been invaluable. Anyways. The change was introduced back in 2017 by mondjef, in a pull request intended to avoid false-positives caused by the confusion of otherwise identical transactions affecting different source accounts and unintentionally introducing a new class of it's own.
So, using both # self.md5sum = hashlib.md5(self.raw_csv.encode('utf-8')).hexdigest()
# self.md5sum = hashlib.md5(','.join(x.strip() for x in (self.date,self.desc,self.credit,self.debit,self.credit_account)).encode('utf-8')).hexdigest()
self.md5sum = hashlib.md5(','.join(x.strip() for x in (self.raw_csv,self.credit_account)).encode('utf-8')).hexdigest() Or adjusting the dupe detection to check # I don't usually write python, this might not work
# if (options.skip_dupes or options.confirm_dupes) and entry.md5sum in md5sum_hashes:
if (options.skip_dupes or options.confirm_dupes) and (entry.md5sum in md5sum_hashes) and (entry.raw_csv in csv_comments): First option makes the most sense, but I'm nagged by the lack of a migration path. Would be sweet to check if the hash of a transaction matches the old MD5 construct and prompt to update it. I don't know if it bothers anyone else, and it's still the right move to have a better piece of software, so I'll submit a pull request anyways. I don't have the time to do anything fancy. Just to be sure, there are no other fields like |
Update to quentinsf#100, see quentinsf#134. Note that this will cause current transactions to become unique. Appropriate adjustments have been made to existing tests.
We should also consider here a problem of another nature: my bank has the unpleasant habit of changing the format of CSV exports (even though it is the 7th largest bank in the world in some rankings). I added two options to control the checksum generation: the first one is the list of fields to consider, the second one is a regex to remove a part of the desc field. It is this second option that I use to remove what the bank adds to the communication and which is variable from one export to another. The modification is minimal, I let you look at it here https://github.com/areynoua/icsv2ledger/commit/3ca0ff5722083f41baabbbe34ff2af04ebc79dc1. |
I was in too much of a hurry to share the change when I couldn't test (how awful!). The mentioned commit contains errors but is easy to fix. https://github.com/areynoua/icsv2ledger/commit/34fb3e698741b5d336076ce7508283dcefe9c8dc |
That is an important consideration, and a strong argument for baseing the hash off of portions of the extracted data after all-- I just happen to work with an archive of monthly statements, which aren't effected by format changes quite as badly. Another pathological example is the only somewhat unlikely case of two identical transactions without fine grained datetime or account balance fields to differentiate them, which raises questions about the applicability of duplicate detection and the workflows users want support for. |
I import csv files which which will produce duplicate postings (in terms of desc, credit, debit).
With the --skip-dupes option set these postings are ignored. This problem can be resolved by using the md5sum of the raw_csv line.
This observation is also noted at:
icsv2ledger/icsv2ledger.py
Lines 558 to 560 in 55629c7
Why not un-comment that line? Or provide an option to use the raw_csv input as a skip-dupes criteria?
The text was updated successfully, but these errors were encountered: