Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it able to decode other language such as Chinese #2

Open
l1lsl0th opened this issue Mar 22, 2021 · 11 comments
Open

Is it able to decode other language such as Chinese #2

l1lsl0th opened this issue Mar 22, 2021 · 11 comments

Comments

@l1lsl0th
Copy link

New to python but I think it's having issue decoding Chinese, need encoding="utf-8" maybe?:

Error:
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 121: character maps to

@robertmartin8
Copy link
Owner

Did the error say which line was causing the problems? I don't think I've ever tried it on Chinese characters / Kanji etc.

Also, which python version are you using?

@l1lsl0th
Copy link
Author

I am running on python 3.8 but had try 3.9 too. Thanks a bunch

@aiturri
Copy link

aiturri commented May 12, 2021

Hi! Same problem trying to use your script, Python 3.7.6, and books in English and Spanish:

Traceback (most recent call last): File "KindleClippings.py", line 116, in <module> parse_clippings(source_file, destination) File "KindleClippings.py", line 57, in parse_clippings for highlight in f.read().split("=========="): File "d:\Miniconda3\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1589: character maps to <undefined>

Thanks

@robertmartin8
Copy link
Owner

Hi @aiturri,

Thanks for raising this. Would it be possible for you to share the part of the clipping file that is causing the errors? I'd love to try and fix this but can't reproduce the error.

Best,
Robert

@aiturri
Copy link

aiturri commented May 12, 2021

Sure!
Thanks in advance!

@robertmartin8
Copy link
Owner

Hi @aiturri,

I've just pushed a potential fix. Can you download the script again and try?

Otherwise you can manually modify line 55 to specify an encoding.

    with open(source_file, "r", encoding="utf8") as f:

Let me know if it does or doesn't work. For the record, the original script worked fine on my machine with your clippings file so I couldn't verify the issue

Best,
Robert

@aiturri
Copy link

aiturri commented May 12, 2021

Hi @robertmartin8 , I tried again, and still not working:

Traceback (most recent call last):
File "KindleClippings.py", line 116, in
parse_clippings(source_file, destination)
File "KindleClippings.py", line 88, in parse_clippings
outfile.write(clipping_text + "\n\n...\n\n")
File "d:\Miniconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u03c3' in position 142: character maps to

I will attach here my original clippings file so you can try, but I will delete as soon as you download it (please, let me know so I can delete (for privacy reasons!))

Thanks again!

@robertmartin8
Copy link
Owner

@aiturri OK, I've downloaded it. Feel free to remove

@robertmartin8
Copy link
Owner

@aiturri still can't reproduce it – I can parse your file accents and all. I think it's a mac/windows issue.

Can you try again? I forgot to add encoding="utf8" to a couple of the file opens.

@aiturri
Copy link

aiturri commented May 12, 2021

@robertmartin8

_Traceback (most recent call last):
File "KindleClippings.py", line 117, in
parse_clippings(source_file, destination)
File "KindleClippings.py", line 82, in parse_clippings
current_text = textfile.read()
File "d:\Miniconda3\lib\codecs.py", line 322, in decode
(result, consumed) = self.buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 472: invalid start byte

@robertmartin8
Copy link
Owner

@aiturri Ok it seems this is related to a particular windows encoding. Other people seem to have had the same issue.

(Please save your clippings file beforehand just in case)

I've put two fixes: the first just ignores the errors – have a go and see whether it works (the output might be garbled).

The second is a new argument to specify the encoding:

python KindleClippings.py -encoding=cp1252

It might solve your problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants