You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I opened the file by 'rb', and the file contains many unconverted characters
with open('/users/cheng/NLP/Data/finished_files/chunked/test_000.bin', 'rb') as file:
for line in file:
print(line)
b'R\x1e\x00\x00\x00\x00\x00\x00\n'
b'\xcf<\n'
b'\xf0\x02\n'
b'\x08abstract\x12\xe3\x02\n'
b'\xe0\x02\n'
b"\xdd\x02<s> marseille prosecutor says `` so far no videos were used in the crash investigation '' despite media reports . </s> <s> journalists at bild and paris match are `` very confident '' the video clip is real , an editor says . </s> <s> andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says . </s>\n"
b'\xd99\n'
b'\x07article\x12\xcd9\n'
b'\xca9\n'
Then I tried to process them by myself. Split the article and abstract and write them to separate file, but here is an error after processing most files:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
How can I get a clean article and abstract from these files?
The text was updated successfully, but these errors were encountered:
@JunjieCheng it is the binary code that is acceptable by the tensorflow for testing. it is like a pre-process data for testing. The code is accepting the binary data, which fast in reading by system. If you wish not to convert to binary then you can change the code as per your needs as it is openly available. Please do not ask what to change as this is what you have to make and if you have any issue, ask here.
I opened the file by 'rb', and the file contains many unconverted characters
Then I tried to process them by myself. Split the article and abstract and write them to separate file, but here is an error after processing most files:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
How can I get a clean article and abstract from these files?
The text was updated successfully, but these errors were encountered: