Update make_datafiles.py #25

the-black-knight-01 · 2018-11-27T10:41:38Z

Removing this ERRORS
ERROR1:

Traceback (most recent call last):
  File "make_datafiles.py", line 239, in <module>
    write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
  File "make_datafiles.py", line 154, in write_to_bin
    url_hashes = get_url_hashes(url_list)
  File "make_datafiles.py", line 106, in get_url_hashes
    return [hashhex(url) for url in url_list]
  File "make_datafiles.py", line 106, in <listcomp>
    return [hashhex(url) for url in url_list]
  File "make_datafiles.py", line 101, in hashhex
    h.update(s)
TypeError: Unicode-objects must be encoded before hashing

ERROR 2:

PTBTokenizer tokenized 203071165 tokens at 1811476.32 tokens per second.
Stanford CoreNLP Tokenizer has finished.
Successfully finished tokenizing dailymail/stories/ to dm_stories_tokenized.

Making bin file for URLs listed in url_lists/all_test.txt...
Writing story 0 of 11490; 0.00 percent done
Traceback (most recent call last):
  File "make_datafiles.py", line 239, in <module>
    write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
  File "make_datafiles.py", line 184, in write_to_bin
    tf_example.features.feature['article'].bytes_list.value.extend([article])
TypeError: "marseille , france -lrb- cnn -rrb- the french prosecutor leading an investigation into the crash of has type str, but expected one of: bytes

ERROR 3:
print should contain ().

Removing this ERRORS ERROR1: Traceback (most recent call last): File "make_datafiles.py", line 239, in <module> write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin")) File "make_datafiles.py", line 154, in write_to_bin url_hashes = get_url_hashes(url_list) File "make_datafiles.py", line 106, in get_url_hashes return [hashhex(url) for url in url_list] File "make_datafiles.py", line 106, in <listcomp> return [hashhex(url) for url in url_list] File "make_datafiles.py", line 101, in hashhex h.update(s) TypeError: Unicode-objects must be encoded before hashing ERROR 2: PTBTokenizer tokenized 203071165 tokens at 1811476.32 tokens per second. Stanford CoreNLP Tokenizer has finished. Successfully finished tokenizing dailymail/stories/ to dm_stories_tokenized. Making bin file for URLs listed in url_lists/all_test.txt... Writing story 0 of 11490; 0.00 percent done Traceback (most recent call last): File "make_datafiles.py", line 239, in <module> write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin")) File "make_datafiles.py", line 184, in write_to_bin tf_example.features.feature['article'].bytes_list.value.extend([article]) TypeError: "marseille , france -lrb- cnn -rrb- the french prosecutor leading an investigation into the crash of has type str, but expected one of: bytes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update make_datafiles.py #25

Update make_datafiles.py #25

the-black-knight-01 commented Nov 27, 2018

Update make_datafiles.py #25

Are you sure you want to change the base?

Update make_datafiles.py #25

Conversation

the-black-knight-01 commented Nov 27, 2018