Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update make_datafiles.py #25

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

the-black-knight-01
Copy link

Removing this ERRORS
ERROR1:

Traceback (most recent call last):
  File "make_datafiles.py", line 239, in <module>
    write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
  File "make_datafiles.py", line 154, in write_to_bin
    url_hashes = get_url_hashes(url_list)
  File "make_datafiles.py", line 106, in get_url_hashes
    return [hashhex(url) for url in url_list]
  File "make_datafiles.py", line 106, in <listcomp>
    return [hashhex(url) for url in url_list]
  File "make_datafiles.py", line 101, in hashhex
    h.update(s)
TypeError: Unicode-objects must be encoded before hashing

ERROR 2:

PTBTokenizer tokenized 203071165 tokens at 1811476.32 tokens per second.
Stanford CoreNLP Tokenizer has finished.
Successfully finished tokenizing dailymail/stories/ to dm_stories_tokenized.

Making bin file for URLs listed in url_lists/all_test.txt...
Writing story 0 of 11490; 0.00 percent done
Traceback (most recent call last):
  File "make_datafiles.py", line 239, in <module>
    write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
  File "make_datafiles.py", line 184, in write_to_bin
    tf_example.features.feature['article'].bytes_list.value.extend([article])
TypeError: "marseille , france -lrb- cnn -rrb- the french prosecutor leading an investigation into the crash of has type str, but expected one of: bytes

ERROR 3:
print should contain ().

Removing this ERRORS
ERROR1: Traceback (most recent call last):
  File "make_datafiles.py", line 239, in <module>
    write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
  File "make_datafiles.py", line 154, in write_to_bin
    url_hashes = get_url_hashes(url_list)
  File "make_datafiles.py", line 106, in get_url_hashes
    return [hashhex(url) for url in url_list]
  File "make_datafiles.py", line 106, in <listcomp>
    return [hashhex(url) for url in url_list]
  File "make_datafiles.py", line 101, in hashhex
    h.update(s)
TypeError: Unicode-objects must be encoded before hashing

ERROR 2:
PTBTokenizer tokenized 203071165 tokens at 1811476.32 tokens per second.
Stanford CoreNLP Tokenizer has finished.
Successfully finished tokenizing dailymail/stories/ to dm_stories_tokenized.

Making bin file for URLs listed in url_lists/all_test.txt...
Writing story 0 of 11490; 0.00 percent done
Traceback (most recent call last):
  File "make_datafiles.py", line 239, in <module>
    write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
  File "make_datafiles.py", line 184, in write_to_bin
    tf_example.features.feature['article'].bytes_list.value.extend([article])
TypeError: "marseille , france -lrb- cnn -rrb- the french prosecutor leading an investigation into the crash of has type str, but expected one of: bytes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant