Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error while running make_datafile.py #16

Open
97yogitha opened this issue Oct 28, 2017 · 11 comments
Open

error while running make_datafile.py #16

97yogitha opened this issue Oct 28, 2017 · 11 comments

Comments

@97yogitha
Copy link

97yogitha commented Oct 28, 2017

@abisee this is the error that I get when I run the command makefile.py cnn/stories dailymail/stories

Preparing to tokenize cnn/stories to cnn_stories_tokenized...
Making list of files to tokenize...
Tokenizing 92579 files in cnn/stories and saving in cnn_stories_tokenized...
Exception in thread "main" java.io.IOException: Stream closed
	at java.io.BufferedWriter.ensureOpen(BufferedWriter.java:116)
	at java.io.BufferedWriter.write(BufferedWriter.java:221)
	at java.io.Writer.write(Writer.java:157)
	at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:505)
	at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:450)
	at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:813)
Stanford CoreNLP Tokenizer has finished.
Traceback (most recent call last):
  File "make_datafiles.py", line 235, in <module>
    tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir)
  File "make_datafiles.py", line 86, in tokenize_stories
    raise Exception("The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig))
Exception: The tokenized stories directory cnn_stories_tokenized contains 1 files, but it should contain the same number as cnn/stories (which has 92579 files). Was there an error during `tokenization?`
@JafferWilson
Copy link

Please let me know are you using the stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar or the one with 2017? This error mostly occur when you are not using stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar. Please check.

@ibarrien
Copy link

ibarrien commented Oct 30, 2017 via email

@JafferWilson
Copy link

I have created already the processed file you can try that without any issue. Here is the link: https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail
Use Python 2.7

@97yogitha
Copy link
Author

@JafferWilson yes i am using stanford-corenlp-full-2017-09-0/stanford-corenlp-3.8.0.jar. I will use the processed file.

@JafferWilson
Copy link

@97yogitha No do not use the 2017 one.. use 2016 which is mentioned in the Read.me file of the repository.

@IreneZihuiLi
Copy link

@JafferWilson Thanks for the help. I used 3.7.0 from https://stanfordnlp.github.io/CoreNLP/history.html and it worked.

@Neuqmiao
Copy link

Neuqmiao commented Dec 7, 2017

thanks very much, today I encountered this problem with the newest version 3.8.0, and then I changed to 3.7.0, finally, it worked.

@JafferWilson
Copy link

Please some one close this issue.

@Sharathnasa
Copy link

@JafferWilson Could you help in running the nueral network against our own data, how to generate .bin files for our article?

I have clear idea about tokenozation but what about the urls mapping? How to do it?

@dondon2475848
Copy link

Hi @Sharathnasa
You can clone below repository:
https://github.com/dondon2475848/make_datafiles_for_pgn
Run

python make_datafiles.py  ./stories  ./output

It processes your test data into the binary format

@ARNABKUMARPAN
Copy link

check subprocess.call(command) set classpath using os.environ["CLASSPATH"]='stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar', then run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants