Skip to content

Latest commit

 

History

History
50 lines (33 loc) · 2.6 KB

README.md

File metadata and controls

50 lines (33 loc) · 2.6 KB

singgalang

An auto generated NER dataset of 48K sentences

The datasets conforms with the dataset format of Stanford-NER.

Four named entity classes are used:

"Person" for person names
"Place" for place names
"Organisation" for organization names
"O" for others

References

The dataset may be used for free, but if you want to publish paper/publication using the dataset, please cite these publications:

How to create NER model using this dataset?

We suggest you to use the Stanford NER library.
The steps to create NER model using Stanford NER library are as follows:

  1. Download Stanford-NER.

  2. Download the dataset and its properties file (file with .prop extension)

  3. Use Stanford NER classifier to create the model.
    For example:
    java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop singgalang.prop

    I recommend to increase the heap size so you can train the dataset on computer with limited RAM. Add option like "-Xmx1024m" on the command, for example:

    java -Xmx1024m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop singgalang.prop

    if this still doesn't work, increase the number. For example: "-Xmx8000m". This works for me :)

    Let say this step will create a NER model file named "idner-model-singgalang.ser.gz"

  4. Create or use a testing dataset. Lets say the file name is "testing.txt"

  5. Evaluate the NER model using Stanford NER library
    For example:
    java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier idner-model-20k-mdee.ser.gz -testFile testing.txt

Licence

You can use this dataset for free. You don't need our permission to use it. Please cite our paper if your work uses our data in your publication. Please note that you are not allowed to create a copy of this dataset and share it publicly in your own repository without our permission.

Contact

ika.alfina [at] cs.ui.ac.id