Raw dataset for Old Persian cuneiform

Dear contributors, please be aware that cuneiform languages are different. For instance, the most popular are Elamite, Babylonian and Old Persian; we are working on Old Persian. Below you can see the differences:

(Photo is taken from national museum of Iran, the gold plate of king Darius)

Data structure:

/imagedata/

 /source/
        /king/
           source_king_001.jpg
        
  #example:
  
  /behistun/
       /darius_1/
           behistun_darius_1_001.jpg

/textdata/

  /eng_transcription_to_english/
       /metadata/
       eng_transcription_to_english_001.json
       
  /eng_transliteration_to_english/
       /metadata/
       eng_transliteration_to_english_001.json
       
  /single/
      /metadata/
      /eng_transliteration/
            eng_transliteration_001.json

              
   # "single" refers to text data that are just a text without translation

Translating Old Persian language has some methods, for example, transliteration and transcription. Below you can see an example to know the difference between them:

Metadata

For each directory a "source.metadata.csv" file is provided to see the information of data.

Explanation about metadata columns:

imagedata:

source: The source that I have taken data from.

abbreviation: The name of inscription

location: The main discovered location of that inscription.

translation: 1: if I have the translation of that inscription, 0: if I have not.

collection: The palace of storing that inscription at this current time.

artifact_id : artifact_id from CDLI reference

asset_number: asset_number from british museum collection

museum_number: museum_number from british museum collection

textdata:

abbreviation: The name of inscription

reference: The reference that I have taken data from.

location: The main discovered location of that inscription

image: 1: if I have the image of that inscription, 0: if I have not.

artifact_id : artifact_id from CDLI reference

References

Livius.org
British Museum collection
Wikipedia
Cuneiform Digital Library Initiative (CDLI)
Book: The Inscriptions in Old Persian Cuneiform of the Achaemenian Emperors by Ralph Norman Sharp
Personal photography from national museum of Iran and Takht-e-Jamshid (Persepolis)

Data pipeline

In the first stage, Old Persian cuneiform will be converted to English transcription text as an output using an OCR model. In the second stage, that English transcription text will be the input for an NLP or Large language model (LLM) model to be converted to modern languages. The NLP model performs as a machine translation model

Glossary

Behistun:بیستون

Susa:شوش

Persepolis:پرسپولیس(تخت جمشید)

Elamite:ایلامی

Babylonian:بابِلی

Cyrus:کوروش

Xerxes:خشایار

Artaxerxes:اردشیر

𐎠𐎢𐎼𐎶𐏀𐎡𐎠:اهورامزدا

LICENSE

This repository is under CC-BY-NC license and any commercial use is prohibited.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Raw dataset for Old Persian cuneiform

Data structure:

Metadata

References

Data pipeline

Glossary

LICENSE

Files

README.md

Latest commit

History

README.md

File metadata and controls

Raw dataset for Old Persian cuneiform

Data structure:

Metadata

References

Data pipeline

Glossary

LICENSE