Skip to content

Commit

Permalink
USX conversion (#157)
Browse files Browse the repository at this point in the history
* update architecture.md

* implement granular rules for attributes

* Implement USX export: id, c, v, paragraphs

* Make Tag node part of AST in numbered markers

* note down the question on USX

* update test cases after inlcuing tag node in AST

* Implement USX conversion for all para type makers, notes, char , nested and attributed markers

* map the markers to their default attributes in USX conversion

* Fix types and add links in noted down questions

* Figure out how to use USX's rnc grammar for validation

* update grammar and tests to include table cell tags in AST

* Implement USX conversion for tables

* expose milestone and znamespace Tag nodes in AST

* Implement USX conversion for milestones

* include catogory value in the AST and update tests

* implement USX conversion for sidebars, cat and fig

* implement USX conversion for optbreak(\b)

* bug fixes in usx conversion

* update notes on grammar

* update note caller rule to allow sequence of any charaters
  • Loading branch information
kavitharaju authored Jul 14, 2022
1 parent 5271791 commit 897de5c
Show file tree
Hide file tree
Showing 17 changed files with 1,767 additions and 240 deletions.
219 changes: 216 additions & 3 deletions python-usfm-parser/API guide for python usfm_grammar.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"id": "b3d034a2",
"metadata": {},
"outputs": [],
Expand Down Expand Up @@ -200,11 +200,224 @@
{
"cell_type": "code",
"execution_count": null,
"id": "818e36d9",
"id": "38a6d3fb",
"metadata": {},
"outputs": [],
"source": [
"import xml.etree.ElementTree as ET\n",
"\n",
"usx_elem = my_parser.toUSX()\n",
"usx_str = ET.tostring(usx_elem, encoding=\"unicode\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6359c463",
"metadata": {},
"outputs": [],
"source": [
"usx_str"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "295dae47",
"metadata": {},
"outputs": [],
"source": [
"!pip install lxml"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "583efddc",
"metadata": {},
"outputs": [],
"source": [
"!pip install rnc2rng"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2bd40ba2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'<usx version=\"3.0\"><book code=\"GEN\" style=\"id\" /><chapter number=\"1\" style=\"c\" sid=\"GEN 1\" /><para style=\"p\" /><chapter eid=\"GEN 1\" /></usx>'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sys\n",
"sys.path.append('/home/kavitha/Documents/PEG JS and USFM/usfm-grammar-v3/usfm-grammar/python-usfm-parser/ENV/lib/python3.8/site-packages')\n",
"\n",
"\n",
"from usfm_grammar import USFMParser, Filter\n",
"import xml.etree.ElementTree as ET\n",
"\n",
"input_usfm_str = open(\"origin.usfm\",\"r\", encoding='utf8').read()\n",
"my_parser = USFMParser(input_usfm_str)\n",
"\n",
"usx_elem = my_parser.toUSX()\n",
"usx_str = ET.tostring(usx_elem, encoding=\"unicode\")\n",
"\n",
"usx_str"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "a680a0b6",
"metadata": {},
"outputs": [],
"source": [
"from lxml import etree\n",
"with open(\"../schemas/usx.rnc\") as f:\n",
" usxrnc_doc = f.read()\n",
" relaxng = etree.RelaxNG.from_rnc_string(usxrnc_doc)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "0fac8a56",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"valid\n"
]
}
],
"source": [
"\n",
"\n",
"from io import StringIO \n",
"usx_f = StringIO(usx_str)\n",
"doc = etree.parse(usx_f)\n",
"if relaxng.validate(doc):\n",
" print(\"valid\")\n",
"else:\n",
" relaxng.assertValid(doc)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4e0c784",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 31,
"id": "1ea6bb28",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"valid\n"
]
}
],
"source": [
"origin_usx_str = '''<usx version=\"3.0\">\n",
" <book code=\"GEN\" style=\"id\" />\n",
" <para style=\"mt1\">MARK</para>\n",
" <chapter number=\"1\" style=\"c\" sid=\"GEN 1\" />\n",
" <para style=\"p\">\n",
" <verse number=\"1\" style=\"v\" sid=\"GEN 1:1\" />\n",
" verse one \n",
" <verse eid=\"GEN 1:1\" />\n",
" <verse number=\"2\" style=\"v\" sid=\"GEN 1:2\" />\n",
" verse two\n",
" <verse eid=\"GEN 1:2\" />\n",
" </para>\n",
" <chapter eid=\"GEN 1\" />\n",
"</usx>'''\n",
"usx_f = StringIO(origin_usx_str)\n",
"doc = etree.parse(usx_f)\n",
"if relaxng.validate(doc):\n",
" print(\"valid\")\n",
"else:\n",
" relaxng.assertValid(doc)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dc42b6af",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 32,
"id": "8d12593b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"valid\n"
]
}
],
"source": [
"empty_usx_str = '''<usx version=\"3.0.0\">\n",
" <book code=\"GEN\" style=\"id\" />\n",
" <chapter number=\"1\" style=\"c\" sid=\"GEN 1\" />\n",
" <para style=\"p\">\n",
" <verse number=\"1\" style=\"v\" altnumber=\"2\" pubnumber=\"B\" sid=\"GEN 1:22\" />\n",
" verse one\n",
" </para>\n",
" <chapter eid=\"GEN 1\" />\n",
"\n",
"</usx>'''\n",
"usx_f = StringIO(empty_usx_str)\n",
"doc = etree.parse(usx_f)\n",
"if relaxng.validate(doc):\n",
" print(\"valid\")\n",
"else:\n",
" relaxng.assertValid(doc)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "818e36d9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'(File (book (id (bookcode) (description))) (mtBlock (mt (numberedLevelMax4) (text))) (chapter (c (chapterNumber)) (paragraph (p))))'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_parser.AST"
"my_parser.toAST()"
]
},
{
Expand Down
21 changes: 21 additions & 0 deletions python-usfm-parser/sample.usfm
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,12 @@
\li1 ബെന്യാമീൻ
\p
\v 4 ദാൻ, നഫ്താലി, ഗാദ്, ആശേർ.
\v 12-83 They presented their offerings in the following order:
\tr \th1 Day \th2 Tribe \th3 Leader
\tr \tcr1 1st \tc2 Judah \tc3 Nahshon son of Amminadab
\tr \tcr1 2nd \tc2 Issachar \tc3 Nethanel son of Zuar
\tr \tcr1 3rd \tc2 Zebulun \tc3 Eliab son of Helon
\p
\v 5 യാക്കോബിന്റെ സന്താനപരമ്പരകൾ എല്ലാം കൂടി എഴുപതു പേർ ആയിരുന്നു; യോസേഫ് മുമ്പെ തന്നെ ഈജിപ്റ്റിൽ ആയിരുന്നു. \w gracious|grace\w* and then a few words later \w gracious|lemma="grace" x-myattr="metadata"\w*
\c 2
\s1 A Prayer of Habakkuk
Expand All @@ -34,3 +40,18 @@ word for “living,” which is rendered in this context as “human beings.”\
was the mother of all human beings.
\v 21 And the \nd Lord\nd* God made clothes out of animal skins for Adam and his wife,
and he clothed them.
\qt-s |sid="qt_123" who="Pilate"\*“Are you the king of the Jews?”\qt-e |eid="qt_123"\*
\esb \cat History\cat*
\ms Fish and Fishing
\p In Jesus' time, fishing took place mostly on lake Galilee, because Jewish people
could not use many of the harbors along the coast of the Mediterranean Sea, since these
harbors were often controlled by unfriendly neighbors. The most common fish in the Lake
of Galilee were carp and catfish. \wj The Law of Moses \wj* allowed people to eat any fish with
fins and scales, but since catfish lack scales (as do eels and sharks) they were not to
be eaten (\xt Lev 11.9-12\xt*). Fish were also probably brought from Tyre and Sidon,
where they were dried and salted.
...
\p Among early Christians, the fish was a favorite image for Jesus, because the Greek
word for fish ( \tl ichthus\tl* ) consists of the first letters of the Greek words that
tell who Jesus is: \fig Christian Fish Image\fig*
\esbe
Loading

0 comments on commit 897de5c

Please sign in to comment.