USX conversion (#157)

* update architecture.md * implement granular rules for attributes * Implement USX export: id, c, v, paragraphs * Make Tag node part of AST in numbered markers * note down the question on USX * update test cases after inlcuing tag node in AST * Implement USX conversion for all para type makers, notes, char , nested and attributed markers * map the markers to their default attributes in USX conversion * Fix types and add links in noted down questions * Figure out how to use USX's rnc grammar for validation * update grammar and tests to include table cell tags in AST * Implement USX conversion for tables * expose milestone and znamespace Tag nodes in AST * Implement USX conversion for milestones * include catogory value in the AST and update tests * implement USX conversion for sidebars, cat and fig * implement USX conversion for optbreak(\b) * bug fixes in usx conversion * update notes on grammar * update note caller rule to allow sequence of any charaters
Bridgeconn · Jul 14, 2022 · 897de5c · 897de5c
1 parent 5271791
commit 897de5c
Show file tree

Hide file tree

Showing 17 changed files with 1,767 additions and 240 deletions.
diff --git a/python-usfm-parser/API guide for python usfm_grammar.ipynb b/python-usfm-parser/API guide for python usfm_grammar.ipynb
@@ -23,7 +23,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
    "id": "b3d034a2",
    "metadata": {},
    "outputs": [],
@@ -200,11 +200,224 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "818e36d9",
+   "id": "38a6d3fb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import xml.etree.ElementTree as ET\n",
+    "\n",
+    "usx_elem = my_parser.toUSX()\n",
+    "usx_str = ET.tostring(usx_elem, encoding=\"unicode\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6359c463",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "usx_str"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "295dae47",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install lxml"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "583efddc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install rnc2rng"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "2bd40ba2",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'<usx version=\"3.0\"><book code=\"GEN\" style=\"id\" /><chapter number=\"1\" style=\"c\" sid=\"GEN 1\" /><para style=\"p\" /><chapter eid=\"GEN 1\" /></usx>'"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import sys\n",
+    "sys.path.append('/home/kavitha/Documents/PEG JS and USFM/usfm-grammar-v3/usfm-grammar/python-usfm-parser/ENV/lib/python3.8/site-packages')\n",
+    "\n",
+    "\n",
+    "from usfm_grammar import USFMParser, Filter\n",
+    "import xml.etree.ElementTree as ET\n",
+    "\n",
+    "input_usfm_str = open(\"origin.usfm\",\"r\", encoding='utf8').read()\n",
+    "my_parser = USFMParser(input_usfm_str)\n",
+    "\n",
+    "usx_elem = my_parser.toUSX()\n",
+    "usx_str = ET.tostring(usx_elem, encoding=\"unicode\")\n",
+    "\n",
+    "usx_str"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "a680a0b6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from lxml import etree\n",
+    "with open(\"../schemas/usx.rnc\") as f:\n",
+    "    usxrnc_doc  = f.read()\n",
+    "    relaxng = etree.RelaxNG.from_rnc_string(usxrnc_doc)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "0fac8a56",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "valid\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "\n",
+    "from io import StringIO \n",
+    "usx_f = StringIO(usx_str)\n",
+    "doc = etree.parse(usx_f)\n",
+    "if relaxng.validate(doc):\n",
+    "    print(\"valid\")\n",
+    "else:\n",
+    "    relaxng.assertValid(doc)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d4e0c784",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "1ea6bb28",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "valid\n"
+     ]
+    }
+   ],
+   "source": [
+    "origin_usx_str = '''<usx version=\"3.0\">\n",
+    "  <book code=\"GEN\" style=\"id\" />\n",
+    "  <para style=\"mt1\">MARK</para>\n",
+    "  <chapter number=\"1\" style=\"c\" sid=\"GEN 1\" />\n",
+    "  <para style=\"p\">\n",
+    "    <verse number=\"1\" style=\"v\" sid=\"GEN 1:1\" />\n",
+    "    verse one \n",
+    "    <verse eid=\"GEN 1:1\" />\n",
+    "    <verse number=\"2\" style=\"v\" sid=\"GEN 1:2\" />\n",
+    "    verse two\n",
+    "    <verse eid=\"GEN 1:2\" />\n",
+    "  </para>\n",
+    "  <chapter eid=\"GEN 1\" />\n",
+    "</usx>'''\n",
+    "usx_f = StringIO(origin_usx_str)\n",
+    "doc = etree.parse(usx_f)\n",
+    "if relaxng.validate(doc):\n",
+    "    print(\"valid\")\n",
+    "else:\n",
+    "    relaxng.assertValid(doc)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dc42b6af",
    "metadata": {},
    "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "8d12593b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "valid\n"
+     ]
+    }
+   ],
+   "source": [
+    "empty_usx_str = '''<usx version=\"3.0.0\">\n",
+    "  <book code=\"GEN\" style=\"id\" />\n",
+    "  <chapter number=\"1\" style=\"c\" sid=\"GEN 1\" />\n",
+    "  <para style=\"p\">\n",
+    "    <verse number=\"1\" style=\"v\" altnumber=\"2\" pubnumber=\"B\" sid=\"GEN 1:22\" />\n",
+    "    verse one\n",
+    "  </para>\n",
+    "  <chapter eid=\"GEN 1\" />\n",
+    "\n",
+    "</usx>'''\n",
+    "usx_f = StringIO(empty_usx_str)\n",
+    "doc = etree.parse(usx_f)\n",
+    "if relaxng.validate(doc):\n",
+    "    print(\"valid\")\n",
+    "else:\n",
+    "    relaxng.assertValid(doc)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "818e36d9",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'(File (book (id (bookcode) (description))) (mtBlock (mt (numberedLevelMax4) (text))) (chapter (c (chapterNumber)) (paragraph (p))))'"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
-    "my_parser.AST"
+    "my_parser.toAST()"
    ]
   },
   {

diff --git a/python-usfm-parser/sample.usfm b/python-usfm-parser/sample.usfm
@@ -15,6 +15,12 @@
 \li1 ബെന്യാമീൻ
 \p
 \v 4 ദാൻ, നഫ്താലി, ഗാദ്, ആശേർ.
+\v 12-83 They presented their offerings in the following order:
+\tr \th1 Day \th2 Tribe \th3 Leader
+\tr \tcr1 1st \tc2 Judah \tc3 Nahshon son of Amminadab
+\tr \tcr1 2nd \tc2 Issachar \tc3 Nethanel son of Zuar
+\tr \tcr1 3rd \tc2 Zebulun \tc3 Eliab son of Helon
+\p
 \v 5 യാക്കോബിന്റെ സന്താനപരമ്പരകൾ എല്ലാം കൂടി എഴുപതു പേർ ആയിരുന്നു; യോസേഫ് മുമ്പെ തന്നെ ഈജിപ്റ്റിൽ ആയിരുന്നു. \w gracious|grace\w* and then a few words later \w gracious|lemma="grace" x-myattr="metadata"\w*
 \c 2
 \s1 A Prayer of Habakkuk
@@ -34,3 +40,18 @@ word for “living,” which is rendered in this context as “human beings.”\
 was the mother of all human beings.
 \v 21 And the \nd Lord\nd* God made clothes out of animal skins for Adam and his wife,
 and he clothed them.
+\qt-s |sid="qt_123" who="Pilate"\*“Are you the king of the Jews?”\qt-e |eid="qt_123"\*
+\esb \cat History\cat*
+\ms Fish and Fishing
+\p In Jesus' time, fishing took place mostly on lake Galilee, because Jewish people
+could not use many of the harbors along the coast of the Mediterranean Sea, since these
+harbors were often controlled by unfriendly neighbors. The most common fish in the Lake
+of Galilee were carp and catfish. \wj The Law of Moses \wj* allowed people to eat any fish with
+fins and scales, but since catfish lack scales (as do eels and sharks) they were not to
+be eaten (\xt Lev 11.9-12\xt*). Fish were also probably brought from Tyre and Sidon,
+where they were dried and salted.
+...
+\p Among early Christians, the fish was a favorite image for Jesus, because the Greek
+word for fish ( \tl ichthus\tl* ) consists of the first letters of the Greek words that
+tell who Jesus is: \fig Christian Fish Image\fig*
+\esbe