Skip to content

celerometis/html2json

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

html2json

html2json python this is recursive function to extract html node and child , and add to dictionary you can get html text to data and return dictionary

example:

data2 = '<html xmlns="http://www.w3.org/1999/xhtml" lang="en" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]><html xmlns="http://www.w3.org/1999/xhtml" lang="en" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]><html xmlns="http://www.w3.org/1999/xhtml" lang="en" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html xmlns="http://www.w3.org/1999/xhtml" lang="en" class="no-js"> <!--<![endif]--><!--[if lt IE 9]><!--><script type="text/javascript" src="/static/nhsuk_shared/js/jquery-1.11.3.min.b091a47f6b91.js" charset="utf-8"></script><!--<![endif]--><head>    <!--[if IE 9]>    <meta http-equiv="X-UA-Compatible" content="IE=9">    <![endif]-->    <script>        (function (d) {            d.className = d.className.replace(/\bno-js\b/, "js-enabled")        })(document.documentElement)    </script>    <meta charset="utf-8"/>    <meta http-equiv="X-UA-Compatible" content="IE=edge"/>    <meta name="DC.Author" content="NHS website"/>    <meta name="DC.Publisher" content="Department of Health"/>    <meta name="DC.Format" scheme="IMT" content="text/html"/>    <meta name="DC.Subject" scheme="eGMS.IPSV" content="National Health Service (NHS), "/>'

and output of

print(dic)

is

{
  "html/0": {
    "xmlns": "http://www.w3.org/1999/xhtml",
    "lang": "en",
    "class": [
      "no-js",
      "lt-ie9",
      "lt-ie8",
      "lt-ie7"
    ]
  },
  "text/1": " ",
  "text/2": "endif]--><!--[if IE 7",
  "html/3": {
    "xmlns": "http://www.w3.org/1999/xhtml",
    "lang": "en",
    "class": [
      "no-js",
      "lt-ie9",
      "lt-ie8"
    ]
  },
  "text/4": " ",
  "text/5": "endif]--><!--[if IE 8",
  "html/6": {
    "xmlns": "http://www.w3.org/1999/xhtml",
    "lang": "en",
    "class": [
      "no-js",
      "lt-ie9"
    ]
  },
  "text/7": " ",
  "text/8": "endif]--><!--[if gt IE 8",
  "text/9": "><html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"en\" class=\"no-js\"> <!--<![endif]",
  "text/10": "[if lt IE 9]><!",
  "script/11": {
    "type": "text/javascript",
    "src": "/static/nhsuk_shared/js/jquery-1.11.3.min.b091a47f6b91.js",
    "charset": "utf-8"
  },
  "text/12": "<![endif]",
  "head/13": {},
  "text/14": " ",
  "text/15": "[if IE 9]>    <meta http-equiv=\"X-UA-Compatible\" content=\"IE=9\">    <![endif]",
  "text/16": " ",
  "script/17": {},
  "text/18": "        (function (d) {            d.className = d.className.replace(/\bno-js\b/, \"js-enabled\")        })(document.documentElement)    ",
  "text/19": " ",
  "meta/20": {
    "charset": "utf-8"
  },
  "text/21": " ",
  "meta/22": {
    "http-equiv": "X-UA-Compatible",
    "content": "IE=edge"
  },
  "text/23": " ",
  "meta/24": {
    "name": "DC.Author",
    "content": "NHS website"
  },
  "text/25": " ",
  "meta/26": {
    "name": "DC.Publisher",
    "content": "Department of Health"
  },
  "text/27": " ",
  "meta/28": {
    "name": "DC.Format",
    "scheme": "IMT",
    "content": "text/html"
  },
  "text/29": " ",
  "meta/30": {
    "name": "DC.Subject",
    "scheme": "eGMS.IPSV",
    "content": "National Health Service (NHS), "
  }
}

About

html2json python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages