{"id":309,"date":"2020-10-01T04:57:00","date_gmt":"2020-10-01T04:57:00","guid":{"rendered":"https:\/\/molecularsciences.org\/content\/?p=309"},"modified":"2021-01-12T00:28:32","modified_gmt":"2021-01-12T05:28:32","slug":"php-code-to-scrape-tags-from-an-html-page","status":"publish","type":"post","link":"https:\/\/molecularsciences.org\/content\/php-code-to-scrape-tags-from-an-html-page\/","title":{"rendered":"PHP code to Scrape tags from an HTML page"},"content":{"rendered":"\n<p>Following code uses DOM to extract links.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ get contents of your html page as a string\n$sText = file_get_contents('mypage.html');\n\n\/\/ create a DOM document\n$dom = new DOMDocument;\n\n\/\/ load html into DOM. @ will parsing errors\n@$dom->loadHTML($sText);\n\n\/\/ scrape all links\n$aLinkTags = $dom->getElementsByTagName('a');\n$aImgTags = $dom->getElementsByTagName('img');\n\n\/\/ put the links in an array\n$aLinks = array();\nforeach ($aLinkTags as $sLinkTag) {\n    $aLink&#91;$sLinkTag->nodeValue] = $link->getAttribute('href');\n}\nprint_r($aLinks);\n\n\/\/ put the links in an array\n$aImg = array();\nforeach ($aImgTags as $sImgTag) {\n    $aImg&#91;$sImgTag->nodeValue] = $link->getAttribute('href');\n}\nprint_r($aImg);<\/code><\/pre>\n\n\n\n<p>This code scrapes image and anchor links. This code can be extended to include other html tags.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Following code uses DOM to extract links. This code scrapes image and anchor links. This code can be extended to include other html tags.<\/p>\n","protected":false},"author":1,"featured_media":543,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[23],"tags":[119,24,108],"class_list":["post-309","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-php","tag-dom","tag-php","tag-xml"],"_links":{"self":[{"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/posts\/309","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/comments?post=309"}],"version-history":[{"count":2,"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/posts\/309\/revisions"}],"predecessor-version":[{"id":544,"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/posts\/309\/revisions\/544"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/media\/543"}],"wp:attachment":[{"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/media?parent=309"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/categories?post=309"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/molecularsciences.org\/content\/wp-json\/wp\/v2\/tags?post=309"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}