Monday, 13 August 2018

Convert A sitemap.xml File To A HTML Sitemap With PHP

I have already talked about converting a sitemap.xml file into a urllist.txt file, but what if you want to create a HTML sitemap? If you have a sitemap.xml file then you can use this to spider your site, scrape the contents of each page and populate the HTML file with this information.
The following code does this. For every page it looks for the title tag, the description meta tag and the first h2 tag on the page. These items are then used to construct a segment of HTML for that page.
  1. <?php
  2. $header = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  3. <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  4. <head>
  5. <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
  6. <title>HTML Sitemap</title>
  7. </head>
  8. <body>';
  9.  
  10. set_time_limit(400);
  11.  
  12. $currentElement = '';
  13. $currentLoc = '';
  14.  
  15. $map = "<h1>HTML Sitemap</h1>"."\n";
  16.  
  17. function parsePage($data)
  18. {
  19. global $map;
  20. /*
  21. if you want to trap a certain file extention then use the syntax below...
  22. stripos($data, ".php")>0
  23. stripos($data, ".htm")>0
  24. stripos($data, ".asp")>0
  25. */
  26. if ( stripos($data,".pdf") > 0 ) {
  27. // if the url is a pdf document.
  28. $map .= '<p><a href="'.$data.'">PDF document.</a></p>'."\n";
  29. $map .= '<p>A pdf document.</p>'."\n";
  30. } elseif ( stripos($data, ".txt")>0 ) {
  31. // if the url is a text document
  32. $map .= '<p><a href="'.$data.'">Text document.</a></p>'."\n";
  33. $map .= '<p>A text document.</p>'."\n";
  34. } else {
  35. // try to open it anyway...
  36. // make sure that you can read the file
  37. if ( $urlh = @fopen($data, 'rb') ) {
  38. $contents = '';
  39. //check php version
  40. if ( phpversion()>5 ) {
  41. $contents = stream_get_contents($urlh);
  42. } else {
  43. while ( !feof($urlh) ) {
  44. $contents .= fread($urlh, 8192);
  45. };
  46. };
  47.  
  48. // find the title
  49. preg_match('/(?<=\<[Tt][Ii][Tt][Ll][Ee]\>)\s*?(.*?)\s*?(?=\<\/[Tt][Ii][Tt][Ll][Ee]\>)/U', $contents, $title);
  50. $title = $title[0];
  51.  
  52. // find the first h1 tag
  53. $header = array();
  54. preg_match('/(?<=\<[Hh]2\>)(.*?)(?=\<\/[Hh]2\>)/U', $contents, $header);
  55. $header = strip_tags($header[0]);
  56.  
  57. if ( strlen($title) > 0 && strlen($header) > 0 ) {
  58. // print the title and h1 tag in combo
  59. $map .= '<p class="link"><a href="'.str_replace('&','&amp;',$data).'" title="'.(strlen($header)>0?trim($header):trim($title)).'">'.trim($title).(strlen($header)>0?" - ".trim($header):'').'</a></p>'."\n";
  60. } elseif ( strlen($title) > 0 ) {
  61. $map .= '<p class="link"><a href="'.str_replace('&','&amp;',$data).'" title="'.trim($title).'">'.trim($title).'</a></p>'."\n";
  62. } elseif ( strlen($header) > 0 ) {
  63. $map .= '<p class="link"><a href="'.str_replace('&','&amp;',$data).'" title="'.trim($header).'">'.trim($header).'</a></p>'."\n";
  64. };
  65.  
  66. // find description
  67. preg_match('/(?<=\<[Mm][Ee][Tt][Aa]\s[Nn][Aa][Mm][Ee]\=\"[Dd]escription\" content\=\")(.*?)(?="\s*?\/?\>)/U', $contents, $description);
  68. $description = $description[0];
  69.  
  70. // print description
  71. if ( strlen($description)>0 ) {
  72. $map .= '<p class="desc">'.trim($description).'</p>'."\n";
  73. };
  74. // close the file
  75. fclose($urlh);
  76. };
  77. };
  78. };
  79.  
  80. /////////// XML PARSE FUNCTIONS HERE /////////////
  81. // the start element function
  82. function startElement($xmlParser, $name, $attribs)
  83. {
  84. global $currentElement;
  85. $currentElement = $name;
  86. };
  87.  
  88. // the end element function
  89. function endElement($parser, $name)
  90. {
  91. global $currentElement,$currentLoc;
  92. if ( $currentElement == 'loc') {
  93. parsePage($currentLoc);
  94. $currentLoc = '';
  95. };
  96. $currentElement = '';
  97. };
  98.  
  99. // the character data function
  100. function characterData($parser, $data)
  101. {
  102. global $currentElement,$currentLoc;
  103. // if the current element is loc then it will be a url
  104. if ( $currentElement == 'loc' ) {
  105. $currentLoc .= $data;
  106. };
  107. };
  108.  
  109. // create parse object
  110. $xml_parser = xml_parser_create();
  111. // turn off case folding!
  112. xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, false);
  113. // set start and end element functions
  114. xml_set_element_handler($xml_parser,"startElement", "endElement");
  115. // set character data function
  116. xml_set_character_data_handler($xml_parser, "characterData");
  117.  
  118. // open xml file
  119. if ( !($fp = fopen('sitemap.xml', "r")) ) {
  120. die("could not open XML input");
  121. };
  122.  
  123. // read the file - print error if something went wrong.
  124. while ( $data = fread($fp,4096) ) {
  125. if ( !xml_parse($xml_parser, $data,feof($fp)) ) {
  126. die(sprintf("XML error: %s at line %d",xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser)));
  127. };
  128. };
  129.  
  130. // close file
  131. fclose($fp);
  132.  
  133. $footer = '</body>
  134. </html>';
  135.  
  136. // write output to a file
  137. $fp = fopen('sitemap.html', "w+");
  138. fwrite($fp,$header.$map.$footer);
  139. fclose($fp);
  140.  
  141. // print output
  142. echo $header.$map.$footer;
This script prints out the sitemap and also saves the sitemap to a file for later use. This is essential as the script can take a long time to run due to all of the page accessing that it has to do.
This script is fairly complicated and has gone through several versions since I first created it so if you find any improvements or bugs then let me know and I will incorporate them.

0 comments:

Post a Comment