HTTP-SPIDER The purpose of this library is to gather information about all files found in a web server. The spidering library provides a complete implementation of a http crawler. The information found is useful to a variety of NSE http scripts and encourages better structured scripts where the script logic is separated from the crawler. Functionality Example code: --Sets options and starts crawler options = {host = host, port = port} crawl(OPT_PATH, options) --Iterate through page list to find php files and send the attack vector pages = get_page_list() for k,pg in pairs(pages) do stdnse.print_debug("Page found: "..pg["uri"]) if pg["ext"] == ".php" and pg["status"] == 200 then if launch_probe(options["host"], options["port"], pg["uri"]) then output_lns[#output_lns + 1] = "PHPSELF Cross Site Scripting POC: http://"..stdnse.get_hostname(host)..pg["uri"]..PHP_SELF_PROBE end end end get_page_list() Returns table with a list of pages found and its information. 1. uri - Absolute URI of this page 2. status - Status code 3. chksm - Checksum 4. ext - Page extension 5. type - Content-type 6. content - Page content 7. forms - Boolean indicating the presence of forms Proposed function list 1. crawl 2. get_href_links 3. check_redirects 4. get_page_list 5. get_img_files 6. get_js_files 7. get_css_files 8. is_absolute_url 9. is_link_malformed 10. download_page 11. is_link_crawlable 12. is_link_local 13. is_link_anchored Page Content Caching An important element in this spidering library is content caching. Page content caching refers to storing the page content and is important to analize how we are going to proceed with this: No cache This means the library will only create a table of the crawled sites without storing the content of the pages. This will cause an additional request for every page read. Caching If we decide to cache the content of pages we need to decide if we want to store it in memory or files. If we cache the content when we discover the link we won't need to go back and perform additional requests File caching We will need to implement a clean up routine to delete all generated files at the end of the script execution. Memory and http-max-cache-size If we decide to store in memory we will need to remember users to increment the value of http-max-cache-size Threading The spidering library will benefit from using threads but we need to be careful to not waste any extra resources. Template Implement something like the brute library. Maybe