HTTP-SPIDER 


The purpose of this library is to gather information about all files found in a web server. The spidering library provides a complete implementation of a http crawler. The information found is useful to a variety of NSE http scripts and encourages better structured scripts where the script logic is separated from the crawler.

Functionality

Example code:


        --Sets options and starts crawler
        options = {host = host, port = port}        
        crawl(OPT_PATH, options)


        --Iterate through page list to find php files and send the attack vector
        pages = get_page_list()
        for k,pg in pairs(pages) do
                stdnse.print_debug("Page found: "..pg["uri"])
                if pg["ext"] == ".php" and pg["status"] == 200 then
                        if launch_probe(options["host"], options["port"], pg["uri"]) then
                                output_lns[#output_lns + 1] = "PHPSELF Cross Site Scripting POC: http://"..stdnse.get_hostname(host)..pg["uri"]..PHP_SELF_PROBE
                        end                        
                end
        end




get_page_list()
Returns table with a list of pages found and its information.


1. uri - Absolute URI of this page
2. status - Status code
3. chksm - Checksum
4. ext - Page extension
5. type - Content-type
6. content - Page content
7. forms - Boolean indicating the presence of forms

Proposed function list
1. crawl
2. get_href_links
3. check_redirects
4. get_page_list
5. get_img_files
6. get_js_files
7. get_css_files
8. is_absolute_url
9. is_link_malformed
10. download_page
11. is_link_crawlable
12. is_link_local
13. is_link_anchored


Page Content Caching
An important element in this spidering library is content caching. Page content caching refers to storing the page content and is important to analize how we are going to proceed with this:

No cache
This means the library will only create a table of the crawled sites without storing the content of the pages. This will cause an additional request for every
page read.

Caching
If we decide to cache the content of pages we need to decide if we want to store it in memory or files. If we cache the content when we discover the link we won't need to go back and perform 
additional requests

File caching
We will need to implement a clean up routine to delete all generated files at the end of the script execution.

Memory and http-max-cache-size
If we decide to store in memory we will need to remember users to increment the value of http-max-cache-size

Threading
The spidering library will benefit from using threads but we need to be careful to not waste any extra resources.

Template
Implement something like the brute library. Maybe