Spider inverser variable $banned_ext

djmick007 · Mars 22, 2010, 7:05

Salut voila je bosse actuellement sur un site permettant de referencer les flux rss et xml et j aurait besoin d un peut aide

En gros le code ci dessous permet de recuperer des urls , le probleme c’est qu’il y a une option en bas du code qui permet de bloquer les fichiers a pas indexer .

$banned_ext = array
(
“.xml”,
“.rss”
);

Moi de voudrais faire l’inverse , a la place de bloquer les extensions je voudrais que le code recherche justement ces extensions et seulement ces extensions , pouvez vous m’aider ?


<?
set_time_limit( 0 );
ini_set("memory_limit", 128 );

//strip HTML and Javascript
function html2txt($document){
$search = array('@<script[^>]*?>.*?</script>@si',  // Strip out javascript
               '@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
               '@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
               '@<![\s\S]*?--[ \t\n\r]*>@'         // Strip multi-line comments including CDATA
);
$text = preg_replace($search, '', $document);
return $text;
}



//garble the data that we get from the website.



class spider_man
{
    var $limit;
    var $cache;
    var $crawled;
    var $banned_ext;
   
    function spider_man( $url, $banned_ext, $limit )
    {
        $this->start = $url ;
        $this->banned_ext = $banned_ext ;
        $this->limit = $limit ;
       
        if( !fopen( $url, "r") ) return false;
        else $this->_spider( $url );
    }

	function _spider( $url )
    {
        $this->cache = @file_get_contents( urldecode( $url ) );
        if( !$this->cache ) return false;
        $this->crawled[] = urldecode( $url ) ;
        preg_match_all( "#href=\"(https?://[&=a-zA-Z0-9-_./]+)\"#si", $this->cache, $links );
        if ( $links ) :
            foreach ( $links[1] as $hyperlink )
            {
                $this->limit--;
                if( ! $this->limit  ) return;
               
                if( $this->is_valid_ext( trim( $hyperlink ) ) and !$this->is_crawled( $hyperlink ) ) :
                    $this->crawled[] = $hyperlink;
                    echo "Crawling $hyperlink<br />\n";
					unset( $this->cache );
                    $this->_spider( $hyperlink );
                endif;
            }
        endif;
    }
    function is_valid_ext( $url )
    {   
        foreach( $this->banned_ext as $ext )
        {
            if( $ext == substr( $url, strlen($url) - strlen( $ext ) ) ) return false;
        }
        return true;
    }
    function is_crawled( $url )
    {
        return in_array( $url, $this->crawled );
    }
}

$banned_ext = array
(
    ".xml",
    ".rss"
);
$spider = new spider_man( 'http://www.google.com', $banned_ext, 10000 );
print_r( $spider->crawled );


?>

Edité le 23/03/2010 à 05:56

Sans-Nom · Mars 22, 2010, 9:22

Bonjour,

Pourrais-tu mettre un titre de sujet plus explicite ?

Merci d’avance

Akkai · Mars 23, 2010, 9:27

Pour faire l’inverse il suffit de regarder ta class, tu as cette méthode

 
function is_valid_ext( $url )
 { 
 foreach( $this->banned_ext as $ext )
 {
 if( $ext == substr( $url, strlen($url) - strlen( $ext ) ) ) return false;
 }
 return true;
 }

Qui vérifie l’extension et retourne false si cette extension est comprise dans ton tableau, alors il te suffit d’inverser

 function is_valid_ext( $url )
 { 
 foreach( $this->banned_ext as $ext )
 {
 if( $ext == substr( $url, strlen($url) - strlen( $ext ) ) ) return true;
 }
 return false;
 }

Voilà

Akkai · Mars 26, 2010, 5:36

La politesse et la courtoisie sont des signes de respects, quand on te fournis de l’aide il est courant de dire merci et ça ne coûte rien

djmick007 · Mars 29, 2010, 9:41

Excuse moi akkai , merci !