Php Buddies,
What I am trying to do is learn to build a simple web crawler.
So at first, I will feed it a url to start with.
It will then fetch that page and extract all the links into a single array.
Then it will fetch each of those links pages and extract all their links into a single array likewise. It will do this until it reaches it's max link deep level.
Here is how I coded it:
	
	
	
		
I have a feeling I got confused and messed it up in the foreach loops. Nestled too much. Is that the case ? Hint where I went wrong.
Unable to test the script as I have to first sort out this error:
Fatal error: Uncaught Error: Call to a member function find() on string in C:\xampp\h
After that, I will be able to test it. Anyway, just looking at the script, you think I got it right or what ?
Thanks
I just replaced:
	
	
	
		
with:
	
	
	
		
That is all!
That should not result in that error!
UPDATE:
I have been given this sample code just now ...
	
	
	
		
Gonna experiment with it.
Just sharing it here for other future newbies!
I am told:
"file_get_html is a special function from simple_html_dom library. If you open source code for simple_html_dom you will see that file_get_html() does a lot of things that your curl replacement does not. That's why you get your error."
Anyway, folks, I really don't wanna be using this limited capacity file_get_html() and so let's replace it with cURL. I tried my best in giving a shot at cURL here. What-about you ? Care to show how to fix this thingY ?
I did a search on the php manual for str_get_html to be sure what the function does. But, I am shown no results.
And so, I ask: Just what does it do ?
Php Buddies,
Look at these 2 updates. They both succeed in fetching the php manual page but fail to fetch the yahoo homepage. Why is that ?
The 2nd script is like the 1st one except a small change. Look at the commented-out parts in script 2 to see the difference. The added code comes after the commented-out code part.
SCRIPT 1
	
	
	
		
SCRIPT 2
	
	
	
		
Don't forget my previous post!
Cheers!
								What I am trying to do is learn to build a simple web crawler.
So at first, I will feed it a url to start with.
It will then fetch that page and extract all the links into a single array.
Then it will fetch each of those links pages and extract all their links into a single array likewise. It will do this until it reaches it's max link deep level.
Here is how I coded it:
		PHP:
	
	<?php 
include('simple_html_dom.php'); 
$current_link_crawling_level = 0; 
$link_crawling_level_max = 2
if($current_link_crawling_level == $link_crawling_level_max)
{
	exit(); 
}
else
{
	$url = 'https://www.yahoo.com'; 
	$curl = curl_init($url); 
	curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
	curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
	curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
	curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
	$html = curl_exec($curl); 
	
	$current_link_crawling_level++;	
	//to fetch all hyperlinks from the webpage 
	$links = array(); 
	foreach($html->find('a') as $a) 
	{ 
		$links[] = $a->href; 
		echo "Value: $value<br />\n"; 
		print_r($links); 
		
		$url = '$value'; 
		$curl = curl_init($value); 
	    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
		curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
		curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
		curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
		$html = curl_exec($curl); 
		//to fetch all hyperlinks from the webpage 
		$links = array(); 
		foreach($html->find('a') as $a) 
		{ 
			$links[] = $a->href; 
			echo "Value: $value<br />\n";
			print_r($links); 
			$current_link_crawling_level++;
		} 
	echo "Value: $value<br />\n";
	print_r($links);  
}
?>
	Unable to test the script as I have to first sort out this error:
Fatal error: Uncaught Error: Call to a member function find() on string in C:\xampp\h
After that, I will be able to test it. Anyway, just looking at the script, you think I got it right or what ?
Thanks
I just replaced:
		PHP:
	
	//$html = file_get_html('http://example.com');
	
		PHP:
	
	$url = 'https://www.yahoo.com'; 
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$html = curl_exec($curl);
	That should not result in that error!
UPDATE:
I have been given this sample code just now ...
		PHP:
	
	Possible solution with str_get_html:
$url = 'https://www.yahoo.com'; 
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$response_string = curl_exec($curl); 
$html = str_get_html($response_string);
//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
    $links[] = $a->href; 
} 
print_r($links); 
echo "<br />";
	Just sharing it here for other future newbies!
I am told:
"file_get_html is a special function from simple_html_dom library. If you open source code for simple_html_dom you will see that file_get_html() does a lot of things that your curl replacement does not. That's why you get your error."
Anyway, folks, I really don't wanna be using this limited capacity file_get_html() and so let's replace it with cURL. I tried my best in giving a shot at cURL here. What-about you ? Care to show how to fix this thingY ?
I did a search on the php manual for str_get_html to be sure what the function does. But, I am shown no results.
And so, I ask: Just what does it do ?
Php Buddies,
Look at these 2 updates. They both succeed in fetching the php manual page but fail to fetch the yahoo homepage. Why is that ?
The 2nd script is like the 1st one except a small change. Look at the commented-out parts in script 2 to see the difference. The added code comes after the commented-out code part.
SCRIPT 1
		PHP:
	
	<?php 
//HALF WORKING
include('simple_html_dom.php'); 
$url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref'; // WORKS ON URL
//$url = 'https://yahoo.com'; // FAILS ON URL
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$response_string = curl_exec($curl); 
$html = str_get_html($response_string);
//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
    $links[] = $a->href; 
} 
print_r($links); 
echo "<br />"; 
?>
	
		PHP:
	
	<?php 
//HALF WORKING
include('simple_html_dom.php'); 
$url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref'; // WORKS ON URL
//$url = 'https://yahoo.com'; // FAILS ON URL
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$response_string = curl_exec($curl); 
$html = str_get_html($response_string);
/*
//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
    $links[] = $a->href; 
} 
print_r($links); 
echo "<br />"; 
*/
// Hide HTML warnings
libxml_use_internal_errors(true);
$dom = new DOMDocument;
if($dom->loadHTML($html, LIBXML_NOWARNING)){
    // echo Links and their anchor text
    echo '<pre>';
    echo "Link\tAnchor\n";
    foreach($dom->getElementsByTagName('a') as $link) {
        $href = $link->getAttribute('href');
        $anchor = $link->nodeValue;
        echo $href,"\t",$anchor,"\n";
    }
    echo '</pre>';
}else{
    echo "Failed to load html.";
}
?>
	Cheers!
				







