How To Extract Urls And Their Anchors From Links On Any Page Desired ?

sunny_pro

New member
Joined
Jun 18, 2017
Messages
86
Points
0
Php Buds,

Here's the code, using DOM for grabbing links from google:

PHP:
    <?php

    # Use the Curl extension to query Google and get back a page of results
    $url = "http://www.google.com";
    $ch = curl_init();
    $timeout = 5;
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    $html = curl_exec($ch);
    curl_close($ch);

    # Create a DOM parser object
    $dom = new DOMDocument();

    # Parse the HTML from Google.
    # The @ before the method call suppresses any warnings that
    # loadHTML might throw because of invalid HTML in the page.
    @$dom->loadHTML($html);

    # Iterate over all the <a> tags
    foreach($dom->getElementsByTagName('a') as $link) {
            # Show the <a href>
            echo $link->getAttribute('href');
            echo "<br />";
    
    ?>
It echoes results sort of like this:

https://www.google.com.com/webhp?tab=ww
http://www.google.com.com/imghp?hl=bn&tab=wi
http://maps.google.com.com/maps?hl=bn&tab=wl


Now, I'd like to convert the above code so again using DOM it is able to extract all urls and their anchor texts from all links residing on any chosen webpage no matter what format the links are in.
Formats such as:

<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>

The anchor texts should sit underneath each extracted url. And there should be a line in-between each listed item. Such as:


http://stackoverflow.com<br>
A programmer's forum<br>
<br>
http://google.com<br>
A searchengine<br>
<br>
http://yahoo.com<br>
An Index<br>
<br>

And so on.
I'd also appreciate another version. This time a cURL version too (not using DOM). Which performs the same result.
This cURL did not exactly work as was intended:

PHP:
    <?php

    /*
    $curl = curl_init('http://devshed.com/');
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);

    $page = curl_exec($curl);

    if(curl_errno($curl)) // check for execution errors
    {
	    echo 'Scraper error: ' . curl_error($curl);
	    exit;
    }

    curl_close($curl);

    $regex = '<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]';
    if ( preg_match($regex, $page, $list) )
        echo $list[0];
    else 
        print "Not found"; 

    ?>
Any chance I can this achieved with cURL (not using DOM) without the regex ? You know I dislike regex and prefer simplicity in coding.
Nevertheless, from your end, I'd like to see a regex sample and another sample without regex. ;)
Oh by the way, I really prefer not to use limited functions such as the get_file() and the like. You know what I mean.

Give us your best shots!

Take care!

Cheers!

As you know, that code in my original post was for scraping all links found on Google homepage.
Thanks for the hint. I have worked on it. But facing a little problem.
The 1st foreach belongs to the original script to scrape the links from Google homepage.
I now added 2 more foreach to scrape the outerhtml and innertext from each link in the hope that one of them 2 would scrape the links' anchor texts.
But, I get a blank page now.
Here is the code ...
PHP:
<?php

# Use the Curl extension to query Google and get back a page of results
$url = "http://forums.devshed.com/";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);

# Create a DOM parser object
$dom = new DOMDocument();

# Parse the HTML from Devshed Forum.
# The @ before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
@$dom->loadHTML($html);

# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('a') as $link) {
        # Show the <a href>
        echo $link->getAttribute('href');
        echo "<br />";
		foreach($dom->getElementsByTagName('p') as $outertml) {
			# Show the <outerhtml>
			echo $outerhtml->getAttribute('outerhtml');
			echo "<br />";
			foreach($dom->getElementsByTagName('p') as $innertext) {
				# Show the <innertext>
				echo $innertext->getAttribute('innertext');
				echo "<br />";
			}
	}
		
}
?>
I need the anchor texts of each link on the next line underneath their links.
Where am I going wrong here ?
Trying to scrape the links from here:
http://forums.devshed.com/

Php Development
Perl programming
C Programming

and so on ...
Only scraping for learning purpose.

Html looks like this:
<h2 class="demi-title"><a class="forums" href="http://forums.devshed.com/perl-programming-6/">Perl Programming</a> <span class="viewing instruction">(16 Viewing)</span></h2>
</div>
<p class="forumdescription">Perl Programming forum discussing coding in Perl, utilizing Perl modules, and other Perl-related topics. Perl, the Practical Extraction and Reporting Language, is the choice for many for parsing textual information.</p>


Neither did this work as I still see a blank page. Nothing is getting scraped or echoed:
PHP:
<?php

# Use the Curl extension to query Google and get back a page of results
$url = "http://forums.devshed.com/";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);

# Create a DOM parser object
$dom = new DOMDocument();

# Parse the HTML from Devshed Forum.
# The @ before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
@$dom->loadHTML($html);

# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('a') as $link) {
        # Show the <a href>
        echo $link->getAttribute('href');
        echo "<br />";
		echo $link->nodeValue;		
}
?>
Regards! :)

Ok. I tried another ...

PHP:
<?php
/*
Using PHP's DOM functions to
  fetch hyperlinks and their anchor text
*/
$dom = new DOMDocument;
$dom->loadHTML(file_get_contents('https://stackoverflow.com/questions/50381348/extract-urls-anchor-texts-from-links-on-a-webpage-fetched-by-php-or-curl/')); 
 
// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link) {
	$href = $link->getAttribute('href');
	$anchor = $link->nodeValue;
	echo $href,"\t",$anchor,"\n";
}
echo '</pre>';
?>
I get error:

Warning: DOMDocument::loadHTML(): Tag header invalid in Entity, line: 185 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 187 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 187 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 187 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 187 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag ellipse invalid in Entity, line: 187 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 187 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag main invalid in Entity, line: 190 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag section invalid in Entity, line: 191 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag header invalid in Entity, line: 192 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag article invalid in Entity, line: 195 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag footer invalid in Entity, line: 322 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag article invalid in Entity, line: 323 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 325 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 326 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 327 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 328 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 332 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 333 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 334 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 335 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 336 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 337 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag lineargradient invalid in Entity, line: 338 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 339 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 340 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 341 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 342 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 343 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 344 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 346 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag lineargradient invalid in Entity, line: 347 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 348 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 349 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 350 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 351 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 353 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag lineargradient invalid in Entity, line: 354 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 355 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 356 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 358 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag lineargradient invalid in Entity, line: 359 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 360 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 361 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 362 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 363 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag stop invalid in Entity, line: 364 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 366 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 367 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 368 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 369 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 372 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 374 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 375 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 379 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 380 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 384 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 385 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 389 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 390 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 394 in C:\xampp\htdocs\cURL\crawler.php on line 7

Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 395 in C:\xampp\htdocs\cURL\crawler.php on line 7
Link Anchor

//app.appsflyer.com/com.fiverr.fiverr?pid=blockpage

//www.linkedin.com/company/fiverr-com

//twitter.com/fiverr

//www.pinterest.com/fiverr

//www.facebook.com/fiverr

//www.instagram.com/fiverr


Why all these errors ?

I would appreciate anyone's replies on my previous 2 posts since they have different codes and I want to know why they are not working.
 
Older threads
Replies
3
Views
2,061
Replies
4
Views
1,897
Replies
8
Views
2,789
wms
Replies
3
Views
1,992
Latest threads
Replies
0
Views
31
Replies
0
Views
34
Replies
1
Views
42
Replies
3
Views
106
Top