PHP Cookbook读书笔记 – 第13章Web自动化

通过GET获得一个指定url的页面内容

有3种方式来获取一个URL的内容:

  1. PHP提供的文件函数file_get_contents()
  2. cURL扩展
  3. PEAR中的HTTP_Request类
//方式1

$page = file_get_contents('http://www.example.com/robots.txt');



//方式2

$c = curl_init('http://www.example.com/robots.txt');

curl_setopt($c, CURLOPT_RETURNTRANSFER, true);

$page = curl_exec($c);

curl_close($c);



//方式3

require_once 'HTTP/Request.php';

$r = new HTTP_Request('http://www.example.com/robots.txt');

$r->sendRequest();

$page = $r->getResponseBody();

可以通过这些方式来获取XML文档,通过结合http_build_query()来建立一个查询字符串,可以通过url中加入username@password的形式来访问受保护的页面,通过cURL和PEAR的HTTP_Client类来跟踪重定向。

通过POST获得一个URL

让PHP模拟发送一个POST请求并获得服务器的反馈内容

//1

$url = 'http://www.example.com/submit.php';

$body = 'monkey=uncle&rhino=aunt';

$options = array('method' => 'POST', 'content' => $body);

$context = stream_context_create(array('http' => $options));

print file_get_contents($url, false, $context);



//2

$url = 'http://www.example.com/submit.php';

$body = 'monkey=uncle&rhino=aunt';

$c = curl_init($url);

curl_setopt($c, CURLOPT_POST, true);

curl_setopt($c, CURLOPT_POSTFIELDS, $body);

curl_setopt($c, CURLOPT_RETURNTRANSFER, true);

$page = curl_exec($c);

curl_close($c);



//3

require 'HTTP/Request.php';

$url = 'http://www.example.com/submit.php';

$r = new HTTP_Request($url);

$r->setMethod(HTTP_REQUEST_METHOD_POST);

$r->addPostData('monkey','uncle');

$r->addPostData('rhino','aunt');

$r->sendRequest();

$page = $r->getResponseBody();

通过Cookie获得一个URL

//2

$c = curl_init('http://www.example.com/needs-cookies.php');

curl_setopt($c, CURLOPT_COOKIE, 'user=ellen; activity=swimming');

curl_setopt($c, CURLOPT_RETURNTRANSFER, true);

$page = curl_exec($c);

curl_close($c);



//3

require 'HTTP/Request.php';

$r = new HTTP_Request('http://www.example.com/needs-cookies.php');

$r->addHeader('Cookie','user=ellen; activity=swimming');

$r->sendRequest();

$page = $r->getResponseBody();

通过Header获得一个URL

通过修改header中的信息可以来伪造 Referer 或 User-Agent 后请求目标URL,不少防盗链网站经常会采用判断Referer中的信息来源决定是否允许下载或访问资源。需要具备一些HTTP的HEADER背景知识。

标记网页

其实这个代码经过简单修改还可以应用到替换网页中的敏感关键字,这在天朝是很有用的一个功能

$body = '

I like pickles and herring.


A pickle picture



I have a herringbone-patterned toaster cozy.



Herring is not a real HTML element!

';



$words = array('pickle','herring');

$patterns = array();

$replacements = array();

foreach ($words as $i => $word) {

    $patterns[] = '/' . preg_quote($word) .'/i';

    $replacements[] = "\\0";

}



// Split up the page into chunks delimited by a

// reasonable approximation of what an HTML element

// looks like.

$parts = preg_split("{(<(?:\"[^\"]*\"|'[^']*'|[^'\">])*>)}",

                    $body,

                    -1,  // Unlimited number of chunks

                    PREG_SPLIT_DELIM_CAPTURE);

foreach ($parts as $i => $part) {

    // Skip if this part is an HTML element

    if (isset($part[0]) && ($part[0] == '<')) { continue; }

    // Wrap the words with s

    $parts[$i] = preg_replace($patterns, $replacements, $part);

}



// Reconstruct the body

$body = implode('',$parts);



print $body;

提取页面所有链接

也是一个很不错的功能,在做采集之类的程序时可以用的上

采用了tidy扩展的实现方式:

$doc = new DOMDocument();

$opts = array('output-xml' => true,

              // Prevent DOMDocument from being confused about entities

              'numeric-entities' => true);

$doc->loadXML(tidy_repair_file('linklist.html',$opts));

$xpath = new DOMXPath($doc);

// Tell $xpath about the XHTML namespace

$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');

foreach ($xpath->query('//xhtml:a/@href') as $node) {

    $link = $node->nodeValue;

    print $link . "\n";

通过正则提取链接:

$html = file_get_contents('linklist.html');

$links = pc_link_extractor($html);

foreach ($links as $link) {

    print $link[0] . "\n";

}



function pc_link_extractor($html) {

    $links = array();

    preg_match_all('/]*)[\"\']?[^>]*>(.*?)<\/a>/i', $html,$matches,PREG_SET_ORDER); foreach($matches as $match) { $links[] = array($match[1],$match[2]); } return $links;

将文本转换为HTML

bbcode的概念和这个很像,所以将这个贴出来

function pc_text2html($s) {

  $s = htmlentities($s);

  $grafs = split("\n\n",$s);

  for ($i = 0, $j = count($grafs); $i < $j; $i++) {

    // 转换html超链接

    $grafs[$i] = preg_replace('/((ht|f)tp:\/\/[^\s&]+)/',

                              '$1',$grafs[$i]);    // 转换email链接

    $grafs[$i] = preg_replace('/[^@\s]+@([-a-z0-9]+\.)+[a-z]{2,}/i',        '$1',$grafs[$i]);    // 开始一个新段落

    $grafs[$i] = '

'.$grafs[$i].'

';  }  return implode("\n\n",$grafs);}

将HTML转换为文本

已经有现成的代码来实现这个功能http://www.chuggnutt.com/html2text.php

删除HTML和PHP标签

用这个函数strip_tags( ) 可以删除HTML和PHP标签

你可能感兴趣的:(读书笔记)