浅谈爬虫

';

$ch = curl_init($login_url);
curl_setopt($ch, CURLOPT_HEADER, 0);// 返回页面内容
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0); //获取的信息以文件流的形式返回
curl_setopt($ch, CURLOPT_POST, 1);// 提交post数据
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);// 保存cookie
curl_exec($ch); // 执行
curl_close($ch);// 关闭
echo '

';

$ch2=curl_init($cookie_url);
curl_setopt($ch2, CURLOPT_COOKIEFILE, $cookie_file); // 使用cookie
curl_exec($ch2);
curl_close($ch2);
echo '

';

echo '

0x06.PHP多线程抓取网页实现代码

代码如下

$ch1 = curl_init();
$ch2 = curl_init();
curl_setopt($ch1, CURLOPT_URL, "http://www.XX.com/");
curl_setopt($ch1, CURLOPT_HEADER, 0);
curl_setopt($ch2, CURLOPT_URL, "http://www.00.com/");
curl_setopt($ch2, CURLOPT_HEADER, 0);
$mh = curl_multi_init();
curl_multi_add_handle($mh,$ch1);
curl_multi_add_handle($mh,$ch2);
do{
curl_multi_exec($mh,$flag);
}while($flag > 0);
curl_multi_remove_handle($mh,$ch1);
curl_multi_remove_handle($mh,$ch2);
curl_multi_close($mh);
?>

0x07.抓取和分析一个文件是非常简单的事。

file函数也可以，不同的网站要写不同的正则表达式来匹配抓取想要的内容。
以下实现php抓取网页title，keywords，description，content，注释掉的数据库操作（即把抓取的内容存到数据库）

代码如下

header('Content-Type: text/html; charset=gbk');
$url ="http://www.jj59.com/qingganwenzhang/093750.html";
$lines_array=file($url);//read entire web file into array
//var_dump($lines_array);
$lines_string = implode('', $lines_array);//maybe not in a row,so join the array elements with a string
$count=count($lines_string);
for($i=0;$i<$count;$i++){
if(preg_match("/(.*)/is",$lines_string,$title)){//标题
$title=$title[0];
}
if(preg_match("/]*?name=['"]?keywords['"]?[^>]*?>/is",$lines_string,$keywords)){//关键字
$title2=$keywords[0];
}
if(preg_match("/]*?name=['"]?description['"]?[^>]*?>/is",$lines_string,$description)){//描述
$title3=$description[0];
}
if(preg_match("/

(.*)

/is",$lines_string,$content)){//描述
$title4=$content[0];
//var_dump($title4);
}
}
$title=substr($title,7,-19);
$title2=substr($title2,31,-4);
$title3=substr($title3,34,-4);
//$title4=substr($title4,34,-40);
echo "Title:".$title."
";
echo "KeyWords:".$title2."
";
echo "Description:".$title3."
";
echo "Content:".$title4."
";
//$query="insert into insun4 values('$i','$title','$title2','$title3','$title4')";
//$result=mysql_query($query) or die("查询数据失败");//执行查询
?>

代码如下

header('Content-Type: text/html; charset=gbk');
error_reporting(E_CORE_ERROR);
$url ="http://yxmhero1989.blog.163.com";
$lines_array=file($url);//read entire web file into array
//var_dump($lines_array);
$lines_string = implode('', $lines_array);//maybe not in a row,so join the array elements with a string
$count=count($lines_string);
for($i=0;$i<$count;$i++){
//if(preg_match("/(.*)/is",$lines_string,$title)){//标题
if(eregi("(.*)",$lines_string,$title)){
  $title=$title[0];
  }
if(preg_match("/]*?name=['"]?keywords['"]?[^>]*?>/is",$lines_string,$keywords)){//关键字
  $title2=$keywords[0];
  }
if(preg_match("/]*?name=['"]?description['"]?[^>]*?>/is",$lines_string,$description)){//描述
  $title3=$description[0];
  }
}
$title=substr($title,7,-19);
$title2=substr($title2,31,-3);
$title3=substr($title3,34,-5);

echo "Title:".$title."
";
echo "KeyWords:".$title2."
";
echo "Description:".$title3."
";

//$query="insert into insun4 values('$i','$title','$title2','$title3')";
//$result=mysql_query($query) or die("查询数据失败");//执行查询
?>

结果：
Title:Minghacker is Insun - InSun
KeyWords:InSun,Minghacker is Insun,网易博客,网易,blog
Description:百度(baidu)分词算法分析,对网易博客日志标签功能的期盼,CDbConnection failed to open the DB connection: could not find driver,Warning: date() [function.date]: It is not safe to rely on the system's timezone settings.,浅谈Python web框架,谷歌是如何做代码审查的（Things Everyone Should Do: Code Review）,Failed to connect to mailserver at localhost port 25,PHP flock文件锁,针对$_SERVER[’PHP_SELF’]的跨站脚本攻击,安全跑路指南与安全跑路指南升级版,InSun的网易博客,凡你醉处你说过皆非他乡,爱好写词摄影和美女。专注编程和安全。应该可以发好人卡,可能时机没到

0x08.找到了的数据存到数据库，然后CSS+DIV架个前台，在前台用分页调用，代码就不贴了，直接上图片：

你看多简单，前面的几个大站互爬数据添加到数据库，然后前台调用，充实了自己的信息库，减少了手动的工作量，何其乐啊，可是不知道算不算违反？大概恶意竞争才算，所以反爬虫机制必须的~~~

0x09.浅谈反爬虫机制：
①。现在各大搜索网站和行业化以及特别商业化的网站应该都有反爬虫机制，爬虫需要伪造Agent信息，而且每次爬信息都要有一定的时间间隔（即加上随机sleep ）。
②。robots.txt是一存放于网站根目录中的文本文件，用来定义网站上哪些内容可以或不能供网络爬虫存取，Yahoo也在网站上说明如何利用robots.txt以避免网站或特定网页数据不被搜索引擎搜集及索引。
③。如果是爬虫程序来访，则user-agent会显示Googlebot或MSNBot等搜索引擎爬虫程序名称，每个搜索引擎都有自己的user-agent，以下分别列出国内主要的爬虫程序。

百度baidu.com—-Baiduspider 谷歌google.com—-Googlebot

雅虎yahoo.com—-Yahoo 有道yodao.com—-YodaoBot

搜搜soso.com—-Sosospider/Sosoimagespider 搜狗sogou.com—-sogou 微软msn.com—-msnbot

自写程序，if不是上面就access deny，这样不失流量，也遏制了对手的肆意爬取信息，减少了对公司的损失。
④。重要的2篇文章分析了反爬虫策略。
web性能优化(三)反爬虫策略
http://dynamiclu.iteye.com/blog/1044645
互联网网站的反爬虫策略浅析
http://robbin.iteye.com/blog/451014

浅谈爬虫

PHP实现最简单爬虫原型（话说原作者的正则注释反了，出现错误哈哈）

0x06.PHP多线程抓取网页实现代码

0x07.抓取和分析一个文件是非常简单的事。

你可能感兴趣的:(php,爬虫)