爬虫爬取数据出现只有表头或者需要验证的情况

问题描述:

小白在学习爬虫爬取猫眼电影的时候出现了只有空表头的情况:

爬虫爬取数据出现只有表头或者需要验证的情况_第1张图片

学习使用的代码为:

import requests
import bs4
from requests.exceptions import RequestException
import openpyxl


def get_one_page(url, headers):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None


def parse_one_page(html):
    soup = bs4.BeautifulSoup(html, 'lxml')
    # 获取电影名
    movies = []
    targets = soup.find_all(class_='name')
    for each in targets:
        movies.append(each.get_text())
    # 获取评分
    scores = []
    targets = soup.find_all(class_='score')
    for each in targets:
        scores.append(each.get_text())
    # 获取主演信息
    star_message = []
    targets = soup.find_all(class_='star')
    for each in targets:
        star_message.append(each.get_text().split('\n')[1].strip())
        print(each.get_text().split('\n')[1].strip())
    # 获取上映时间
    play_time = []
    targets = soup.find_all(class_='releasetime')
    for each in targets:
        play_time.append(each.get_text())
    result = []
    length = len(movies)
    for j in range(length):
        result.append([movies[j], scores[j], star_message[j], play_time[j]])

    return result


def save_to_excel(result):
    wb = openpyxl.Workbook()
    ws = wb.active
    ws['A1'] = '电影名称'
    ws['B1'] = '评分'
    ws['C1'] = '主演'
    ws['D1'] = '上映时间'
    for item in result:
        ws.append(item)
    wb.save('猫眼电影TOP100.xlsx')


def main():
    result = []
    for i in range(10):
        headers = {
     
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
        }
        url = 'http://maoyan.com/board/4?offset=' + str(i * 10)
        html = get_one_page(url, headers)
        result.extend(parse_one_page(html))
    save_to_excel(result)


if __name__ == '__main__':
    main()

代码链接:

https://blog.csdn.net/Waspvae/article/details/80617357?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.control

原因分析:

猫眼增强了反爬:
1.我们headers中如果只有‘User-Agent’有可能会跳转到验证界面。
2.由于爬虫的行为与普通用户有着明显的区别,爬虫的请求频率与请求次数要远高于普通用户,程序不报错却只有一个表头


解决方案:

1.如果是跳转到验证界面这一错误:我们在header中增加cookie一项:爬虫爬取数据出现只有表头或者需要验证的情况_第2张图片
代码改为:

headers = {
     
		'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
        'Cookie': '__mta=208959789.1585106920033.1593509077842.1593509107607.47; _lxsdk_cuid=1710fbc224bc8-0048503dcb84eb-f313f6d-1a298c-1710fbc224cc8; mojo-uuid=bc73035186bc203e1e0a1a9d69cf0c8f; uuid_n_v=v1; uuid=010A4750BAB111EA977B252D9527D646FCA82B59C6B54FB3934C361D719643F2; _csrf=ab7e60b187089a5c797755f042abdbd14eed1760f8308dc455570ee9ea4edfa2; mojo-session-'
}

即可。
如果还有问题,将代码
url = 'http://maoyan.com/board/4?offset=' + str(i * 10)
中的http改为https。
2.如果出现空表头可以通过改变访问ip解决。
例如原来用的wifi可以改为手机热点。

以上解决方法参照以下链接学习得出:
1.https://blog.csdn.net/zhao1299002788/article/details/108558232?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-15.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-15.control
2.https://www.cnblogs.com/li-lou/p/13819783.html
3https://blog.csdn.net/u011808596/article/details/108808717?utm_medium=distribute.pc_relevant_bbs_down.none-task-blog-baidujs-1.nonecase&depth_1-utm_source=distribute.pc_relevant_bbs_down.none-task-blog-baidujs-1.nonecase

你可能感兴趣的:(python,爬虫)