环境搭建,自行百度
上节内容主要抓取非小号收录的所有数字货币的详情链接和数字货币名称,本节内容,我们就要抓取每一种数字货币的基本信息了。
每一种数字货币的详情页都包含了许多信息,包括:名称、市值、流通量、价格支持的交易所、区块浏览器等等,如下图:
红框内的内容就是我们今天要抓取的内容。
设计数字货币详情的数据结构,如下:
class Coin(scrapy.Item):
_id = scrapy.Field()
english_name = scrapy.Field()
short_name = scrapy.Field()
chinese_name = scrapy.Field()
exchanger_count = scrapy.Field()
publish_time = scrapy.Field()
publish_name = scrapy.Field()
white_paper = scrapy.Field()
website = scrapy.Field()
block_explorer = scrapy.Field()
is_token = scrapy.Field()
ico_price = scrapy.Field()
description = scrapy.Field()
market_capitalization = scrapy.Field() # 流通市值
publish_count = scrapy.Field() # 发行量
market_count = scrapy.Field() # 流通量
tx_count = scrapy.Field() # 交易额
market_ranking = scrapy.Field()
tx_ranking = scrapy.Field()
price = scrapy.Field()
time = scrapy.Field()
lowest_price = scrapy.Field()
highest_price = scrapy.Field()
在pipelines.py文件里面添加如下类:
class MongoDBPipeline(object):
def __init__(self):
clinet = pymongo.MongoClient("localhost", 27017)
db = clinet["CoinStudio"]
self.CoinUrl = db["CoinUrl"]
self.Coin = db["Coin"]
def process_item(self, item, spider):
""" 判断item的类型,并作相应的处理,再入数据库 """
if isinstance(item, Coin):
try:
count = self.Coin.find({'english_name': item['english_name']}).count()
if count > 0:
self.Coin.update({'english_name': item['english_name']}, dict(item))
else:
self.Coin.insert(dict(item))
except Exception as e:
print(e)
pass
elif isinstance(item, CoinUrl):
try:
count = self.CoinUrl.find({'name': item['name']}).count()
if count > 0:
self.CoinUrl.update({'name': item['name']}, dict(item))
else:
self.CoinUrl.insert(dict(item))
except Exception as e:
print(e)
pass
为了防止数据重复写入,所以在写入数据之前要判断数据库里面是否已经存储了对应的信息,如果已经存储了,那么更新对应的信息,如果没有存储,直接插入。
查看页面的信息,可以发现所有的数字货币的信息都在一个id为baseInfo的div元素里面:
首先,我们来看和市场相关的信息,在第一个子div里面,class名称为firstPart:
这个里面包含了数字货币当前的价格,24小时的最高价和最低价以及详情描述。其中详情描述又需要到另外一个详情页面才能看到具体的信息:
而流通市值,流通量以及24小时成交额都可以在对应的div里面拿到:
我们再来看数字货币的名称,上架的交易所,白皮书,网站以及区块站信息,都在下方一个列表里面:
当前价格信息就是coinprice的class里面,
直接定位到这个div,然后使用正则表达式进行匹配,代码如下:
# current price
coin_price = selector.xpath('//div[@class="coinprice"]').extract()
current_price = re.findall(r'(.*?), coin_price[0], re.S)
if len(current_price) is not 0:
coin['price'] = current_price[0]
coin['time'] = datetime.utcnow().replace(tzinfo=utc)
print(coin['price'], ' ', coin['time'])
最低价格和最高价格在lowheight的class里面,
也可以直接使用一个正则表达式进行提取,代码如下:
# lowest price and highest price
low_height = selector.xpath('//div[@class="lowHeight"]').extract()
prices = re.findall(r'.*?(.*?).*?.*?(.*?)', low_height[0], re.S)
if len(prices) is not 0:
coin['highest_price'] = prices[0][0]
coin['lowest_price'] = prices[0][1]
print(coin['highest_price'], ' ', coin['lowest_price'])
如之前所说,描述信息需要到另外一个页面进行抓取,那么我们需要先获得另外一个页面的链接,然后再从另外一个页面提取详情信息,代码如下:
# description
desc = selector.xpath('//div[@class="des"]/a').extract()
description = re.findall(r'', desc[0], re.S)
if len(description) is not 0:
desc_url = base_url + description[0]
print(desc_url)
response = requests.get(desc_url)
desc_selector = Selector(response)
desc_content = desc_selector.xpath('//div[@class="boxContain"]/div/p').extract()
coin['description'] = self.tool.replace(''.join(i.strip() for i in desc_content))
print(coin['description'])
市场信息都在对应的div里面,所以可以一次性进行匹配提取,代码如下:
# market
market = selector.xpath('//div[@id="baseInfo"]/div[@class="firstPart"]/div/div[@class="value"]').extract()
values = []
for value in market:
market_value = re.findall(r'(.*?)<', value, re.S)
values.append(market_value[0])
if len(values) is not 0:
coin['market_capitalization'] = values[0]
coin['market_count'] = values[1]
coin['publish_count'] = values[2]
coin['tx_count'] = values[3]
print(coin['market_capitalization'], ' ', coin['market_count'], ' ', coin['publish_count'], ' ', coin['tx_count'])
列表里面的基本信息处理起来就相对麻烦一些,其中区块站和网站都不止一个,所以我们需要用数组的方式进行存储,相关代码如下:
# base info
items = selector.xpath('//div[@id="baseInfo"]/div[@class="secondPark"]/ul/li').extract()
for item in items:
base_info = re.findall(r'.*?(.*?).*?(.*?).*? ', item, re.S)
if len(base_info) is not 0:
if base_info[0][0] == '英文名:':
coin['english_name'] = self.tool.replace(base_info[0][1]).strip()
print(coin['english_name'])
elif base_info[0][0] == '中文名:':
coin['chinese_name'] = self.tool.replace(base_info[0][1]).strip()
print(coin['chinese_name'])
elif base_info[0][0] == '上架交易所:':
coin['exchanger_count'] = self.tool.replace(base_info[0][1]).strip()
print(coin['exchanger_count'])
elif base_info[0][0] == '发行时间:':
coin['publish_time'] = self.tool.replace(base_info[0][1]).strip()
print(coin['publish_time'])
elif base_info[0][0] == '白皮书:':
coin['white_paper'] = self.tool.replace(base_info[0][1]).strip()
print(coin['white_paper'])
elif base_info[0][0] == '网站:':
websites = re.findall(r'', base_info[0][1], re.S)
if len(websites) is not 0:
office_websites = []
for website in websites:
office_websites.append(self.tool.replace(website).strip())
coin['website'] = office_websites
print(coin['website'])
elif base_info[0][0] == '区块站:':
explorers = []
block_explorers = re.findall(r'', base_info[0][1], re.S)
if block_explorers is not []:
for block_explorer in block_explorers:
explorers.append(self.tool.replace(block_explorer).strip())
coin['block_explorer'] = explorers
print(coin['block_explorer'])
elif base_info[0][0] == '是否代币:':
coin['is_token'] = self.tool.replace(base_info[0][1]).strip()
print(coin['is_token'])
elif base_info[0][0] == '众筹价格:':
ico_price = re.findall(r'(.*?)', base_info[0][1], re.S)
coin['ico_price'] = self.tool.replace(ico_price[0]).strip()
print(coin['ico_price'])
5. 抓取过程
声明:本人不提供数据,如需数据,请自行抓取。
如有疑问,可以添加页面左侧的个人公众号沟通,谢谢!