一. 多进度爬虫

  对于数据量较大的爬虫,对数据的处理供给较高时,能够使用python多进度或二十四线程的建制完结,多进程是指分配多少个CPU处理程序,同一时半刻刻唯有2个CPU在工作,10二线程是指进程之中有三个近乎”子进度”同时在协同事业。python中有种种两个模块可成功多进程和八线程的行事,此处此用multiprocessing模块造成10二线程爬虫,测试进程中发觉,由于站点具有反爬虫机制,当url地址和经过数目较多时,爬虫会报错。

Python二十二十四线程爬虫与多种数量存款和储蓄方式完成(Python爬虫实战贰),python爬虫

金沙注册送58 1

1]-
微信公众号爬虫。基于搜狗微信找寻的微信公众号爬虫接口,可以增加成基于搜狗搜索的爬虫,再次来到结果是列表,每一种均是大众号具体音信字典

贰. 代码内容

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import requests
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 time.sleep(1)
 return duanzi_list

def normal_scapper(url_lists):
 '''
 定义调用函数,使用普通的爬虫函数爬取数据
 '''
 begin_time = time.time()
 for url in url_lists:
  scrap_qiushi_info(url)
 end_time = time.time()
 print "普通爬虫一共耗费时长:%f" % (end_time - begin_time)

def muti_process_scapper(url_lists,process_num=2):
 '''
 定义多进程爬虫调用函数,使用mutiprocessing模块爬取web数据
 '''
 begin_time = time.time()
 pool = Pool(processes=process_num)
 pool.map(scrap_qiushi_info,url_lists)
 end_time = time.time()
 print "%d个进程爬虫爬取所耗费时长为:%s" % (process_num,(end_time - begin_time))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 normal_scapper(url_lists)
 muti_process_scapper(url_lists,process_num=2)


if __name__ == "__main__":
 main()

一. 多进度爬虫

  对于数据量较大的爬虫,对数据的拍卖须要较高时,能够动用python多进度或三十二线程的机制变成,多进度是指分配多少个CPU处理程序,同权且刻唯有1个CPU在劳作,多线程是指进度之中有多个八九不离10″子进程”同时在协同工作。python中有种种八个模块可达成多进度和十2线程的劳作,此处此用multiprocessing模块造成二十八线程爬虫,测试过程中窥见,由于站点具备反爬虫机制,当url地址和进程数目较多时,爬虫会报错。

这是新手学Python的第8八篇原创作品阅读本文大约须求三分钟

python零基础学习摄像教程全集-三

 叁. 爬取的多少存入到MongoDB数据库

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import json
import requests
import pymongo
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mongo(datas):
 '''
 @datas: 需要插入到mongoDB的数据,封装为字典,通过遍历的方式将数据插入到mongoDB中,insert_one()表示一次插入一条数据
 '''
 client = pymongo.MongoClient('localhost',27017)
 duanzi = client['duanzi_db']
 duanzi_info = duanzi['duanzi_info']
 for data in datas:
  duanzi_info.insert_one(data)

def query_data_from_mongo():
 '''
 查询mongoDB中的数据
 '''
 client = pymongo.MongoClient('localhost',27017)['duanzi_db']['duanzi_info']
 for data in client.find():
  print data 
 print "一共查询到%d条数据" % (client.find().count())


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mongo(duanzi_list)

if __name__ == "__main__":
 main()
 #query_data_from_mongo()

二. 代码内容

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import requests
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 time.sleep(1)
 return duanzi_list

def normal_scapper(url_lists):
 '''
 定义调用函数,使用普通的爬虫函数爬取数据
 '''
 begin_time = time.time()
 for url in url_lists:
  scrap_qiushi_info(url)
 end_time = time.time()
 print "普通爬虫一共耗费时长:%f" % (end_time - begin_time)

def muti_process_scapper(url_lists,process_num=2):
 '''
 定义多进程爬虫调用函数,使用mutiprocessing模块爬取web数据
 '''
 begin_time = time.time()
 pool = Pool(processes=process_num)
 pool.map(scrap_qiushi_info,url_lists)
 end_time = time.time()
 print "%d个进程爬虫爬取所耗费时长为:%s" % (process_num,(end_time - begin_time))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 normal_scapper(url_lists)
 muti_process_scapper(url_lists,process_num=2)


if __name__ == "__main__":
 main()

日前写了壹篇小说关于爬取市面上全部的Python书思路,那也算是大家多少解析连串讲座里面包车型客车3个小的实战项目。上次代码未有写完,正好周末有时光把代码全体成功同时存入了数据库中,前日就给大家一步步分析一下是自家是怎样爬取数据,清洗数据和绕过反爬虫的局地安插和一定量记录。

2]-
豆瓣读书爬虫。可以爬下豆瓣读书标签下的有着图书,按评分排行依次存款和储蓄,存款和储蓄到Excel中,可便宜我们筛选搜聚,比如筛选评价人数>一千的高分书籍;可依照不一样的宗旨存款和储蓄到Excel不相同的Sheet
,选用User
Agent伪装为浏览器实行爬取,并进入随机延时来更加好的一成不改变浏览器行为,幸免爬虫被封

 4. 插入至MySQL数据库

  将爬虫获取的数量插入到关系性数据库MySQL数据库中作为长久数据存款和储蓄,首先供给在MySQL数据库中创立库和表,如下:

1. 创建库
MariaDB [(none)]> create database qiushi;
Query OK, 1 row affected (0.00 sec)

2. 使用库
MariaDB [(none)]> use qiushi;
Database changed

3. 创建表格
MariaDB [qiushi]> create table qiushi_info(id int(32) unsigned primary key auto_increment,username varchar(64) not null,level int default 0,laugh_count int default 0,comment_count int default 0,content text default '')engine=InnoDB charset='UTF8';
Query OK, 0 rows affected, 1 warning (0.06 sec)

MariaDB [qiushi]> show create table qiushi_info;
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table       | Create Table                                                                                                                                                                                                                                                                                            |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| qiushi_info | CREATE TABLE `qiushi_info` (
  `id` int(32) unsigned NOT NULL AUTO_INCREMENT,
  `username` varchar(64) NOT NULL,
  `level` int(11) DEFAULT '0',
  `laugh_count` int(11) DEFAULT '0',
  `comment_count` int(11) DEFAULT '0',
  `content` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

 写入到MySQL数据库中的代码如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import time 
import pymysql
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mysql(datas):
 '''
 @params: datas,将爬虫获取的数据写入到MySQL数据库中
 '''
 try:
  conn = pymysql.connect(host='localhost',port=3306,user='root',password='',db='qiushi',charset='utf8')
  cursor = conn.cursor(pymysql.cursors.DictCursor)
  for data in datas:
   data_list = (data['username'],int(data['level']),int(data['laugh_count']),int(data['comment_count']),data['content'])
   sql = "INSERT INTO qiushi_info(username,level,laugh_count,comment_count,content) VALUES('%s',%s,%s,%s,'%s')" %(data_list)
   cursor.execute(sql)
   conn.commit()
 except Exception as e:
  print e
 cursor.close()
 conn.close()


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mysql(duanzi_list)

if __name__ == "__main__":
 main()

 三. 爬取的数码存入到MongoDB数据库

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import json
import requests
import pymongo
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mongo(datas):
 '''
 @datas: 需要插入到mongoDB的数据,封装为字典,通过遍历的方式将数据插入到mongoDB中,insert_one()表示一次插入一条数据
 '''
 client = pymongo.MongoClient('localhost',27017)
 duanzi = client['duanzi_db']
 duanzi_info = duanzi['duanzi_info']
 for data in datas:
  duanzi_info.insert_one(data)

def query_data_from_mongo():
 '''
 查询mongoDB中的数据
 '''
 client = pymongo.MongoClient('localhost',27017)['duanzi_db']['duanzi_info']
 for data in client.find():
  print data 
 print "一共查询到%d条数据" % (client.find().count())


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mongo(duanzi_list)

if __name__ == "__main__":
 main()
 #query_data_from_mongo()

一).市面上全体的Python书,都在京东,Tmall和豆子上,于是小编选取了豆瓣来爬取二).分析网址的结构,其实依然相比较简单的,首先有1个主的页面,里面有全体python的链接,壹共1388本(在这之中有拾0多本其实是重新的),网页尾巴部分分页显示一共九3页

3]-
网易爬虫。此项目标效益是爬取搜狐用户音信以及人际拓扑关系,爬虫框架使用scrapy,数据存款和储蓄使用mongodb。[3]: 

 伍. 将爬虫数据写入到CSV文件

  CSV文件是以逗号,情势分开的文本读写形式,可以由此纯文本恐怕Excel格局读取,是壹种普及的多少存款和储蓄方式,此处将爬取的多寡存入到CSV文件内。

将数据存入到CSV文件代码内容如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_csv(datas,filename):
 '''
 @datas: 需要写入csv文件的数据内容,是一个列表
 @params:filename,需要写入到目标文件的csv文件名
 '''
 with file(filename,'w+') as f:
  writer = csv.writer(f)
  writer.writerow(('username','level','laugh_count','comment_count','content'))
  for data in datas:
   writer.writerow((data['username'],data['level'],data['laugh_count'],data['comment_count'],data['content']))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_csv(duanzi_list,'/root/duanzi_info.csv')

if __name__ == "__main__":
 main()

 4. 插入至MySQL数据库

  将爬虫获取的数量插入到关系性数据库MySQL数据库中作为长久数据存款和储蓄,首先必要在MySQL数据库中创设库和表,如下:

1. 创建库
MariaDB [(none)]> create database qiushi;
Query OK, 1 row affected (0.00 sec)

2. 使用库
MariaDB [(none)]> use qiushi;
Database changed

3. 创建表格
MariaDB [qiushi]> create table qiushi_info(id int(32) unsigned primary key auto_increment,username varchar(64) not null,level int default 0,laugh_count int default 0,comment_count int default 0,content text default '')engine=InnoDB charset='UTF8';
Query OK, 0 rows affected, 1 warning (0.06 sec)

MariaDB [qiushi]> show create table qiushi_info;
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table       | Create Table                                                                                                                                                                                                                                                                                            |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| qiushi_info | CREATE TABLE `qiushi_info` (
  `id` int(32) unsigned NOT NULL AUTO_INCREMENT,
  `username` varchar(64) NOT NULL,
  `level` int(11) DEFAULT '0',
  `laugh_count` int(11) DEFAULT '0',
  `comment_count` int(11) DEFAULT '0',
  `content` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

 写入到MySQL数据库中的代码如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import time 
import pymysql
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mysql(datas):
 '''
 @params: datas,将爬虫获取的数据写入到MySQL数据库中
 '''
 try:
  conn = pymysql.connect(host='localhost',port=3306,user='root',password='',db='qiushi',charset='utf8')
  cursor = conn.cursor(pymysql.cursors.DictCursor)
  for data in datas:
   data_list = (data['username'],int(data['level']),int(data['laugh_count']),int(data['comment_count']),data['content'])
   sql = "INSERT INTO qiushi_info(username,level,laugh_count,comment_count,content) VALUES('%s',%s,%s,%s,'%s')" %(data_list)
   cursor.execute(sql)
   conn.commit()
 except Exception as e:
  print e
 cursor.close()
 conn.close()


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mysql(duanzi_list)

if __name__ == "__main__":
 main()

金沙注册送58 23).这一个页面是静态页面,url页比较有规律金沙注册送58,,所以很轻便构造出富有的url的地点金沙注册送58 3四).爬虫每种分页里面包车型客车保有的Python书和对应的url,比如第一页里面有”笨办法那本书”,大家只必要领取书名和呼应的url金沙注册送58 4金沙注册送58 5

[4]-
Bilibili用户爬虫。,抓取字段:用户id,别称,性别,头像,品级,经验值,观者数,破壳日,地址,注册时间,签字,品级与经验值等。抓取之后生成B站用户数据报告。

 6. 将爬取数据写入到文本文件中

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_files(datas,filename):
 '''
 定义数据存入写文件的函数
 @params:datas需要写入的数据
 @filename:将数据写入到指定的文件名
 '''
 print "开始写入文件.."
 with file(filename,'w+') as f:
  f.write("用户名" + "\t" + "用户等级" + "\t" + "笑话数" + "\t" + "评论数" + "\t" + "段子内容" + "\n")
  for data in datas:
   f.write(data['username'] + "\t" + \
    data['level'] + "\t" + \
    data['laugh_count'] + "\t" + \
    data['comment_count'] + "\t" + \
    data['content'] + "\n" + "\n"
   )

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_files(duanzi_list,'/root/duanzi.txt')

if __name__ == "__main__":
 main()

 

 5. 将爬虫数据写入到CSV文件

  CSV文件是以逗号,方式分开的文本读写方式,能够通过纯文本或许Excel格局读取,是壹种普及的数量存款和储蓄格局,此处将爬取的数目存入到CSV文件内。

Python爬虫实战2,Python二十多线程爬虫与种种数码存款和储蓄方式贯彻。将数据存入到CSV文件代码内容如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_csv(datas,filename):
 '''
 @datas: 需要写入csv文件的数据内容,是一个列表
 @params:filename,需要写入到目标文件的csv文件名
 '''
 with file(filename,'w+') as f:
  writer = csv.writer(f)
  writer.writerow(('username','level','laugh_count','comment_count','content'))
  for data in datas:
   writer.writerow((data['username'],data['level'],data['laugh_count'],data['comment_count'],data['content']))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_csv(duanzi_list,'/root/duanzi_info.csv')

if __name__ == "__main__":
 main()

1).上面大家曾经领到了91个页面包车型地铁全部的Python书和对应的url,一共是93*一5大概1300多本书,首先先去重,然后我们可以把它存到内部存款和储蓄器里面用3个字典保存,只怕存到1个csv文件中去(有同学只怕想不到为何要存到文件之中呢,用字典存取不是有利呢,先不说最终颁发)

5]-
今日头条微博爬虫。首要爬取新浪网易用户的个人音讯、和讯新闻、客官和关心。代码获取和讯和讯Cookie举行登入,可经过多账号登入来抗御博客园的反扒。主要行使
scrapy 爬虫框架。

 6. 将爬取数据写入到文本文件中

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_files(datas,filename):
 '''
 定义数据存入写文件的函数
 @params:datas需要写入的数据
 @filename:将数据写入到指定的文件名
 '''
 print "开始写入文件.."
 with file(filename,'w+') as f:
  f.write("用户名" + "\t" + "用户等级" + "\t" + "笑话数" + "\t" + "评论数" + "\t" + "段子内容" + "\n")
  for data in datas:
   f.write(data['username'] + "\t" + \
    data['level'] + "\t" + \
    data['laugh_count'] + "\t" + \
    data['comment_count'] + "\t" + \
    data['content'] + "\n" + "\n"
   )

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_files(duanzi_list,'/root/duanzi.txt')

if __name__ == "__main__":
 main()

 

  1. 多进度爬虫 对于数据量较大的爬虫,对数码的拍卖须要较高时,可…

二).大家随后分析每本书页面的特点:

r[6]-
小说下载分布式爬虫。使用scrapy,redis,
mongodb,graphite完结的一个分布式网络爬虫,底层存款和储蓄mongodb集群,分布式使用redis完结,爬虫状态展现采取graphite完结,重要针对一个小说站点。

金沙注册送58 6上一片文章说过我们需求分析:我/出版社/译者/出版年/页数/定价/ISBN/评分/评价人数

[7]-
中夏族民共和国知网爬虫。设置检索条件后,实行src/CnkiSpider.py抓取多少,抓取数据存款和储蓄在/data目录下,每种数据文件的第2行为字段名称。

看一下网址的源码,发现重大的新闻在div 和div

[8]-
链家网爬虫。爬取香港(Hong Kong)地区链家历年二手房成交记录。涵盖链家爬虫一文的成套代码,包蕴链家模拟登陆代码。

金沙注册送58 7三).那壹部分的数目清洗是比较费心的,因为不是每1本书都以有点评和评分系统的,而且不是每一本书都有小编,页面,价格的,所以提取的时候一定要盘活丰硕处理,比如部分页面长的如此:金沙注册送58 8土生土长数据采撷的历程中有无数不平等的数量:

[9]- 京东爬虫。基于scrapy的京东网址爬虫,保存格式为csv。

  • 书的日期表示格式,各样各类都有:有的书的日子是:’September
    200七’,’October 2二, 200七’,’2017-九’,’2017-8-2伍’

  • 部分书的价钱是通货单位不统一,有法郎,英镑,日币和人民币比如:CNY
    4玖.00,13五,19 €,JPY 4320, $ 176.00

[10]- QQ 群爬虫。批量抓取 QQ
群音讯,包涵群名称、群号、群人数、群主、群简要介绍等剧情,最后生成 XLS(X) /
CSV 结果文件。

一).有的同室后台问作者,你是用scrapy框架依旧友好入手写的,小编这一个类型是温馨入手写的,其实scrapy是五个百般棒的框架,若是爬取几九千0的数量,小编一定会用这一个拔尖武器.

[11]-乌云爬虫。
乌云公开漏洞、知识库爬虫和寻觅。全体当着漏洞的列表和每一种漏洞的文本内容存在mongodb中,大约约贰G内容;假使整站爬全部文书和图纸作为离线查询,大约供给10G上空、二钟头(10M邮电通讯带宽);爬取全部知识库,总共约500M空间。漏洞找寻采取了Flask作为web
server,bootstrap作为前端。

二).笔者用的是二十八线程爬取,把具备的url都扔到2个类别之中,然后设置多少个线程去队列之中不断的爬取,然后循环往复,直到队列里的url全体处理达成

2016.9.11补充:

叁).数据存款和储蓄的时候,有三种思路:

[12]- 去何方网爬虫。
网络爬虫之Selenium使用代理登录:爬取去何地网址,使用selenium模拟浏览器登入,获取翻页操作。代理能够存入二个文书,程序读取并接纳。资助多进度抓取。

  • 一种是直接把爬取完的数目存到SQL数据Curry面,然后每一次新的url来了随后,直接查询数据Curry面有未有,有的话,就跳过,没有就爬取处理

  • 另一种是存入CSV文件,因为是八线程存取,所以肯定要加入保障护,不然多少个线程同时写贰个文件的会卓殊的,写成CSV文件也能调换到数据库,而且保存成CSV文件还有三个益处,能够转成pandas非常有益的拍卖分析.

[13]-
机票爬虫(去何方和携程网)。Findtrip是3个依据Scrapy的机票爬虫,近日整合了国内两大机票网址(去哪儿

壹).一般大型的网址都有反爬虫战术,纵然大家此次爬的数码唯有一千本书,不过同样会遇见反爬虫难题

  • 携程)。

二).关于反爬虫计策,绕过反爬虫有很二种方法。有的时候加时延(尤其是多线程处理的时候),有的时候用cookie,有的会代理,尤其是附近的爬取明确是要用代理池的,作者那边用的是cookie加时延,相比土的方法.

r[14]

叁).断点续传,即使自个儿的数据量不是异常的大,千条规模,然而提议要加断点续传效率,因为你不晓得在爬的时候相会世什么难题,纵然您能够递归爬取,不过假设你爬了800多条,程序挂了,你的事物还没用存下来,下次爬取又要重头开首爬,会遗精的(聪明的同学明确猜到,作者下面第3步留的伏笔,便是这般原因)

  • 基于requests、MySQLdb、torndb的微博客户端内容爬虫。

壹).整个的代码架构笔者还从未完全优化,近来是伍个py文件,前边小编会尤其优化和打包的

[15]- 豆瓣电影、书籍、小组、相册、东西等爬虫集。

金沙注册送58 9

[17]-
百度mp3全站爬虫,使用redis协助断点续传。r[18]-
天猫和天猫商店的爬虫,能够依据查找关键词,货品id来抓去页面的音讯,数据存款和储蓄在

  • spider_main:主要是爬取9三个分页的全部书的链接和书面,并且四线程处理
  • book_html_parser:首假使爬取每一本书的音讯
  • url_manager:主借使治本全数的url链接
  • db_manager:主假若数据库的存取和查询
  • util:是3个存放一些大局的变量
  • verify:是自小编内测代码的三个小程序

[19]-
2个股票(stock)数量(沪深)爬虫和选股计策测试框架。依据选定的日子范围抓取全数沪深两市股票(stock)的盘子数据。协理接纳表明式定义选股攻略。援救八线程处理。保存数据到JSON文件、CSV文件。

二).首要的爬取结果的寄放

金沙注册送58 10all_books_link.csv:主要存放在1200多本书的url和书名金沙注册送58 11python_books.csv:首要存放在具体每一本书的音讯金沙注册送58 12叁).用到的库爬虫部分:用了requests,beautifulSoup数据清洗:用了大气的正则表明式,collection模块,对书的出版日期用了datetime和calendar模块八线程:用了threading模块和queue


结论:好,先天的全网分析Python书,爬虫篇,就讲道那里,基本上大家全体那么些类其他本事点都讲了一次,爬虫照旧很风趣的,然而要改成1个爬虫高手还有众多地点要上学,想把爬虫写的爬取速度快,又沉稳,还是能绕过反爬虫系统,并不是1件轻易的业务.
有意思味的伙伴,也能够团结入手写一下哦。源码等背后的数目解析篇讲完后,小编会放github上,若有怎么着难题,也欢迎留言切磋一下.

本项目收录种种Python互联网爬虫实战开源代码,并永恒更新,欢迎补充。

越多Python干货欢迎加笔者爱python大神QQ群:30405079九

相关文章

网站地图xml地图