文章詳情頁

一文帶你了解Python 四種常見基礎(chǔ)爬蟲方法介紹

瀏覽：2日期：2022-07-03 11:40:53

一、Urllib方法

Urllib是python內(nèi)置的HTTP請(qǐng)求庫

import urllib.request#1.定位抓取的urlurl=’http://www.baidu.com/’#2.向目標(biāo)url發(fā)送請(qǐng)求response=urllib.request.urlopen(url)#3.讀取數(shù)據(jù)data=response.read()# print(data) #打印出來的數(shù)據(jù)有ASCII碼print(data.decode(’utf-8’)) #decode將相應(yīng)編碼格式的數(shù)據(jù)轉(zhuǎn)換成字符串

#post請(qǐng)求import urllib.parseurl=’http://www.iqianyue.com/mypost/’#構(gòu)建上傳的datapostdata=urllib.parse.urlencode({ ’name’:’Jack’, ’pass’:’123456’}).encode(’utf-8’) #字符串轉(zhuǎn)化成字節(jié)流數(shù)據(jù)html=urllib.request.urlopen(url,data=postdata).read()print(html)

#headers針對(duì)檢驗(yàn)頭信息的反爬機(jī)制import urllib.requestheaders={’User-Agent’:’Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36’}request1=urllib.request.Request(’https://www.dianping.com/’,headers=headers)#Request類構(gòu)建了一個(gè)完整的請(qǐng)求response1=urllib.request.urlopen(request1).read()print(response1.decode(’utf-8’))

#超時(shí)設(shè)置+異常處理import urllib.requestimport urllib.errorfor i in range(20): try: response1=urllib.request.urlopen(’http://www.ibeifeng.com/’,timeout=0.01) print(’a’) except urllib.error.URLError as e: print(e) except BaseException as a: #所有異常的基類 print(a)

二、requests方法

?Requests是用python語言基于urllib編寫的，采用的是Apache2 Licensed開源協(xié)議的HTTP庫?urllib還是非常不方便的，而Requests它會(huì)比urllib更加方便，可以節(jié)約我們大量的工作。?requests是python實(shí)現(xiàn)的最簡單易用的HTTP庫，建議爬蟲使用requests庫。?默認(rèn)安裝好python之后，是沒有安裝requests模塊的，需要單獨(dú)通過pip安裝

import requests#get請(qǐng)求r=requests.get(’https://www.taobao.com/’)#打印字節(jié)流數(shù)據(jù)# print(r.content)# print(r.content.decode(’utf-8’)) #轉(zhuǎn)碼print(r.text) #打印文本數(shù)據(jù)import chardet#自動(dòng)獲取到網(wǎng)頁編碼，返回字典類型print(chardet.detect(r.content))

POST請(qǐng)求實(shí)現(xiàn)模擬表單登錄import requests#構(gòu)建上傳到網(wǎng)頁的數(shù)據(jù)data={ ’name’:’Jack’, ’pass’:’123456’}#帶登陸數(shù)據(jù)發(fā)送請(qǐng)求r=requests.post(’http://www.iqianyue.com/mypost/’,data=data)print(r.text) #打印請(qǐng)求數(shù)據(jù)#將登錄后的html儲(chǔ)存在本地f=open(’login.html’,’wb’)f.write(r.content) #寫入字節(jié)流數(shù)據(jù)f.close()

#針對(duì)檢驗(yàn)頭信息的反爬機(jī)制headersimport requests#構(gòu)建headersheaders={ ’User-Agent’:’Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36’}r=requests.get(’https://www.dianping.com/’,headers=headers)print(r.text)print(r.status_code) #狀態(tài)403 被攔截了(查看狀態(tài))

#cookies#跳過登陸，獲取資源import requestsf=open(’cookie.txt’,’r’) #打開cookie文件#初始化cookies，聲明一個(gè)空字典cookies={}#按照字符；進(jìn)行切割讀取，返回列表數(shù)據(jù)，然后遍歷#split()：切割函數(shù) strip()去除字符串前后空白for line in f.read().split(’;’): #split將參數(shù)設(shè)置為1，把字符串切割成兩個(gè)部分 name,value=line.strip().split(’=’,1) #為空字典cookies添加內(nèi)容 cookies[name]=valuer=requests.get(’http://www.baidu.com’,cookies=cookies)data=r.textf1=open(’baidu.html’,’w’,encoding=’utf-8’)f1.write(data)f1.close()

#設(shè)置代理（網(wǎng)站搜索免費(fèi)代理ip）#解決網(wǎng)頁封IP的問題import requestsproxies={ #’協(xié)議’:’ip:端口號(hào)’ ’HTTP’:’222.83.160.37：61205’}req=requests.get(’http://www.taobao.com/’,proxies=proxies)print(req.text)#設(shè)置超時(shí)import requestsfrom requests.exceptions import Timeouttry: response = requests.get('http://www.ibeifeng.com ', timeout=0.01) print(response.status_code)except Timeout: print(’訪問超時(shí)！’)

三、BS4- BeautifulSoup4解析

from bs4 import BeautifulSouphtml = '''<html><head><title>The Dormouse’s story</title></head><body><p class='title'><b>The Dormouse’s story</b></p><p class='story'>Once upon a time there were three little sisters; and their names were<a rel='external nofollow' rel='external nofollow' id='link1'>Elsie</a>,<a rel='external nofollow' id='link2'>Lacie</a> and<a rel='external nofollow' id='link3'>Tillie</a>;and they lived at the bottom of a well.</p><p class='story'>...</p>'''# #創(chuàng)建一個(gè)BS對(duì)象soup=BeautifulSoup(html,’html.parser’) #html.parser默認(rèn)解析器print(type(soup))# 結(jié)構(gòu)化輸出print(soup.prettify())#1獲取標(biāo)簽(只能獲取第一條對(duì)應(yīng)的標(biāo)簽)print(soup.p) #獲取p標(biāo)簽print(soup.a) #獲取a標(biāo)簽print(soup.title) #獲取title#2獲取標(biāo)簽內(nèi)容print(soup.title.string)print(soup.a.string)print(soup.body.string) #如果標(biāo)簽中有多個(gè)子標(biāo)簽返回Noneprint(soup.head.string) #如果標(biāo)簽中有一個(gè)子標(biāo)簽返回子標(biāo)簽里的文本#3獲取屬性print(soup.a.attrs) #返回字典print(soup.a[’id’]) #得到指定屬性值#4操作字節(jié)點(diǎn)print(soup.p.contents) #得到標(biāo)簽下所有子節(jié)點(diǎn)print(soup.p.children) #得到標(biāo)簽下所有子節(jié)點(diǎn)的迭代對(duì)象#5操作父節(jié)點(diǎn)print(soup.p.parent) #得到標(biāo)簽p的父節(jié)點(diǎn)其內(nèi)部的所有內(nèi)容print(soup.p.parents) # 得到標(biāo)簽p的父節(jié)點(diǎn)的迭代對(duì)象#6操作兄弟節(jié)點(diǎn)(同級(jí)的節(jié)點(diǎn))#next_sibling和previous_sibling分別獲取節(jié)點(diǎn)的下一個(gè)和上一個(gè)兄弟元素print(soup.a.next_sibling)print(soup.a.previous_sibling)#二.搜索文檔數(shù)#1標(biāo)簽名#查詢所有a標(biāo)簽res1=soup.find_all(’a’)print(res1)#獲取所有a標(biāo)簽下屬性為class='sister'的標(biāo)簽（#使用 class 做參數(shù)會(huì)導(dǎo)致語法錯(cuò)誤，這里也要用class_）print(soup.find_all(’a’,class_='sister'))#2正則表達(dá)式import re#查詢所有包含d字符的標(biāo)簽res2=soup.find_all(re.compile(’d+’))print(res2)#3列表#查找所有的title標(biāo)簽和a標(biāo)簽res3=soup.find_all([’title’,’a’])print(res3)#4關(guān)鍵詞#查詢屬性id=’link1’的標(biāo)簽res4=soup.find_all(id=’link1’)print(res4)#5內(nèi)容匹配res5=soup.find_all(text=’Tillie’) #文本匹配res55=soup.find_all(text=re.compile(’Dormouse’))print(res55)#6嵌套選擇print(soup.find_all(’p’))#查看所有p標(biāo)簽下所有的a標(biāo)簽for i in soup.find_all(’p’): print(i.find_all(’a’))#三.CSS選擇器#1根據(jù)標(biāo)簽查詢對(duì)象res6=soup.select(’a’) #返回列表print(res6) #得到所有的a標(biāo)簽#2根據(jù)ID屬性查詢標(biāo)簽對(duì)象(id用#)print(soup.select(’#link2’))#3根據(jù)class屬性查詢標(biāo)簽對(duì)象(class用.)print(soup.select(’.sister’))print(soup.select(’.sister’)[2].get_text()) #獲取文本內(nèi)容#4屬性選擇(獲取a標(biāo)簽里=href屬性值的標(biāo)簽)print(soup.select(’a[ rel='external nofollow' rel='external nofollow' ]’))#5包含選擇(獲取)print(soup.select(’p a#link1’))#6并列選擇print(soup.select(’a#link1,a#link2’))#7得到標(biāo)簽內(nèi)容res7=soup.select(’p a.sister’)for i in res7: print(i.get_text())

#練習(xí)：爬取51job主頁12個(gè)職位from bs4 import BeautifulSoupimport requestsurl=’https://www.51job.com/’headers={’User-Agent’:’Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36’}html=requests.get(url,headers=headers)data=html.content.decode(’gbk’)soup=BeautifulSoup(data,’html.parser’)#獲取span標(biāo)簽，class_='at'屬性span=soup.find_all(’span’,class_='at')# for i in span:# print(i.get_text())#select方法（CSS選擇器）span1=soup.select(’span[class='at']’)for m in span1: print(m.get_text())

四、XPath語法

XPath 是一門在 XML 文檔中查找信息的語言。XPath 可用來在 XML 文檔中對(duì)元素和屬性進(jìn)行遍歷

from lxml import etreetext=’’’ <html> <head> <title>春晚</title> </head> <body> <h1 name='title'>個(gè)人簡介</h1> <div name='desc'> <p name='name'>姓名：<span>岳云鵬</span></p> <p name='addr'>住址：中國河南</p> <p name='info'>代表作：五環(huán)之歌</p> </div>’’’#初始化html=etree.HTML(text)# result=etree.tostring(html) #字節(jié)流# print(result.decode(’utf-8’))#查詢所有的p標(biāo)簽p_x=html.xpath(’//p’)print(p_x)#查詢所有p標(biāo)簽的文本,用text只能拿到該標(biāo)簽下的文本，不包括子標(biāo)簽for i in p_x: print(i.text) #發(fā)現(xiàn)<span>沒有拿到#優(yōu)化，用string（）拿標(biāo)簽內(nèi)部的所有文本for i in p_x: print(i.xpath(’string(.)’))# 查詢所有name屬性的值attr_name=html.xpath(’//@name’)print(attr_name)#查詢出所有包含name屬性的標(biāo)簽attr_name1=html.xpath(’//*[@name]’)print(attr_name1)

到此這篇關(guān)于一文帶你了解Python 四種常見基礎(chǔ)爬蟲方法介紹的文章就介紹到這了,更多相關(guān)Python 基礎(chǔ)爬蟲內(nèi)容請(qǐng)搜索好吧啦網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持好吧啦網(wǎng)！

Python 編程

上一條：Python實(shí)現(xiàn)中英文全文搜索的示例下一條：使用Python通過oBIX協(xié)議訪問Niagara數(shù)據(jù)的示例

相關(guān)文章：

1. 詳解php如何合并身份證正反面圖片為一張圖片2. 得到XML文檔大小的方法3. ASP錯(cuò)誤捕獲的幾種常規(guī)處理方式4. asp.net core項(xiàng)目授權(quán)流程詳解5. 詳解JS前端使用迭代器和生成器原理及示例6. ASP編碼必備的8條原則7. Python 如何將字符串每兩個(gè)用空格隔開8. .NET 中配置從xml轉(zhuǎn)向json方法示例詳解9. 解決python 輸出到csv 出現(xiàn)多空行的情況10. asp錯(cuò)誤 '80040e21' 多步 OLE DB 操作產(chǎn)生錯(cuò)誤

排行榜

					
					改進(jìn)JAVA字符串分解的方法
PHP 編碼規(guī)范及建議
Python sorted對(duì)list和dict排序
python實(shí)現(xiàn)猜數(shù)游戲(保存游戲記錄）
Python切割圖片成九宮格的示例代碼
Python使用shutil模塊實(shí)現(xiàn)文件拷貝
python 實(shí)現(xiàn)aes256加密
Python容器類型公共方法總結(jié)
Python 如何將字符串每兩個(gè)用空格隔開
如何用python開發(fā)Zeroc Ice應(yīng)用
利用python+request通過接口實(shí)現(xiàn)人員通行記錄上傳功能