爬虫理论入门

1.如果你不伪装，你会被服务器嘲讽

import requests

response=requests.get("https://movie.douban.com/top250")
print(response)

如果你想这样简单的实现爬虫，服务器会告诉你

<Response [418]>

状态码 418 实际上是一个愚人节玩笑。它在 RFC 2324 中定义，该 RFC 是一个关于超文本咖啡壶控制协议（HTCPCP）的笑话文件。在这个笑话中，418 状态码是作为一个玩笑加入到 HTTP 协议中的。

翻译成人话是“你是傻子吗？”

2.如何不被嘲讽

import requests

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36 Edg/140.0.0.0"
}
response=requests.get("https://movie.douban.com/top250",headers=headers)
print(response)

那就是headers伪装成正常客户端的访问。得到方法非常简单，f12无需多言

此时就会显示

代表你成功了。

1	print(response.text)

这样你就可以得到页面源码了

3.关于request库（一切的基础）

上面的代码只是冰山一角，如果要真的全面学习，还是要把基础打牢。

这是一个常用的HTTP请求库，可以方便的向网站发送请求，也是爬虫技术的基石。

每次调用request请求之后，会返回一个response对象，该对象包含了具体的响应信息。

以下列举了部分常用响应信息的属性方法使用。

#import request

url='你要发送请求的网址'

#发送请求
x=request.get(url)

print(x.text)#返回网页内容
print(x.status_code)#返回http状态码
print(x.reason)#响应状态的描述，比如ok
print(x.apparent_encoding)#返回编码 比如 utf-8
print(x.json)#返回json数据
'''
{'name': '网站', 'num': 3, 'sites': [{'name': 'Google', 'info': ['Android', 'Google 搜索', 'Google 翻译']}, {'name': 'Runoob', 'info': ['菜鸟教程', '菜鸟工具', '菜鸟微信']}, {'name': 'Taobao', 'info': ['淘宝', '网购']}]}
'''
print(x.cookies)#返回CookieJar对象，包含从服务器发送的cookie
print(x.headers)#返回响应头

以下是request方法

方法	描述
delete(url,args)	发送delete请求到指定url
get(url,params,args)	发送GET请求到指定url
head(url,args)	发送HEAD请求到指定url
patch(url,data,args)	发送PATCH请求到指定url
post(url,data,json,args)	发送POST请求到指定url
put(url,data,args)	发送PUT请求到指定url
request(method, url, args)	向指定的 url 发送指定的请求方法

get和request

#import requests

kw={'s':'python教程'}

#设置请求头
headers={}

#params接受一个字典或字符串的查询参数，字典类型自动转化为url编码
response=requests.get("url",params=kw,headers=headers)
#request的使用
x=requests.request('get','url')

post

post() 方法可以发送 POST 请求到指定 url，一般格式如下：

1	requests.post(url, data={key: value}, json={key: value}, args)

url 请求 url。
data 参数为要发送到指定 url 的字典、元组列表、字节或文件对象。
json 参数为要发送到指定 url 的 JSON 对象。
args 为其他参数，比如 cookies、headers、verify等

# 导入 requests 包
import requests

# 表单参数，参数名为 fname 和 lname
myobj = {'fname': 'RUNOOB','lname': 'Boy'}

# 发送请求
x = requests.post('https://www.runoob.com/try/ajax/demo_post2.php', data = myobj)

# 返回网页内容
print(x.text)

附加请求参数

发送请求我们可以在请求中附加额外的参数，例如请求头、查询参数、请求体等，例如：

headers = {'User-Agent': 'Mozilla/5.0'}  # 设置请求头
params = {'key1': 'value1', 'key2': 'value2'}  # 设置查询参数,params用于在URL后面添加查询字符串(query string)，通常用于GET请求
data = {'username': 'example', 'password': '123456'}  # 设置请求体,data用于发送表单数据，通常用于POST请求。
response = requests.post('https://www.runoob.com', headers=headers, params=params, data=data)

params和data的示例调用

import requests

# 获取天气信息
base_url = 'https://api.weather.com/v3'
params = {
    'location': 'beijing',
    'apikey': 'your_api_key',
    'units': 'metric'
}
data = {
    'fields': 'temperature,humidity,windSpeed'
}

response = requests.post(f"{base_url}/weather/now", params=params, json=data)
weather_data = response.json()
print(f"当前温度: {weather_data['temperature']}℃")

4.如何解析得到的信息？BeautifulSoup库

1	from bs4 import BeautifulSoup

导入这个库之后，就可以开始使用了。

下面的一段HTML代码将作为例子被多次用到.这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档):

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

几个简单的浏览结构化数据的方法:

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

从文档中找到所有标签的链接:

for link in soup.find_all('a'):
    print(link.get('href'))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie

从文档中获取所有文字内容:

print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

5.尝试汇总我们学到的知识

#如何获取网页标题
from bs4 import BeautifulSoup
import requests

url='x'
response=requests.get(url)
#解决中文乱码问题
response.encoding='utf-8'
#确保请求成功
if response.status_code==200:
    soup=BeautifulSoup(response.text,'html.parser')
     #查找<title>标签
     title_tag=soup.find('title')
     # 打印标题文本
    if title_tag:
        print(title_tag.get_text())
    else:
        print("未找到<title>标签")
else:
    print("请求失败，状态码：", response.status_code)

find和find_all方法的使用

from bs4 import BeautifulSoup
import requests

# 指定你想要获取标题的网站
url = 'https://www.baidu.com/' # 抓取bing搜索引擎的网页内容

# 发送HTTP请求获取网页内容
response = requests.get(url)
# 中文乱码问题
response.encoding = 'utf-8'

soup = BeautifulSoup(response.text, 'lxml')

# 查找第一个 <a> 标签
first_link = soup.find('a')
print(first_link)
print("----------------------------")

# 获取第一个 <a> 标签的 href 属性
first_link_url = first_link.get('href')
print(first_link_url)
print("----------------------------")

# 查找所有 <a> 标签
all_links = soup.find_all('a')
print(all_links)

# 查找具有 id="unique-id" 的 <input> 标签
unique_input = soup.find('input', id='su')

input_value = unique_input['value'] # 获取 input 输入框的值

print(input_value)