Python网络爬虫全栈教程 - 从基础到实战技术导航

概述

在当今数据驱动的时代，网络爬虫技术已成为获取网络信息的重要手段。幽络源作为提供免费源码和技术教程的综合平台，今天为大家带来一篇关于Python网络爬虫的全面指南。本文将详细介绍如何使用Python进行网页内容抓取、数据提取和存储，帮助开发者高效获取网络数据资源。

主要内容

1. 使用Requests库进行基础爬取

Python中的Requests库是发起HTTP请求的利器，可以轻松获取网页HTML内容：

import requests

URL = "https://www.geeksforgeeks.org/"
resp = requests.get(URL)

print("状态码:", resp.status_code)
print("\n响应内容:", resp.text)

关键方法说明：
requests.get(URL)：向指定URL发送GET请求
response.status_code：检查请求状态(200表示成功)
response.text：包含网页HTML内容

2. 处理JSON格式数据

许多现代网站提供JSON格式的API接口，Requests库可以轻松解析：

import requests
URL = "http://api.open-notify.org/iss-now.json"

response = requests.get(URL)
if response.status_code == 200:
    data = response.json()
    print("国际空间站位置数据:", data)
else:
    print(f"错误: 数据获取失败. 状态码: {response.status_code}")

response.json()方法将JSON响应转换为Python字典，便于后续处理。

3. 图片资源抓取实战

使用Requests库可以轻松下载网络图片资源：

import requests

image_url = "https://media.geeksforgeeks.org/wp-content/uploads/20230505175603/100-Days-of-Machine-Learning.webp"
output_filename = "gfg_logo.png"

response = requests.get(image_url)

if response.status_code == 200:
    with open(output_filename, "wb") as file:
        file.write(response.content)
    print(f"图片下载成功: {output_filename}")

关键点：
response.content包含图片二进制数据
使用”wb”模式写入文件

4. 使用XPath精准定位元素

XPath是定位网页元素的强大工具，结合lxml库使用：

from lxml import etree
import requests

weather_url = "https://weather.com/en-IN/weather/today/l/60f76bec229c75a05ac18013521f7bfb52c75869637f3449105e9cb79738d492"

response = requests.get(weather_url)

if response.status_code == 200:
    dom = etree.HTML(response.text)
    elements = dom.xpath("//span[@data-testid='TemperatureValue' and contains(@class,'CurrentConditions')]")
    
    if elements:
        print(f"当前温度: {elements[0].text}")

5. 使用Pandas提取网页表格

Pandas的read_html函数可以快速提取网页中的表格数据：

import pandas as pd

url = "https://www.geeksforgeeks.org/html/html-tables/"
extracted_tables = pd.read_html(url)

if extracted_tables:
    for idx, table in enumerate(extracted_tables, 1):
        print(f"表格 {idx}:")
        print(table)

6. 词频统计实战案例

结合BeautifulSoup实现网页内容词频统计：

import requests
from bs4 import BeautifulSoup
from collections import Counter

def start(url):
    source_code = requests.get(url).text
    soup = BeautifulSoup(source_code, 'html.parser')
    wordlist = []

    for each_text in soup.findAll('div', {'class': 'entry-content'}):
        words = each_text.text.lower().split()
        wordlist.extend(words)
    
    clean_wordlist(wordlist)

def clean_wordlist(wordlist):
    clean_list = []
    symbols = "!@#$%^&*()_-+={[}]|\\;:\"<>?/.,"
    
    for word in wordlist:
        for symbol in symbols:
            word = word.replace(symbol, '')
        if len(word) > 0:
            clean_list.append(word)
    
    word_count = Counter(clean_list)
    top = word_count.most_common(10)
    
    print("出现频率最高的10个词:")
    for word, count in top:
        print(f'{word}: {count}')

if __name__ == "__main__":
    start(url)

结语

Python网络爬虫技术为数据采集和分析提供了强大支持，广泛应用于市场研究、内容聚合等领域。本文介绍了从基础请求到高级数据处理的完整流程，希望能帮助开发者掌握这一实用技能。在实际应用中，请务必遵守网站的robots.txt协议和相关法律法规。

如果您需要更多技术帮助或想交流开发经验，欢迎加入我们的技术交流QQ群：307531422。幽络源将持续为您提供优质的技术教程和项目资源！

Python网络爬虫全栈教程 – 从基础到实战技术导航 | 幽络源

概述