929 lines
23 KiB
Markdown
929 lines
23 KiB
Markdown
# 新闻爬虫系统 - 添加新爬虫实现指南
|
||
|
||
## 目录
|
||
1. [项目架构概述](#项目架构概述)
|
||
2. [添加新爬虫的完整流程](#添加新爬虫的完整流程)
|
||
3. [详细实现步骤](#详细实现步骤)
|
||
4. [示例代码](#示例代码)
|
||
5. [常见问题](#常见问题)
|
||
|
||
---
|
||
|
||
## 项目架构概述
|
||
|
||
### 核心组件
|
||
|
||
```
|
||
crawler-module/
|
||
├── src/
|
||
│ ├── base/ # 基类层
|
||
│ │ ├── crawler_base.py # 爬虫基类
|
||
│ │ └── parser_base.py # 解析器基类
|
||
│ ├── crawlers/ # 爬虫实现层
|
||
│ │ ├── netease/ # 网易爬虫
|
||
│ │ ├── kr36/ # 36氪爬虫
|
||
│ │ └── sina/ # 新浪爬虫
|
||
│ ├── parsers/ # 解析器层
|
||
│ │ ├── netease_parser.py
|
||
│ │ ├── kr36_parser.py
|
||
│ │ └── sina_parser.py
|
||
│ ├── utils/ # 工具层
|
||
│ │ ├── http_client.py # HTTP客户端
|
||
│ │ ├── selenium_driver.py # Selenium驱动
|
||
│ │ └── logger.py # 日志工具
|
||
│ ├── database/ # 数据层
|
||
│ │ ├── models.py # 数据模型
|
||
│ │ ├── repository.py # 数据访问
|
||
│ │ └── connection.py # 数据库连接
|
||
│ └── cli/ # CLI入口
|
||
│ └── main.py # 命令行接口
|
||
└── config/
|
||
├── config.yaml # 配置文件
|
||
└── settings.py # 配置加载器
|
||
```
|
||
|
||
### 架构设计模式
|
||
|
||
1. **基类继承模式**: 所有爬虫继承 `DynamicCrawler` 或 `StaticCrawler`
|
||
2. **解析器分离模式**: 爬虫负责抓取URL列表,解析器负责解析详情页
|
||
3. **配置驱动模式**: 通过 YAML 配置文件管理爬虫参数
|
||
4. **工厂模式**: CLI 通过动态导入创建爬虫实例
|
||
|
||
---
|
||
|
||
## 添加新爬虫的完整流程
|
||
|
||
### 步骤概览
|
||
|
||
```
|
||
1. 分析目标网站
|
||
↓
|
||
2. 创建爬虫类文件
|
||
↓
|
||
3. 创建解析器类文件
|
||
↓
|
||
4. 更新配置文件
|
||
↓
|
||
5. 注册爬虫到CLI
|
||
↓
|
||
6. 测试和调试
|
||
```
|
||
|
||
---
|
||
|
||
## 详细实现步骤
|
||
|
||
### 步骤 1: 分析目标网站
|
||
|
||
在编写代码之前,需要分析目标网站的以下信息:
|
||
|
||
#### 1.1 确定网站类型
|
||
- **静态网站**: 内容直接在 HTML 中,使用 `StaticCrawler`
|
||
- **动态网站**: 内容通过 JavaScript 加载,使用 `DynamicCrawler`
|
||
|
||
#### 1.2 确定关键信息
|
||
- 列表页 URL
|
||
- 文章 URL 提取规则(CSS 选择器)
|
||
- 文章详情页结构
|
||
- 标题、时间、作者、正文的选择器
|
||
|
||
#### 1.3 确定分类信息
|
||
- 分类名称(如:科技、娱乐、财经)
|
||
- 分类 ID(需与数据库一致)
|
||
- 分类代码(如:tech, entertainment, finance)
|
||
|
||
---
|
||
|
||
### 步骤 2: 创建爬虫类文件
|
||
|
||
#### 2.1 创建目录结构
|
||
|
||
假设要添加一个名为 `example` 的网站,分类为 `tech`:
|
||
|
||
```bash
|
||
# 创建网站目录
|
||
mkdir src/crawlers/example
|
||
|
||
# 创建 __init__.py
|
||
touch src/crawlers/example/__init__.py
|
||
```
|
||
|
||
#### 2.2 编写爬虫类
|
||
|
||
创建文件 `src/crawlers/example/tech.py`:
|
||
|
||
```python
|
||
"""
|
||
Example 科技新闻爬虫
|
||
"""
|
||
|
||
from typing import List
|
||
from bs4 import BeautifulSoup
|
||
|
||
import sys
|
||
import os
|
||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
|
||
|
||
from base.crawler_base import DynamicCrawler, Article
|
||
from parsers.example_parser import ExampleParser
|
||
|
||
|
||
class TechCrawler(DynamicCrawler):
|
||
"""Example 科技新闻爬虫"""
|
||
|
||
def _extract_article_urls(self, html: str) -> List[str]:
|
||
"""
|
||
从HTML中提取文章URL列表
|
||
|
||
Args:
|
||
html: 页面HTML内容
|
||
|
||
Returns:
|
||
文章URL列表
|
||
"""
|
||
soup = BeautifulSoup(html, "lxml")
|
||
urls = []
|
||
|
||
# 根据实际网站结构编写选择器
|
||
news_items = soup.select("div.news-list div.news-item")
|
||
|
||
for item in news_items:
|
||
article_link = item.select_one("a.title")
|
||
if article_link:
|
||
href = article_link.get('href')
|
||
if href:
|
||
# 处理相对路径
|
||
if href.startswith('/'):
|
||
href = f"https://www.example.com{href}"
|
||
urls.append(href)
|
||
|
||
return urls
|
||
|
||
def _fetch_articles(self, urls: List[str]) -> List[Article]:
|
||
"""
|
||
爬取文章详情
|
||
|
||
Args:
|
||
urls: 文章URL列表
|
||
|
||
Returns:
|
||
文章列表
|
||
"""
|
||
articles = []
|
||
parser = ExampleParser()
|
||
|
||
for i, url in enumerate(urls[:self.max_articles]):
|
||
try:
|
||
article = parser.parse(url)
|
||
article.category_id = self.category_id
|
||
article.source = "Example"
|
||
|
||
if not article.author:
|
||
article.author = "Example科技"
|
||
|
||
if article.is_valid():
|
||
articles.append(article)
|
||
self.logger.info(f"[{i+1}/{len(urls)}] {article.title}")
|
||
|
||
except Exception as e:
|
||
self.logger.error(f"解析文章失败: {url} - {e}")
|
||
continue
|
||
|
||
return articles
|
||
```
|
||
|
||
#### 2.3 爬虫类说明
|
||
|
||
**继承基类选择**:
|
||
- `DynamicCrawler`: 使用 Selenium,适合动态网站
|
||
- `StaticCrawler`: 使用 requests,适合静态网站
|
||
|
||
**必须实现的方法**:
|
||
- `_extract_article_urls(html)`: 从列表页提取文章 URL
|
||
- `_fetch_articles(urls)`: 爬取每篇文章的详情
|
||
|
||
**可用的属性**:
|
||
- `self.url`: 列表页 URL
|
||
- `self.category_id`: 分类 ID
|
||
- `self.category_name`: 分类名称
|
||
- `self.css_selector`: 等待加载的 CSS 选择器
|
||
- `self.max_articles`: 最大文章数
|
||
- `self.http_client`: HTTP 客户端(StaticCrawler)
|
||
- `self.driver`: Selenium 驱动(DynamicCrawler)
|
||
- `self.logger`: 日志记录器
|
||
|
||
---
|
||
|
||
### 步骤 3: 创建解析器类文件
|
||
|
||
#### 3.1 创建解析器文件
|
||
|
||
创建文件 `src/parsers/example_parser.py`:
|
||
|
||
```python
|
||
"""
|
||
Example 文章解析器
|
||
"""
|
||
|
||
import re
|
||
from bs4 import BeautifulSoup
|
||
|
||
import sys
|
||
import os
|
||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||
|
||
from base.parser_base import BaseParser
|
||
from base.crawler_base import Article
|
||
from utils.http_client import HttpClient
|
||
from utils.logger import get_logger
|
||
|
||
|
||
class ExampleParser(BaseParser):
|
||
"""Example 文章解析器"""
|
||
|
||
def __init__(self):
|
||
self.logger = get_logger(__name__)
|
||
self.http_client = HttpClient()
|
||
|
||
def parse(self, url: str) -> Article:
|
||
"""
|
||
解析文章详情页
|
||
|
||
Args:
|
||
url: 文章URL
|
||
|
||
Returns:
|
||
文章对象
|
||
"""
|
||
# 获取页面 HTML
|
||
html = self.http_client.get(url)
|
||
soup = BeautifulSoup(html, "lxml")
|
||
|
||
# 提取标题
|
||
title = None
|
||
title_tag = soup.select_one("h1.article-title")
|
||
if title_tag:
|
||
title = title_tag.get_text(strip=True)
|
||
|
||
# 提取发布时间
|
||
publish_time = None
|
||
time_tag = soup.select_one("div.article-info span.publish-time")
|
||
if time_tag:
|
||
time_text = time_tag.get_text(strip=True)
|
||
# 标准化时间格式
|
||
time_match = re.search(r"\d{4}-\d{2}-\d{2}", time_text)
|
||
if time_match:
|
||
publish_time = time_match.group()
|
||
|
||
# 提取作者
|
||
author = None
|
||
author_tag = soup.select_one("div.article-info span.author")
|
||
if author_tag:
|
||
author = author_tag.get_text(strip=True)
|
||
|
||
# 提取正文内容
|
||
content_lines = []
|
||
article_body = soup.select_one("div.article-content")
|
||
|
||
if article_body:
|
||
# 移除不需要的标签
|
||
for tag in article_body.select("script, style, iframe, .ad"):
|
||
tag.decompose()
|
||
|
||
# 提取段落
|
||
for p in article_body.find_all("p"):
|
||
text = p.get_text(strip=True)
|
||
if text:
|
||
content_lines.append(text)
|
||
|
||
content = '\n'.join(content_lines)
|
||
|
||
return Article(
|
||
url=url,
|
||
title=title,
|
||
publish_time=publish_time,
|
||
author=author,
|
||
content=content,
|
||
)
|
||
```
|
||
|
||
#### 3.2 解析器类说明
|
||
|
||
**必须实现的方法**:
|
||
- `parse(url)`: 解析文章详情页,返回 Article 对象
|
||
|
||
**Article 对象字段**:
|
||
- `url`: 文章 URL(必需)
|
||
- `title`: 文章标题(必需)
|
||
- `content`: 文章内容(必需)
|
||
- `publish_time`: 发布时间(可选)
|
||
- `author`: 作者(可选)
|
||
- `category_id`: 分类 ID(由爬虫设置)
|
||
- `source`: 新闻源(由爬虫设置)
|
||
|
||
**可用的工具**:
|
||
- `self.http_client`: HTTP 客户端
|
||
- `self.logger`: 日志记录器
|
||
|
||
---
|
||
|
||
### 步骤 4: 更新配置文件
|
||
|
||
编辑 `config/config.yaml`,在 `sources` 节点下添加新网站配置:
|
||
|
||
```yaml
|
||
sources:
|
||
# ... 其他网站配置 ...
|
||
|
||
example:
|
||
base_url: "https://www.example.com"
|
||
categories:
|
||
tech:
|
||
url: "https://www.example.com/tech"
|
||
category_id: 4
|
||
name: "科技"
|
||
css_selector: "div.news-list" # 列表页等待加载的选择器
|
||
# 可以添加更多分类
|
||
entertainment:
|
||
url: "https://www.example.com/entertainment"
|
||
category_id: 1
|
||
name: "娱乐"
|
||
css_selector: "div.news-list"
|
||
```
|
||
|
||
#### 配置项说明
|
||
|
||
| 配置项 | 说明 | 示例 |
|
||
|--------|------|------|
|
||
| `base_url` | 网站基础 URL | `https://www.example.com` |
|
||
| `url` | 列表页 URL | `https://www.example.com/tech` |
|
||
| `category_id` | 分类 ID(需与数据库一致) | `4` |
|
||
| `name` | 分类名称 | `科技` |
|
||
| `css_selector` | 列表页等待加载的选择器 | `div.news-list` |
|
||
|
||
#### 分类 ID 对照表
|
||
|
||
根据项目文档,分类 ID 如下:
|
||
|
||
| ID | 分类名称 | 代码 |
|
||
|----|----------|------|
|
||
| 1 | 娱乐 | entertainment |
|
||
| 2 | 体育 | sports |
|
||
| 3 | 财经 | finance |
|
||
| 4 | 科技 | tech |
|
||
| 5 | 军事 | war |
|
||
| 6 | 汽车 | auto |
|
||
| 7 | 政务 | gov |
|
||
| 8 | 健康 | health |
|
||
| 9 | AI | ai |
|
||
| 10 | 教育 | education |
|
||
|
||
---
|
||
|
||
### 步骤 5: 注册爬虫到 CLI
|
||
|
||
编辑 `src/cli/main.py`,在 `CRAWLER_CLASSES` 字典中添加新爬虫:
|
||
|
||
```python
|
||
CRAWLER_CLASSES = {
|
||
# ... 其他爬虫配置 ...
|
||
|
||
'example': {
|
||
'tech': ('crawlers.example.tech', 'TechCrawler'),
|
||
'entertainment': ('crawlers.example.entertainment', 'EntertainmentCrawler'),
|
||
# 可以添加更多分类
|
||
},
|
||
}
|
||
```
|
||
|
||
#### 注册格式说明
|
||
|
||
```python
|
||
'网站代码': {
|
||
'分类代码': ('爬虫模块路径', '爬虫类名'),
|
||
}
|
||
```
|
||
|
||
**示例**:
|
||
- `'example'`: 网站代码(对应配置文件中的 sources.example)
|
||
- `'tech'`: 分类代码(对应配置文件中的 categories.tech)
|
||
- `'crawlers.example.tech'`: 模块路径(相对于 src 目录)
|
||
- `'TechCrawler'`: 爬虫类名
|
||
|
||
---
|
||
|
||
### 步骤 6: 测试和调试
|
||
|
||
#### 6.1 运行单个爬虫
|
||
|
||
```bash
|
||
# 进入项目目录
|
||
cd D:\tmp\write\news-classifier\crawler-module
|
||
|
||
# 运行新爬虫
|
||
python -m src.cli.main example:tech
|
||
|
||
# 限制爬取数量
|
||
python -m src.cli.main example:tech --max 3
|
||
```
|
||
|
||
#### 6.2 列出所有爬虫
|
||
|
||
```bash
|
||
python -m src.cli.main --list
|
||
```
|
||
|
||
应该能看到新添加的爬虫:
|
||
```
|
||
可用的爬虫:
|
||
- netease:entertainment
|
||
- netease:tech
|
||
- kr36:ai
|
||
- example:tech
|
||
- example:entertainment
|
||
```
|
||
|
||
#### 6.3 查看日志
|
||
|
||
```bash
|
||
# 日志文件位置
|
||
type logs\crawler.log
|
||
```
|
||
|
||
#### 6.4 调试技巧
|
||
|
||
**开启调试模式**:
|
||
```bash
|
||
python -m src.cli.main example:tech --debug
|
||
```
|
||
|
||
**手动测试解析器**:
|
||
```python
|
||
from parsers.example_parser import ExampleParser
|
||
|
||
parser = ExampleParser()
|
||
article = parser.parse("https://www.example.com/article/123")
|
||
print(article.title)
|
||
print(article.content)
|
||
```
|
||
|
||
**手动测试爬虫**:
|
||
```python
|
||
from crawlers.example.tech import TechCrawler
|
||
|
||
crawler = TechCrawler('example', 'tech')
|
||
crawler.max_articles = 3
|
||
articles = crawler.crawl()
|
||
|
||
for article in articles:
|
||
print(article.title)
|
||
```
|
||
|
||
---
|
||
|
||
## 示例代码
|
||
|
||
### 完整示例:添加新浪娱乐爬虫
|
||
|
||
假设我们要为新浪网站添加娱乐分类爬虫:
|
||
|
||
#### 1. 创建爬虫类
|
||
|
||
文件:`src/crawlers/sina/entertainment.py`
|
||
|
||
```python
|
||
"""
|
||
新浪娱乐新闻爬虫
|
||
"""
|
||
|
||
from typing import List
|
||
from bs4 import BeautifulSoup
|
||
|
||
import sys
|
||
import os
|
||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
|
||
|
||
from base.crawler_base import DynamicCrawler, Article
|
||
from parsers.sina_parser import SinaEntertainmentParser
|
||
|
||
|
||
class EntertainmentCrawler(DynamicCrawler):
|
||
"""新浪娱乐新闻爬虫"""
|
||
|
||
def _extract_article_urls(self, html: str) -> List[str]:
|
||
"""从HTML中提取文章URL列表"""
|
||
soup = BeautifulSoup(html, "lxml")
|
||
urls = []
|
||
|
||
# 新浪娱乐列表页选择器
|
||
news_items = soup.select("div.feed_card.ty-feed-card-container div.cardlist-a__list div.ty-card.ty-card-type1")
|
||
|
||
for item in news_items:
|
||
article_link = item.select_one("a")
|
||
if article_link:
|
||
href = article_link.get('href')
|
||
if href:
|
||
urls.append(href)
|
||
|
||
return urls
|
||
|
||
def _fetch_articles(self, urls: List[str]) -> List[Article]:
|
||
"""爬取文章详情"""
|
||
articles = []
|
||
parser = SinaEntertainmentParser()
|
||
|
||
for i, url in enumerate(urls[:self.max_articles]):
|
||
try:
|
||
article = parser.parse(url)
|
||
article.category_id = self.category_id
|
||
article.source = "新浪"
|
||
|
||
if not article.author:
|
||
article.author = "新浪娱乐"
|
||
|
||
if article.is_valid():
|
||
articles.append(article)
|
||
self.logger.info(f"[{i+1}/{len(urls)}] {article.title}")
|
||
|
||
except Exception as e:
|
||
self.logger.error(f"解析文章失败: {url} - {e}")
|
||
continue
|
||
|
||
return articles
|
||
```
|
||
|
||
#### 2. 创建解析器类
|
||
|
||
文件:`src/parsers/sina_parser.py`(在文件末尾添加)
|
||
|
||
```python
|
||
class SinaEntertainmentParser(BaseParser):
|
||
"""新浪网娱乐新闻解析器"""
|
||
|
||
def __init__(self):
|
||
self.logger = get_logger(__name__)
|
||
self.http_client = HttpClient()
|
||
|
||
def parse(self, url: str) -> Article:
|
||
"""解析新浪网文章详情页"""
|
||
html = self.http_client.get(url)
|
||
soup = BeautifulSoup(html, "lxml")
|
||
|
||
# 获取文章标题
|
||
article_title_tag = soup.select_one("div.main-content h1.main-title")
|
||
article_title = article_title_tag.get_text(strip=True) if article_title_tag else "未知标题"
|
||
|
||
# 获取文章发布时间
|
||
time_tag = soup.select_one("div.main-content div.top-bar-wrap div.date-source span.date")
|
||
publish_time = time_tag.get_text(strip=True) if time_tag else "1949-01-01 12:00:00"
|
||
|
||
# 获取文章作者
|
||
author_tag = soup.select_one("div.main-content div.top-bar-wrap div.date-source a")
|
||
author = author_tag.get_text(strip=True) if author_tag else "未知"
|
||
|
||
# 获取文章正文段落
|
||
article_div = soup.select_one("div.main-content div.article")
|
||
if not article_div:
|
||
raise ValueError("无法找到文章内容")
|
||
|
||
paragraphs = article_div.find_all('p')
|
||
content = '\n'.join(p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True))
|
||
|
||
return Article(
|
||
url=url,
|
||
title=article_title,
|
||
publish_time=publish_time,
|
||
author=author,
|
||
content=content,
|
||
)
|
||
```
|
||
|
||
#### 3. 更新配置文件
|
||
|
||
文件:`config/config.yaml`
|
||
|
||
```yaml
|
||
sina:
|
||
base_url: "https://sina.com.cn"
|
||
categories:
|
||
auto:
|
||
url: "https://auto.sina.com.cn/"
|
||
category_id: 6
|
||
name: "汽车"
|
||
css_selector: "div.feed_card.ty-feed-card-container div.cardlist-a__list div.ty-card.ty-card-type1"
|
||
detail_css_selector: "div.main-content"
|
||
gov:
|
||
url: "https://gov.sina.com.cn/"
|
||
category_id: 7
|
||
name: "政务"
|
||
css_selector: "a[href]"
|
||
entertainment: # 新增
|
||
url: "https://ent.sina.com.cn/"
|
||
category_id: 1
|
||
name: "娱乐"
|
||
css_selector: "div.feed_card.ty-feed-card-container div.cardlist-a__list"
|
||
```
|
||
|
||
#### 4. 注册爬虫到 CLI
|
||
|
||
文件:`src/cli/main.py`
|
||
|
||
```python
|
||
CRAWLER_CLASSES = {
|
||
# ... 其他配置 ...
|
||
|
||
'sina': {
|
||
'auto': ('crawlers.sina.auto', 'SinaAutoCrawler'),
|
||
'gov': ('crawlers.sina.gov', 'SinaGovCrawler'),
|
||
'entertainment': ('crawlers.sina.entertainment', 'EntertainmentCrawler'), # 新增
|
||
},
|
||
}
|
||
```
|
||
|
||
#### 5. 测试运行
|
||
|
||
```bash
|
||
# 运行爬虫
|
||
python -m src.cli.main sina:entertainment
|
||
|
||
# 限制数量测试
|
||
python -m src.cli.main sina:entertainment --max 3
|
||
```
|
||
|
||
---
|
||
|
||
## 常见问题
|
||
|
||
### Q1: 如何确定使用 DynamicCrawler 还是 StaticCrawler?
|
||
|
||
**判断方法**:
|
||
1. 使用浏览器查看网页源代码(Ctrl+U)
|
||
2. 如果源代码中包含完整的文章列表和内容,使用 `StaticCrawler`
|
||
3. 如果源代码中内容很少,内容通过 JavaScript 动态加载,使用 `DynamicCrawler`
|
||
|
||
**示例**:
|
||
- 网易新闻:列表页需要滚动加载 → `DynamicCrawler`
|
||
- 简单的博客网站:内容直接在 HTML 中 → `StaticCrawler`
|
||
|
||
### Q2: 如何找到正确的 CSS 选择器?
|
||
|
||
**方法 1: 使用浏览器开发者工具**
|
||
1. 按 F12 打开开发者工具
|
||
2. 使用元素选择器(Ctrl+Shift+C)点击目标元素
|
||
3. 在 Elements 面板中,右键点击元素 → Copy → Copy selector
|
||
|
||
**方法 2: 使用 BeautifulSoup 测试**
|
||
```python
|
||
from bs4 import BeautifulSoup
|
||
|
||
html = """<html>...</html>"""
|
||
soup = BeautifulSoup(html, "lxml")
|
||
elements = soup.select("div.news-list a")
|
||
print(len(elements))
|
||
```
|
||
|
||
### Q3: 爬虫运行失败,如何调试?
|
||
|
||
**步骤**:
|
||
1. 查看日志文件:`logs\crawler.log`
|
||
2. 开启调试模式:`python -m src.cli.main example:tech --debug`
|
||
3. 手动测试 URL 是否可访问
|
||
4. 检查 CSS 选择器是否正确
|
||
5. 检查网站是否有反爬机制(如需要登录、验证码)
|
||
|
||
**常见错误**:
|
||
- `未找到新闻列表`: CSS 选择器错误
|
||
- `解析文章失败`: URL 格式错误或网站结构变化
|
||
- `HTTP请求失败`: 网络问题或被反爬
|
||
|
||
### Q4: 如何处理相对路径的 URL?
|
||
|
||
```python
|
||
href = article_link.get('href')
|
||
|
||
if href.startswith('/'):
|
||
# 相对路径,拼接基础 URL
|
||
base_url = "https://www.example.com"
|
||
href = base_url + href
|
||
elif href.startswith('http'):
|
||
# 绝对路径,直接使用
|
||
pass
|
||
else:
|
||
# 其他情况,拼接当前页面的基础路径
|
||
href = "https://www.example.com/" + href
|
||
```
|
||
|
||
### Q5: 如何处理时间格式不一致?
|
||
|
||
```python
|
||
import re
|
||
from datetime import datetime
|
||
|
||
def normalize_time(time_str):
|
||
"""标准化时间格式"""
|
||
# 定义多种时间格式
|
||
formats = [
|
||
"%Y年%m月%d日 %H:%M",
|
||
"%Y-%m-%d %H:%M:%S",
|
||
"%Y/%m/%d %H:%M",
|
||
"%Y.%m.%d %H:%M",
|
||
]
|
||
|
||
for fmt in formats:
|
||
try:
|
||
dt = datetime.strptime(time_str, fmt)
|
||
return dt.strftime("%Y-%m-%d %H:%M:%S")
|
||
except:
|
||
continue
|
||
|
||
# 如果都不匹配,返回默认值
|
||
return "1949-01-01 12:00:00"
|
||
```
|
||
|
||
### Q6: 如何提取干净的正文内容?
|
||
|
||
```python
|
||
# 移除不需要的标签
|
||
for tag in article_body.select("script, style, iframe, .ad, .comment"):
|
||
tag.decompose()
|
||
|
||
# 提取段落
|
||
content_lines = []
|
||
for p in article_body.find_all("p"):
|
||
text = p.get_text(strip=True)
|
||
if text and len(text) > 10: # 过滤太短的段落
|
||
content_lines.append(text)
|
||
|
||
content = '\n'.join(content_lines)
|
||
```
|
||
|
||
### Q7: 如何处理文章重复?
|
||
|
||
系统自动处理重复:
|
||
1. 通过 URL 去重
|
||
2. 通过内容哈希(content_hash)去重
|
||
3. 使用 `INSERT IGNORE` 语句避免重复插入
|
||
|
||
**查看重复数据**:
|
||
```sql
|
||
SELECT url, COUNT(*) as count
|
||
FROM news
|
||
GROUP BY url
|
||
HAVING count > 1;
|
||
```
|
||
|
||
### Q8: 如何批量运行所有爬虫?
|
||
|
||
```bash
|
||
# 运行所有爬虫
|
||
python -m src.cli.main --all
|
||
|
||
# 限制每个爬虫的数量
|
||
python -m src.cli.main --all --max 5
|
||
```
|
||
|
||
### Q9: 如何修改最大爬取数量?
|
||
|
||
**方法 1: 命令行参数**
|
||
```bash
|
||
python -m src.cli.main example:tech --max 20
|
||
```
|
||
|
||
**方法 2: 配置文件**
|
||
编辑 `config/config.yaml`:
|
||
```yaml
|
||
crawlers:
|
||
max_articles: 20 # 修改全局默认值
|
||
```
|
||
|
||
### Q10: 爬虫运行很慢,如何优化?
|
||
|
||
**优化策略**:
|
||
1. 减少 `max_articles` 数量
|
||
2. 调整 `selenium.scroll_pause_time`(滚动暂停时间)
|
||
3. 减少 `selenium.max_scroll_times`(最大滚动次数)
|
||
4. 使用 `StaticCrawler` 代替 `DynamicCrawler`(如果可能)
|
||
|
||
**配置示例**:
|
||
```yaml
|
||
selenium:
|
||
scroll_pause_time: 0.5 # 减少暂停时间
|
||
max_scroll_times: 3 # 减少滚动次数
|
||
```
|
||
|
||
---
|
||
|
||
## 附录
|
||
|
||
### A. 数据库表结构
|
||
|
||
```sql
|
||
CREATE TABLE `news` (
|
||
`id` int NOT NULL AUTO_INCREMENT,
|
||
`url` varchar(500) NOT NULL COMMENT '文章URL',
|
||
`title` varchar(500) NOT NULL COMMENT '文章标题',
|
||
`content` text COMMENT '文章内容',
|
||
`category_id` int NOT NULL COMMENT '分类ID',
|
||
`publish_time` varchar(50) DEFAULT NULL COMMENT '发布时间',
|
||
`author` varchar(100) DEFAULT NULL COMMENT '作者',
|
||
`source` varchar(50) DEFAULT NULL COMMENT '新闻源',
|
||
`content_hash` varchar(64) DEFAULT NULL COMMENT '内容哈希',
|
||
`created_at` datetime DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
|
||
PRIMARY KEY (`id`),
|
||
UNIQUE KEY `url` (`url`),
|
||
KEY `content_hash` (`content_hash`),
|
||
KEY `category_id` (`category_id`)
|
||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='新闻表';
|
||
```
|
||
|
||
### B. 分类表结构
|
||
|
||
```sql
|
||
CREATE TABLE `news_category` (
|
||
`id` int NOT NULL AUTO_INCREMENT,
|
||
`name` varchar(50) NOT NULL COMMENT '分类名称',
|
||
`code` varchar(50) NOT NULL COMMENT '分类代码',
|
||
`description` varchar(200) DEFAULT NULL COMMENT '描述',
|
||
`sort_order` int DEFAULT 0 COMMENT '排序',
|
||
PRIMARY KEY (`id`),
|
||
UNIQUE KEY `code` (`code`)
|
||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='新闻分类表';
|
||
```
|
||
|
||
### C. 常用 CSS 选择器示例
|
||
|
||
```python
|
||
# 通过 ID 选择
|
||
soup.select_one("#article-title")
|
||
|
||
# 通过 class 选择
|
||
soup.select_one(".article-title")
|
||
soup.select_one("div.article-title")
|
||
|
||
# 通过属性选择
|
||
soup.select_one("a[href^='/article/']")
|
||
|
||
# 组合选择
|
||
soup.select_one("div.news-list div.news-item a.title")
|
||
|
||
# 多层级选择
|
||
soup.select_one("div.main-content > div.article > p")
|
||
|
||
# 伪类选择
|
||
soup.select_one("ul.news-list li:first-child a")
|
||
```
|
||
|
||
### D. BeautifulSoup 常用方法
|
||
|
||
```python
|
||
# 获取文本
|
||
element.get_text(strip=True)
|
||
|
||
# 获取属性
|
||
element.get('href')
|
||
element.get('class')
|
||
|
||
# 查找单个元素
|
||
soup.select_one("div.title")
|
||
soup.find("div", class_="title")
|
||
|
||
# 查找多个元素
|
||
soup.select("div.news-item")
|
||
soup.find_all("div", class_="news-item")
|
||
|
||
# 父元素和子元素
|
||
parent = element.parent
|
||
children = element.children
|
||
```
|
||
|
||
### E. 项目依赖
|
||
|
||
查看 `requirements.txt`:
|
||
```
|
||
requests>=2.31.0
|
||
beautifulsoup4>=4.12.0
|
||
lxml>=4.9.0
|
||
selenium>=4.15.0
|
||
PyYAML>=6.0
|
||
```
|
||
|
||
---
|
||
|
||
## 总结
|
||
|
||
添加新爬虫的核心步骤:
|
||
|
||
1. ✅ 分析目标网站结构
|
||
2. ✅ 创建爬虫类(继承 `DynamicCrawler` 或 `StaticCrawler`)
|
||
3. ✅ 创建解析器类(继承 `BaseParser`)
|
||
4. ✅ 更新配置文件(`config.yaml`)
|
||
5. ✅ 注册爬虫到 CLI(`src/cli/main.py`)
|
||
6. ✅ 测试运行
|
||
|
||
遵循本指南,您可以为新闻爬虫系统添加任意数量的新网站和分类爬虫。
|
||
|
||
---
|
||
|
||
**文档版本**: 1.0
|
||
**最后更新**: 2026-01-15
|
||
**维护者**: 新闻爬虫项目组 |