feat: 增加腾讯网 军事、汽车分类爬虫

This commit is contained in:
shenjianZ 2026-01-15 13:34:44 +08:00
parent 543ce5ec0a
commit 144c9e082f
17 changed files with 1681 additions and 36 deletions

2
.gitignore vendored
View File

@ -74,3 +74,5 @@ logs/
# Temporary # Temporary
*.tmp *.tmp
*.temp *.temp
ml-module/models/

View File

@ -125,3 +125,27 @@ sources:
name: "汽车" name: "汽车"
css_selector: "div.feed_card.ty-feed-card-container div.cardlist-a__list div.ty-card.ty-card-type1" css_selector: "div.feed_card.ty-feed-card-container div.cardlist-a__list div.ty-card.ty-card-type1"
detail_css_selector: "div.main-content" detail_css_selector: "div.main-content"
gov:
url: "https://gov.sina.com.cn/"
category_id: 7
name: "政务"
css_selector: "a[href]"
tencent:
base_url: "https://new.qq.com"
categories:
auto:
url: "https://new.qq.com/auto"
category_id: 6
name: "汽车"
css_selector: ""
war:
url: "https://news.qq.com/ch/milite"
category_id: 5
name: "军事"
css_selector: ""
war_web:
url: "https://news.qq.com/ch/milite"
category_id: 5
name: "军事(网页版)"
css_selector: "div[id='channel-feed-area']"

View File

@ -0,0 +1,12 @@
CSS 属性选择器常用运算符速查表
| 运算符 | 含义 | 示例 | | |
| ---- | ------ | ----------------- | ------ | ------- |
| `=` | 完全等于 | `[id="content"]` | | |
| `*=` | **包含** | `[id*="content"]` | | |
| `^=` | 以…开头 | `[id^="content"]` | | |
| `$=` | 以…结尾 | `[id$="content"]` | | |
| `~=` | 单词匹配 | `[class~="item"]` | | |
| ` | =` | 前缀匹配 | `[lang | ="en"]` |

View File

@ -0,0 +1,929 @@
# 新闻爬虫系统 - 添加新爬虫实现指南
## 目录
1. [项目架构概述](#项目架构概述)
2. [添加新爬虫的完整流程](#添加新爬虫的完整流程)
3. [详细实现步骤](#详细实现步骤)
4. [示例代码](#示例代码)
5. [常见问题](#常见问题)
---
## 项目架构概述
### 核心组件
```
crawler-module/
├── src/
│ ├── base/ # 基类层
│ │ ├── crawler_base.py # 爬虫基类
│ │ └── parser_base.py # 解析器基类
│ ├── crawlers/ # 爬虫实现层
│ │ ├── netease/ # 网易爬虫
│ │ ├── kr36/ # 36氪爬虫
│ │ └── sina/ # 新浪爬虫
│ ├── parsers/ # 解析器层
│ │ ├── netease_parser.py
│ │ ├── kr36_parser.py
│ │ └── sina_parser.py
│ ├── utils/ # 工具层
│ │ ├── http_client.py # HTTP客户端
│ │ ├── selenium_driver.py # Selenium驱动
│ │ └── logger.py # 日志工具
│ ├── database/ # 数据层
│ │ ├── models.py # 数据模型
│ │ ├── repository.py # 数据访问
│ │ └── connection.py # 数据库连接
│ └── cli/ # CLI入口
│ └── main.py # 命令行接口
└── config/
├── config.yaml # 配置文件
└── settings.py # 配置加载器
```
### 架构设计模式
1. **基类继承模式**: 所有爬虫继承 `DynamicCrawler``StaticCrawler`
2. **解析器分离模式**: 爬虫负责抓取URL列表解析器负责解析详情页
3. **配置驱动模式**: 通过 YAML 配置文件管理爬虫参数
4. **工厂模式**: CLI 通过动态导入创建爬虫实例
---
## 添加新爬虫的完整流程
### 步骤概览
```
1. 分析目标网站
2. 创建爬虫类文件
3. 创建解析器类文件
4. 更新配置文件
5. 注册爬虫到CLI
6. 测试和调试
```
---
## 详细实现步骤
### 步骤 1: 分析目标网站
在编写代码之前,需要分析目标网站的以下信息:
#### 1.1 确定网站类型
- **静态网站**: 内容直接在 HTML 中,使用 `StaticCrawler`
- **动态网站**: 内容通过 JavaScript 加载,使用 `DynamicCrawler`
#### 1.2 确定关键信息
- 列表页 URL
- 文章 URL 提取规则CSS 选择器)
- 文章详情页结构
- 标题、时间、作者、正文的选择器
#### 1.3 确定分类信息
- 分类名称(如:科技、娱乐、财经)
- 分类 ID需与数据库一致
- 分类代码tech, entertainment, finance
---
### 步骤 2: 创建爬虫类文件
#### 2.1 创建目录结构
假设要添加一个名为 `example` 的网站,分类为 `tech`
```bash
# 创建网站目录
mkdir src/crawlers/example
# 创建 __init__.py
touch src/crawlers/example/__init__.py
```
#### 2.2 编写爬虫类
创建文件 `src/crawlers/example/tech.py`
```python
"""
Example 科技新闻爬虫
"""
from typing import List
from bs4 import BeautifulSoup
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
from base.crawler_base import DynamicCrawler, Article
from parsers.example_parser import ExampleParser
class TechCrawler(DynamicCrawler):
"""Example 科技新闻爬虫"""
def _extract_article_urls(self, html: str) -> List[str]:
"""
从HTML中提取文章URL列表
Args:
html: 页面HTML内容
Returns:
文章URL列表
"""
soup = BeautifulSoup(html, "lxml")
urls = []
# 根据实际网站结构编写选择器
news_items = soup.select("div.news-list div.news-item")
for item in news_items:
article_link = item.select_one("a.title")
if article_link:
href = article_link.get('href')
if href:
# 处理相对路径
if href.startswith('/'):
href = f"https://www.example.com{href}"
urls.append(href)
return urls
def _fetch_articles(self, urls: List[str]) -> List[Article]:
"""
爬取文章详情
Args:
urls: 文章URL列表
Returns:
文章列表
"""
articles = []
parser = ExampleParser()
for i, url in enumerate(urls[:self.max_articles]):
try:
article = parser.parse(url)
article.category_id = self.category_id
article.source = "Example"
if not article.author:
article.author = "Example科技"
if article.is_valid():
articles.append(article)
self.logger.info(f"[{i+1}/{len(urls)}] {article.title}")
except Exception as e:
self.logger.error(f"解析文章失败: {url} - {e}")
continue
return articles
```
#### 2.3 爬虫类说明
**继承基类选择**:
- `DynamicCrawler`: 使用 Selenium适合动态网站
- `StaticCrawler`: 使用 requests适合静态网站
**必须实现的方法**:
- `_extract_article_urls(html)`: 从列表页提取文章 URL
- `_fetch_articles(urls)`: 爬取每篇文章的详情
**可用的属性**:
- `self.url`: 列表页 URL
- `self.category_id`: 分类 ID
- `self.category_name`: 分类名称
- `self.css_selector`: 等待加载的 CSS 选择器
- `self.max_articles`: 最大文章数
- `self.http_client`: HTTP 客户端StaticCrawler
- `self.driver`: Selenium 驱动DynamicCrawler
- `self.logger`: 日志记录器
---
### 步骤 3: 创建解析器类文件
#### 3.1 创建解析器文件
创建文件 `src/parsers/example_parser.py`
```python
"""
Example 文章解析器
"""
import re
from bs4 import BeautifulSoup
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from base.parser_base import BaseParser
from base.crawler_base import Article
from utils.http_client import HttpClient
from utils.logger import get_logger
class ExampleParser(BaseParser):
"""Example 文章解析器"""
def __init__(self):
self.logger = get_logger(__name__)
self.http_client = HttpClient()
def parse(self, url: str) -> Article:
"""
解析文章详情页
Args:
url: 文章URL
Returns:
文章对象
"""
# 获取页面 HTML
html = self.http_client.get(url)
soup = BeautifulSoup(html, "lxml")
# 提取标题
title = None
title_tag = soup.select_one("h1.article-title")
if title_tag:
title = title_tag.get_text(strip=True)
# 提取发布时间
publish_time = None
time_tag = soup.select_one("div.article-info span.publish-time")
if time_tag:
time_text = time_tag.get_text(strip=True)
# 标准化时间格式
time_match = re.search(r"\d{4}-\d{2}-\d{2}", time_text)
if time_match:
publish_time = time_match.group()
# 提取作者
author = None
author_tag = soup.select_one("div.article-info span.author")
if author_tag:
author = author_tag.get_text(strip=True)
# 提取正文内容
content_lines = []
article_body = soup.select_one("div.article-content")
if article_body:
# 移除不需要的标签
for tag in article_body.select("script, style, iframe, .ad"):
tag.decompose()
# 提取段落
for p in article_body.find_all("p"):
text = p.get_text(strip=True)
if text:
content_lines.append(text)
content = '\n'.join(content_lines)
return Article(
url=url,
title=title,
publish_time=publish_time,
author=author,
content=content,
)
```
#### 3.2 解析器类说明
**必须实现的方法**:
- `parse(url)`: 解析文章详情页,返回 Article 对象
**Article 对象字段**:
- `url`: 文章 URL必需
- `title`: 文章标题(必需)
- `content`: 文章内容(必需)
- `publish_time`: 发布时间(可选)
- `author`: 作者(可选)
- `category_id`: 分类 ID由爬虫设置
- `source`: 新闻源(由爬虫设置)
**可用的工具**:
- `self.http_client`: HTTP 客户端
- `self.logger`: 日志记录器
---
### 步骤 4: 更新配置文件
编辑 `config/config.yaml`,在 `sources` 节点下添加新网站配置:
```yaml
sources:
# ... 其他网站配置 ...
example:
base_url: "https://www.example.com"
categories:
tech:
url: "https://www.example.com/tech"
category_id: 4
name: "科技"
css_selector: "div.news-list" # 列表页等待加载的选择器
# 可以添加更多分类
entertainment:
url: "https://www.example.com/entertainment"
category_id: 1
name: "娱乐"
css_selector: "div.news-list"
```
#### 配置项说明
| 配置项 | 说明 | 示例 |
|--------|------|------|
| `base_url` | 网站基础 URL | `https://www.example.com` |
| `url` | 列表页 URL | `https://www.example.com/tech` |
| `category_id` | 分类 ID需与数据库一致 | `4` |
| `name` | 分类名称 | `科技` |
| `css_selector` | 列表页等待加载的选择器 | `div.news-list` |
#### 分类 ID 对照表
根据项目文档,分类 ID 如下:
| ID | 分类名称 | 代码 |
|----|----------|------|
| 1 | 娱乐 | entertainment |
| 2 | 体育 | sports |
| 3 | 财经 | finance |
| 4 | 科技 | tech |
| 5 | 军事 | war |
| 6 | 汽车 | auto |
| 7 | 政务 | gov |
| 8 | 健康 | health |
| 9 | AI | ai |
| 10 | 教育 | education |
---
### 步骤 5: 注册爬虫到 CLI
编辑 `src/cli/main.py`,在 `CRAWLER_CLASSES` 字典中添加新爬虫:
```python
CRAWLER_CLASSES = {
# ... 其他爬虫配置 ...
'example': {
'tech': ('crawlers.example.tech', 'TechCrawler'),
'entertainment': ('crawlers.example.entertainment', 'EntertainmentCrawler'),
# 可以添加更多分类
},
}
```
#### 注册格式说明
```python
'网站代码': {
'分类代码': ('爬虫模块路径', '爬虫类名'),
}
```
**示例**:
- `'example'`: 网站代码(对应配置文件中的 sources.example
- `'tech'`: 分类代码(对应配置文件中的 categories.tech
- `'crawlers.example.tech'`: 模块路径(相对于 src 目录)
- `'TechCrawler'`: 爬虫类名
---
### 步骤 6: 测试和调试
#### 6.1 运行单个爬虫
```bash
# 进入项目目录
cd D:\tmp\write\news-classifier\crawler-module
# 运行新爬虫
python -m src.cli.main example:tech
# 限制爬取数量
python -m src.cli.main example:tech --max 3
```
#### 6.2 列出所有爬虫
```bash
python -m src.cli.main --list
```
应该能看到新添加的爬虫:
```
可用的爬虫:
- netease:entertainment
- netease:tech
- kr36:ai
- example:tech
- example:entertainment
```
#### 6.3 查看日志
```bash
# 日志文件位置
type logs\crawler.log
```
#### 6.4 调试技巧
**开启调试模式**:
```bash
python -m src.cli.main example:tech --debug
```
**手动测试解析器**:
```python
from parsers.example_parser import ExampleParser
parser = ExampleParser()
article = parser.parse("https://www.example.com/article/123")
print(article.title)
print(article.content)
```
**手动测试爬虫**:
```python
from crawlers.example.tech import TechCrawler
crawler = TechCrawler('example', 'tech')
crawler.max_articles = 3
articles = crawler.crawl()
for article in articles:
print(article.title)
```
---
## 示例代码
### 完整示例:添加新浪娱乐爬虫
假设我们要为新浪网站添加娱乐分类爬虫:
#### 1. 创建爬虫类
文件:`src/crawlers/sina/entertainment.py`
```python
"""
新浪娱乐新闻爬虫
"""
from typing import List
from bs4 import BeautifulSoup
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
from base.crawler_base import DynamicCrawler, Article
from parsers.sina_parser import SinaEntertainmentParser
class EntertainmentCrawler(DynamicCrawler):
"""新浪娱乐新闻爬虫"""
def _extract_article_urls(self, html: str) -> List[str]:
"""从HTML中提取文章URL列表"""
soup = BeautifulSoup(html, "lxml")
urls = []
# 新浪娱乐列表页选择器
news_items = soup.select("div.feed_card.ty-feed-card-container div.cardlist-a__list div.ty-card.ty-card-type1")
for item in news_items:
article_link = item.select_one("a")
if article_link:
href = article_link.get('href')
if href:
urls.append(href)
return urls
def _fetch_articles(self, urls: List[str]) -> List[Article]:
"""爬取文章详情"""
articles = []
parser = SinaEntertainmentParser()
for i, url in enumerate(urls[:self.max_articles]):
try:
article = parser.parse(url)
article.category_id = self.category_id
article.source = "新浪"
if not article.author:
article.author = "新浪娱乐"
if article.is_valid():
articles.append(article)
self.logger.info(f"[{i+1}/{len(urls)}] {article.title}")
except Exception as e:
self.logger.error(f"解析文章失败: {url} - {e}")
continue
return articles
```
#### 2. 创建解析器类
文件:`src/parsers/sina_parser.py`(在文件末尾添加)
```python
class SinaEntertainmentParser(BaseParser):
"""新浪网娱乐新闻解析器"""
def __init__(self):
self.logger = get_logger(__name__)
self.http_client = HttpClient()
def parse(self, url: str) -> Article:
"""解析新浪网文章详情页"""
html = self.http_client.get(url)
soup = BeautifulSoup(html, "lxml")
# 获取文章标题
article_title_tag = soup.select_one("div.main-content h1.main-title")
article_title = article_title_tag.get_text(strip=True) if article_title_tag else "未知标题"
# 获取文章发布时间
time_tag = soup.select_one("div.main-content div.top-bar-wrap div.date-source span.date")
publish_time = time_tag.get_text(strip=True) if time_tag else "1949-01-01 12:00:00"
# 获取文章作者
author_tag = soup.select_one("div.main-content div.top-bar-wrap div.date-source a")
author = author_tag.get_text(strip=True) if author_tag else "未知"
# 获取文章正文段落
article_div = soup.select_one("div.main-content div.article")
if not article_div:
raise ValueError("无法找到文章内容")
paragraphs = article_div.find_all('p')
content = '\n'.join(p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True))
return Article(
url=url,
title=article_title,
publish_time=publish_time,
author=author,
content=content,
)
```
#### 3. 更新配置文件
文件:`config/config.yaml`
```yaml
sina:
base_url: "https://sina.com.cn"
categories:
auto:
url: "https://auto.sina.com.cn/"
category_id: 6
name: "汽车"
css_selector: "div.feed_card.ty-feed-card-container div.cardlist-a__list div.ty-card.ty-card-type1"
detail_css_selector: "div.main-content"
gov:
url: "https://gov.sina.com.cn/"
category_id: 7
name: "政务"
css_selector: "a[href]"
entertainment: # 新增
url: "https://ent.sina.com.cn/"
category_id: 1
name: "娱乐"
css_selector: "div.feed_card.ty-feed-card-container div.cardlist-a__list"
```
#### 4. 注册爬虫到 CLI
文件:`src/cli/main.py`
```python
CRAWLER_CLASSES = {
# ... 其他配置 ...
'sina': {
'auto': ('crawlers.sina.auto', 'SinaAutoCrawler'),
'gov': ('crawlers.sina.gov', 'SinaGovCrawler'),
'entertainment': ('crawlers.sina.entertainment', 'EntertainmentCrawler'), # 新增
},
}
```
#### 5. 测试运行
```bash
# 运行爬虫
python -m src.cli.main sina:entertainment
# 限制数量测试
python -m src.cli.main sina:entertainment --max 3
```
---
## 常见问题
### Q1: 如何确定使用 DynamicCrawler 还是 StaticCrawler
**判断方法**:
1. 使用浏览器查看网页源代码Ctrl+U
2. 如果源代码中包含完整的文章列表和内容,使用 `StaticCrawler`
3. 如果源代码中内容很少,内容通过 JavaScript 动态加载,使用 `DynamicCrawler`
**示例**:
- 网易新闻:列表页需要滚动加载 → `DynamicCrawler`
- 简单的博客网站:内容直接在 HTML 中 → `StaticCrawler`
### Q2: 如何找到正确的 CSS 选择器?
**方法 1: 使用浏览器开发者工具**
1. 按 F12 打开开发者工具
2. 使用元素选择器Ctrl+Shift+C点击目标元素
3. 在 Elements 面板中,右键点击元素 → Copy → Copy selector
**方法 2: 使用 BeautifulSoup 测试**
```python
from bs4 import BeautifulSoup
html = """<html>...</html>"""
soup = BeautifulSoup(html, "lxml")
elements = soup.select("div.news-list a")
print(len(elements))
```
### Q3: 爬虫运行失败,如何调试?
**步骤**:
1. 查看日志文件:`logs\crawler.log`
2. 开启调试模式:`python -m src.cli.main example:tech --debug`
3. 手动测试 URL 是否可访问
4. 检查 CSS 选择器是否正确
5. 检查网站是否有反爬机制(如需要登录、验证码)
**常见错误**:
- `未找到新闻列表`: CSS 选择器错误
- `解析文章失败`: URL 格式错误或网站结构变化
- `HTTP请求失败`: 网络问题或被反爬
### Q4: 如何处理相对路径的 URL
```python
href = article_link.get('href')
if href.startswith('/'):
# 相对路径,拼接基础 URL
base_url = "https://www.example.com"
href = base_url + href
elif href.startswith('http'):
# 绝对路径,直接使用
pass
else:
# 其他情况,拼接当前页面的基础路径
href = "https://www.example.com/" + href
```
### Q5: 如何处理时间格式不一致?
```python
import re
from datetime import datetime
def normalize_time(time_str):
"""标准化时间格式"""
# 定义多种时间格式
formats = [
"%Y年%m月%d日 %H:%M",
"%Y-%m-%d %H:%M:%S",
"%Y/%m/%d %H:%M",
"%Y.%m.%d %H:%M",
]
for fmt in formats:
try:
dt = datetime.strptime(time_str, fmt)
return dt.strftime("%Y-%m-%d %H:%M:%S")
except:
continue
# 如果都不匹配,返回默认值
return "1949-01-01 12:00:00"
```
### Q6: 如何提取干净的正文内容?
```python
# 移除不需要的标签
for tag in article_body.select("script, style, iframe, .ad, .comment"):
tag.decompose()
# 提取段落
content_lines = []
for p in article_body.find_all("p"):
text = p.get_text(strip=True)
if text and len(text) > 10: # 过滤太短的段落
content_lines.append(text)
content = '\n'.join(content_lines)
```
### Q7: 如何处理文章重复?
系统自动处理重复:
1. 通过 URL 去重
2. 通过内容哈希content_hash去重
3. 使用 `INSERT IGNORE` 语句避免重复插入
**查看重复数据**:
```sql
SELECT url, COUNT(*) as count
FROM news
GROUP BY url
HAVING count > 1;
```
### Q8: 如何批量运行所有爬虫?
```bash
# 运行所有爬虫
python -m src.cli.main --all
# 限制每个爬虫的数量
python -m src.cli.main --all --max 5
```
### Q9: 如何修改最大爬取数量?
**方法 1: 命令行参数**
```bash
python -m src.cli.main example:tech --max 20
```
**方法 2: 配置文件**
编辑 `config/config.yaml`:
```yaml
crawlers:
max_articles: 20 # 修改全局默认值
```
### Q10: 爬虫运行很慢,如何优化?
**优化策略**:
1. 减少 `max_articles` 数量
2. 调整 `selenium.scroll_pause_time`(滚动暂停时间)
3. 减少 `selenium.max_scroll_times`(最大滚动次数)
4. 使用 `StaticCrawler` 代替 `DynamicCrawler`(如果可能)
**配置示例**:
```yaml
selenium:
scroll_pause_time: 0.5 # 减少暂停时间
max_scroll_times: 3 # 减少滚动次数
```
---
## 附录
### A. 数据库表结构
```sql
CREATE TABLE `news` (
`id` int NOT NULL AUTO_INCREMENT,
`url` varchar(500) NOT NULL COMMENT '文章URL',
`title` varchar(500) NOT NULL COMMENT '文章标题',
`content` text COMMENT '文章内容',
`category_id` int NOT NULL COMMENT '分类ID',
`publish_time` varchar(50) DEFAULT NULL COMMENT '发布时间',
`author` varchar(100) DEFAULT NULL COMMENT '作者',
`source` varchar(50) DEFAULT NULL COMMENT '新闻源',
`content_hash` varchar(64) DEFAULT NULL COMMENT '内容哈希',
`created_at` datetime DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
PRIMARY KEY (`id`),
UNIQUE KEY `url` (`url`),
KEY `content_hash` (`content_hash`),
KEY `category_id` (`category_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='新闻表';
```
### B. 分类表结构
```sql
CREATE TABLE `news_category` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(50) NOT NULL COMMENT '分类名称',
`code` varchar(50) NOT NULL COMMENT '分类代码',
`description` varchar(200) DEFAULT NULL COMMENT '描述',
`sort_order` int DEFAULT 0 COMMENT '排序',
PRIMARY KEY (`id`),
UNIQUE KEY `code` (`code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='新闻分类表';
```
### C. 常用 CSS 选择器示例
```python
# 通过 ID 选择
soup.select_one("#article-title")
# 通过 class 选择
soup.select_one(".article-title")
soup.select_one("div.article-title")
# 通过属性选择
soup.select_one("a[href^='/article/']")
# 组合选择
soup.select_one("div.news-list div.news-item a.title")
# 多层级选择
soup.select_one("div.main-content > div.article > p")
# 伪类选择
soup.select_one("ul.news-list li:first-child a")
```
### D. BeautifulSoup 常用方法
```python
# 获取文本
element.get_text(strip=True)
# 获取属性
element.get('href')
element.get('class')
# 查找单个元素
soup.select_one("div.title")
soup.find("div", class_="title")
# 查找多个元素
soup.select("div.news-item")
soup.find_all("div", class_="news-item")
# 父元素和子元素
parent = element.parent
children = element.children
```
### E. 项目依赖
查看 `requirements.txt`:
```
requests>=2.31.0
beautifulsoup4>=4.12.0
lxml>=4.9.0
selenium>=4.15.0
PyYAML>=6.0
```
---
## 总结
添加新爬虫的核心步骤:
1. ✅ 分析目标网站结构
2. ✅ 创建爬虫类(继承 `DynamicCrawler``StaticCrawler`
3. ✅ 创建解析器类(继承 `BaseParser`
4. ✅ 更新配置文件(`config.yaml`
5. ✅ 注册爬虫到 CLI`src/cli/main.py`
6. ✅ 测试运行
遵循本指南,您可以为新闻爬虫系统添加任意数量的新网站和分类爬虫。
---
**文档版本**: 1.0
**最后更新**: 2026-01-15
**维护者**: 新闻爬虫项目组

View File

@ -1,33 +0,0 @@
这是36kr关于爬取健康相关新闻的代码
```python
import requests
from bs4 import BeautifulSoup
import re
URL = "https://www.36kr.com/search/articles/%E5%81%A5%E5%BA%B7"
TARGET_URL = "https://www.36kr.com"
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
}
resp = requests.get(URL,headers=headers,timeout=10)
resp.raise_for_status()
resp.encoding = "utf-8"
# print(resp.text)
with open("example/example-11.html","r",encoding="utf-8") as f:
html = f.read()
# soup = BeautifulSoup(resp.text,"lxml")
soup = BeautifulSoup(html,"lxml")
li_list = soup.select("div.kr-layout div.kr-layout-main div.kr-layout-content div.kr-search-result-list ul.kr-search-result-list-main > li")
for item in li_list:
a = item.select_one("div.kr-shadow-content a")
href = TARGET_URL+ a.get("href")
print(href)
```

View File

@ -34,6 +34,12 @@ CRAWLER_CLASSES = {
}, },
'sina': { 'sina': {
'auto': ('crawlers.sina.auto', 'SinaAutoCrawler'), 'auto': ('crawlers.sina.auto', 'SinaAutoCrawler'),
'gov': ('crawlers.sina.gov', 'SinaGovCrawler'),
},
'tencent': {
'auto': ('crawlers.tencent.auto', 'AutoCrawler'),
'war': ('crawlers.tencent.war', 'WarCrawler'),
'war_web': ('crawlers.tencent.war_web', 'WarWebCrawler'),
}, },
} }
@ -62,6 +68,11 @@ def list_crawlers() -> List[str]:
for category in sina_categories.keys(): for category in sina_categories.keys():
crawlers.append(f"sina:{category}") crawlers.append(f"sina:{category}")
# 腾讯爬虫
tencent_categories = config.get('sources.tencent.categories', {})
for category in tencent_categories.keys():
crawlers.append(f"tencent:{category}")
return crawlers return crawlers

View File

@ -0,0 +1,68 @@
"""
新浪政务新闻爬虫
"""
from typing import List
from bs4 import BeautifulSoup
import re
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
from base.crawler_base import StaticCrawler, Article
from parsers.sina_parser import SinaAutoParser
class SinaGovCrawler(StaticCrawler):
"""新浪政务新闻爬虫"""
def _extract_article_urls(self, html: str) -> List[str]:
"""从HTML中提取文章URL列表"""
soup = BeautifulSoup(html, "lxml")
urls = []
# 遍历所有带href的a标签
for a in soup.select("a[href]"):
href = a.get("href")
if not href:
continue
# 补全 // 开头的链接
if href.startswith("//"):
href = "https:" + href
# 正则匹配真正的新闻正文页
# 格式: https://news.sina.com.cn/xxx/2024-01-14/doc-xxxxxxxxx.shtml
if re.search(r"^https://news\.sina\.com\.cn/.+/\d{4}-\d{2}-\d{2}/doc-.*\.shtml$", href):
title = a.get_text(strip=True)
if title: # 有标题,这才是新闻
urls.append(href)
# 去重
return list(dict.fromkeys(urls))
def _fetch_articles(self, urls: List[str]) -> List[Article]:
"""爬取文章详情"""
articles = []
parser = SinaAutoParser()
for i, url in enumerate(urls[:self.max_articles]):
try:
article = parser.parse(url)
article.category_id = self.category_id
article.source = "新浪"
if not article.author:
article.author = "新浪政务"
if article.is_valid():
articles.append(article)
self.logger.info(f"[{i+1}/{len(urls)}] {article.title}")
except Exception as e:
self.logger.error(f"解析文章失败: {url} - {e}")
continue
return articles

View File

@ -0,0 +1,2 @@
# 腾讯新闻爬虫

View File

@ -0,0 +1,210 @@
"""
腾讯汽车新闻爬虫
"""
import time
import random
import hashlib
from typing import List
import requests
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
from base.crawler_base import StaticCrawler, Article
from parsers.tencent_parser import TencentParser
class AutoCrawler(StaticCrawler):
"""腾讯汽车新闻爬虫"""
def __init__(self, source: str, category: str):
super().__init__(source, category)
# 腾讯API配置
self.api_url = "https://i.news.qq.com/web_feed/getPCList"
self.channel_id = "news_news_auto" # 汽车频道
self.seen_ids = set()
self.item_count = 20 # 每页固定请求20条
def _generate_trace_id(self):
"""生成trace_id"""
random_str = str(random.random()) + str(time.time())
return "0_" + hashlib.md5(random_str.encode()).hexdigest()[:12]
def crawl(self) -> List[Article]:
"""
执行爬取任务重写基类方法以支持API接口
Returns:
文章列表
"""
self.logger.info(f"开始爬取腾讯{self.category_name}新闻")
try:
# 生成设备ID
device_id = self._generate_trace_id()
# 获取文章URL列表
article_urls = self._fetch_article_urls_from_api(device_id)
self.logger.info(f"找到 {len(article_urls)} 篇文章")
# 爬取文章详情
articles = self._fetch_articles(article_urls)
self.logger.info(f"成功爬取 {len(articles)} 篇文章")
return articles
except Exception as e:
self.logger.error(f"爬取失败: {e}", exc_info=True)
return []
finally:
self._cleanup()
def _fetch_article_urls_from_api(self, device_id: str) -> List[str]:
"""
从API获取文章URL列表
Args:
device_id: 设备ID
Returns:
文章URL列表
"""
urls = []
# 根据 max_articles 动态计算需要抓取的页数
# 每页20条向上取整
import math
max_pages = math.ceil(self.max_articles / self.item_count)
self.logger.info(f"根据 max_articles={self.max_articles},计算需要抓取 {max_pages}")
for flush_num in range(max_pages):
payload = {
"base_req": {"from": "pc"},
"forward": "1",
"qimei36": device_id,
"device_id": device_id,
"flush_num": flush_num + 1,
"channel_id": self.channel_id,
"item_count": self.item_count,
"is_local_chlid": "0"
}
try:
headers = {
"User-Agent": self.http_client.session.headers.get("User-Agent"),
"Referer": "https://new.qq.com/",
"Origin": "https://new.qq.com",
"Content-Type": "application/json"
}
response = requests.post(
self.api_url,
headers=headers,
json=payload,
timeout=10
)
if response.status_code == 200:
data = response.json()
if data.get("code") == 0 and "data" in data:
news_list = data["data"]
if not news_list:
self.logger.info("没有更多数据了")
break
# 提取URL
for item in news_list:
news_id = item.get("id")
# 去重
if news_id in self.seen_ids:
continue
self.seen_ids.add(news_id)
# 过滤视频新闻articletype == "4"
article_type = item.get("articletype")
if article_type == "4":
continue
# 提取URL
url = item.get("link_info", {}).get("url")
if url:
urls.append(url)
# 如果已经获取到足够的文章数量,提前终止
if len(urls) >= self.max_articles:
self.logger.info(f"已获取 {len(urls)} 篇文章,达到目标数量,停止抓取")
break
# 如果外层循环也需要终止
if len(urls) >= self.max_articles:
break
else:
self.logger.warning(f"接口返回错误: {data.get('message')}")
else:
self.logger.warning(f"HTTP请求失败: {response.status_code}")
except Exception as e:
self.logger.error(f"获取API数据失败: {e}")
# 延迟,避免请求过快
time.sleep(random.uniform(1, 2))
return urls
def _fetch_page(self) -> str:
"""
获取页面HTML腾讯爬虫不使用此方法
Returns:
空字符串
"""
return ""
def _extract_article_urls(self, html: str) -> List[str]:
"""
从HTML中提取文章URL列表腾讯爬虫不使用此方法
Args:
html: 页面HTML内容
Returns:
空列表
"""
return []
def _fetch_articles(self, urls: List[str]) -> List[Article]:
"""
爬取文章详情
Args:
urls: 文章URL列表
Returns:
文章列表
"""
articles = []
parser = TencentParser()
for i, url in enumerate(urls[:self.max_articles]):
try:
article = parser.parse(url)
article.category_id = self.category_id
article.source = "腾讯"
if not article.author:
article.author = "腾讯汽车"
if article.is_valid():
articles.append(article)
self.logger.info(f"[{i+1}/{len(urls)}] {article.title}")
except Exception as e:
self.logger.error(f"解析文章失败: {url} - {e}")
continue
return articles

View File

@ -0,0 +1,211 @@
"""
腾讯军事新闻爬虫API版
使用腾讯新闻 API 接口获取数据性能更好
"""
import time
import random
import hashlib
from typing import List
import requests
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
from base.crawler_base import StaticCrawler, Article
from parsers.tencent_parser import TencentParser
class WarCrawler(StaticCrawler):
"""腾讯军事新闻爬虫API版"""
def __init__(self, source: str, category: str):
super().__init__(source, category)
# 腾讯API配置
self.api_url = "https://i.news.qq.com/web_feed/getPCList"
self.channel_id = "news_news_mil" # 军事频道
self.seen_ids = set()
self.item_count = 20 # 每页固定请求20条
def _generate_trace_id(self):
"""生成trace_id"""
random_str = str(random.random()) + str(time.time())
return "0_" + hashlib.md5(random_str.encode()).hexdigest()[:12]
def crawl(self) -> List[Article]:
"""
执行爬取任务重写基类方法以支持API接口
Returns:
文章列表
"""
self.logger.info(f"开始爬取腾讯{self.category_name}新闻")
try:
# 生成设备ID
device_id = self._generate_trace_id()
# 获取文章URL列表
article_urls = self._fetch_article_urls_from_api(device_id)
self.logger.info(f"找到 {len(article_urls)} 篇文章")
# 爬取文章详情
articles = self._fetch_articles(article_urls)
self.logger.info(f"成功爬取 {len(articles)} 篇文章")
return articles
except Exception as e:
self.logger.error(f"爬取失败: {e}", exc_info=True)
return []
finally:
self._cleanup()
def _fetch_article_urls_from_api(self, device_id: str) -> List[str]:
"""
从API获取文章URL列表
Args:
device_id: 设备ID
Returns:
文章URL列表
"""
urls = []
# 根据 max_articles 动态计算需要抓取的页数
# 每页20条向上取整
import math
max_pages = math.ceil(self.max_articles / self.item_count)
self.logger.info(f"根据 max_articles={self.max_articles},计算需要抓取 {max_pages}")
for flush_num in range(max_pages):
payload = {
"base_req": {"from": "pc"},
"forward": "1",
"qimei36": device_id,
"device_id": device_id,
"flush_num": flush_num + 1,
"channel_id": self.channel_id,
"item_count": self.item_count,
"is_local_chlid": "0"
}
try:
headers = {
"User-Agent": self.http_client.session.headers.get("User-Agent"),
"Referer": "https://new.qq.com/",
"Origin": "https://new.qq.com",
"Content-Type": "application/json"
}
response = requests.post(
self.api_url,
headers=headers,
json=payload,
timeout=10
)
if response.status_code == 200:
data = response.json()
if data.get("code") == 0 and "data" in data:
news_list = data["data"]
if not news_list:
self.logger.info("没有更多数据了")
break
# 提取URL
for item in news_list:
news_id = item.get("id")
# 去重
if news_id in self.seen_ids:
continue
self.seen_ids.add(news_id)
# 过滤视频新闻articletype == "4"
article_type = item.get("articletype")
if article_type == "4":
continue
# 提取URL
url = item.get("link_info", {}).get("url")
if url:
urls.append(url)
# 如果已经获取到足够的文章数量,提前终止
if len(urls) >= self.max_articles:
self.logger.info(f"已获取 {len(urls)} 篇文章,达到目标数量,停止抓取")
break
# 如果外层循环也需要终止
if len(urls) >= self.max_articles:
break
else:
self.logger.warning(f"接口返回错误: {data.get('message')}")
else:
self.logger.warning(f"HTTP请求失败: {response.status_code}")
except Exception as e:
self.logger.error(f"获取API数据失败: {e}")
# 延迟,避免请求过快
time.sleep(random.uniform(1, 2))
return urls
def _fetch_page(self) -> str:
"""
获取页面HTML腾讯爬虫不使用此方法
Returns:
空字符串
"""
return ""
def _extract_article_urls(self, html: str) -> List[str]:
"""
从HTML中提取文章URL列表腾讯爬虫不使用此方法
Args:
html: 页面HTML内容
Returns:
空列表
"""
return []
def _fetch_articles(self, urls: List[str]) -> List[Article]:
"""
爬取文章详情
Args:
urls: 文章URL列表
Returns:
文章列表
"""
articles = []
parser = TencentParser()
for i, url in enumerate(urls[:self.max_articles]):
try:
article = parser.parse(url)
article.category_id = self.category_id
article.source = "腾讯"
if not article.author:
article.author = "腾讯军事"
if article.is_valid():
articles.append(article)
self.logger.info(f"[{i+1}/{len(urls)}] {article.title}")
except Exception as e:
self.logger.error(f"解析文章失败: {url} - {e}")
continue
return articles

View File

@ -0,0 +1,79 @@
"""
腾讯军事新闻爬虫网页版
使用 Selenium 动态加载页面适用于网页抓取
"""
from typing import List
from bs4 import BeautifulSoup
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
from base.crawler_base import DynamicCrawler, Article
from parsers.tencent_parser import TencentParser
class WarWebCrawler(DynamicCrawler):
"""腾讯军事新闻爬虫(网页版)"""
def _extract_article_urls(self, html: str) -> List[str]:
"""
从HTML中提取文章URL列表
Args:
html: 页面HTML内容
Returns:
文章URL列表
"""
soup = BeautifulSoup(html, "lxml")
urls = []
# 选择军事频道的文章列表
# dt-params*='article_type=0' 过滤掉视频新闻
div_list = soup.select("div[id='channel-feed-area'] div.channel-feed-list div.channel-feed-item[dt-params*='article_type=0']")
for div in div_list:
article_link = div.select_one("a.article-title")
if article_link:
href = article_link.get('href')
if href:
# 处理相对路径
if href.startswith('/'):
href = f"https://news.qq.com{href}"
urls.append(href)
return urls
def _fetch_articles(self, urls: List[str]) -> List[Article]:
"""
爬取文章详情
Args:
urls: 文章URL列表
Returns:
文章列表
"""
articles = []
parser = TencentParser()
for i, url in enumerate(urls[:self.max_articles]):
try:
article = parser.parse(url)
article.category_id = self.category_id
article.source = "腾讯"
if not article.author:
article.author = "腾讯军事"
if article.is_valid():
articles.append(article)
self.logger.info(f"[{i+1}/{len(urls)}] {article.title}")
except Exception as e:
self.logger.error(f"解析文章失败: {url} - {e}")
continue
return articles

View File

@ -0,0 +1,79 @@
"""
腾讯新闻文章解析器
"""
from bs4 import BeautifulSoup
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from base.parser_base import BaseParser
from base.crawler_base import Article
from utils.http_client import HttpClient
from utils.logger import get_logger
class TencentParser(BaseParser):
"""腾讯新闻文章解析器"""
def __init__(self):
self.logger = get_logger(__name__)
self.http_client = HttpClient()
def parse(self, url: str) -> Article:
"""
解析腾讯新闻文章详情页
Args:
url: 文章URL
Returns:
文章对象
"""
# 获取页面HTML
html = self.http_client.get(url)
soup = BeautifulSoup(html, "lxml")
# 提取标题
title = None
title_tag = soup.select_one("div.content-left div.content-article > h1")
if title_tag:
title = title_tag.get_text(strip=True)
# 提取作者
author = None
author_tag = soup.select_one("div.content-left div.content-article div.article-author div.media-info a p")
if author_tag:
author = author_tag.get_text(strip=True)
# 提取发布时间
publish_time = None
time_tag = soup.select_one("div.content-left div.content-article div.article-author div.media-info p.media-meta > span")
if time_tag:
publish_time = time_tag.get_text(strip=True)
# 提取正文内容
content_lines = []
content_tag = soup.select_one("div.content-left div.content-article #article-content div.rich_media_content")
if content_tag:
# 提取段落,跳过只包含图片的段落
for p in content_tag.find_all("p"):
# 跳过只包含图片的 p
if p.find("img"):
continue
text = p.get_text(strip=True)
if text:
content_lines.append(text)
content = '\n'.join(content_lines)
return Article(
url=url,
title=title,
publish_time=publish_time,
author=author,
content=content,
)

View File

@ -0,0 +1,31 @@
这是关于腾讯新闻网爬取军事分类新闻的一个可行的代码
需要注意的是腾讯新闻解析文章详情的代码是通用的这里没有给出使用tencent_parser.py即可
注意这里需要使用到动态加载继承DynamicCrawler并且无需重写_fetch_page()
```python
import requests
from bs4 import BeautifulSoup
URL = "https://news.qq.com/ch/milite"
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
}
resp = requests.get(URL,headers=headers,timeout=10)
resp.raise_for_status()
resp.encoding = "utf-8"
# print(resp.text)
# with open("example/example-13.html","r",encoding="utf-8") as f:
# html = f.read()
soup = BeautifulSoup(resp.text,"lxml")
# soup = BeautifulSoup(html,"lxml")
div_list = soup.select("div[id='channel-feed-area'] div.channel-feed-list div.channel-feed-item[dt-params*='article_type=0']")
for div in div_list:
href = div.select_one("a.article-title").get("href")
print(href)
```

View File

@ -232,8 +232,28 @@ if __name__ == '__main__':
classifier.save_model('../../models/traditional') classifier.save_model('../../models/traditional')
# 测试预测 # 测试预测
test_title = "华为发布新款折叠屏手机" test_title = "高效办成一件事”"
test_content = "华为今天正式发布了新一代折叠屏手机,搭载最新麒麟芯片..." test_content = """
国务院办公厅12日对外发布高效办成一件事2026年度第一批重点事项清单并印发通知指出统筹线上和线下政务服务渠道因地制宜推进新一批重点事项落实落细持续优化已推出重点事项服务推动政务服务从能办好办易办转变
此次清单包含13项内容既涵盖科技型企业创新政策扶持知识产权保护举办体育赛事活动等赋能发展的大事也包含育儿补贴申领灵活就业参保等贴近生活的小事更有外籍来华人员办理电话卡海船开航等服务开放的新事可以说清单聚焦的都是市场主体和群众关切和我们每一个人都息息相关
去年7月国办印发关于健全高效办成一件事重点事项常态化推进机制的意见要求加强重点事项清单管理推动重点事项常态化实施拓展高效办成一件事应用领域
让政务服务好办易办背后涉及的改革并不轻松一件事往往横跨多个部门多个系统牵涉数据共享流程重塑和权责厘清正因如此高效办成一件事并不是简单做减法而是对行政体系协同能力的一次系统性考验能否打破条块分割能否让规则围绕真实需求变动能否把制度设计真正落到操作层面
在这个过程中很多地方都做出了积极努力比如北京市推进个人创业一件事按照实际诉求进行流程优化再造最终实现一表申请一次办结还能用一张表单同时办理创业担保贷款和一次性创业补贴两项业务上海由市政府办公厅统筹各类一件事部门形成多个工作小组推动一件事落地见实效
这些改革从群众和企业的呼声出发直面办事过程中的堵点与不便通过打破部门壁垒推动协同和数据共享将分散环节重新整合让服务真正落到具体场景中体现出可感知的效率提升也让办好易办一件事逐渐从制度设计走向实际运行
截至目前国家层面高效办成一件事重点事项清单已推出五批共55项便民惠企清单持续扩容政务服务改革红利持续释放流程被压缩材料被整合时间成本不断降低
这一改革最终通向的是办事效率和群众满意度的持续提升对企业而言是政策兑现和经营活动更可预期对个人而言是办事体验更稳定更省心每一次分散的便利叠加起来改革的价值就不再停留在抽象的表述中而是转化为日常生活和经济运行中看得见摸得着的获得感也是在为经济社会的发展从细微处巩固基础积蓄动能
政务服务改革永远在路上从更长的时间维度看高效办成一件事所指向的不只是推动某一件某一批事项的完成而是要带动政府治理能力整体提升助力高质量发展
随着社会需求变化新业态不断出现新的堵点断点还可能不断出现改革也必须随之滚动推进动态优化只有保持这种不断拆解问题不断修补细节的耐心政务服务才能持续进阶在回应现实需求中走向更加成熟与稳健
"""
result = classifier.predict(test_title, test_content) result = classifier.predict(test_title, test_content)
print("\n测试预测结果:", result) print("\n测试预测结果:", result)
else: else:

View File

@ -154,7 +154,7 @@ if __name__ == '__main__':
loader = DataLoader(db_url) loader = DataLoader(db_url)
# 从数据库加载数据并保存到本地 # 从数据库加载数据并保存到本地
data = loader.update_local_data(limit=800) data = loader.update_local_data(limit=2000)
if data is not None: if data is not None:
print(f"成功加载数据,共 {len(data)} 条记录") print(f"成功加载数据,共 {len(data)} 条记录")