feat: 新增爬虫统计功能、多爬虫支持及腾讯财经API爬虫

主要更新:

1. 新增统计展示功能
   - 添加 CrawlerStats 数据类,记录爬取/插入/重复数量
   - run_crawler() 返回详细统计信息而非简单布尔值
   - 新增 display_stats() 函数,支持单个/汇总两种展示格式
   - 自动按数据源分组展示统计信息

2. CLI支持多爬虫运行
   - 修改 crawler 参数支持多个值(nargs='*')
   - 支持三种运行方式:单个爬虫、多个爬虫、--all全部爬虫
   - 自动识别单个/多个场景并切换展示格式

3. 新增腾讯财经API爬虫
   - 创建 src/crawlers/tencent/finance.py
   - 使用腾讯新闻 API 接口,性能优于Selenium爬虫
   - channel_id: news_news_finance
   - 支持 API 分页和去重

4. 更新配置和文档
   - config.yaml 新增腾讯财经分类配置(category_id: 3)
   - 更新《添加新爬虫指南》v2.0,包含API爬虫示例和统计功能说明

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
shenjianZ 2026-01-17 09:02:41 +08:00
parent 5eb92268ec
commit 4cb71256e6
4 changed files with 654 additions and 52 deletions

View File

@ -164,3 +164,19 @@ sources:
category_id: 4 category_id: 4
name: "科技" name: "科技"
css_selector: "" css_selector: ""
entertainment:
url: "https://news.qq.com/ch/ent"
category_id: 1
name: "娱乐"
css_selector: ""
finance:
url: "https://news.qq.com/ch/finance"
category_id: 3
name: "财经"
css_selector: ""
ai:
url: "https://i.news.qq.com/gw/pc_search/result"
category_id: 9
name: "AI"
css_selector: ""
# 注意:此分类通过搜索接口获取数据,而非正常的分类列表接口

View File

@ -49,6 +49,20 @@ crawler-module/
3. **配置驱动模式**: 通过 YAML 配置文件管理爬虫参数 3. **配置驱动模式**: 通过 YAML 配置文件管理爬虫参数
4. **工厂模式**: CLI 通过动态导入创建爬虫实例 4. **工厂模式**: CLI 通过动态导入创建爬虫实例
### 爬虫类型
| 类型 | 基类 | 适用场景 | 依赖 | 示例 |
|------|------|----------|------|------|
| **API爬虫** | `StaticCrawler` | 有数据API接口 | requests | 腾讯科技/财经 |
| **静态爬虫** | `StaticCrawler` | HTML直接渲染 | requests | 简单网站 |
| **动态爬虫** | `DynamicCrawler` | JS动态加载 | Selenium | 网易/36氪 |
### 新增功能v2.0
- **多爬虫运行**: 支持同时运行多个指定爬虫
- **统计展示**: 自动展示爬取数、插入数、重复数
- **分组统计**: 按数据源分组展示汇总信息
--- ---
## 添加新爬虫的完整流程 ## 添加新爬虫的完整流程
@ -77,7 +91,8 @@ crawler-module/
在编写代码之前,需要分析目标网站的以下信息: 在编写代码之前,需要分析目标网站的以下信息:
#### 1.1 确定网站类型 #### 1.1 确定网站类型和爬虫方式
- **API接口**: 网站提供数据API优先选择性能最好
- **静态网站**: 内容直接在 HTML 中,使用 `StaticCrawler` - **静态网站**: 内容直接在 HTML 中,使用 `StaticCrawler`
- **动态网站**: 内容通过 JavaScript 加载,使用 `DynamicCrawler` - **动态网站**: 内容通过 JavaScript 加载,使用 `DynamicCrawler`
@ -195,8 +210,26 @@ class TechCrawler(DynamicCrawler):
#### 2.3 爬虫类说明 #### 2.3 爬虫类说明
**继承基类选择**: **继承基类选择**:
- `DynamicCrawler`: 使用 Selenium适合动态网站 - `DynamicCrawler`: 使用 Selenium适合动态网站需滚动加载
- `StaticCrawler`: 使用 requests适合静态网站 - `StaticCrawler`: 使用 requests适合静态网站或API接口
**三种实现方式**:
1. **API爬虫**(推荐,性能最好)
- 继承 `StaticCrawler`
- 重写 `crawl()` 方法
- 直接调用API接口获取数据
- 参考:`src/crawlers/tencent/tech.py`
2. **静态爬虫**
- 继承 `StaticCrawler`
- 实现 `_extract_article_urls()``_fetch_articles()`
- 使用 BeautifulSoup 解析HTML
3. **动态爬虫**
- 继承 `DynamicCrawler`
- 实现 `_extract_article_urls()``_fetch_articles()`
- 使用 Selenium 自动化浏览器
**必须实现的方法**: **必须实现的方法**:
- `_extract_article_urls(html)`: 从列表页提取文章 URL - `_extract_article_urls(html)`: 从列表页提取文章 URL
@ -414,26 +447,81 @@ CRAWLER_CLASSES = {
### 步骤 6: 测试和调试 ### 步骤 6: 测试和调试
#### 6.1 运行单个爬虫 #### 6.1 运行爬虫
**方式1单个爬虫**
```bash ```bash
# 进入项目目录
cd D:\tmp\write\news-classifier\crawler-module
# 运行新爬虫
python -m src.cli.main example:tech python -m src.cli.main example:tech
# 限制爬取数量
python -m src.cli.main example:tech --max 3 python -m src.cli.main example:tech --max 3
``` ```
#### 6.2 列出所有爬虫 **方式2多个爬虫**v2.0新增)
```bash
python -m src.cli.main example:tech example:finance netease:tech --max 5
```
**方式3所有爬虫**
```bash
python -m src.cli.main --all
python -m src.cli.main --all --max 5
```
#### 6.2 统计信息展示v2.0新增)
**单个爬虫输出:**
```
============================================================
爬虫统计: example:tech
============================================================
状态: [成功]
爬取数量: 10 篇
插入数量: 8 条
重复数量: 2 条
============================================================
```
**多个爬虫输出:**
```
================================================================================
爬虫任务汇总统计
================================================================================
【EXAMPLE】
--------------------------------------------------------------------------------
分类 状态 爬取 插入 重复
--------------------------------------------------------------------------------
tech [成功] 10 8 2
finance [成功] 10 9 1
--------------------------------------------------------------------------------
小计 2/2 成功 20 17 3
【NETEASE】
--------------------------------------------------------------------------------
分类 状态 爬取 插入 重复
--------------------------------------------------------------------------------
tech [成功] 10 10 0
--------------------------------------------------------------------------------
小计 1/1 成功 10 10 0
================================================================================
总计统计
================================================================================
总爬虫数: 3
成功数: 3
失败数: 0
总爬取: 30 篇
总插入: 27 条
总重复: 3 条
================================================================================
```
#### 6.3 列出所有爬虫
```bash ```bash
python -m src.cli.main --list python -m src.cli.main --list
``` ```
应该能看到新添加的爬虫: 输出
``` ```
可用的爬虫: 可用的爬虫:
- netease:entertainment - netease:entertainment
@ -443,11 +531,13 @@ python -m src.cli.main --list
- example:entertainment - example:entertainment
``` ```
#### 6.3 查看日志 #### 6.4 查看日志
```bash ```bash
# 日志文件位置 # 日志文件位置
type logs\crawler.log type logs\crawler.log
# Linux/Mac:
tail -f logs/crawler.log
``` ```
#### 6.4 调试技巧 #### 6.4 调试技巧
@ -483,6 +573,126 @@ for article in articles:
## 示例代码 ## 示例代码
### 示例1API爬虫腾讯财经
**适用场景**: 网站提供数据API接口性能最好
**爬虫类**: `src/crawlers/tencent/finance.py`
```python
"""
腾讯财经新闻爬虫API版
"""
import time
import random
import hashlib
from typing import List
import requests
from base.crawler_base import StaticCrawler, Article
from parsers.tencent_parser import TencentParser
class FinanceCrawler(StaticCrawler):
"""腾讯财经新闻爬虫API版"""
def __init__(self, source: str, category: str):
super().__init__(source, category)
# 腾讯API配置
self.api_url = "https://i.news.qq.com/web_feed/getPCList"
self.channel_id = "news_news_finance" # 财经频道ID
self.seen_ids = set()
self.item_count = 20 # 每页固定请求20条
def crawl(self) -> List[Article]:
"""执行爬取任务重写基类方法以支持API接口"""
self.logger.info(f"开始爬取腾讯{self.category_name}新闻")
try:
device_id = self._generate_trace_id()
article_urls = self._fetch_article_urls_from_api(device_id)
self.logger.info(f"找到 {len(article_urls)} 篇文章")
articles = self._fetch_articles(article_urls)
self.logger.info(f"成功爬取 {len(articles)} 篇文章")
return articles
except Exception as e:
self.logger.error(f"爬取失败: {e}", exc_info=True)
return []
finally:
self._cleanup()
def _fetch_article_urls_from_api(self, device_id: str) -> List[str]:
"""从API获取文章URL列表"""
urls = []
import math
max_pages = math.ceil(self.max_articles / self.item_count)
for flush_num in range(max_pages):
payload = {
"base_req": {"from": "pc"},
"forward": "1",
"qimei36": device_id,
"device_id": device_id,
"flush_num": flush_num + 1,
"channel_id": self.channel_id,
"item_count": self.item_count,
"is_local_chlid": "0"
}
try:
response = requests.post(self.api_url, json=payload, timeout=10)
if response.status_code == 200:
data = response.json()
if data.get("code") == 0 and "data" in data:
news_list = data["data"]
for item in news_list:
news_id = item.get("id")
if news_id not in self.seen_ids:
self.seen_ids.add(news_id)
url = item.get("link_info", {}).get("url")
if url:
urls.append(url)
if len(urls) >= self.max_articles:
break
if len(urls) >= self.max_articles:
break
except Exception as e:
self.logger.error(f"获取API数据失败: {e}")
time.sleep(random.uniform(1, 2))
return urls
def _generate_trace_id(self):
"""生成trace_id"""
random_str = str(random.random()) + str(time.time())
return "0_" + hashlib.md5(random_str.encode()).hexdigest()[:12]
```
**配置**: `config/config.yaml`
```yaml
tencent:
categories:
finance:
url: "https://news.qq.com/ch/finance"
category_id: 3
name: "财经"
css_selector: ""
```
**运行**:
```bash
python -m src.cli.main tencent:finance --max 5
```
---
### 示例2动态爬虫网易科技
### 完整示例:添加新浪娱乐爬虫 ### 完整示例:添加新浪娱乐爬虫
假设我们要为新浪网站添加娱乐分类爬虫: 假设我们要为新浪网站添加娱乐分类爬虫:
@ -772,16 +982,25 @@ GROUP BY url
HAVING count > 1; HAVING count > 1;
``` ```
### Q8: 如何批量运行所有爬虫? ### Q8: 如何批量运行爬虫?
**v2.0支持三种运行方式:**
```bash ```bash
# 运行所有爬虫 # 1. 单个爬虫
python -m src.cli.main --all python -m src.cli.main tencent:finance --max 5
# 限制每个爬虫的数量 # 2. 指定多个爬虫(跨数据源)
python -m src.cli.main tencent:finance tencent:tech netease:tech --max 3
# 3. 所有爬虫
python -m src.cli.main --all --max 5 python -m src.cli.main --all --max 5
``` ```
**统计功能自动启用:**
- 单个爬虫:显示简明统计
- 多个爬虫:显示按数据源分组的汇总统计
### Q9: 如何修改最大爬取数量? ### Q9: 如何修改最大爬取数量?
**方法 1: 命令行参数** **方法 1: 命令行参数**
@ -913,17 +1132,21 @@ PyYAML>=6.0
添加新爬虫的核心步骤: 添加新爬虫的核心步骤:
1. ✅ 分析目标网站结构 1. ✅ 分析目标网站结构确定使用API/静态/动态方式)
2. ✅ 创建爬虫类(继承 `DynamicCrawler``StaticCrawler` 2. ✅ 创建爬虫类(继承 `DynamicCrawler``StaticCrawler`
3. ✅ 创建解析器类(继承 `BaseParser` 3. ✅ 创建解析器类(继承 `BaseParser`
4. ✅ 更新配置文件(`config.yaml` 4. ✅ 更新配置文件(`config.yaml`
5. ✅ 注册爬虫到 CLI`src/cli/main.py` 5. ✅ 注册爬虫到 CLI`src/cli/main.py`
6. ✅ 测试运行 6. ✅ 测试运行(单个/多个/全部)
遵循本指南,您可以为新闻爬虫系统添加任意数量的新网站和分类爬虫。 遵循本指南,您可以为新闻爬虫系统添加任意数量的新网站和分类爬虫。
--- ---
**文档版本**: 1.0 **文档版本**: 2.0
**最后更新**: 2026-01-15 **最后更新**: 2026-01-17
**维护者**: 新闻爬虫项目组 **维护者**: 新闻爬虫项目组
**版本更新说明:**
- v2.0: 新增API爬虫类型、多爬虫支持、统计展示功能
- v1.0: 初始版本

View File

@ -5,7 +5,9 @@
import argparse import argparse
import sys import sys
from typing import List from typing import List, Optional, Union
from dataclasses import dataclass
from collections import defaultdict
import os import os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
@ -43,10 +45,29 @@ CRAWLER_CLASSES = {
'health': ('crawlers.tencent.health', 'HealthCrawler'), 'health': ('crawlers.tencent.health', 'HealthCrawler'),
'house': ('crawlers.tencent.house', 'HouseCrawler'), 'house': ('crawlers.tencent.house', 'HouseCrawler'),
'tech': ('crawlers.tencent.tech', 'TechCrawler'), 'tech': ('crawlers.tencent.tech', 'TechCrawler'),
'entertainment': ('crawlers.tencent.entertainment', 'EntertainmentCrawler'),
'finance': ('crawlers.tencent.finance', 'FinanceCrawler'),
'ai': ('crawlers.tencent.ai', 'SearchAICrawler'),
}, },
} }
@dataclass
class CrawlerStats:
"""单个爬虫的统计信息"""
source: str
category: str
success: bool
crawled_count: int # 爬取的文章数
inserted_count: int # 插入成功的文章数
duplicate_count: int # 重复的文章数
error: Optional[str] = None
@property
def crawler_name(self) -> str:
return f"{self.source}:{self.category}"
def init_logging(): def init_logging():
"""初始化日志系统""" """初始化日志系统"""
Logger.get_logger("news-crawler") Logger.get_logger("news-crawler")
@ -104,7 +125,7 @@ def get_crawler_class(source: str, category: str):
return getattr(module, class_name) return getattr(module, class_name)
def run_crawler(source: str, category: str, max_articles: int = None) -> bool: def run_crawler(source: str, category: str, max_articles: int = None) -> CrawlerStats:
""" """
运行指定爬虫 运行指定爬虫
@ -114,7 +135,7 @@ def run_crawler(source: str, category: str, max_articles: int = None) -> bool:
max_articles: 最大文章数 max_articles: 最大文章数
Returns: Returns:
是否成功 CrawlerStats: 统计信息对象
""" """
logger = get_logger(__name__) logger = get_logger(__name__)
@ -128,13 +149,22 @@ def run_crawler(source: str, category: str, max_articles: int = None) -> bool:
# 覆盖最大文章数 # 覆盖最大文章数
if max_articles: if max_articles:
crawler.max_articles = max_articles crawler.max_articles = max_articles
articles = crawler.crawl() articles = crawler.crawl()
crawled_count = len(articles)
if not articles: if not articles:
logger.warning(f"未爬取到任何文章") logger.warning(f"未爬取到任何文章")
return False return CrawlerStats(
source=source,
category=category,
success=False,
crawled_count=0,
inserted_count=0,
duplicate_count=0,
error="未爬取到任何文章"
)
# 转换为数据模型 # 转换为数据模型
news_list = [ news_list = [
@ -151,16 +181,119 @@ def run_crawler(source: str, category: str, max_articles: int = None) -> bool:
if article.is_valid() if article.is_valid()
] ]
valid_count = len(news_list)
# 保存到数据库 # 保存到数据库
repository = NewsRepository() repository = NewsRepository()
count = repository.save_news(news_list) inserted_count = repository.save_news(news_list)
duplicate_count = valid_count - inserted_count
logger.info(f"任务完成,保存了 {count} 条新闻") success = inserted_count > 0
return count > 0
logger.info(f"任务完成,爬取 {crawled_count} 篇,保存 {inserted_count}")
return CrawlerStats(
source=source,
category=category,
success=success,
crawled_count=crawled_count,
inserted_count=inserted_count,
duplicate_count=duplicate_count
)
except Exception as e: except Exception as e:
logger.error(f"运行爬虫失败: {e}", exc_info=True) logger.error(f"运行爬虫失败: {e}", exc_info=True)
return False return CrawlerStats(
source=source,
category=category,
success=False,
crawled_count=0,
inserted_count=0,
duplicate_count=0,
error=str(e)
)
def display_stats(stats: Union[CrawlerStats, List[CrawlerStats]]):
"""
展示统计信息
Args:
stats: 单个统计对象或统计列表
"""
if isinstance(stats, CrawlerStats):
# 单个爬虫的统计信息
print("\n" + "="*60)
print(f"爬虫统计: {stats.crawler_name}")
print("="*60)
print(f"状态: {'[成功]' if stats.success else '[失败]'}")
print(f"爬取数量: {stats.crawled_count}")
print(f"插入数量: {stats.inserted_count}")
if stats.duplicate_count > 0:
print(f"重复数量: {stats.duplicate_count}")
if stats.error:
print(f"错误信息: {stats.error}")
print("="*60 + "\n")
elif isinstance(stats, list) and len(stats) > 0:
# 多个爬虫的统计信息(汇总)
print("\n" + "="*80)
print("爬虫任务汇总统计")
print("="*80)
# 按数据源分组
grouped = defaultdict(list)
for stat in stats:
grouped[stat.source].append(stat)
# 按数据源展示
for source, source_stats in grouped.items():
print(f"\n{source.upper()}")
print("-"*80)
# 表头
print(f"{'分类':<12} {'状态':<8} {'爬取':<8} {'插入':<8} {'重复':<8}")
print("-"*80)
total_crawled = 0
total_inserted = 0
total_duplicate = 0
success_count = 0
for stat in source_stats:
status = "[成功]" if stat.success else "[失败]"
print(f"{stat.category:<12} {status:<8} "
f"{stat.crawled_count:<8} {stat.inserted_count:<8} "
f"{stat.duplicate_count:<8}")
total_crawled += stat.crawled_count
total_inserted += stat.inserted_count
total_duplicate += stat.duplicate_count
if stat.success:
success_count += 1
# 汇总行
print("-"*80)
print(f"{'小计':<12} {success_count}/{len(source_stats)} 成功 "
f" {total_crawled:<8} {total_inserted:<8} {total_duplicate:<8}")
# 总计
print("\n" + "="*80)
print("总计统计")
print("="*80)
total_crawled_all = sum(s.crawled_count for s in stats)
total_inserted_all = sum(s.inserted_count for s in stats)
total_duplicate_all = sum(s.duplicate_count for s in stats)
success_count_all = sum(1 for s in stats if s.success)
print(f"总爬虫数: {len(stats)}")
print(f"成功数: {success_count_all}")
print(f"失败数: {len(stats) - success_count_all}")
print(f"总爬取: {total_crawled_all}")
print(f"总插入: {total_inserted_all}")
print(f"总重复: {total_duplicate_all}")
print("="*80 + "\n")
def main(): def main():
@ -170,18 +303,18 @@ def main():
formatter_class=argparse.RawDescriptionHelpFormatter, formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=""" epilog="""
示例: 示例:
%(prog)s --list # 列出所有可用爬虫 %(prog)s --list # 列出所有可用爬虫
%(prog)s netease:tech # 爬取网易科技新闻 %(prog)s netease:tech # 爬取单个网易科技新闻
%(prog)s kr36:ai # 爬取36氪AI新闻 %(prog)s netease:tech kr36:ai tencent:tech # 爬取多个指定的爬虫
%(prog)s netease:tech --max 5 # 爬取5篇网易科技新闻 %(prog)s --all # 运行所有爬虫
%(prog)s --all # 运行所有爬虫 %(prog)s netease:tech --max 5 # 爬取5篇网易科技新闻
""" """
) )
parser.add_argument( parser.add_argument(
'crawler', 'crawler',
nargs='?', nargs='*',
help='爬虫名称 (格式: source:category)' help='爬虫名称 (格式: source:category),可指定多个'
) )
parser.add_argument( parser.add_argument(
@ -225,32 +358,51 @@ def main():
if args.all: if args.all:
logger.info("运行所有爬虫...") logger.info("运行所有爬虫...")
crawlers = list_crawlers() crawlers = list_crawlers()
success_count = 0 all_stats = []
for crawler_name in crawlers: for crawler_name in crawlers:
source, category = crawler_name.split(':') source, category = crawler_name.split(':')
logger.info(f"正在运行 {crawler_name}...") logger.info(f"正在运行 {crawler_name}...")
if run_crawler(source, category, args.max): stats = run_crawler(source, category, args.max)
success_count += 1 all_stats.append(stats)
logger.info(f"完成: {success_count}/{len(crawlers)} 个爬虫成功") # 展示汇总统计
return 0 if success_count == len(crawlers) else 1 display_stats(all_stats)
# 处理单个爬虫 # 返回码全部成功返回0否则返回1
return 0 if all(s.success for s in all_stats) else 1
# 处理爬虫列表(支持单个或多个)
if not args.crawler: if not args.crawler:
parser.print_help() parser.print_help()
return 1 return 1
try: # 验证爬虫格式
source, category = args.crawler.split(':') crawler_list = []
except ValueError: for crawler_name in args.crawler:
logger.error(f"爬虫名称格式错误,应为 'source:category'") try:
return 1 source, category = crawler_name.split(':')
crawler_list.append((source, category))
except ValueError:
logger.error(f"爬虫名称格式错误: '{crawler_name}',应为 'source:category'")
return 1
if run_crawler(source, category, args.max): # 运行爬虫并收集统计
return 0 all_stats = []
for source, category in crawler_list:
crawler_name = f"{source}:{category}"
logger.info(f"正在运行 {crawler_name}...")
stats = run_crawler(source, category, args.max)
all_stats.append(stats)
# 展示统计(单个或汇总)
if len(all_stats) == 1:
display_stats(all_stats[0])
else: else:
return 1 display_stats(all_stats)
# 返回码全部成功返回0否则返回1
return 0 if all(s.success for s in all_stats) else 1
if __name__ == "__main__": if __name__ == "__main__":

View File

@ -0,0 +1,211 @@
"""
腾讯财经新闻爬虫API版
使用腾讯新闻 API 接口获取数据性能更好
"""
import time
import random
import hashlib
from typing import List
import requests
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
from base.crawler_base import StaticCrawler, Article
from parsers.tencent_parser import TencentParser
class FinanceCrawler(StaticCrawler):
"""腾讯财经新闻爬虫API版"""
def __init__(self, source: str, category: str):
super().__init__(source, category)
# 腾讯API配置
self.api_url = "https://i.news.qq.com/web_feed/getPCList"
self.channel_id = "news_news_finance" # 财经频道
self.seen_ids = set()
self.item_count = 20 # 每页固定请求20条
def _generate_trace_id(self):
"""生成trace_id"""
random_str = str(random.random()) + str(time.time())
return "0_" + hashlib.md5(random_str.encode()).hexdigest()[:12]
def crawl(self) -> List[Article]:
"""
执行爬取任务重写基类方法以支持API接口
Returns:
文章列表
"""
self.logger.info(f"开始爬取腾讯{self.category_name}新闻")
try:
# 生成设备ID
device_id = self._generate_trace_id()
# 获取文章URL列表
article_urls = self._fetch_article_urls_from_api(device_id)
self.logger.info(f"找到 {len(article_urls)} 篇文章")
# 爬取文章详情
articles = self._fetch_articles(article_urls)
self.logger.info(f"成功爬取 {len(articles)} 篇文章")
return articles
except Exception as e:
self.logger.error(f"爬取失败: {e}", exc_info=True)
return []
finally:
self._cleanup()
def _fetch_article_urls_from_api(self, device_id: str) -> List[str]:
"""
从API获取文章URL列表
Args:
device_id: 设备ID
Returns:
文章URL列表
"""
urls = []
# 根据 max_articles 动态计算需要抓取的页数
# 每页20条向上取整
import math
max_pages = math.ceil(self.max_articles / self.item_count)
self.logger.info(f"根据 max_articles={self.max_articles},计算需要抓取 {max_pages}")
for flush_num in range(max_pages):
payload = {
"base_req": {"from": "pc"},
"forward": "1",
"qimei36": device_id,
"device_id": device_id,
"flush_num": flush_num + 1,
"channel_id": self.channel_id,
"item_count": self.item_count,
"is_local_chlid": "0"
}
try:
headers = {
"User-Agent": self.http_client.session.headers.get("User-Agent"),
"Referer": "https://new.qq.com/",
"Origin": "https://new.qq.com",
"Content-Type": "application/json"
}
response = requests.post(
self.api_url,
headers=headers,
json=payload,
timeout=10
)
if response.status_code == 200:
data = response.json()
if data.get("code") == 0 and "data" in data:
news_list = data["data"]
if not news_list:
self.logger.info("没有更多数据了")
break
# 提取URL
for item in news_list:
news_id = item.get("id")
# 去重
if news_id in self.seen_ids:
continue
self.seen_ids.add(news_id)
# 过滤视频新闻articletype == "4"
article_type = item.get("articletype")
if article_type == "4":
continue
# 提取URL
url = item.get("link_info", {}).get("url")
if url:
urls.append(url)
# 如果已经获取到足够的文章数量,提前终止
if len(urls) >= self.max_articles:
self.logger.info(f"已获取 {len(urls)} 篇文章,达到目标数量,停止抓取")
break
# 如果外层循环也需要终止
if len(urls) >= self.max_articles:
break
else:
self.logger.warning(f"接口返回错误: {data.get('message')}")
else:
self.logger.warning(f"HTTP请求失败: {response.status_code}")
except Exception as e:
self.logger.error(f"获取API数据失败: {e}")
# 延迟,避免请求过快
time.sleep(random.uniform(1, 2))
return urls
def _fetch_page(self) -> str:
"""
获取页面HTML腾讯爬虫不使用此方法
Returns:
空字符串
"""
return ""
def _extract_article_urls(self, html: str) -> List[str]:
"""
从HTML中提取文章URL列表腾讯爬虫不使用此方法
Args:
html: 页面HTML内容
Returns:
空列表
"""
return []
def _fetch_articles(self, urls: List[str]) -> List[Article]:
"""
爬取文章详情
Args:
urls: 文章URL列表
Returns:
文章列表
"""
articles = []
parser = TencentParser()
for i, url in enumerate(urls[:self.max_articles]):
try:
article = parser.parse(url)
article.category_id = self.category_id
article.source = "腾讯"
if not article.author:
article.author = "腾讯财经"
if article.is_valid():
articles.append(article)
self.logger.info(f"[{i+1}/{len(urls)}] {article.title}")
except Exception as e:
self.logger.error(f"解析文章失败: {url} - {e}")
continue
return articles