优化docs文档
This commit is contained in:
parent
f05234847b
commit
73fee7d713
|
|
@ -10,7 +10,9 @@
|
|||
"Bash(Select-Object:*)",
|
||||
"Bash(powershell:*)",
|
||||
"Bash(python:*)",
|
||||
"Bash(move:*)"
|
||||
"Bash(move:*)",
|
||||
"Bash(srcdir:*)",
|
||||
"Bash(cd:*)"
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,8 +1,8 @@
|
|||
spring:
|
||||
datasource:
|
||||
url: jdbc:mysql://${DB_HOST:43.143.145.172}:${DB_PORT:3306}/${DB_NAME:db_spring_1}?useSSL=false&serverTimezone=UTC&characterEncoding=UTF-8
|
||||
url: jdbc:mysql://${DB_HOST:localhost}:${DB_PORT:3306}/${DB_NAME:news}?useSSL=false&serverTimezone=UTC&characterEncoding=UTF-8
|
||||
username: ${DB_USER:root}
|
||||
password: ${DB_PASS:kyff145972}
|
||||
password: ${DB_PASS:root}
|
||||
driver-class-name: com.mysql.cj.jdbc.Driver
|
||||
|
||||
jpa:
|
||||
|
|
|
|||
1206
docs/模块开发任务清单.md
1206
docs/模块开发任务清单.md
File diff suppressed because it is too large
Load Diff
287
docs/爬虫模块说明.md
287
docs/爬虫模块说明.md
|
|
@ -1,287 +0,0 @@
|
|||
# 新闻爬虫模块使用说明
|
||||
|
||||
## 模块概述
|
||||
|
||||
爬虫模块负责从各大新闻网站自动抓取新闻数据,经过清洗、去重、分类后存储到数据库。
|
||||
|
||||
## 文件结构
|
||||
|
||||
```
|
||||
backend/src/main/java/com/newsclassifier/
|
||||
├── config/
|
||||
│ ├── CrawlerConfig.java # 爬虫配置类(RestTemplate)
|
||||
│ ├── CrawlerProperties.java # 爬虫属性配置
|
||||
│ └── AsyncConfig.java # 异步任务配置
|
||||
├── controller/
|
||||
│ └── CrawlerController.java # 爬虫API控制器
|
||||
├── service/
|
||||
│ ├── CrawlerService.java # 爬虫服务接口
|
||||
│ └── impl/
|
||||
│ └── CrawlerServiceImpl.java # 爬虫服务实现
|
||||
├── crawler/
|
||||
│ ├── HtmlParser.java # HTML解析器
|
||||
│ ├── DataCleaner.java # 数据清洗工具
|
||||
│ └── DuplicationService.java # 去重服务
|
||||
├── scheduler/
|
||||
│ └── CrawlerScheduledTasks.java # 定时任务
|
||||
└── dto/
|
||||
├── CrawledNewsDTO.java # 爬取的新闻数据
|
||||
└── CrawlerReportDTO.java # 爬虫报告
|
||||
```
|
||||
|
||||
## 核心功能
|
||||
|
||||
### 1. HTML解析器 (HtmlParser)
|
||||
|
||||
**功能**: 使用Jsoup解析HTML,提取新闻标题、内容、链接等信息
|
||||
|
||||
**方法**:
|
||||
- `parseNewsList()`: 解析新闻列表页面
|
||||
- `parseNewsDetail()`: 解析新闻详情页
|
||||
- `parseDateTime()`: 解析各种格式的日期时间
|
||||
|
||||
### 2. 数据清洗工具 (DataCleaner)
|
||||
|
||||
**功能**: 清洗和规范化爬取的新闻数据
|
||||
|
||||
**清洗内容**:
|
||||
- 去除HTML标签
|
||||
- 去除特殊控制字符
|
||||
- 规范化空白字符
|
||||
- 生成摘要(前200字)
|
||||
- 验证数据完整性
|
||||
|
||||
### 3. 去重服务 (DuplicationService)
|
||||
|
||||
**功能**: 检查并过滤重复新闻
|
||||
|
||||
**去重方式**:
|
||||
- URL去重: 检查`source_url`是否已存在
|
||||
- 标题去重: 检查`title`是否已存在
|
||||
- 相似度计算: Levenshtein距离算法
|
||||
|
||||
### 4. 爬虫服务 (CrawlerService)
|
||||
|
||||
**功能**: 协调整个爬虫流程
|
||||
|
||||
**流程**:
|
||||
```
|
||||
定时任务触发 → 获取新闻源 → 并行爬取 → 解析HTML → 数据清洗
|
||||
→ 去重检查 → 文本分类 → 保存数据库 → 更新缓存 → 生成报告
|
||||
```
|
||||
|
||||
## 配置说明
|
||||
|
||||
### application.yml 配置
|
||||
|
||||
```yaml
|
||||
crawler:
|
||||
# 是否启用爬虫
|
||||
enabled: true
|
||||
|
||||
# Cron表达式 (每30分钟执行一次)
|
||||
cron: "0 */30 * * * ?"
|
||||
|
||||
# User-Agent
|
||||
user-agent: "Mozilla/5.0 ..."
|
||||
|
||||
# 连接超时(毫秒)
|
||||
connect-timeout: 10000
|
||||
|
||||
# 读取超时(毫秒)
|
||||
read-timeout: 30000
|
||||
|
||||
# 新闻源配置
|
||||
sources:
|
||||
- name: 新闻源名称
|
||||
url: https://example.com/news
|
||||
enabled: true
|
||||
encoding: UTF-8
|
||||
delay: 2000 # 请求间隔(毫秒)
|
||||
selector:
|
||||
container: ".news-list" # 列表容器选择器
|
||||
title: ".title" # 标题选择器
|
||||
link: "a" # 链接选择器
|
||||
content: ".content" # 内容选择器
|
||||
publish-time: ".time" # 时间选择器
|
||||
author: ".author" # 作者选择器
|
||||
source: ".source" # 来源选择器
|
||||
```
|
||||
|
||||
### 新闻源配置步骤
|
||||
|
||||
1. **确定新闻源**: 选择要爬取的新闻网站
|
||||
|
||||
2. **分析页面结构**: 使用浏览器开发者工具查看HTML结构
|
||||
|
||||
3. **编写CSS选择器**:
|
||||
```html
|
||||
<!-- 示例HTML结构 -->
|
||||
<div class="news-list">
|
||||
<div class="news-item">
|
||||
<a href="/news/123" class="title">新闻标题</a>
|
||||
<p class="content">新闻内容摘要</p>
|
||||
<span class="time">2024-12-24 10:00</span>
|
||||
</div>
|
||||
</div>
|
||||
```
|
||||
|
||||
```yaml
|
||||
selector:
|
||||
container: ".news-list .news-item"
|
||||
title: ".title"
|
||||
link: "a"
|
||||
content: ".content"
|
||||
publish-time: ".time"
|
||||
```
|
||||
|
||||
4. **测试验证**: 使用手动触发接口测试
|
||||
|
||||
## API接口
|
||||
|
||||
### 1. 手动触发爬虫
|
||||
|
||||
```http
|
||||
POST /api/crawler/execute
|
||||
Authorization: Bearer {token}
|
||||
```
|
||||
|
||||
**响应**:
|
||||
```json
|
||||
{
|
||||
"code": 200,
|
||||
"message": "爬虫任务执行完成",
|
||||
"data": {
|
||||
"startTime": "2025-12-24T10:00:00",
|
||||
"endTime": "2025-12-24T10:05:00",
|
||||
"duration": 300000,
|
||||
"totalSuccess": 50,
|
||||
"totalFailed": 2,
|
||||
"totalSkipped": 5,
|
||||
"sourceStatsMap": {
|
||||
"36kr": {
|
||||
"sourceName": "36kr",
|
||||
"successCount": 30,
|
||||
"failedCount": 1,
|
||||
"skippedCount": 2
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2. 从指定新闻源爬取
|
||||
|
||||
```http
|
||||
POST /api/crawler/crawl/{sourceName}
|
||||
Authorization: Bearer {token}
|
||||
```
|
||||
|
||||
### 3. 获取爬虫配置
|
||||
|
||||
```http
|
||||
GET /api/crawler/config
|
||||
Authorization: Bearer {token}
|
||||
```
|
||||
|
||||
### 4. 更新爬虫状态
|
||||
|
||||
```http
|
||||
PUT /api/crawler/status?enabled=true
|
||||
Authorization: Bearer {token}
|
||||
```
|
||||
|
||||
## 定时任务
|
||||
|
||||
默认配置为每30分钟执行一次,可在`application.yml`中修改:
|
||||
|
||||
```yaml
|
||||
crawler:
|
||||
# Cron表达式格式: 秒 分 时 日 月 周
|
||||
cron: "0 */30 * * * ?" # 每30分钟
|
||||
# cron: "0 0 */2 * * ?" # 每2小时
|
||||
# cron: "0 0 8,12,18 * * ?" # 每天8点、12点、18点
|
||||
```
|
||||
|
||||
## 日志输出
|
||||
|
||||
```
|
||||
2025-12-24 10:00:00 [crawler-async-1] INFO - 开始执行爬虫任务...
|
||||
2025-12-24 10:00:01 [crawler-async-1] INFO - 启用的新闻源数量: 2
|
||||
2025-12-24 10:00:02 [crawler-async-1] INFO - 开始爬取新闻源: 36kr
|
||||
2025-12-24 10:00:03 [crawler-async-1] DEBUG - 从 36kr 解析到 30 条新闻
|
||||
2025-12-24 10:00:04 [crawler-async-1] DEBUG - 成功保存新闻: xxxxx
|
||||
2025-12-24 10:00:05 [crawler-async-1] INFO - 新闻源 36kr 爬取完成 - 成功: 28, 失败: 1, 跳过: 1
|
||||
2025-12-24 10:05:00 [crawler-async-1] INFO - 爬虫任务完成 - 成功: 50, 失败: 2, 跳过: 5, 耗时: 300000ms
|
||||
```
|
||||
|
||||
## 常见问题
|
||||
|
||||
### 1. 爬取失败怎么办?
|
||||
|
||||
检查以下几点:
|
||||
- 目标网站是否可访问
|
||||
- CSS选择器是否正确
|
||||
- 是否需要添加请求头(Referer、Cookie)
|
||||
- 是否有反爬措施(需要调整User-Agent或延迟)
|
||||
|
||||
### 2. 如何调试CSS选择器?
|
||||
|
||||
使用浏览器开发者工具:
|
||||
1. F12打开开发者工具
|
||||
2. 使用Ctrl+Shift+C选择元素
|
||||
3. 右键 → Copy → Copy selector
|
||||
|
||||
### 3. 如何添加新的新闻源?
|
||||
|
||||
在`application.yml`的`crawler.sources`中添加配置:
|
||||
|
||||
```yaml
|
||||
- name: 新新闻源
|
||||
url: https://new-source.com
|
||||
enabled: true
|
||||
delay: 2000
|
||||
selector:
|
||||
title: ".news-title"
|
||||
link: "a"
|
||||
content: ".news-content"
|
||||
```
|
||||
|
||||
### 4. 爬虫影响性能怎么办?
|
||||
|
||||
- 调整`delay`参数增加请求间隔
|
||||
- 减少`enabled`的新闻源数量
|
||||
- 调整线程池大小(AsyncConfig.java)
|
||||
|
||||
## 注意事项
|
||||
|
||||
1. **遵守robots.txt**: 检查目标网站的爬取规则
|
||||
2. **合理设置延迟**: 避免对目标网站造成压力
|
||||
3. **注意版权**: 爬取的内容仅供学习使用
|
||||
4. **定期维护**: 网站结构变化时需要更新CSS选择器
|
||||
|
||||
## 扩展功能
|
||||
|
||||
### 1. 支持更多解析方式
|
||||
|
||||
当前使用CSS选择器,可以扩展支持:
|
||||
- XPath
|
||||
- 正则表达式
|
||||
- 自定义解析规则
|
||||
|
||||
### 2. 增加反爬策略
|
||||
|
||||
- 随机User-Agent
|
||||
- 代理IP池
|
||||
- Cookie池
|
||||
- 验证码识别
|
||||
|
||||
### 3. 分布式爬虫
|
||||
|
||||
使用消息队列实现多实例协同爬取
|
||||
|
||||
---
|
||||
|
||||
**作者**: 张俊恒
|
||||
**版本**: v1.0
|
||||
**更新日期**: 2025-12-24
|
||||
Loading…
Reference in New Issue