607 lines
12 KiB
Markdown
607 lines
12 KiB
Markdown
# 新闻爬虫系统 API 文档
|
||
|
||
## 概述
|
||
|
||
新闻爬虫系统提供了完整的Web API接口,支持爬虫管理、任务监控、数据预览等功能。
|
||
|
||
**基础URL**: `http://localhost:8000`
|
||
|
||
**API版本**: v1
|
||
|
||
**内容类型**: `application/json`
|
||
|
||
---
|
||
|
||
## 快速开始
|
||
|
||
### 1. 健康检查
|
||
|
||
```bash
|
||
GET /health
|
||
```
|
||
|
||
**响应示例**:
|
||
```json
|
||
{
|
||
"status": "healthy",
|
||
"app_name": "News Crawler",
|
||
"version": "2.0.0"
|
||
}
|
||
```
|
||
|
||
### 2. 查看交互式文档
|
||
|
||
- **Swagger UI**: http://localhost:8000/docs
|
||
- **ReDoc**: http://localhost:8000/redoc
|
||
|
||
---
|
||
|
||
## API 端点
|
||
|
||
### 爬虫管理
|
||
|
||
#### 1. 列出所有可用爬虫
|
||
|
||
获取系统中所有可用的爬虫列表。
|
||
|
||
```http
|
||
GET /api/v1/crawlers
|
||
```
|
||
|
||
**响应示例**:
|
||
```json
|
||
[
|
||
{
|
||
"name": "netease:tech",
|
||
"source": "netease",
|
||
"category": "tech",
|
||
"category_name": "科技",
|
||
"url": "https://tech.163.com/"
|
||
},
|
||
{
|
||
"name": "kr36:ai",
|
||
"source": "kr36",
|
||
"category": "ai",
|
||
"category_name": "AI",
|
||
"url": "https://www.36kr.com/information/AI/"
|
||
}
|
||
]
|
||
```
|
||
|
||
**支持的新闻源**:
|
||
- `netease` - 网易(娱乐、科技、体育、财经、汽车、政务、健康、军事)
|
||
- `kr36` - 36氪(AI、健康)
|
||
- `sina` - 新浪(汽车、政务)
|
||
- `tencent` - 腾讯(汽车、军事、健康、房产、科技、娱乐、财经、AI)
|
||
- `souhu` - 搜狐(房产)
|
||
|
||
---
|
||
|
||
#### 2. 运行爬虫
|
||
|
||
启动指定的爬虫任务,支持同时运行多个爬虫。
|
||
|
||
```http
|
||
POST /api/v1/crawlers/run
|
||
Content-Type: application/json
|
||
```
|
||
|
||
**请求体**:
|
||
```json
|
||
{
|
||
"crawlers": ["netease:tech", "kr36:ai"],
|
||
"max_articles": 10
|
||
}
|
||
```
|
||
|
||
**参数说明**:
|
||
| 参数 | 类型 | 必填 | 说明 |
|
||
|------|------|------|------|
|
||
| crawlers | array[string] | 是 | 爬虫列表,格式为 `source:category` |
|
||
| max_articles | integer | 否 | 最大爬取文章数,不限制则不填 |
|
||
|
||
**响应示例**:
|
||
```json
|
||
{
|
||
"task_id": "550e8400-e29b-41d4-a716-446655440000",
|
||
"status": "running",
|
||
"message": "已启动 2 个爬虫任务"
|
||
}
|
||
```
|
||
|
||
**cURL 示例**:
|
||
```bash
|
||
curl -X POST "http://localhost:8000/api/v1/crawlers/run" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"crawlers": ["netease:tech"],
|
||
"max_articles": 5
|
||
}'
|
||
```
|
||
|
||
---
|
||
|
||
#### 3. 取消运行中的任务
|
||
|
||
取消正在运行的爬虫任务。
|
||
|
||
```http
|
||
DELETE /api/v1/crawlers/{task_id}
|
||
```
|
||
|
||
**路径参数**:
|
||
| 参数 | 类型 | 说明 |
|
||
|------|------|------|
|
||
| task_id | string | 任务ID |
|
||
|
||
**响应示例**:
|
||
```json
|
||
{
|
||
"status": "cancelled",
|
||
"message": "任务 550e8400-e29b-41d4-a716-446655440000 已取消"
|
||
}
|
||
```
|
||
|
||
**cURL 示例**:
|
||
```bash
|
||
curl -X DELETE "http://localhost:8000/api/v1/crawlers/550e8400-e29b-41d4-a716-446655440000"
|
||
```
|
||
|
||
---
|
||
|
||
### 任务管理
|
||
|
||
#### 4. 获取任务历史列表
|
||
|
||
获取爬虫任务的执行历史记录,支持分页和状态过滤。
|
||
|
||
```http
|
||
GET /api/v1/tasks?page=1&size=20&status=completed
|
||
```
|
||
|
||
**查询参数**:
|
||
| 参数 | 类型 | 必填 | 默认值 | 说明 |
|
||
|------|------|------|--------|------|
|
||
| page | integer | 否 | 1 | 页码(从1开始) |
|
||
| size | integer | 否 | 20 | 每页数量(1-100) |
|
||
| status | string | 否 | - | 状态过滤:pending/running/completed/failed/cancelled |
|
||
|
||
**响应示例**:
|
||
```json
|
||
{
|
||
"data": [
|
||
{
|
||
"id": "550e8400-e29b-41d4-a716-446655440000",
|
||
"status": "completed",
|
||
"crawlers": ["netease:tech", "kr36:ai"],
|
||
"max_articles": 10,
|
||
"stats": {
|
||
"crawled_count": 20,
|
||
"inserted_count": 18,
|
||
"duplicate_count": 2
|
||
},
|
||
"error_message": null,
|
||
"started_at": "2024-01-20T10:30:00",
|
||
"completed_at": "2024-01-20T10:35:00",
|
||
"created_at": "2024-01-20T10:30:00"
|
||
}
|
||
],
|
||
"total": 1,
|
||
"page": 1,
|
||
"size": 20
|
||
}
|
||
```
|
||
|
||
**cURL 示例**:
|
||
```bash
|
||
# 获取所有任务
|
||
curl "http://localhost:8000/api/v1/tasks"
|
||
|
||
# 只获取已完成的任务
|
||
curl "http://localhost:8000/api/v1/tasks?status=completed&page=1&size=10"
|
||
```
|
||
|
||
---
|
||
|
||
#### 5. 获取任务详情
|
||
|
||
获取单个任务的详细信息。
|
||
|
||
```http
|
||
GET /api/v1/tasks/{task_id}
|
||
```
|
||
|
||
**路径参数**:
|
||
| 参数 | 类型 | 说明 |
|
||
|------|------|------|
|
||
| task_id | string | 任务ID |
|
||
|
||
**响应示例**:
|
||
```json
|
||
{
|
||
"id": "550e8400-e29b-41d4-a716-446655440000",
|
||
"status": "completed",
|
||
"crawlers": ["netease:tech"],
|
||
"max_articles": 10,
|
||
"stats": {
|
||
"crawled_count": 10,
|
||
"inserted_count": 8,
|
||
"duplicate_count": 2
|
||
},
|
||
"error_message": null,
|
||
"started_at": "2024-01-20T10:30:00",
|
||
"completed_at": "2024-01-20T10:32:00",
|
||
"created_at": "2024-01-20T10:30:00"
|
||
}
|
||
```
|
||
|
||
**cURL 示例**:
|
||
```bash
|
||
curl "http://localhost:8000/api/v1/tasks/550e8400-e29b-41d4-a716-446655440000"
|
||
```
|
||
|
||
---
|
||
|
||
### 数据预览
|
||
|
||
#### 6. 获取新闻列表
|
||
|
||
获取爬取的新闻数据,支持分页和过滤。
|
||
|
||
```http
|
||
GET /api/v1/news?page=1&size=20&source=netease&category_id=4
|
||
```
|
||
|
||
**查询参数**:
|
||
| 参数 | 类型 | 必填 | 默认值 | 说明 |
|
||
|------|------|------|--------|------|
|
||
| page | integer | 否 | 1 | 页码 |
|
||
| size | integer | 否 | 20 | 每页数量(1-100) |
|
||
| source | string | 否 | - | 新闻源过滤 |
|
||
| category_id | integer | 否 | - | 分类ID过滤 |
|
||
|
||
**响应示例**:
|
||
```json
|
||
{
|
||
"data": [
|
||
{
|
||
"id": 1,
|
||
"url": "https://example.com/news/123",
|
||
"title": "示例新闻标题",
|
||
"content": "新闻内容...",
|
||
"category_id": 4,
|
||
"source": "netease",
|
||
"publish_time": "2024-01-20T10:00:00",
|
||
"author": "张三",
|
||
"created_at": "2024-01-20T10:05:00"
|
||
}
|
||
],
|
||
"total": 100,
|
||
"page": 1,
|
||
"size": 20
|
||
}
|
||
```
|
||
|
||
**cURL 示例**:
|
||
```bash
|
||
# 获取所有新闻
|
||
curl "http://localhost:8000/api/v1/news"
|
||
|
||
# 获取网易科技新闻
|
||
curl "http://localhost:8000/api/v1/news?source=netease&category_id=4"
|
||
```
|
||
|
||
---
|
||
|
||
#### 7. 获取新闻详情
|
||
|
||
获取单条新闻的完整内容。
|
||
|
||
```http
|
||
GET /api/v1/news/{news_id}
|
||
```
|
||
|
||
**状态**: `501` (功能开发中)
|
||
|
||
---
|
||
|
||
#### 8. 删除新闻
|
||
|
||
删除指定的新闻记录。
|
||
|
||
```http
|
||
DELETE /api/v1/news/{news_id}
|
||
```
|
||
|
||
**状态**: `501` (功能开发中)
|
||
|
||
---
|
||
|
||
### 定时任务
|
||
|
||
#### 9. 定时任务管理
|
||
|
||
定时任务相关API,用于创建和管理周期性执行的爬虫任务。
|
||
|
||
```http
|
||
GET /api/v1/schedules # 获取定时任务列表
|
||
POST /api/v1/schedules # 创建定时任务
|
||
PUT /api/v1/schedules/{id} # 更新定时任务
|
||
DELETE /api/v1/schedules/{id} # 删除定时任务
|
||
POST /api/v1/schedules/{id}/toggle # 启用/禁用定时任务
|
||
```
|
||
|
||
**状态**: `501` (功能开发中)
|
||
|
||
---
|
||
|
||
## WebSocket 接口
|
||
|
||
### 实时日志流
|
||
|
||
通过WebSocket实时接收爬虫任务的日志输出。
|
||
|
||
```javascript
|
||
// WebSocket连接
|
||
const ws = new WebSocket('ws://localhost:8000/ws/logs/{task_id}');
|
||
|
||
// 连接建立
|
||
ws.onopen = () => {
|
||
console.log('已连接到日志流');
|
||
};
|
||
|
||
// 接收消息
|
||
ws.onmessage = (event) => {
|
||
const data = JSON.parse(event.data);
|
||
|
||
if (data.type === 'log') {
|
||
// 日志数据
|
||
console.log(`[${data.data.level}] ${data.data.message}`);
|
||
} else if (data.type === 'completed') {
|
||
// 任务完成
|
||
console.log('日志流结束');
|
||
}
|
||
};
|
||
|
||
// 连接关闭
|
||
ws.onclose = () => {
|
||
console.log('连接已关闭');
|
||
};
|
||
|
||
// 错误处理
|
||
ws.onerror = (error) => {
|
||
console.error('WebSocket错误:', error);
|
||
};
|
||
```
|
||
|
||
**消息格式**:
|
||
|
||
1. **连接成功消息**:
|
||
```json
|
||
{
|
||
"type": "connected",
|
||
"message": "已连接到任务 xxx 的日志流"
|
||
}
|
||
```
|
||
|
||
2. **日志消息**:
|
||
```json
|
||
{
|
||
"type": "log",
|
||
"data": {
|
||
"level": "INFO",
|
||
"message": "正在运行爬虫: netease:tech",
|
||
"timestamp": "2024-01-20T10:30:15.123456"
|
||
}
|
||
}
|
||
```
|
||
|
||
3. **完成消息**:
|
||
```json
|
||
{
|
||
"type": "completed",
|
||
"message": "日志流结束"
|
||
}
|
||
```
|
||
|
||
4. **错误消息**:
|
||
```json
|
||
{
|
||
"type": "error",
|
||
"message": "错误描述"
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 任务状态说明
|
||
|
||
任务有以下几种状态:
|
||
|
||
| 状态 | 说明 |
|
||
|------|------|
|
||
| `pending` | 任务已创建,等待执行 |
|
||
| `running` | 任务正在执行中 |
|
||
| `completed` | 任务成功完成 |
|
||
| `failed` | 任务执行失败 |
|
||
| `cancelled` | 任务被取消 |
|
||
|
||
**统计字段说明**:
|
||
- `crawled_count` - 成功爬取的文章数
|
||
- `inserted_count` - 成功插入数据库的文章数
|
||
- `duplicate_count` - 重复(未插入)的文章数
|
||
|
||
---
|
||
|
||
## 错误响应
|
||
|
||
当请求失败时,API会返回相应的HTTP状态码和错误信息。
|
||
|
||
### HTTP 状态码
|
||
|
||
| 状态码 | 说明 |
|
||
|--------|------|
|
||
| 200 | 请求成功 |
|
||
| 400 | 请求参数错误 |
|
||
| 404 | 资源不存在 |
|
||
| 500 | 服务器内部错误 |
|
||
| 501 | 功能未实现 |
|
||
|
||
### 错误响应格式
|
||
|
||
```json
|
||
{
|
||
"detail": "错误描述信息"
|
||
}
|
||
```
|
||
|
||
**示例**:
|
||
```json
|
||
{
|
||
"detail": "爬虫名称格式错误: 'invalid_crawler',应为 'source:category'"
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 使用示例
|
||
|
||
### Python 示例
|
||
|
||
```python
|
||
import requests
|
||
import json
|
||
|
||
BASE_URL = "http://localhost:8000"
|
||
|
||
# 1. 获取所有可用爬虫
|
||
response = requests.get(f"{BASE_URL}/api/v1/crawlers")
|
||
crawlers = response.json()
|
||
print(f"可用爬虫: {len(crawlers)} 个")
|
||
|
||
# 2. 运行爬虫
|
||
task_request = {
|
||
"crawlers": ["netease:tech"],
|
||
"max_articles": 5
|
||
}
|
||
response = requests.post(
|
||
f"{BASE_URL}/api/v1/crawlers/run",
|
||
json=task_request
|
||
)
|
||
task = response.json()
|
||
task_id = task["task_id"]
|
||
print(f"任务ID: {task_id}")
|
||
|
||
# 3. 查询任务状态
|
||
response = requests.get(f"{BASE_URL}/api/v1/tasks/{task_id}")
|
||
task_status = response.json()
|
||
print(f"任务状态: {task_status['status']}")
|
||
|
||
# 4. 获取任务历史
|
||
response = requests.get(
|
||
f"{BASE_URL}/api/v1/tasks",
|
||
params={"status": "completed", "page": 1, "size": 10}
|
||
)
|
||
history = response.json()
|
||
print(f"已完成任务: {history['total']} 个")
|
||
```
|
||
|
||
### JavaScript 示例
|
||
|
||
```javascript
|
||
const BASE_URL = 'http://localhost:8000';
|
||
|
||
// 1. 获取所有可用爬虫
|
||
async function getCrawlers() {
|
||
const response = await fetch(`${BASE_URL}/api/v1/crawlers`);
|
||
const crawlers = await response.json();
|
||
console.log('可用爬虫:', crawlers);
|
||
return crawlers;
|
||
}
|
||
|
||
// 2. 运行爬虫
|
||
async function runCrawlers(crawlerList, maxArticles) {
|
||
const response = await fetch(`${BASE_URL}/api/v1/crawlers/run`, {
|
||
method: 'POST',
|
||
headers: {
|
||
'Content-Type': 'application/json'
|
||
},
|
||
body: JSON.stringify({
|
||
crawlers: crawlerList,
|
||
max_articles: maxArticles
|
||
})
|
||
});
|
||
const task = await response.json();
|
||
console.log('任务ID:', task.task_id);
|
||
return task.task_id;
|
||
}
|
||
|
||
// 3. 连接WebSocket查看实时日志
|
||
function connectLogStream(taskId) {
|
||
const ws = new WebSocket(`ws://localhost:8000/ws/logs/${taskId}`);
|
||
|
||
ws.onmessage = (event) => {
|
||
const data = JSON.parse(event.data);
|
||
if (data.type === 'log') {
|
||
console.log(`[${data.data.level}] ${data.data.message}`);
|
||
}
|
||
};
|
||
|
||
return ws;
|
||
}
|
||
|
||
// 使用示例
|
||
(async () => {
|
||
const crawlers = await getCrawlers();
|
||
const taskId = await runCrawlers(['netease:tech'], 5);
|
||
connectLogStream(taskId);
|
||
})();
|
||
```
|
||
|
||
---
|
||
|
||
## 分类ID对照表
|
||
|
||
| ID | 分类名称 | 说明 |
|
||
|----|---------|------|
|
||
| 1 | 娱乐 | entertainment |
|
||
| 2 | 体育 | sports |
|
||
| 3 | 财经 | money/finance |
|
||
| 4 | 科技 | tech |
|
||
| 5 | 军事 | war |
|
||
| 6 | 汽车 | auto |
|
||
| 7 | 政务 | gov |
|
||
| 8 | 健康 | health |
|
||
| 9 | AI | ai |
|
||
| 10 | 房产 | house |
|
||
|
||
---
|
||
|
||
## 注意事项
|
||
|
||
1. **并发限制**: 系统限制最多同时运行3个爬虫任务
|
||
2. **超时设置**: 默认页面加载超时60秒,脚本超时30秒
|
||
3. **去重机制**: 系统会自动过滤重复的新闻(通过URL和内容哈希)
|
||
4. **日志保留**: 日志在内存中最多保留1000条
|
||
5. **WebSocket**: 建议在生产环境中使用SSL/TLS加密WebSocket连接
|
||
|
||
---
|
||
|
||
## 更新日志
|
||
|
||
### v2.0.0 (2024-01-20)
|
||
- ✅ 添加Web API接口
|
||
- ✅ 支持异步爬虫执行
|
||
- ✅ WebSocket实时日志推送
|
||
- ✅ 任务历史记录
|
||
- 🚧 定时任务功能(开发中)
|
||
- 🚧 数据预览功能(开发中)
|
||
|
||
---
|
||
|
||
## 技术支持
|
||
|
||
如有问题或建议,请查看项目文档或联系开发团队。
|
||
|
||
**在线API文档**: http://localhost:8000/docs
|