news-classifier/crawler-module/API.md

# 新闻爬虫系统 API 文档

## 概述

新闻爬虫系统提供了完整的Web API接口，支持爬虫管理、任务监控、数据预览等功能。

**基础URL**: `http://localhost:8000`

**API版本**: v1

**内容类型**: `application/json`

---

## 快速开始

### 1. 健康检查

```bash
GET /health
```

**响应示例**:
```json
{
  "status": "healthy",
  "app_name": "News Crawler",
  "version": "2.0.0"
}
```

### 2. 查看交互式文档

- **Swagger UI**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc

---

## API 端点

### 爬虫管理

#### 1. 列出所有可用爬虫

获取系统中所有可用的爬虫列表。

```http
GET /api/v1/crawlers
```

**响应示例**:
```json
[
  {
    "name": "netease:tech",
    "source": "netease",
    "category": "tech",
    "category_name": "科技",
    "url": "https://tech.163.com/"
  },
  {
    "name": "kr36:ai",
    "source": "kr36",
    "category": "ai",
    "category_name": "AI",
    "url": "https://www.36kr.com/information/AI/"
  }
]
```

**支持的新闻源**:
- `netease` - 网易（娱乐、科技、体育、财经、汽车、政务、健康、军事）
- `kr36` - 36氪（AI、健康）
- `sina` - 新浪（汽车、政务）
- `tencent` - 腾讯（汽车、军事、健康、房产、科技、娱乐、财经、AI）
- `souhu` - 搜狐（房产）

---

#### 2. 运行爬虫

启动指定的爬虫任务，支持同时运行多个爬虫。

```http
POST /api/v1/crawlers/run
Content-Type: application/json
```

**请求体**:
```json
{
  "crawlers": ["netease:tech", "kr36:ai"],
  "max_articles": 10
}
```

**参数说明**:
| 参数 | 类型 | 必填 | 说明 |
|------|------|------|------|
| crawlers | array[string] | 是 | 爬虫列表，格式为 `source:category` |
| max_articles | integer | 否 | 最大爬取文章数，不限制则不填 |

**响应示例**:
```json
{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running",
  "message": "已启动 2 个爬虫任务"
}
```

**cURL 示例**:
```bash
curl -X POST "http://localhost:8000/api/v1/crawlers/run" \
  -H "Content-Type: application/json" \
  -d '{
    "crawlers": ["netease:tech"],
    "max_articles": 5
  }'
```

---

#### 3. 取消运行中的任务

取消正在运行的爬虫任务。

```http
DELETE /api/v1/crawlers/{task_id}
```

**路径参数**:
| 参数 | 类型 | 说明 |
|------|------|------|
| task_id | string | 任务ID |

**响应示例**:
```json
{
  "status": "cancelled",
  "message": "任务 550e8400-e29b-41d4-a716-446655440000 已取消"
}
```

**cURL 示例**:
```bash
curl -X DELETE "http://localhost:8000/api/v1/crawlers/550e8400-e29b-41d4-a716-446655440000"
```

---

### 任务管理

#### 4. 获取任务历史列表

获取爬虫任务的执行历史记录，支持分页和状态过滤。

```http
GET /api/v1/tasks?page=1&size=20&status=completed
```

**查询参数**:
| 参数 | 类型 | 必填 | 默认值 | 说明 |
|------|------|------|--------|------|
| page | integer | 否 | 1 | 页码（从1开始） |
| size | integer | 否 | 20 | 每页数量（1-100） |
| status | string | 否 | - | 状态过滤：pending/running/completed/failed/cancelled |

**响应示例**:
```json
{
  "data": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "status": "completed",
      "crawlers": ["netease:tech", "kr36:ai"],
      "max_articles": 10,
      "stats": {
        "crawled_count": 20,
        "inserted_count": 18,
        "duplicate_count": 2
      },
      "error_message": null,
      "started_at": "2024-01-20T10:30:00",
      "completed_at": "2024-01-20T10:35:00",
      "created_at": "2024-01-20T10:30:00"
    }
  ],
  "total": 1,
  "page": 1,
  "size": 20
}
```

**cURL 示例**:
```bash
# 获取所有任务
curl "http://localhost:8000/api/v1/tasks"

# 只获取已完成的任务
curl "http://localhost:8000/api/v1/tasks?status=completed&page=1&size=10"
```

---

#### 5. 获取任务详情

获取单个任务的详细信息。

```http
GET /api/v1/tasks/{task_id}
```

**路径参数**:
| 参数 | 类型 | 说明 |
|------|------|------|
| task_id | string | 任务ID |

**响应示例**:
```json
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "crawlers": ["netease:tech"],
  "max_articles": 10,
  "stats": {
    "crawled_count": 10,
    "inserted_count": 8,
    "duplicate_count": 2
  },
  "error_message": null,
  "started_at": "2024-01-20T10:30:00",
  "completed_at": "2024-01-20T10:32:00",
  "created_at": "2024-01-20T10:30:00"
}
```

**cURL 示例**:
```bash
curl "http://localhost:8000/api/v1/tasks/550e8400-e29b-41d4-a716-446655440000"
```

---

### 数据预览

#### 6. 获取新闻列表

获取爬取的新闻数据，支持分页和过滤。

```http
GET /api/v1/news?page=1&size=20&source=netease&category_id=4
```

**查询参数**:
| 参数 | 类型 | 必填 | 默认值 | 说明 |
|------|------|------|--------|------|
| page | integer | 否 | 1 | 页码 |
| size | integer | 否 | 20 | 每页数量（1-100） |
| source | string | 否 | - | 新闻源过滤 |
| category_id | integer | 否 | - | 分类ID过滤 |

**响应示例**:
```json
{
  "data": [
    {
      "id": 1,
      "url": "https://example.com/news/123",
      "title": "示例新闻标题",
      "content": "新闻内容...",
      "category_id": 4,
      "source": "netease",
      "publish_time": "2024-01-20T10:00:00",
      "author": "张三",
      "created_at": "2024-01-20T10:05:00"
    }
  ],
  "total": 100,
  "page": 1,
  "size": 20
}
```

**cURL 示例**:
```bash
# 获取所有新闻
curl "http://localhost:8000/api/v1/news"

# 获取网易科技新闻
curl "http://localhost:8000/api/v1/news?source=netease&category_id=4"
```

---

#### 7. 获取新闻详情

获取单条新闻的完整内容。

```http
GET /api/v1/news/{news_id}
```

**状态**: `501` (功能开发中)

---

#### 8. 删除新闻

删除指定的新闻记录。

```http
DELETE /api/v1/news/{news_id}
```

**状态**: `501` (功能开发中)

---

### 定时任务

#### 9. 定时任务管理

定时任务相关API，用于创建和管理周期性执行的爬虫任务。

```http
GET    /api/v1/schedules           # 获取定时任务列表
POST   /api/v1/schedules           # 创建定时任务
PUT    /api/v1/schedules/{id}      # 更新定时任务
DELETE /api/v1/schedules/{id}      # 删除定时任务
POST   /api/v1/schedules/{id}/toggle  # 启用/禁用定时任务
```

**状态**: `501` (功能开发中)

---

## WebSocket 接口

### 实时日志流

通过WebSocket实时接收爬虫任务的日志输出。

```javascript
// WebSocket连接
const ws = new WebSocket('ws://localhost:8000/ws/logs/{task_id}');

// 连接建立
ws.onopen = () => {
  console.log('已连接到日志流');
};

// 接收消息
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);

  if (data.type === 'log') {
    // 日志数据
    console.log(`[${data.data.level}] ${data.data.message}`);
  } else if (data.type === 'completed') {
    // 任务完成
    console.log('日志流结束');
  }
};

// 连接关闭
ws.onclose = () => {
  console.log('连接已关闭');
};

// 错误处理
ws.onerror = (error) => {
  console.error('WebSocket错误:', error);
};
```

**消息格式**:

1. **连接成功消息**:
```json
{
  "type": "connected",
  "message": "已连接到任务 xxx 的日志流"
}
```

2. **日志消息**:
```json
{
  "type": "log",
  "data": {
    "level": "INFO",
    "message": "正在运行爬虫: netease:tech",
    "timestamp": "2024-01-20T10:30:15.123456"
  }
}
```

3. **完成消息**:
```json
{
  "type": "completed",
  "message": "日志流结束"
}
```

4. **错误消息**:
```json
{
  "type": "error",
  "message": "错误描述"
}
```

---

## 任务状态说明

任务有以下几种状态：

| 状态 | 说明 |
|------|------|
| `pending` | 任务已创建，等待执行 |
| `running` | 任务正在执行中 |
| `completed` | 任务成功完成 |
| `failed` | 任务执行失败 |
| `cancelled` | 任务被取消 |

**统计字段说明**:
- `crawled_count` - 成功爬取的文章数
- `inserted_count` - 成功插入数据库的文章数
- `duplicate_count` - 重复（未插入）的文章数

---

## 错误响应

当请求失败时，API会返回相应的HTTP状态码和错误信息。

### HTTP 状态码

| 状态码 | 说明 |
|--------|------|
| 200 | 请求成功 |
| 400 | 请求参数错误 |
| 404 | 资源不存在 |
| 500 | 服务器内部错误 |
| 501 | 功能未实现 |

### 错误响应格式

```json
{
  "detail": "错误描述信息"
}
```

**示例**:
```json
{
  "detail": "爬虫名称格式错误: 'invalid_crawler'，应为 'source:category'"
}
```

---

## 使用示例

### Python 示例

```python
import requests
import json

BASE_URL = "http://localhost:8000"

# 1. 获取所有可用爬虫
response = requests.get(f"{BASE_URL}/api/v1/crawlers")
crawlers = response.json()
print(f"可用爬虫: {len(crawlers)} 个")

# 2. 运行爬虫
task_request = {
    "crawlers": ["netease:tech"],
    "max_articles": 5
}
response = requests.post(
    f"{BASE_URL}/api/v1/crawlers/run",
    json=task_request
)
task = response.json()
task_id = task["task_id"]
print(f"任务ID: {task_id}")

# 3. 查询任务状态
response = requests.get(f"{BASE_URL}/api/v1/tasks/{task_id}")
task_status = response.json()
print(f"任务状态: {task_status['status']}")

# 4. 获取任务历史
response = requests.get(
    f"{BASE_URL}/api/v1/tasks",
    params={"status": "completed", "page": 1, "size": 10}
)
history = response.json()
print(f"已完成任务: {history['total']} 个")
```

### JavaScript 示例

```javascript
const BASE_URL = 'http://localhost:8000';

// 1. 获取所有可用爬虫
async function getCrawlers() {
  const response = await fetch(`${BASE_URL}/api/v1/crawlers`);
  const crawlers = await response.json();
  console.log('可用爬虫:', crawlers);
  return crawlers;
}

// 2. 运行爬虫
async function runCrawlers(crawlerList, maxArticles) {
  const response = await fetch(`${BASE_URL}/api/v1/crawlers/run`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      crawlers: crawlerList,
      max_articles: maxArticles
    })
  });
  const task = await response.json();
  console.log('任务ID:', task.task_id);
  return task.task_id;
}

// 3. 连接WebSocket查看实时日志
function connectLogStream(taskId) {
  const ws = new WebSocket(`ws://localhost:8000/ws/logs/${taskId}`);

  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.type === 'log') {
      console.log(`[${data.data.level}] ${data.data.message}`);
    }
  };

  return ws;
}

// 使用示例
(async () => {
  const crawlers = await getCrawlers();
  const taskId = await runCrawlers(['netease:tech'], 5);
  connectLogStream(taskId);
})();
```

---

## 分类ID对照表

| ID | 分类名称 | 说明 |
|----|---------|------|
| 1  | 娱乐 | entertainment |
| 2  | 体育 | sports |
| 3  | 财经 | money/finance |
| 4  | 科技 | tech |
| 5  | 军事 | war |
| 6  | 汽车 | auto |
| 7  | 政务 | gov |
| 8  | 健康 | health |
| 9  | AI | ai |
| 10 | 房产 | house |

---

## 注意事项

1. **并发限制**: 系统限制最多同时运行3个爬虫任务
2. **超时设置**: 默认页面加载超时60秒，脚本超时30秒
3. **去重机制**: 系统会自动过滤重复的新闻（通过URL和内容哈希）
4. **日志保留**: 日志在内存中最多保留1000条
5. **WebSocket**: 建议在生产环境中使用SSL/TLS加密WebSocket连接

---

## 更新日志

### v2.0.0 (2024-01-20)
- ✅ 添加Web API接口
- ✅ 支持异步爬虫执行
- ✅ WebSocket实时日志推送
- ✅ 任务历史记录
- 🚧 定时任务功能（开发中）
- 🚧 数据预览功能（开发中）

---

## 技术支持

如有问题或建议，请查看项目文档或联系开发团队。

**在线API文档**: http://localhost:8000/docs