12 KiB

Raw Permalink Blame History

新闻爬虫系统 API 文档

概述

新闻爬虫系统提供了完整的Web API接口，支持爬虫管理、任务监控、数据预览等功能。

基础URL: http://localhost:8000

API版本: v1

内容类型: application/json

快速开始

1. 健康检查

GET /health

响应示例:

{
  "status": "healthy",
  "app_name": "News Crawler",
  "version": "2.0.0"
}

2. 查看交互式文档

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

API 端点

爬虫管理

1. 列出所有可用爬虫

获取系统中所有可用的爬虫列表。

GET /api/v1/crawlers

响应示例:

[
  {
    "name": "netease:tech",
    "source": "netease",
    "category": "tech",
    "category_name": "科技",
    "url": "https://tech.163.com/"
  },
  {
    "name": "kr36:ai",
    "source": "kr36",
    "category": "ai",
    "category_name": "AI",
    "url": "https://www.36kr.com/information/AI/"
  }
]

支持的新闻源:

netease - 网易（娱乐、科技、体育、财经、汽车、政务、健康、军事）
kr36 - 36氪（AI、健康）
sina - 新浪（汽车、政务）
tencent - 腾讯（汽车、军事、健康、房产、科技、娱乐、财经、AI）
souhu - 搜狐（房产）

2. 运行爬虫

启动指定的爬虫任务，支持同时运行多个爬虫。

POST /api/v1/crawlers/run
Content-Type: application/json

请求体:

{
  "crawlers": ["netease:tech", "kr36:ai"],
  "max_articles": 10
}

参数说明:

参数	类型	必填	说明
crawlers	array[string]	是	爬虫列表，格式为 `source:category`
max_articles	integer	否	最大爬取文章数，不限制则不填

响应示例:

{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running",
  "message": "已启动 2 个爬虫任务"
}

cURL 示例:

curl -X POST "http://localhost:8000/api/v1/crawlers/run" \
  -H "Content-Type: application/json" \
  -d '{
    "crawlers": ["netease:tech"],
    "max_articles": 5
  }'

3. 取消运行中的任务

取消正在运行的爬虫任务。

DELETE /api/v1/crawlers/{task_id}

路径参数:

参数	类型	说明
task_id	string	任务ID

响应示例:

{
  "status": "cancelled",
  "message": "任务 550e8400-e29b-41d4-a716-446655440000 已取消"
}

cURL 示例:

curl -X DELETE "http://localhost:8000/api/v1/crawlers/550e8400-e29b-41d4-a716-446655440000"

任务管理

4. 获取任务历史列表

获取爬虫任务的执行历史记录，支持分页和状态过滤。

GET /api/v1/tasks?page=1&size=20&status=completed

查询参数:

参数	类型	必填	默认值	说明
page	integer	否	1	页码（从1开始）
size	integer	否	20	每页数量（1-100）
status	string	否	-	状态过滤：pending/running/completed/failed/cancelled

响应示例:

{
  "data": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "status": "completed",
      "crawlers": ["netease:tech", "kr36:ai"],
      "max_articles": 10,
      "stats": {
        "crawled_count": 20,
        "inserted_count": 18,
        "duplicate_count": 2
      },
      "error_message": null,
      "started_at": "2024-01-20T10:30:00",
      "completed_at": "2024-01-20T10:35:00",
      "created_at": "2024-01-20T10:30:00"
    }
  ],
  "total": 1,
  "page": 1,
  "size": 20
}

cURL 示例:

# 获取所有任务
curl "http://localhost:8000/api/v1/tasks"

# 只获取已完成的任务
curl "http://localhost:8000/api/v1/tasks?status=completed&page=1&size=10"

5. 获取任务详情

获取单个任务的详细信息。

GET /api/v1/tasks/{task_id}

路径参数:

参数	类型	说明
task_id	string	任务ID

响应示例:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "crawlers": ["netease:tech"],
  "max_articles": 10,
  "stats": {
    "crawled_count": 10,
    "inserted_count": 8,
    "duplicate_count": 2
  },
  "error_message": null,
  "started_at": "2024-01-20T10:30:00",
  "completed_at": "2024-01-20T10:32:00",
  "created_at": "2024-01-20T10:30:00"
}

cURL 示例:

curl "http://localhost:8000/api/v1/tasks/550e8400-e29b-41d4-a716-446655440000"

数据预览

6. 获取新闻列表

获取爬取的新闻数据，支持分页和过滤。

GET /api/v1/news?page=1&size=20&source=netease&category_id=4

查询参数:

参数	类型	必填	默认值	说明
page	integer	否	1	页码
size	integer	否	20	每页数量（1-100）
source	string	否	-	新闻源过滤
category_id	integer	否	-	分类ID过滤

响应示例:

{
  "data": [
    {
      "id": 1,
      "url": "https://example.com/news/123",
      "title": "示例新闻标题",
      "content": "新闻内容...",
      "category_id": 4,
      "source": "netease",
      "publish_time": "2024-01-20T10:00:00",
      "author": "张三",
      "created_at": "2024-01-20T10:05:00"
    }
  ],
  "total": 100,
  "page": 1,
  "size": 20
}

cURL 示例:

# 获取所有新闻
curl "http://localhost:8000/api/v1/news"

# 获取网易科技新闻
curl "http://localhost:8000/api/v1/news?source=netease&category_id=4"

7. 获取新闻详情

获取单条新闻的完整内容。

GET /api/v1/news/{news_id}

状态: 501 (功能开发中)

8. 删除新闻

删除指定的新闻记录。

DELETE /api/v1/news/{news_id}

状态: 501 (功能开发中)

定时任务

9. 定时任务管理

定时任务相关API，用于创建和管理周期性执行的爬虫任务。

GET    /api/v1/schedules           # 获取定时任务列表
POST   /api/v1/schedules           # 创建定时任务
PUT    /api/v1/schedules/{id}      # 更新定时任务
DELETE /api/v1/schedules/{id}      # 删除定时任务
POST   /api/v1/schedules/{id}/toggle  # 启用/禁用定时任务

状态: 501 (功能开发中)

WebSocket 接口

实时日志流

通过WebSocket实时接收爬虫任务的日志输出。

// WebSocket连接
const ws = new WebSocket('ws://localhost:8000/ws/logs/{task_id}');

// 连接建立
ws.onopen = () => {
  console.log('已连接到日志流');
};

// 接收消息
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);

  if (data.type === 'log') {
    // 日志数据
    console.log(`[${data.data.level}] ${data.data.message}`);
  } else if (data.type === 'completed') {
    // 任务完成
    console.log('日志流结束');
  }
};

// 连接关闭
ws.onclose = () => {
  console.log('连接已关闭');
};

// 错误处理
ws.onerror = (error) => {
  console.error('WebSocket错误:', error);
};

消息格式:

连接成功消息:

{
  "type": "connected",
  "message": "已连接到任务 xxx 的日志流"
}

日志消息:

{
  "type": "log",
  "data": {
    "level": "INFO",
    "message": "正在运行爬虫: netease:tech",
    "timestamp": "2024-01-20T10:30:15.123456"
  }
}

完成消息:

{
  "type": "completed",
  "message": "日志流结束"
}

错误消息:

{
  "type": "error",
  "message": "错误描述"
}

任务状态说明

任务有以下几种状态：

状态	说明
`pending`	任务已创建，等待执行
`running`	任务正在执行中
`completed`	任务成功完成
`failed`	任务执行失败
`cancelled`	任务被取消

统计字段说明:

crawled_count - 成功爬取的文章数
inserted_count - 成功插入数据库的文章数
duplicate_count - 重复（未插入）的文章数

错误响应

当请求失败时，API会返回相应的HTTP状态码和错误信息。

HTTP 状态码

状态码	说明
200	请求成功
400	请求参数错误
404	资源不存在
500	服务器内部错误
501	功能未实现

错误响应格式

{
  "detail": "错误描述信息"
}

示例:

{
  "detail": "爬虫名称格式错误: 'invalid_crawler'，应为 'source:category'"
}

使用示例

Python 示例

import requests
import json

BASE_URL = "http://localhost:8000"

# 1. 获取所有可用爬虫
response = requests.get(f"{BASE_URL}/api/v1/crawlers")
crawlers = response.json()
print(f"可用爬虫: {len(crawlers)} 个")

# 2. 运行爬虫
task_request = {
    "crawlers": ["netease:tech"],
    "max_articles": 5
}
response = requests.post(
    f"{BASE_URL}/api/v1/crawlers/run",
    json=task_request
)
task = response.json()
task_id = task["task_id"]
print(f"任务ID: {task_id}")

# 3. 查询任务状态
response = requests.get(f"{BASE_URL}/api/v1/tasks/{task_id}")
task_status = response.json()
print(f"任务状态: {task_status['status']}")

# 4. 获取任务历史
response = requests.get(
    f"{BASE_URL}/api/v1/tasks",
    params={"status": "completed", "page": 1, "size": 10}
)
history = response.json()
print(f"已完成任务: {history['total']} 个")

JavaScript 示例

const BASE_URL = 'http://localhost:8000';

// 1. 获取所有可用爬虫
async function getCrawlers() {
  const response = await fetch(`${BASE_URL}/api/v1/crawlers`);
  const crawlers = await response.json();
  console.log('可用爬虫:', crawlers);
  return crawlers;
}

// 2. 运行爬虫
async function runCrawlers(crawlerList, maxArticles) {
  const response = await fetch(`${BASE_URL}/api/v1/crawlers/run`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      crawlers: crawlerList,
      max_articles: maxArticles
    })
  });
  const task = await response.json();
  console.log('任务ID:', task.task_id);
  return task.task_id;
}

// 3. 连接WebSocket查看实时日志
function connectLogStream(taskId) {
  const ws = new WebSocket(`ws://localhost:8000/ws/logs/${taskId}`);

  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.type === 'log') {
      console.log(`[${data.data.level}] ${data.data.message}`);
    }
  };

  return ws;
}

// 使用示例
(async () => {
  const crawlers = await getCrawlers();
  const taskId = await runCrawlers(['netease:tech'], 5);
  connectLogStream(taskId);
})();

分类ID对照表

ID	分类名称	说明
1	娱乐	entertainment
2	体育	sports
3	财经	money/finance
4	科技	tech
5	军事	war
6	汽车	auto
7	政务	gov
8	健康	health
9	AI	ai
10	房产	house

注意事项

并发限制: 系统限制最多同时运行3个爬虫任务
超时设置: 默认页面加载超时60秒，脚本超时30秒
去重机制: 系统会自动过滤重复的新闻（通过URL和内容哈希）
日志保留: 日志在内存中最多保留1000条
WebSocket: 建议在生产环境中使用SSL/TLS加密WebSocket连接

更新日志

v2.0.0 (2024-01-20)

✅ 添加Web API接口
✅ 支持异步爬虫执行
✅ WebSocket实时日志推送
✅ 任务历史记录
🚧 定时任务功能（开发中）
🚧 数据预览功能（开发中）

技术支持

如有问题或建议，请查看项目文档或联系开发团队。

在线API文档: http://localhost:8000/docs

12 KiB Raw Permalink Blame History Unescape Escape

新闻爬虫系统 API 文档

概述

快速开始

1. 健康检查

2. 查看交互式文档

API 端点

爬虫管理

1. 列出所有可用爬虫

2. 运行爬虫

3. 取消运行中的任务

任务管理

4. 获取任务历史列表

5. 获取任务详情

数据预览

6. 获取新闻列表

7. 获取新闻详情

8. 删除新闻

定时任务

9. 定时任务管理

WebSocket 接口

实时日志流

任务状态说明

错误响应

HTTP 状态码

错误响应格式

使用示例

Python 示例

JavaScript 示例

分类ID对照表

注意事项

更新日志

v2.0.0 (2024-01-20)

技术支持

12 KiB

Raw Permalink Blame History