news-classifier/crawler-module/tencent-war.txt

31 lines
1.1 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

这是关于腾讯新闻网爬取军事分类新闻的一个可行的代码
需要注意的是腾讯新闻解析文章详情的代码是通用的这里没有给出使用tencent_parser.py即可
注意这里需要使用到动态加载继承DynamicCrawler并且无需重写_fetch_page()
```python
import requests
from bs4 import BeautifulSoup
URL = "https://news.qq.com/ch/milite"
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
}
resp = requests.get(URL,headers=headers,timeout=10)
resp.raise_for_status()
resp.encoding = "utf-8"
# print(resp.text)
# with open("example/example-13.html","r",encoding="utf-8") as f:
# html = f.read()
soup = BeautifulSoup(resp.text,"lxml")
# soup = BeautifulSoup(html,"lxml")
div_list = soup.select("div[id='channel-feed-area'] div.channel-feed-list div.channel-feed-item[dt-params*='article_type=0']")
for div in div_list:
href = div.select_one("a.article-title").get("href")
print(href)
```