python腾讯新闻爬虫

可抓取首页的新闻和图片

利用的模块

  • requests用来请求下载网页
  • beautifulsoup用来解析网页
  • re用来对字符串进行处理

header头伪装

  • 是自己看起来像正常的访问

    proxies代理

  • 利用的免费的代理节点,更加安全

    beautifulsoup

  • 利用beautifulsoup去精准找到数据的位置

    requests

  • 利用requests加上header和proxies下载网页
  • 利用requests.content保存图片数据
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/usr/bin/python
# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import re



header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36",
"Accept-Language": "zh - CN, zh;q = 0.9"}
proxies = {

'https': '223.203.0.14:8000'
}

url = "http://news.qq.com/"

news = requests.get(url, headers=header, proxies=proxies).text

soup = BeautifulSoup(news, "lxml")
try:
for house in soup.find_all('div', attrs={'class': 'Q-tpList'}):
name = house.find('a', attrs={'class': 'linkto'})

img = house.find('img', attrs={'class': 'picto'})
if re.match(re.compile(r'^//'), img['src']):
img['src'] = 'http:' + img["src"]
r = requests.get(img['src'])

with open('./img/{}.jpg'.format(name.text), 'wb') as f:
f.write(r.content)


except Exception as e:
print(e)