Posted on 13 Jun 2014
平台
WINDOWS 8.1 x64
工具
python 2.7.7
模块
urllib2
chardet
获取页面内容
import urllib2
url = 'http://www.bqg5200.com/quanbu.html'
c = urllib2.urlopen(url, timeout=120).read()
获取页面编码
import chardet
chardit1 = chardet.detect(c)
charset = chardit1['encoding']
根据页面编码对抓取内容进行编码转换
if(charset == 'utf-8' or charset == 'UTF-8'):
s = c.encode('utf-8')
else:
ch = c.decode('gbk', 'ignore')
s = ch.encode('utf-8')
print s
正则提取信息
import re
url_pattern = re.compile(r'<li><a href="([^\"]+)">([^\<]+)<\/a>/([^\<]+)</li>')
results=re.findall(url_pattern,c)
re.compile是将字符串编译为用于python正则式的模式,字符前的r表示是纯字符,这样就不需要对元字符进行两次转义。re.findall返回的是字符串中符合url_pattern的列表,由于在url_pattern中使用了子表达式,所以results存储的就是子表达式所匹配的内容,即与之间的内容。
输出
for a in results:
print "%-20s -\t%s" %(a[1],a[2])
全部代码
# coding: UTF-8
import urllib2
import chardet
import re
#url = 'http://www.baidu.com/s?wd=cloga'
url = 'http://www.bqg5200.com/quanbu.html'
c = urllib2.urlopen(url, timeout=120).read()
chardit1 = chardet.detect(c)
charset = chardit1['encoding']
if(charset == 'utf-8' or charset == 'UTF-8'):
s = c.encode('utf-8')
else:
ch = c.decode('gbk', 'ignore')
s = ch.encode('utf-8')
##<li><a href="http://www.bqg5200.com/0_549/">剑凌九重天</a>/轩辕亮</li>
url_pattern = re.compile(r'<li><a href="([^\"]+)">([^\<]+)<\/a>/([^\<]+)</li>')
results=re.findall(url_pattern,c)
for a in results:
print "%-20s -\t%s" %(a[1],a[2])
相关
chardet安装
python setup.py install