用python写爬虫有什么用,python爬虫怎么写代码

在爬取网站之前，需要对网站规模和结构了解，常常会借助网站自身的robot.txt以及Sitemap文件，还有比如外部工具：Google搜索和WHOIS。

1. 检查robot.txt

该文件可以让爬虫了解爬取该网站时存在哪些限制，以及一些网站结构线索，通常如下结构

例1、禁止所有搜索引擎访问网站的任何部分

User-agent: * Disallow: / 例2、允许所有的robot访问 User-agent: * Disallow: （或者也可以建一个空文件 "/robots.txt" file）例3、禁止某个搜索引擎的访问 User-agent: BadBot Disallow: / 例4、允许某个搜索引擎的访问
User-agent: Baiduspider Disallow: User-agent: * Disallow: / 例5、无论哪种代理，都应该在两次请求之间给予5秒抓取延迟，对于不允许链接/trap,会封禁ip。 User-agent: * Crawl-delay:5 Disallow: /trap

2. 检查网站地图

可以帮助定位网站最新内容。

3. 识别网站所用技术

运用pip builtwith模块：

pip install builtwith

builtwith将网站URL作为参数，下载对其分析，得到网站所用技术。

>>> import builtwith>>> builtwith.parse('https://www.csdn.net/'){'web-servers': ['OpenResty', 'Nginx'], 'programming-languages': ['Lua'], 'javascript-frameworks': ['Modernizr', 'jQuery'], 'web-frameworks': ['Twitter Bootstrap']}

4. 获取网站所用者

可以使用WHOTS协议查询域名注册者是谁，Python对该协议有对应的封装库，

pip install python-whois>>> whois.whois('https://www.csdn.net/'){'domain_name': 'CSDN.NET', 'registrar': 'NETWORK SOLUTIONS, LLC.', 'whois_server': 'whois.networksolutions.com', 'referral_url': None, 'updated_date': [datetime.datetime(2017, 3, 10, 0, 52, 46), datetime.datetime(2018, 2, 9, 1, 43, 52)], 'creation_date': datetime.datetime(1999, 3, 11, 5, 0), 'expiration_date': datetime.datetime(2020, 3, 11, 4, 0), 'name_servers': ['NS3.DNSV3.COM', 'NS4.DNSV3.COM'], 'status': 'clientTransferProhibited https://icann.org/epp#clientTransferProhibited', 'emails': ['abuse@web.com', 'Jiangtao@CSDN.NET'], 'dnssec': 'unsigned', 'name': 'Beijing Chuangxin Lezhi Co.ltd', 'org': 'Beijing Chuangxin Lezhi Co.ltd', 'address': 'B3-2-1 ZHaowei Industry Park', 'city': '等待的雨', 'state': 'Beijing', 'zipcode': '100016', 'country': 'CN'}

用python写爬虫有什么用,python爬虫怎么写代码

添加新评论

最新文章

最近回复

分类

归档

其它