import requests
kv={'wd':'Python'}
r=requests.get("http://www.baidu.com/s",params=kv)
r.status_code
>>>200
r.request.url
>>>'http://www.baidu.com/s?wd=Python'
print(r.request.url)
>>>http://www.baidu.com/s?wd=Python
print(r.text[1000:2000])

当链接返回的非常多的时候，r.text可能会导致idle失效,所以尽量约束一个范围空间

2，图片爬取并保存本地

要考虑一切可能会发生的情况

import requests
import os
root = 'E://pictures//'
url = 'https://cj.jj20.com/2020/down.html?picurl=/up/allimg/tp03/1Z9211U233AA-0.jpg'
path = root+url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url=url)
        with open(path, 'wb') as f:
            f.write(r.content)
            f.close()
            print("该文件保存成功")
    else:
        print('文件已存在')
except:
    print("爬取失败")

二，网络爬虫之信息提取——Beautiful soup库学习

1，安装Beautiful soup

pip install beautifulsoup4

用来解析html和xml文件的功能库

2,运用Beautiful soup获取源代码

from bs4 import BeautifulSoup
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')  # html.parser是html解析器，使代码能看懂
print(soup.prettify())#打印源代码

成功，beatifulsoup成功解析demo页面

3， beautifulsoup使用格式

from bs4 import BeautifulSoup

soup=BeautifulSoup('data','html.parser')

4,beautiful的基本使用元素

beatiful soup库解析器

beautiful soup类基本元素

from bs4 import BeautifulSoup
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')
soup.title
.>>><title>This is a python demo page</title>
tag=soup.a //只会返回第一个
tag
>>><a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

soup.a.parent.name
>>>'p'
soup.a.name
>>>'a'

soup.a.parent.parent.name
>>>'body'
tag=soup.a
tag.attrs
>>>{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
tag.attrs['href']
>>>'http://www.icourse163.org/course/BIT-268001'

5,基于bs4库的HTML内容遍历方法

标签树的下行遍历

soup.head.contents
>>>[<title>This is a python demo page</title>]
soup.body.contents
>>>['n', The demo python introduces several python courses., 'n', Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>., 'n']
len(soup.body.contents)
>>>5
soup.body.contents[1]
>>>The demo python introduces several python courses.

//可用循环进行遍历
for child in soup.body.children:
print(child)

标签树的上行遍历

for parent in soup.a.parents:
if parent is None://遍历父辈会遍历soup本身，但是soup父辈是空，所以用判断
print(parent)
else:
print(parent.name)

>>>
p
body
html
[document]

标签树的平行遍历

平行遍历发生在同一个父节点下的各节点间
平行遍历获得的下一个节点不一定是标签类型

soup.a.next_sibling
>>>' and '
soup.a.next_sibling.next_sibling
>>><a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

遍历前续节点(循环)

for sibling in soup.a.previous_siblings:

print(sibling)

总结

6,基于bs4库的HTML格式输出

print(soup.prettify())
print(soup.a.prettify())

>>> <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">

Basic Python

</a>
soup.a.prettify()
>>>'<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">n Basic Pythonn</a>n'

本图文内容来源于网友网络收集整理提供，作为学习参考使用，版权属于原作者。

THE END

爬虫

二维码

元旦到了，手把手教你用 Python 制作一个炫酷烟花秀

< <上一篇

)">

python爬虫详解（二）——爬取bilibili网页排名、视频、播放量、点赞量、链接等内容并存储csv文件中

下一篇>>

搜索内容

爬虫笔记3

一，Requests库练习

1，用百度 360 搜索关键字

2，图片爬取并保存本地

二，网络爬虫之信息提取——Beautiful soup库学习

1，安装Beautiful soup

2,运用Beautiful soup获取源代码

3， beautifulsoup使用格式

4,beautiful的基本使用元素

beatiful soup库解析器

beautiful soup类基本元素

5,基于bs4库的HTML内容遍历方法

标签树的下行遍历

标签树的上行遍历

标签树的平行遍历

总结

6,基于bs4库的HTML格式输出

最新文章

分类

标签云

爬虫笔记3

一，Requests库练习

1，用百度 360 搜索关键字

2，图片爬取并保存本地

二，网络爬虫之信息提取——Beautiful soup库学习

1，安装Beautiful soup

2,运用Beautiful soup获取源代码

3， beautifulsoup使用格式

4,beautiful的基本使用元素

beatiful soup库 解析器

beautiful soup类基本元素

5,基于bs4库的HTML内容遍历方法

标签树的下行遍历

标签树的上行遍历

标签树的平行遍历

总结

6,基于bs4库的HTML格式输出

最新文章

分类

标签云

beatiful soup库解析器