博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
python BeautifulSoup4解析网页
阅读量:4356 次
发布时间:2019-06-07

本文共 2694 字,大约阅读时间需要 8 分钟。

html = """The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Lacie andTillieand they lived at the bottom of a well.

...

"""soup=BS(html,'html.parser')for i in soup.find_all('a'): print('i.text:',i.text)#注释掉的内容就不打印了 str类型 print('i.string:',i.string) #注释掉的内容 都会打印出来,NavigableString对象print('soup.head.contents:',soup.head.contents,type(soup.head.contents))print('soup.head.children:',soup.head.children,type(soup.head.children))print('soup.body.contents:',soup.body.contents)#返回一个子元素的列表print('soup.body.children:',soup.body.children)#返回一个子元素的迭代器for i in soup.body.children: print(i)print('子孙节点 都显示出来')for i in soup.body.descendants: print(i)print('soup.body.string:',soup.body.string)print('soup.body.strings:',soup.body.strings)print('soup.body.stripped_strings:',soup.body.stripped_strings) #过滤掉所有空格显示print('去掉空格的body子元素:')for i in soup.body.stripped_strings: print(i)print('soup.a.parent:',soup.a.parent)print('soup.a.next_sibling:',soup.a.next_sibling) #注意文本节点、换行\n都可能成为当前节点的上一个或者下一个同级节点print('soup.a.previous_sibling:',soup.a.previous_sibling)print('soup.a.next_element:',soup.a.next_element) #下一个元素 不一定同级print('soup.a.previous_element:',soup.a.previous_element)print('打印所有后面的同级节点:\n')for i in soup.a.next_siblings: print(i)print('soup.a.next_element:',list(soup.a.next_elements)[1])print('***********find_all*****')print(soup.find_all('a'))print('引入正则表达式:')import reprint(soup.find_all(re.compile(r'^title'))) #正则匹配的是 标签的名字print('列表的方式匹配:')print(soup.find_all(['a','b']))print('函数的方式匹配,类似filter')def func(tag): if tag.has_attr('class') and re.search(r'^a',tag.name): return tagprint(soup.find_all(func))html = """The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Lacie andTillieand they lived at the bottom of a well.

...

"""soup=BS(html,'html.parser')print('按属性值查找:')print(soup.find_all(id='link1'))print(soup.find_all('a',id='link1'))print(soup.find_all(id='link2',href=re.compile(r'laci'))) #返回的都是列表print(soup.find_all(class_='story')) #注意后面加的下划线print(soup.find_all(attrs={
'class':'sister'}))print('按元素内容查找text参数:')print(soup.find_all(text='Tillie'))print(soup.find_all(text=['Tillie','Lacie'])) #返回的都是元素内容print(soup.find_all(text=re.compile(r'ormous')))print('通过内容元素 找到上级元素')print(soup.find_all(text=re.compile(r'ormous'))[1].parent.parent)#限制查找数量print('limit:')print(soup.find_all('a',limit=2))print('只在子节点查找:')print(soup.body.find_all('a',limit=2,recursive=False)) #只查找子节点 recursive循环的、递归的print(soup.body.find_all(class_='story',recursive=False))

 

转载于:https://www.cnblogs.com/xiaoxiao075/p/10925489.html

你可能感兴趣的文章
HTML+CSS小结
查看>>
Android防止按钮连续点击
查看>>
ElasticSearch Mapping中的字段类型
查看>>
数据库中主键和外键的设计原则
查看>>
怎样理解阻塞非阻塞与同步异步的区别?
查看>>
Xcode 警告信息处理:Format string is not a string literal (potentially insecure)
查看>>
关于jQuery表单校验的应用
查看>>
matplotlib----初探------5直方图
查看>>
jquery之ajax
查看>>
Pro Git(中文版)
查看>>
解决phpmyadmin-1800秒超时链接失效问题
查看>>
OpenGL第十一节:拉伸和过滤
查看>>
AlertDialog的onCreateDialog与onPrepareDialog用法
查看>>
swift菜鸟入门视频教程-12-21讲
查看>>
PL/SQL 异常处理程序
查看>>
javascript小白学习指南1---0
查看>>
div:给div加滚动栏 div的滚动栏设置
查看>>
java随机函数使用方法Random
查看>>
链表中环的入口结点
查看>>
凤姐讲学英语
查看>>