开始
本文会通过这个网页http://reeoo.com来进行示例讲解,如下图所示
进群:125240963 即可获取数十套PDF或者零基础入门教程一整套哦
!
BeautifulSoup 对象初始化
将一段文档传入 BeautifulSoup 的构造方法,就能得到一个文档对象。如下代码所示,文档通过请求url获取:
打印结果:
Reeoo - web design inspiration and website gallerytitle><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);"><span style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;font-weight: 700;">Name</span></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">通过Tag对象的name属性,可以获取到标签的名称</p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">print tag.name</p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);"># title</p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);"><span style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;font-weight: 700;">Attributes</span></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">一个tag可能包含很多属性,如id、class等,操作tag属性的方式与字典相同。</p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">例如网页中包含缩略图区域的标签 article</p><p><img class="" data-ratio="0.990625" referrerpolicy="no-referrer" data- referrerpolicy="no-referrer" src="http://mmbiz.qpic.cn/mmbiz_jpg/pOTh2wdMWXpe5Qr0J9mMnb3DQZ6icnnMcfMCnuqnvs7VubqTiaJfrO9dTnEllSgo6h2akFBcMaCeGjVWkXibtrzAQ/640?wx_fmt=jpeg" data-type="jpeg" data-w="640" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;border-style: none;display: block;margin: 10px auto;"></p><p class="pgc-img-caption" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;text-align: center;font-size: 12px;color: rgb(119, 119, 119);line-height: 16px;"><br></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);"><span style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;font-weight: 700;">tag中的字符串</span></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">通过 string 方法获取标签中包含的字符串</p><p><img class="" data-ratio="0.509375" referrerpolicy="no-referrer" data- referrerpolicy="no-referrer" src="http://mmbiz.qpic.cn/mmbiz_jpg/pOTh2wdMWXpe5Qr0J9mMnb3DQZ6icnnMcuALsYYrf2WuCpRzVoicKq1r1K4nrXMskrCwmh5tQ8w21d7e6RH8lw7g/640?wx_fmt=jpeg" data-type="jpeg" data-w="640" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;border-style: none;display: block;margin: 10px auto;"></p><p class="pgc-img-caption" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;text-align: center;font-size: 12px;color: rgb(119, 119, 119);line-height: 16px;"><br></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">如下图:</p><p><img class="" data-ratio="0.503125" referrerpolicy="no-referrer" data- referrerpolicy="no-referrer" src="http://mmbiz.qpic.cn/mmbiz_jpg/pOTh2wdMWXpe5Qr0J9mMnb3DQZ6icnnMcaicaeOoeicpHh13akK3nOhCC3QO2zfau53oZ8jWXHbOheoXvN2CX2QEg/640?wx_fmt=jpeg" data-type="jpeg" data-w="640" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;border-style: none;display: block;margin: 10px auto;"></p><p class="pgc-img-caption" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;text-align: center;font-size: 12px;color: rgb(119, 119, 119);line-height: 16px;"><br></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">我们希望获取到 article 标签中的 li</p><p><img class="" data-ratio="0.9609375" referrerpolicy="no-referrer" data- referrerpolicy="no-referrer" src="http://mmbiz.qpic.cn/mmbiz_jpg/pOTh2wdMWXpe5Qr0J9mMnb3DQZ6icnnMc1BbRqlsUXsKZGKN6horvibDX9FoLhGZhQv8ibxvx86yPenp5uqiay7X9w/640?wx_fmt=jpeg" data-type="jpeg" data-w="640" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;border-style: none;display: block;margin: 10px auto;"></p><p class="pgc-img-caption" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;text-align: center;font-size: 12px;color: rgb(119, 119, 119);line-height: 16px;"><br></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">打印 contents 可以看到列表中不仅包含了 li 标签内容,还包括了换行符 ' '</p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">过tag的 .children 生成器,可以对tag的子节点进行循环</p><p><img class="" data-ratio="0.9953125" referrerpolicy="no-referrer" data- referrerpolicy="no-referrer" src="http://mmbiz.qpic.cn/mmbiz_jpg/pOTh2wdMWXpe5Qr0J9mMnb3DQZ6icnnMcrD0sQQrb0nt1cE7ONPfShPoiadOqpPYs18iasTrFpsRo0HBd1IAXp0Rg/640?wx_fmt=jpeg" data-type="jpeg" data-w="640" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;border-style: none;display: block;margin: 10px auto;"></p><p class="pgc-img-caption" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;text-align: center;font-size: 12px;color: rgb(119, 119, 119);line-height: 16px;"><br></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);"><span style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;font-weight: 700;">文档树的搜索</span></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">对树形结构的文档进行特定的搜索是爬虫抓取过程中最常用的操作。</p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);"><span style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;font-weight: 700;">find_all()</span></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">find_all(name , attrs , recursive , string , ** kwargs)</p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);"><span style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;font-weight: 700;">name 参数</span></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">查找所有名字为 name 的tag</p><p><img class="" data-ratio="1.153125" referrerpolicy="no-referrer" data- referrerpolicy="no-referrer" src="http://mmbiz.qpic.cn/mmbiz_jpg/pOTh2wdMWXpe5Qr0J9mMnb3DQZ6icnnMcCz6bqLsNUMw0Rl4gWdIadcblRn3EnRJtH4JMUgny5zp8WibCyBJTrXg/640?wx_fmt=jpeg" data-type="jpeg" data-w="640" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;border-style: none;display: block;margin: 10px auto;"></p><p class="pgc-img-caption" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;text-align: center;font-size: 12px;color: rgb(119, 119, 119);line-height: 16px;"><br></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">指定名字的属性参数值可以包括:字符串、正则表达式、列表、True/False。</p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);"><span style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;font-weight: 700;">True/False</span></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">是否存在指定的属性。</p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">搜索所有带有 target 属性的标签</p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">soup.find_all(target=True)</p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">搜索所有不带 target 属性的标签(仔细观察会发现,搜索结果还是会有带 target 的标签,那是不带 target 标签的子标签,这里需要注意一下。)</p><p><img class="" data-ratio="0.95" referrerpolicy="no-referrer" data- referrerpolicy="no-referrer" src="http://mmbiz.qpic.cn/mmbiz_jpg/pOTh2wdMWXpe5Qr0J9mMnb3DQZ6icnnMcrNRRUffRY3mEx9PibTk378of0rmq9AGHj6XuibaRDOk3Ticibia55KwWoOg/640?wx_fmt=jpeg" data-type="jpeg" data-w="640" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;border-style: none;display: block;margin: 10px auto;"></p><p class="pgc-img-caption" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;text-align: center;font-size: 12px;color: rgb(119, 119, 119);line-height: 16px;"><br></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);"><span style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;font-weight: 700;">attrs 参数</span></p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">定义一个字典参数来搜索对应属性的tag,一定程度上能解决上面提到的不能将某些属性作为参数的问题。</p><p style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;margin-top: 16px;margin-bottom: 16px;color: rgb(34, 34, 34);font-family: 'PingFang SC', 'Hiragino Sans GB', 'Microsoft YaHei', 'WenQuanYi Micro Hei', 'Helvetica Neue', Arial, sans-serif;font-size: 16px;font-variant-ligatures: normal;orphans: 2;white-space: normal;widows: 2;background-color: rgb(255, 255, 255);">例如,搜索包含 data-original 属性的标签</p><p><img class="" data-ratio="0.6484375" referrerpolicy="no-referrer" data- referrerpolicy="no-referrer" src="http://mmbiz.qpic.cn/mmbiz_jpg/pOTh2wdMWXpe5Qr0J9mMnb3DQZ6icnnMcoL5I8Fiaq87WYBibLyYyZxoaMxT9I8F910LrfanrkQ06ZPfthFY4lNrQ/640?wx_fmt=jpeg" data-type="jpeg" data-w="640" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;border-style: none;display: block;margin: 10px auto;"></p><p class="pgc-img-caption" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent;text-align: center;font-size: 12px;color: rgb(119, 119, 119);line-height: 16px;">