文章詳情頁

python 提取html文本的方法

瀏覽：4日期：2022-06-19 08:44:15

假設我們需要從各種網頁中提取全文，并且要剝離所有HTML標記。通常，默認解決方案是使用BeautifulSoup軟件包中的get_text方法，該方法內部使用lxml。這是一個經過充分測試的解決方案，但是在處理成千上萬個HTML文檔時可能會非常慢。通過用selectolax替換BeautifulSoup，您幾乎可以免費獲得5-30倍的加速！這是一個簡單的基準測試，可分析commoncrawl(`處理NLP問題時，有時您需要獲得大量的文本集。互聯網是文本的最大來源，但是不幸的是，從任意HTML頁面提取文本是一項艱巨而痛苦的任務。假設我們需要從各種網頁中提取全文，并且要剝離所有HTML標記。通常，默認解決方案是使用BeautifulSoup軟件包中的get_text方法，該方法內部使用lxml。這是一個經過充分測試的解決方案，但是在處理成千上萬個HTML文檔時可能會非常慢。通過用selectolax替換BeautifulSoup，您幾乎可以免費獲得5-30倍的加速！這是一個簡單的基準測試，可分析commoncrawl(https://commoncrawl.org/)的10,000個HTML頁面：

# coding: utf-8from time import timeimport warcfrom bs4 import BeautifulSoupfrom selectolax.parser import HTMLParserdef get_text_bs(html): tree = BeautifulSoup(html, ’lxml’) body = tree.body if body is None:return None for tag in body.select(’script’):tag.decompose() for tag in body.select(’style’):tag.decompose() text = body.get_text(separator=’n’) return textdef get_text_selectolax(html): tree = HTMLParser(html) if tree.body is None:return None for tag in tree.css(’script’):tag.decompose() for tag in tree.css(’style’):tag.decompose() text = tree.body.text(separator=’n’) return textdef read_doc(record, parser=get_text_selectolax): url = record.url text = None if url:payload = record.payload.read()header, html = payload.split(b’rnrn’, maxsplit=1)html = html.strip()if len(html) > 0: text = parser(html) return url, textdef process_warc(file_name, parser, limit=10000): warc_file = warc.open(file_name, ’rb’) t0 = time() n_documents = 0 for i, record in enumerate(warc_file):url, doc = read_doc(record, parser)if not doc or not url: continuen_documents += 1if i > limit: break warc_file.close() print(’Parser: %s’ % parser.__name__) print(’Parsing took %s seconds and produced %s documentsn’ % (time() - t0, n_documents))

>>> ! wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz>>> file_name = 'CC-MAIN-20180116070444-20180116090444-00000.warc.gz'>>> process_warc(file_name, get_text_selectolax, 10000)Parser: get_text_selectolaxParsing took 16.170367002487183 seconds and produced 3317 documents>>> process_warc(file_name, get_text_bs, 10000)Parser: get_text_bsParsing took 432.6902508735657 seconds and produced 3283 documents

顯然，這并不是對某些事物進行基準測試的最佳方法，但是它提供了一個想法，即selectolax有時比lxml快30倍。selectolax最適合將HTML剝離為純文本。如果我有10,000多個HTML片段，需要將它們作為純文本索引到Elasticsearch中。（Elasticsearch有一個html_strip文本過濾器，但這不是我想要/不需要在此上下文中使用的過濾器）。事實證明，以這種規模將HTML剝離為純文本實際上是非常低效的。那么，最有效的方法是什么？

PyQuery

from pyquery import PyQuery as pqtext = pq(html).text() selectolax

from selectolax.parser import HTMLParsertext = HTMLParser(html).text() 正則表達式

import reregex = re.compile(r’<.*?>’)text = clean_regex.sub(’’, html)結果

我編寫了一個腳本來計算時間，該腳本遍歷包含HTML片段的10,000個文件。注意！這些片段不是完整的<html>文檔（帶有<head>和<body>等），只是HTML的一小部分。平均大小為10,314字節（中位數為5138字節）。結果如下：

pyquery SUM: 18.61 seconds MEAN: 1.8633 ms MEDIAN: 1.0554 msselectolax SUM: 3.08 seconds MEAN: 0.3149 ms MEDIAN: 0.1621 msregex SUM: 1.64 seconds MEAN: 0.1613 ms MEDIAN: 0.0881 ms

我已經運行了很多次，結果非常穩定。重點是：selectolax比PyQuery快7倍。

正則表達式好用？真的嗎？

對于最基本的HTML Blob，它可能工作得很好。實際上，如果HTML是<p> Foo＆amp; Bar </ p>，我希望純文本轉換應該是Foo＆Bar，而不是Foo＆amp; bar。更重要的一點是，PyQuery和selectolax支持非常特定但對我的用例很重要的內容。在繼續之前，我需要刪除某些標簽（及其內容）。例如：

<h4 class='warning'>This should get stripped.</h4><p>Please keep.</p><div style='display: none'>This should also get stripped.</div>

正則表達式永遠無法做到這一點。

2.0 版本

因此，我的要求可能會發生變化，但基本上，我想刪除某些標簽。例如：<div class =“ warning”> 、 <div class =“ hidden”> 和 <div style =“ display：none”>。因此，讓我們實現一下：

PyQuery

from pyquery import PyQuery as pq_display_none_regex = re.compile(r’display:s*none’)doc = pq(html)doc.remove(’div.warning, div.hidden’)for div in doc(’div[style]’).items(): style_value = div.attr(’style’) if _display_none_regex.search(style_value):div.remove()text = doc.text() selectolax

from selectolax.parser import HTMLParser_display_none_regex = re.compile(r’display:s*none’)tree = HTMLParser(html)for tag in tree.css(’div.warning, div.hidden’): tag.decompose()for tag in tree.css(’div[style]’): style_value = tag.attributes[’style’] if style_value and _display_none_regex.search(style_value):tag.decompose()text = tree.body.text()

這實際上有效。當我現在為10,000個片段運行相同的基準時，新結果如下：

pyquery SUM: 21.70 seconds MEAN: 2.1701 ms MEDIAN: 1.3989 msselectolax SUM: 3.59 seconds MEAN: 0.3589 ms MEDIAN: 0.2184 msregex Skip

同樣，selectolax擊敗PyQuery約6倍。

結論

正則表達式速度快，但功能弱。selectolax的效率令人印象深刻。

以上就是python 提取html文本的方法的詳細內容，更多關于python 提取html文本的資料請關注好吧啦網其它相關文章！

Python 編程

上一條：Python快速優雅的批量修改Word文檔樣式下一條：Python 京東云無線寶消息推送功能

相關文章：

1. python 讀txt文件,按‘,’分割每行數據操作2. Python 忽略文件名編碼的方法3. JavaEE SpringMyBatis是什么? 它和Hibernate的區別及如何配置MyBatis4. 解決vue頁面刷新，數據丟失的問題5. android studio實現簡單的計算器（無bug）6. Java Media Framework 基礎教程7. 在Mac中配置Python虛擬環境過程解析8. python如何實現word批量轉HTML9. 利用單元測試對PHP代碼進行檢查10. python excel和yaml文件的讀取封裝

排行榜

					
					Java Media Framework 基礎教程
JavaEE SpringMyBatis是什么? 它和Hibernate的區別及如何配置MyBatis
Python 忽略文件名編碼的方法
python 讀txt文件,按‘,’分割每行數據操作
android studio實現簡單的計算器（無bug）
解決vue頁面刷新，數據丟失的問題
在Mac中配置Python虛擬環境過程解析
python如何實現word批量轉HTML
利用單元測試對PHP代碼進行檢查
python excel和yaml文件的讀取封裝
python爬蟲實戰之制作屬于自己的一個IP代理模塊