PythonでHTML（２）lxml

2023年5月17日公開
2023年5月17日

lxmlを使うとデータをツリー構造に格納してくれる。

インストール

conda install lxml

使用例（テキストから）

import lxml.html


# HTMLのテキスト
text = """<html>
<head><title>タイトル</title>
<body>
<p> hello world </p>
<p> 二行目 </p>
<div>
子要素
</div>
</body>
</html>

"""

ret = lxml.html.fromstring( text )

for itr in ret: # forでイテレーションする

	print(itr.tag) # タグへアクセス
		
	if len(list(itr)): # 子要素があるかどうかはlistの長さを調べる
	
		print("{")
		for i in itr:
			print(i.tag ,"[" ,  i.text , "]" )
		print("}")

head
{
title [ タイトル ]
}
body
{
p [ hello world ]
p [ 二行目 ]
div [
子要素
]
}

URLを指定してインターネットからHTML取得

lxml.html.parseはURLを渡せるのだが、

lxml.html.parse('http://www.suzulang.com/')

どうやらHTTPSに対応していないらしい。

lxml.html.parse('https://www.suzulang.com/') # 失敗

urllibでインターネットからデータを取得

urllib.requestを使ってHTMLを取得し、それをfromstringへ入力する。

# lxmlはhttpsに対応していない。
# html.parse( /*ここに入れていいのはhttpのURLだけ*/ )
# urllib.request.urlopenを使ってhttpsからテキストを取得してそれを入力する
# from urllib import urlopen # Python2だとurllib2らしい
import urllib.request


urldata = urllib.request.urlopen('https://suzulang.com/')

text = urldata.read()
print( text )

コメントを残すコメントをキャンセル

この記事のトラックバックURL：