PythonでHTMLテーブルをフェッチするためにHTMLページを解析する方法は？

問題

WebページからHTMLテーブルを抽出する必要があります。

はじめに

インターネットとワールドワイドウェブ（WWW）は、今日最も重要な情報源です。そこには非常に多くの情報があり、非常に多くのオプションからコンテンツを選択することは非常に困難です。その情報のほとんどは、HTTPを介して取得できます。

ただし、これらの操作をプログラムで実行して、情報を自動的に取得して処理することもできます。

Pythonでは、標準ライブラリであるHTTPクライアントを使用してこれを行うことができますが、requestsモジュールはWebページ情報を非常に簡単に取得するのに役立ちます。

この投稿では、HTMLページを解析して、ページに埋め込まれたHTMLテーブルを抽出する方法を説明します。

その方法..

1.リクエスト、パンダ、beautifulsoup4、表形式のパッケージを使用します。それらが欠落している場合は、システムにインストールしてください。不明な場合は、pipfreezeを使用して検証してください。

import requests
import pandas as pd
from tabulate import tabulate

2. https：//www.tutorialspoint.com/python/python_basic_operators.htmを使用してページを調べ、その中に埋め込まれているすべてのHTMLページを印刷します。

# set the site url
site_url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"

3.サーバーにリクエストを送信し、応答を確認します。

# Make a request to the server
response = requests.get(site_url)

# Check the response
print(f"*** The response for {site_url} is {response.status_code}")

4.まあ、応答コード200-サーバーからの応答が成功したことを表します。そこで、リクエストヘッダー、レスポンスヘッダー、およびサーバーから返された最初の100個のテキストを確認します。

# Check the request headers
print(f"*** Printing the request headers - \n {response.request.headers} ")

# Check the response headers
print(f"*** Printing the request headers - \n {response.headers} ")

# check the content of the results
print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ")

出力

*** Printing the request headers -
{'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
*** Printing the request headers -
{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '213246', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Tue, 20 Oct 2020 09:45:18 GMT', 'Expires': 'Thu, 19 Nov 2020 09:45:18 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'}
*** Accessing the first 100/37624 characters -

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Python - Basic Operators - Tutorialspoint</title>

5.次に、BeautifulSoupを使用してHTMLを解析します。

# Parse the HTML pages

from bs4 import BeautifulSoup
tutorialpoints_page = BeautifulSoup(response.text, 'html.parser')
print(f"*** The title of the page is - {tutorialpoints_page.title}")

# You can extract the page title as string as well
print(f"*** The title of the page is - {tutorialpoints_page.title.string}")

6.まあ、ほとんどのテーブルには、h2、h3、h4、h5、またはh6タグのいずれかで定義された見出しがあります。最初にこれらのタグを識別し、次に識別されたタグの横にあるhtmlテーブルを取得します。このロジックでは、以下に定義されているように、find、sibling、およびfind_next_siblingsを使用します。

# Find all the h3 elements
print(f"{tutorialpoints_page.find_all('h2')}")
tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6")
for sibling in tags.find_next_siblings():
if sibling.name == "table":
my_table = sibling
df = pd.read_html(str(my_table))
print(tabulate(df[0], headers='keys', tablefmt='psql'))

完全なコード

7.今すぐすべてをまとめます。

# STEP1 : Download the page required
import requests
import pandas as pd


# set the site url
site_url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"

# Make a request to the server
response = requests.get(site_url)

# Check the response
print(f"*** The response for {site_url} is {response.status_code}")

# Check the request headers
print(f"*** Printing the request headers - \n {response.request.headers} ")

# Check the response headers
print(f"*** Printing the request headers - \n {response.headers} ")

# check the content of the results
print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ")

# Parse the HTML pages

from bs4 import BeautifulSoup
tutorialpoints_page = BeautifulSoup(response.text, 'html.parser')
print(f"*** The title of the page is - {tutorialpoints_page.title}")

# You can extract the page title as string as well
print(f"*** The title of the page is - {tutorialpoints_page.title.string}")

# Find all the h3 elements
# print(f"{tutorialpoints_page.find_all('h2')}")
tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6")
for sibling in tags.find_next_siblings():
if sibling.name == "table":
my_table = sibling
df = pd.read_html(str(my_table))
print(df)

出力

*** The response for https://www.tutorialspoint.com/python/python_basic_operators.htm is 200
*** Printing the request headers -
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
*** Printing the request headers -
{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '558841', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Sat, 24 Oct 2020 09:45:13 GMT', 'Expires': 'Mon, 23 Nov 2020 09:45:13 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'}
*** Accessing the first 100/37624 characters -

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Python - Basic Operators - Tutorialspoint</title>
*** The title of the page is - <title>Python - Basic Operators - Tutorialspoint</title>
*** The title of the page is - Python - Basic Operators - Tutorialspoint
[<h2>Types of Operator</h2>, <h2>Python Arithmetic Operators</h2>, <h2>Python Comparison Operators</h2>, <h2>Python Assignment Operators</h2>, <h2>Python Bitwise Operators</h2>, <h2>Python Logical Operators</h2>, <h2>Python Membership Operators</h2>, <h2>Python Identity Operators</h2>, <h2>Python Operators Precedence</h2>]
[ Operator Description \
0 + Addition Adds values on either side of the operator.
1 - Subtraction Subtracts right hand operand from left hand op...
2 * Multiplication Multiplies values on either side of the operator
3 / Division Divides left hand operand by right hand operand
4 % Modulus Divides left hand operand by right hand operan...
5 ** Exponent Performs exponential (power) calculation on op...
6 // Floor Division - The division of operands wher...

例

0 a + b = 30
1 a – b = -10
2 a * b = 200
3 b / a = 2
4 b % a = 0
5 a**b =10 to the power 20
6 9//2 = 4 and 9.0//2.0 = 4.0, -11//3 = -4, -11.... ]
[ Operator Description \
0 == If the values of two operands are equal, then ...
1 != If values of two operands are not equal, then ...
2 <> If values of two operands are not equal, then ...
3 > If the value of left operand is greater than t...
4 < If the value of left operand is less than the ...
5 >= If the value of left operand is greater than o...
6 <= If the value of left operand is less than or e...

より小さい場合

例

0 (a == b) is not true.
1 (a != b) is true.
2 (a <> b) is true. This is similar to != operator.
3 (a > b) is not true.
4 (a < b) is true.
5 (a >= b) is not true.
6 (a <= b) is true. ]
[ Operator Description \
0 = Assigns values from right side operands to lef...
1 += Add AND It adds right operand to the left operand and ...
2 -= Subtract AND It subtracts right operand from the left opera...
3 *= Multiply AND It multiplies right operand with the left oper...
4 /= Divide AND It divides left operand with the right operand...
5 %= Modulus AND It takes modulus using two operands and assign...
6 **= Exponent AND Performs exponential (power) calculation on op...
7 //= Floor Division It performs floor division on operators and as...

例

0 c = a + b assigns value of a + b into c
1 c += a is equivalent to c = c + a
2 c -= a is equivalent to c = c - a
3 c *= a is equivalent to c = c * a
4 c /= a is equivalent to c = c / a
5 c %= a is equivalent to c = c % a
6 c **= a is equivalent to c = c ** a
7 c //= a is equivalent to c = c // a ]
[ Operator \
0 & Binary AND
1 | Binary OR
2 ^ Binary XOR
3 ~ Binary Ones Complement
4 << Binary Left Shift
5 >> Binary Right Shift

Description \
0 Operator copies a bit to the result if it exis...
1 It copies a bit if it exists in either operand.
2 It copies the bit if it is set in one operand ...
3 It is unary and has the effect of 'flipping' b...
4 The left operands value is moved left by the n...
5 The left operands value is moved right by the ...

例

0 (a & b) (means 0000 1100)
1 (a | b) = 61 (means 0011 1101)
2 (a ^ b) = 49 (means 0011 0001)
3 (~a ) = -61 (means 1100 0011 in 2's complement...
4 a << 2 = 240 (means 1111 0000)
5 a >> 2 = 15 (means 0000 1111) ]
[ Operator Description \
0 and Logical AND If both the operands are true then condition b...
1 or Logical OR If any of the two operands are non-zero then c...
2 not Logical NOT Used to reverse the logical state of its operand.

Example
0 (a and b) is true.
1 (a or b) is true.
2 Not(a and b) is false. ]
[ Operator Description \
0 in Evaluates to true if it finds a variable in th...
1 not in Evaluates to true if it does not finds a varia...

例

0 x in y, here in results in a 1 if x is a membe...
1 x not in y, here not in results in a 1 if x is... ]
[ Operator Description \
0 is Evaluates to true if the variables on either s...
1 is not Evaluates to false if the variables on either ...

の場合はfalseと評価されます

例

0 x is y, here is results in 1 if id(x) equals i...
1 x is not y, here is not results in 1 if id(x) ... ]
[ Sr.No. Operator & Description
0 1 ** Exponentiation (raise to the power)
1 2 ~ + - Complement, unary plus and minus (method...
2 3 * / % // Multiply, divide, modulo and floor di...
3 4 + - Addition and subtraction
4 5 >> << Right and left bitwise shift
5 6 & Bitwise 'AND'
6 7 ^ | Bitwise exclusive `OR' and regular `OR'
7 8 <= < > >= Comparison operators
8 9 <> == != Equality operators
9 10 = %= /= //= -= += *= **= Assignment operators
10 11 is is not Identity operators
11 12 in not in]

PythonPandasでテンプレートを使用してDataFrameに新しい行を追加する方法

パンダで重複する行を見つけてフィルタリングする方法は？

Bokeh（Python）で画像を操作する方法は？
Bokehで画像を操作するには、 image_url（）を使用しますメソッドと画像のリストを渡します。ステップ：func：show のときにファイルに保存された出力を生成するように、デフォルトの出力状態を構成しますと呼ばれます。プロット用の新しい図を作成します。指定されたURLから読み込まれた画像をレンダリングします。 Bokehオブジェクトまたはアプリケーションをすぐに表示します。例 from bokeh.plotting import figure, show, output_file output_file('image.html') p = fi
PythonでAPIの結果を視覚化する方法
はじめに.. APIを作成する最大の利点の1つは、現在/ライブのデータを抽出することです。データが急速に変化している場合でも、APIは常に最新のデータを取得します。 APIプログラムは、非常に具体的なURLを使用して、特定の情報を要求します。 SpotifyまたはYoutubeMusicで2020年に最も再生された100曲をToppします。リクエストされたデータは、JSONやCSVなどの簡単に処理できる形式で返されます。 Pythonを使用すると、ユーザーは考えられるほぼすべてのURLにAPI呼び出しを記述できます。この例では、GitHubからAPIの結果を抽出して視覚化する方法を示します