PythonでExcelファイルのデータをチャンクで処理するにはどうすればよいですか？

はじめに

世界はExcelに支配されているようです。私はデータエンジニアリングの仕事に驚いています。同僚の何人が意思決定のための重要なツールとしてExcelを使用しているのかを見てきました。私はMSOfficeとそのExcelスプレッドシートの大ファンではありませんが、大きなExcelスプレッドシートを効果的に処理するための巧妙なトリックを紹介します。

その方法..

プログラムに直接参加する前に、ExcelスプレッドシートをPandasで処理するためのいくつかの基本を理解しましょう。

1.インストール。先に進み、openpyxlとxlwtをインストールします。インストールされているかどうかわからない場合は、Pythonターミナルからpipfreezeまたはpiplistを使用して、利用可能なパッケージだけではありません。

まず、データのタプルを渡してExcelスプレッドシートを作成し、次にデータをpandasデータフレームにロードします。最後に、データフレームデータを新しいワークブックに書き込みます。

import xlsxwriter
import pandas as pd

2.小さなデータでExcelスプレッドシートを作成します。辞書データをExcelスプレッドシートに書き込む小さな関数があります。すべてのコードロジックは、各ステップで定義されます。

# Function : write_data_to_files
def write_data_to_files(inp_data, inp_file_name):
"""
function : create a csv file with the data passed to this code
args : inp_data : tuple data to be written to the target file
file_name : target file name to store the data
return : none
assumption : File to be created and this code are in same directory.
"""
print(f" *** Writing the data to - {inp_file_name}")

# Create a Workbook.
workbook = xlsxwriter.Workbook(inp_file_name)

# add a worksheet.
worksheet = workbook.add_worksheet()

# Start from the first cell. Rows and columns are zero indexed.
row = 0
col = 0

# read the input data and write them in rows and columns
for player, titles in inp_data:
worksheet.write(row, col, player)
worksheet.write(row, col + 1, titles)
row += 1

# close the workbook.
workbook.close()
print(f" *** Completed writing the data to - {inp_file_name}")

# Function : excel_functions_with_pandas
def excel_functions_with_pandas(inp_file_name):
"""
function : Quick overview of functions you can apply on excel with pandas
args : inp_file_name : input excel spread sheet.
return : none
assumption : Input excel spreadsheet and this code are in same directory.
"""
data = pd.read_excel(inp_file_name)

# print top 2 rows
print(f" *** Displaying top 2 rows of - {inp_file_name} \n {data.head()} ")

# look at the data types
print(f" *** Displaying info about {inp_file_name} - {data.info()}")

# Create a new spreadsheet "Sheet2" and write data into it.
new_players_info = pd.DataFrame(data=[
{"players": "new Roger Federer", "titles": 20},
{"players": "new Rafael Nadal", "titles": 20},
{"players": "new Novak Djokovic", "titles": 17},
{"players": "new Andy Murray", "titles": 3}], columns=["players", "titles"])

new_data = pd.ExcelWriter(inp_file_name)
new_players_info.to_excel(new_data, sheet_name="Sheet2")
if __name__ == '__main__':
# Define your file name and data
file_name = "temporary_file.xlsx"

# tuple data for storage
file_data = (['player', 'titles'], ['Federer', 20], ['Nadal', 20], ['Djokovic', 17], ['Murray', 3])

# write the file_data to file_name
# write_data_to_files(file_data, file_name)

# # Read excel file into pandas and apply functions.
# excel_functions_with_pandas(file_name)

if __name__ == '__main__':
# Define your file name and data
file_name = "temporary_file.xlsx"

# tuple data for storage
file_data = (['player', 'titles'], ['Federer', 20], ['Nadal', 20], ['Djokovic', 17], ['Murray', 3])

# write the file_data to file_name
# write_data_to_files(file_data, file_name)

# # Read excel file into pandas and apply functions.
# excel_functions_with_pandas(file_name)

出力

*** Writing the data to - temporary_file.xlsx
*** Completed writing the data to - temporary_file.xlsx
*** Displaying top 2 rows of - temporary_file.xlsx
player titles
0 Federer 20
1 Nadal 20
2 Djokovic 17
3 Murray 3
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 player 4 non-null object
1 titles 4 non-null int64
dtypes: int64(1), object(1)
memory usage: 192.0+ bytes
*** Displaying info about temporary_file.xlsx - None

大きなcsvファイルを処理する場合、チャンクを含めてチャンクで処理するためのオプションがかなりありますが、Excelスプレッドシートの場合、Pandasはデフォルトでチャンクオプションを提供していません。

したがって、Excelスプレッドシートをチャンクで処理する場合は、以下のプログラムが非常に役立ちます。

例

def global_excel_to_db_chunks(file_name, nrows):
"""
function : handle excel spreadsheets in chunks
args : inp_file_name : input excel spread sheet.
return : none
assumption : Input excel spreadsheet and this code are in same directory.
"""
chunks = []
i_chunk = 0

# The first row is the header. We have already read it, so we skip it.
skiprows = 1
df_header = pd.read_excel(file_name, nrows=1)

while True:
df_chunk = pd.read_excel(
file_name, nrows=nrows, skiprows=skiprows, header=None)
skiprows += nrows

# When there is no data, we know we can break out of the loop.
if not df_chunk.shape[0]:
break
else:
print(
f" ** Reading chunk number {i_chunk} with {df_chunk.shape[0]} Rows")
# print(f" *** Reading chunk {i_chunk} ({df_chunk.shape[0]} rows)")
chunks.append(df_chunk)
i_chunk += 1

df_chunks = pd.concat(chunks)

# Rename the columns to concatenate the chunks with the header.
columns = {i: col for i, col in enumerate(df_header.columns.tolist())}
df_chunks.rename(columns=columns, inplace=True)
df = pd.concat([df_header, df_chunks])

print(f' *** Reading is Completed in chunks...')

if __name__ == '__main__':
print(f" *** Gathering & Displaying Stats on the excel spreadsheet ***")
file_name = 'Sample-sales-data-excel.xls'
stats = pd.read_excel(file_name)
print(f" ** Total rows in the spreadsheet are - {len(stats.index)} Rows")

# process the excel file in chunks of 1000 rows at a time.
global_excel_to_db_chunks(file_name, 1000)

*** Gathering & Displaying Stats on the excel spreadsheet ***
** Total rows in the spreadsheet are - 9994 Rows
** Reading chunk number 0 with 1000 Rows
** Reading chunk number 1 with 1000 Rows
** Reading chunk number 2 with 1000 Rows
** Reading chunk number 3 with 1000 Rows
** Reading chunk number 4 with 1000 Rows
** Reading chunk number 5 with 1000 Rows
** Reading chunk number 6 with 1000 Rows
** Reading chunk number 7 with 1000 Rows
** Reading chunk number 8 with 1000 Rows
** Reading chunk number 9 with 994 Rows
*** Reading is Completed in chunks...

出力

*** Gathering & Displaying Stats on the excel spreadsheet ***
** Total rows in the spreadsheet are - 9994 Rows
** Reading chunk number 0 with 1000 Rows
** Reading chunk number 1 with 1000 Rows
** Reading chunk number 2 with 1000 Rows
** Reading chunk number 3 with 1000 Rows
** Reading chunk number 4 with 1000 Rows
** Reading chunk number 5 with 1000 Rows
** Reading chunk number 6 with 1000 Rows
** Reading chunk number 7 with 1000 Rows
** Reading chunk number 8 with 1000 Rows
** Reading chunk number 9 with 994 Rows
*** Reading is Completed in chunks...

Pythonのゼロ値に対して行と列の各要素をゼロに変換するプログラム

PythonでZIPFILEモジュールを使用してファイルを圧縮する方法。

Excel でデータベースを作成する方法 (簡単な手順)
Excel で簡単なデータベースを作成する方法がわかりませんか?この記事では、わずか 7 で Excel でデータベースを作成する方法を紹介します。簡単な手順 MS Access は、データベースとして使用するには複雑なツールだと思いますか?そのため、Excel はそのための優れたツールです。テクニックを学びましょう。次の練習用ワークブックをダウンロードしてください。トピックをより明確に理解するのに役立ちます。 Excel でデータベースを作成するための 8 つのステップ Excel ワークブックを正しく設計すれば、簡単にデータベースとして使用できます。主なポイントは、ワークブ
Excel で質的データを分析する方法 (簡単な手順)
質的データの分析方法を知る方法を探していますエクセルで ?それなら、これはあなたにぴったりの記事です。データ数えることができず、数値で説明するのが難しい場合は、データ質的です .この定性を収集できますデータフォーカスグループディスカッション、詳細なインタビュー、文の補完、単語の連想、カジュアルな会話などから。 Excel で質的データを分析するための 8 つのステップ私たちのアプローチを示すために、調査アンケートから 3 つの回答を得ました。こちら XYZ カフェです町のはずれにあり、学生がたむろすることもあります。 3 つの質問は次のとおりです。まず、リッカート