使用pandas做vlookup

Allen Li
6 min readFeb 26, 2021

--

工作需求常常載入大量資料做比對,以excel來vlookup超過數十萬筆資料實在痛苦@@,工作之餘來順便紀錄如何針對表格式資料簡單做處理。

先讓我們載入所需套件與資料:

import pandas as pdtxn_raw = pd.read_csv('Txn.csv', encoding='utf-8', low_memory=False) 
r_raw = pd.read_csv('Refund.csv', encoding='utf-8', low_memory=False)
v_raw = pd.read_csv('Void.csv', encoding='utf-8', low_memory=False)
inv_raw = pd.read_csv('Inv.csv', encoding='utf-8', low_memory=False)
item_raw = pd.read_csv('item1.csv', encoding='utf-8', low_memory=False)

接著針對第一個df移除不需要的"欄位",並新增一欄作為標記,此部分概念為先新增欄位再填入值(apply)。最後在對後續將進行核對的key column做資料型態一致化,如缺少該步驟,後續做concat or merge會出現空值,也就是v不到東西的情況。

txn = txn_raw.drop(['GlobalTxnID', 'stock_cost', 'soh_qty',
'price', 'discount', 'voucher_used'], axis=1)
txn['type'] = 0
def col_filling(row):
if pd.notnull(row['store']):
row['type'] = 'SalesItem'
else:
row['type'] = 0
return row
txn = txn.apply(col_filling, axis=1)
txn['item_code'] = txn['item_code'].astype(str).str.strip()

針對第二個df做處理,依序為

  1. 移除無關欄位
  2. 移除目標欄位為空值的列(註記第2種寫法是補上inplace讓該df直接被更新而無須宣告)
  3. 新增並填入新資料
  4. 關鍵欄位資料型態一致化
r = r_raw.drop(columns = ['RsGlobalTxnID', 'stock_cost', 'soh_qty',
'price', 'voucher_used'])
r = r.dropna(axis=0, how='any', subset=["item_code"])
# r.dropna(axis=0, how='any', subset=["item_code"], inplace=True)
r['type'] = 0
def col_filling_r(row):
if pd.notnull(row['store']):
row['type'] = 'R'
else:
row['type'] = 0
return row
r = r.apply(col_filling_r, axis=1)
r['item_code'] = r['item_code'].astype(str).str.strip()

其他df處理相同,這邊紀錄遇到欄位裡面有想排除的值作法,先設立不等是讓內層series為布林值,再帶入外層df篩選條件中。如下例,我不想要欄位中有出現v_type跟99的資料出現。(這邊v_type會先設成一遮罩供後續使用)

另外依據指定欄位移除重複列並直接更新該df

waste = v['type'] != 'v_type'inv = inv[inv['MediaType'] != 99]inv.drop_duplicates(subset=['store', 'transaction_time'], keep='first', ignore_index=True, inplace=True)v.rename(columns={'void_type':'type'}, inplace = True)v['type'].replace({'Transaction' : 'VTransaction', 'Item' : 'VItem'}, inplace=True)

接著,因txn、r、v三個df欄位數量與名稱皆相同,故使用concat做垂直合併,欄位順序會自動reindex。接著再以該df執行merge來達到vlookup效果。這邊要注意key df 與target df位置與主表單left。這邊共執行兩次,分別對單一欄位與多欄位進行比對合併。

combine = [txn, r, v[waste]]df = pd.concat(combine, ignore_index=True)df2 = pd.merge(df, item, how='left', left_on='item_code', right_on='item_code')df3 = pd.merge(df2, inv, how='left', left_on=['store', 'transaction_time'],
right_on=['storez', 'transaction_time'])

# removed A=* and B=*. based condition trans to index
# new_df = new_df[empty].drop(new_df[empty].loc[new_df[empty][(new_df['交易型態'] == 'SalesItem') & (new_df['數量'] == 0)].index])delsi = new_df[empty][(new_df['交易型態'] == 'SalesItem') & (new_df['數量'] == 0)].indexnew_df = new_df[empty].drop(delsi)

需注意的是進行merge時,會把target df的所有欄位都帶進來,如不想在主表單新增太多欄位時,記得先對raw data進行非相關欄位移除。

此部分案例順道提及了簡單的移除指定欄位與缺失值、條件篩選、新增填入、取代值、合併表單,於此做個筆記~~

--

--

Allen Li
Allen Li

Written by Allen Li

Hi, I am an auditor in retail industry and here is a book to record the path that I have been through on studying things such as programing and exercising.

No responses yet