使用pandas做vlookup

6 min readFeb 26, 2021

工作需求常常載入大量資料做比對，以excel來vlookup超過數十萬筆資料實在痛苦@@，工作之餘來順便紀錄如何針對表格式資料簡單做處理。

先讓我們載入所需套件與資料:

import pandas as pdtxn_raw = pd.read_csv('Txn.csv', encoding='utf-8', low_memory=False) 
r_raw = pd.read_csv('Refund.csv', encoding='utf-8', low_memory=False)
v_raw = pd.read_csv('Void.csv', encoding='utf-8', low_memory=False)
inv_raw = pd.read_csv('Inv.csv', encoding='utf-8', low_memory=False)
item_raw = pd.read_csv('item1.csv', encoding='utf-8', low_memory=False)

接著針對第一個df移除不需要的"欄位"，並新增一欄作為標記，此部分概念為先新增欄位再填入值(apply)。最後在對後續將進行核對的key column做資料型態一致化，如缺少該步驟，後續做concat or merge會出現空值，也就是v不到東西的情況。

txn = txn_raw.drop(['GlobalTxnID', 'stock_cost', 'soh_qty',
                    'price', 'discount', 'voucher_used'], axis=1)txn['type'] = 0 
def col_filling(row):
    if pd.notnull(row['store']):
        row['type'] = 'SalesItem'
    else:
        row['type'] = 0
    return row
txn = txn.apply(col_filling, axis=1)txn['item_code'] = txn['item_code'].astype(str).str.strip()

針對第二個df做處理，依序為

移除無關欄位
移除目標欄位為空值的列(註記第2種寫法是補上inplace讓該df直接被更新而無須宣告)
新增並填入新資料
關鍵欄位資料型態一致化

r = r_raw.drop(columns = ['RsGlobalTxnID', 'stock_cost', 'soh_qty',
                                    'price', 'voucher_used'])r = r.dropna(axis=0, how='any', subset=["item_code"])
# r.dropna(axis=0, how='any', subset=["item_code"], inplace=True)r['type'] = 0 
def col_filling_r(row):
    if pd.notnull(row['store']):
        row['type'] = 'R'
    else:
        row['type'] = 0
    return row
r = r.apply(col_filling_r, axis=1)r['item_code'] = r['item_code'].astype(str).str.strip()

其他df處理相同，這邊紀錄遇到欄位裡面有想排除的值作法，先設立不等是讓內層series為布林值，再帶入外層df篩選條件中。如下例，我不想要欄位中有出現v_type跟99的資料出現。(這邊v_type會先設成一遮罩供後續使用)

另外依據指定欄位移除重複列並直接更新該df

waste = v['type'] != 'v_type'inv = inv[inv['MediaType'] != 99]inv.drop_duplicates(subset=['store', 'transaction_time'], keep='first', ignore_index=True, inplace=True)v.rename(columns={'void_type':'type'}, inplace = True)v['type'].replace({'Transaction' : 'VTransaction', 'Item' : 'VItem'}, inplace=True)

接著，因txn、r、v三個df欄位數量與名稱皆相同，故使用concat做垂直合併，欄位順序會自動reindex。接著再以該df執行merge來達到vlookup效果。這邊要注意key df 與target df位置與主表單left。這邊共執行兩次，分別對單一欄位與多欄位進行比對合併。

combine = [txn, r, v[waste]]df = pd.concat(combine, ignore_index=True)df2 = pd.merge(df, item, how='left', left_on='item_code', right_on='item_code')df3 = pd.merge(df2, inv, how='left', left_on=['store', 'transaction_time'],
                right_on=['storez', 'transaction_time'])
# removed A=* and B=*. based condition trans to index# new_df = new_df[empty].drop(new_df[empty].loc[new_df[empty][(new_df['交易型態'] == 'SalesItem') & (new_df['數量'] == 0)].index])delsi = new_df[empty][(new_df['交易型態'] == 'SalesItem') & (new_df['數量'] == 0)].indexnew_df = new_df[empty].drop(delsi)

需注意的是進行merge時，會把target df的所有欄位都帶進來，如不想在主表單新增太多欄位時，記得先對raw data進行非相關欄位移除。

此部分案例順道提及了簡單的移除指定欄位與缺失值、條件篩選、新增填入、取代值、合併表單，於此做個筆記~~

使用pandas做vlookup

Written by Allen Li

No responses yet