DataFrame.shift(periods=1, freq=None, axis=0, fill_value=NoDefault.no_default)

shift๋ฉ”์„œ๋“œ๋Š” ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์˜ ๋ฐ์ดํ„ฐ๋‚˜ ์ธ๋ฑ์Šค๋ฅผ ์›ํ•˜๋Š” ๊ธฐ๊ฐ„๋งŒํผ ์‰ฌํ”„ํŠธ ํ•˜๋Š” ๋ฉ”์„œ๋“œ ์ž…๋‹ˆ๋‹ค.
freq ์ธ์ˆ˜๋ฅผ ์ž…๋ ฅํ•˜์ง€ ์•Š์œผ๋ฉด ๋ฐ์ดํ„ฐ๊ฐ€ ์ด๋™ํ•˜๊ณ , ์ธ์ˆ˜๊ฐ’์„ ์ž…๋ ฅํ•˜๊ฒŒ๋˜๋ฉด ์ธ๋ฑ์Šค๊ฐ€ freq๊ฐ’ ๋งŒํผ ์ด๋™ํ•˜๊ฒŒ๋ฉ๋‹ˆ๋‹ค.

 

์‚ฌ์šฉ๋ฒ•

df.shift(periods=1, freq=None, axis=0, fill_value=NoDefault.no_default)
periods : ์ด๋™ํ•  ๊ธฐ๊ฐ„์ž…๋‹ˆ๋‹ค.
freq : ์ž…๋ ฅ ํ•  ๊ฒฝ์šฐ ์ธ๋ฑ์Šค๊ฐ€ ์ด๋™ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. Y, M, D, H, T, S ๋‚˜ Timestamp, 'Infer'๋“ฑ์ด ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
fill_value : shift๋กœ ์ธํ•ด ์ƒ๊ธด ๊ฒฐ์ธก์น˜๋ฅผ ๋Œ€์ฒดํ•  ๊ฐ’์ž…๋‹ˆ๋‹ค.

 

์˜ˆ์‹œ

๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค.

import pandas as pd
df = pd.read_csv("../input/bigdatacertificationkr/basic2.csv")
df

df.shift(1)

ํ–‰์ด ๋’ค๋กœ ํ•œ ์นธ ๋ฐ€๋ฆฐ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

df.shift(-1)

ํ–‰์ด ์•ž์œผ๋กœ ํ•œ ์นธ ๋ฐ€๋ฆฐ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (๋งˆ์ง€๋ง‰ ์ค„ NaN์œผ๋กœ ํ™•์ธ)

 

df['previous_PV'] = df['PV'].shift(1)
df.head()

ํŠน์ • ์ปฌ๋Ÿผ์„ shift ํ•˜์—ฌ ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

 

์ค‘๋ณตํ–‰ ํ™•์ธ (duplicated)

DataFrame.duplicated(subset=None, keep='first')

duplicated : ์ค‘๋ณต๋˜๋Š” ํ–‰์„ ํ™•์ธ
ํ–‰์˜ ๋ชจ๋“  ์š”์†Œ๊ฐ€ ๋™์ผํ•œ ํ–‰์ด ์ด๋ฏธ ์กด์žฌํ• ๊ฒฝ์šฐ ํ•ด๋‹น ํ–‰์€ True๋กœ ๋ฐ˜ํ™˜

df.duplicated(subset=None, keep='first')
subset : ํŠน์ • ์—ด๋งŒ์„ ๋Œ€์ƒ์œผ๋กœ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. list์˜ ์‚ฌ์šฉ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
keep : {first : ์œ„๋ถ€ํ„ฐ ๊ฒ€์‚ฌ / last : ์•„๋ž˜๋ถ€ํ„ฐ ๊ฒ€์‚ฌ} ๊ฒ€์‚ฌ ์ˆœ์„œ๋ฅผ ์ •ํ•ฉ๋‹ˆ๋‹ค. first์ผ ๊ฒฝ์šฐ ์œ„๋ถ€ํ„ฐ ํ™•์ธํ•ด์„œ ์ค‘๋ณตํ–‰์ด ๋‚˜์˜ค๋ฉด True๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋ฉฐ, last์ผ ๊ฒฝ์šฐ ์•„๋ž˜๋ถ€ํ„ฐ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

 

์˜ˆ์‹œ

df = pd.DataFrame({'Num':[1, 2, 1, 2, 2, 3], 
                   'Alphabet':['a', 'b', 'a', 'b', 'a', 'b']})
df

 

keep='first'์ด๋ฉฐ ์œ„์—์„œ๋ถ€ํ„ฐ ํ–‰์„ ํ™•์ธํ•˜์—ฌ ์ค‘๋ณต์ธ ํ–‰์ด ๋‚˜์˜ค๋ฉด True๋ฅผ ๋ฐ˜ํ™˜

print(df.duplicated(keep='first'))

keep='last'์ผ ๊ฒฝ์šฐ ์•„๋ž˜๋ถ€ํ„ฐ ํ–‰์„ ํ™•์ธํ•˜์—ฌ ์ค‘๋ณต์ธ ํ–‰์ด ๋‚˜์˜ค๋ฉด True๋ฅผ ๋ฐ˜ํ™˜

print(df.duplicated(keep='last'))

 ์•„๋ž˜ ํ–‰ ๋ถ€ํ„ฐ ํ™•์ธ ํ•˜์˜€์œผ๋ฏ€๋กœ 0,1๋ฒˆ์งธ ํ–‰ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋ž˜ 2,3๋ฒˆ์งธ ํ–‰์˜ ์ค‘๋ณต ๋ฐ์ดํ„ฐ๊ฐ€ ๋œ๋‹ค. 

 

 

 

subset์œผ๋กœ ํŠน์ • ์—ด๋งŒ ํ™•์ธ

print(df.duplicated(subset=['Alphabet']))

 


์ค‘๋ณตํ–‰ ์ œ๊ฑฐ (drop_duplicates)

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

drop_duplicates : ๋‚ด์šฉ์ด ์ค‘๋ณต๋˜๋Š” ํ–‰์„ ์ œ๊ฑฐํ•˜๋Š” ๋ฉ”์„œ๋“œ

df.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
subset : ์ค‘๋ณต๊ฐ’์„ ๊ฒ€์‚ฌํ•  ์—ด ์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ชจ๋“  ์—ด์„ ๊ฒ€์‚ฌํ•ฉ๋‹ˆ๋‹ค.
keep : {first / last} ์ค‘๋ณต์ œ๊ฑฐ๋ฅผํ• ๋•Œ ๋‚จ๊ธธ ํ–‰์ž…๋‹ˆ๋‹ค. first๋ฉด ์ฒซ๊ฐ’์„ ๋‚จ๊ธฐ๊ณ  last๋ฉด ๋งˆ์ง€๋ง‰ ๊ฐ’์„ ๋‚จ๊น๋‹ˆ๋‹ค.
inplace : ์›๋ณธ์„ ๋ณ€๊ฒฝํ• ์ง€์˜ ์—ฌ๋ถ€์ž…๋‹ˆ๋‹ค.
ignore_index : ์›๋ž˜ index๋ฅผ ๋ฌด์‹œํ• ์ง€ ์—ฌ๋ถ€์ž…๋‹ˆ๋‹ค. True์ผ ๊ฒฝ์šฐ 0,1,2, ... , n์œผ๋กœ ๋ถ€์—ฌ๋ฉ๋‹ˆ๋‹ค.

 

์˜ˆ์‹œ

df = pd.DataFrame({'Num':[1, 2, 1, 2, 2, 3], 
                   'Alphabet':['a', 'b', 'a', 'b', 'a', 'b']})
df

 

df.drop_duplicates()

0,1๋ฒˆ์งธ ํ–‰๊ณผ ์ค‘๋ณต๋œ 2,3๋ฒˆ์žฌ ํ–‰ ์ œ๊ฑฐ

 

 

 

 

subset์— ํŠน์ • ์ปฌ๋Ÿผ๋ช…๋งŒ ์ž…๋ ฅํ•  ๊ฒฝ์šฐ, ํ•ด๋‹น ์—ด์—๋Œ€ํ•ด์„œ๋งŒ ์ค‘๋ณต๊ฐ’ ๊ฒ€์‚ฌ๋ฅผ ์ˆ˜ํ–‰

df.drop_duplicates(subset='Num')

 

keep์ธ์ˆ˜๋ฅผ ํ†ตํ•ด ๋‚จ๊ธธ ํ–‰ ์„ ํƒ

keep='first'์ธ ๊ฒฝ์šฐ ์ฒ˜์Œ ๊ฐ’์„ ๋‚จ๊น๋‹ˆ๋‹ค. (๊ธฐ๋ณธ๊ฐ’)

df.drop_duplicates(subset='Num', keep='first')

keep='last'์ธ ๊ฒฝ์šฐ ๋งˆ์ง€๋ง‰ ๊ฐ’์„ ๋‚จ๊น๋‹ˆ๋‹ค.

df.drop_duplicates(subset='Num', keep='last')

 

์ถ”๊ฐ€๋กœ ignore_index=True๋กœ ํ•  ๊ฒฝ์šฐ ๊ฒฐ๊ณผ๊ฐ’์˜ ์ธ๋ฑ์Šค๋ฅผ 0, 1, 2, ... , n์œผ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

df.drop_duplicates(subset='Num', keep='last',ignore_index=True)

์œ„์˜ ๊ฒฐ๊ณผ์™€ ๋™์ผํ•˜์ง€๋งŒ ์ธ๋ฑ์Šค๊ฐ€ ๋ฐ”๋€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

inplace ์ธ์ˆ˜์˜ ์‚ฌ์šฉ
Pandas ๊ณตํ†ต์‚ฌํ•ญ์œผ๋กœ inplace์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์›๋ณธ์— ๋ณ€๊ฒฝ์ด ์ ์šฉ

df.drop_duplicates(subset='Num',inplace=True)
print(df)

 

 

 

์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ๋ฅผ DataFrame์œผ๋กœ ์ฒ˜๋ฆฌํ•  ๋•Œ Pandas resample() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐ„์„ ์žฌ์กฐ์ • ํ•  ์ˆ˜ ์žˆ๋‹ค.

import pandas as pd
df = pd.read_csv("../input/bigdatacertificationkr/basic2.csv", parse_dates=['Date'], index_col=0)

์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค.

์ฃผ ๋‹จ์œ„์˜ ์ปฌ๋Ÿผ ํ•ฉ๊ณ„๋ฅผ ๊ตฌํ•ด๋ณด์ž!  

resample('W') ์‚ฌ์šฉ

df_w = df.resample('W').sum()
df_w

Date ์ปฌ๋Ÿผ์„ ๋ณด๋ฉด ์ฃผ ๋‹จ์œ„๋กœ ๋ณ€๊ฒฝ ๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

์›” ๋‹จ์œ„์˜ ์ปฌ๋Ÿผ ํ•ฉ๊ณ„๋ฅผ ๊ตฌํ•ด๋ณด์ž.

resample('M') ์‚ฌ์šฉ

df_m = df.resample('M').sum()
df_m

Date ์ปฌ๋Ÿผ์„ ๋ณด๋ฉด ์›”(๋งˆ์ง€๋ง‰๋‚  ๊ธฐ์ค€) ๋‹จ์œ„๋กœ ๋ณ€๊ฒฝ ๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

resample('MS') ์‚ฌ์šฉ

df_ms = df.resample('MS').sum()
df_ms

Date ์ปฌ๋Ÿผ์„ ๋ณด๋ฉด ์›”(์‹œ์ž‘ ์ผ ๊ธฐ์ค€) ๋‹จ์œ„๋กœ ๋ณ€๊ฒฝ ๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

์ฃผ๊ธฐ ๋‹จ์œ„ ์ •๋ฆฌ

๋‚ ์งœ ์˜คํ”„์…‹๋นˆ๋„ ๋ฌธ์ž์—ด์„ค๋ช…

์ฐธ๊ณ : https://pandas.pydata.org/pandas-docs/stable/user_guide/window.html

DateOffset ์—†์Œ ์ผ๋ฐ˜ ์˜คํ”„์…‹ ํด๋ž˜์Šค, ๊ธฐ๋ณธ๊ฐ’์€ ์ ˆ๋Œ€ 24์‹œ๊ฐ„์ž…๋‹ˆ๋‹ค.
BDay๋˜๋Š”BusinessDay 'B' ์˜์—…์ผ(ํ‰์ผ)
CDay๋˜๋Š”CustomBusinessDay 'C' ๋งž์ถค ์˜์—…์ผ
Week 'W' ์ผ์ฃผ์ผ, ์„ ํƒ์ ์œผ๋กœ ์š”์ผ์— ๊ณ ์ •
WeekOfMonth 'WOM' ๋งค์›” y๋ฒˆ์งธ ์ฃผ์˜ x๋ฒˆ์งธ ๋‚ 
LastWeekOfMonth 'LWOM' ๋งค์›” ๋งˆ์ง€๋ง‰ ์ฃผ์˜ x๋ฒˆ์งธ ๋‚ 
MonthEnd 'M' ์—ญ์›”๋ง
MonthBegin 'MS' ์—ญ์›” ์‹œ์ž‘
BMonthEnd๋˜๋Š”BusinessMonthEnd 'BM' ์˜์—… ์›”๋ง
BMonthBegin๋˜๋Š”BusinessMonthBegin 'BMS' ์˜์—…์›” ์‹œ์ž‘
CBMonthEnd๋˜๋Š”CustomBusinessMonthEnd 'CBM' ๋งž์ถคํ˜• ์˜์—… ์›”๋ง
CBMonthBegin๋˜๋Š”CustomBusinessMonthBegin 'CBMS' ๋งž์ถคํ˜• ๋น„์ฆˆ๋‹ˆ์Šค ์›” ์‹œ์ž‘
SemiMonthEnd 'SM' 15์ผ(๋˜๋Š” ๋‹ค๋ฅธ day_of_month) ๋ฐ ์—ญ์›”๋ง
SemiMonthBegin 'SMS' 15์ผ(๋˜๋Š” ๋‹ค๋ฅธ day_of_month) ๋ฐ ์—ญ์›” ์‹œ์ž‘
QuarterEnd 'Q' ๋‹ฌ๋ ฅ ๋ถ„๊ธฐ๋ง
QuarterBegin 'QS' ๋‹ฌ๋ ฅ ๋ถ„๊ธฐ ์‹œ์ž‘
BQuarterEnd 'BQ ์‚ฌ์—… ๋ถ„๊ธฐ๋ง
BQuarterBegin 'BQS' ์‚ฌ์—… ๋ถ„๊ธฐ ์‹œ์ž‘
FY5253Quarter 'REQ' ์†Œ๋งค(์ผ๋ช… 52-53์ฃผ) ๋ถ„๊ธฐ
YearEnd 'A' ๋‹ฌ๋ ฅ ์—ฐ๋ง
YearBegin 'AS'๋˜๋Š”'BYS' ์—ญ๋…„ ์‹œ์ž‘
BYearEnd 'BA' ์‚ฌ์—…์—ฐ๋ง
BYearBegin 'BAS' ์‚ฌ์—… ์—ฐ๋„ ์‹œ์ž‘
FY5253 'RE' ์†Œ๋งค(์ผ๋ช… 52-53์ฃผ) ์—ฐ๋„
Easter ์—†์Œ ๋ถ€ํ™œ์ ˆ
BusinessHour 'BH' ์˜์—…์‹œ๊ฐ„
CustomBusinessHour 'CBH' ๋งž์ถค ์—…๋ฌด ์‹œ๊ฐ„
Day 'D' ์ ˆ๋Œ€์ ์ธ ํ•˜๋ฃจ
Hour 'H' ํ•œ ์‹œ๊ฐ„
Minute 'T'๋˜๋Š”'min' 1๋ถ„
Second 'S' ์ผ์ดˆ
Milli 'L'๋˜๋Š”'ms' 1๋ฐ€๋ฆฌ์ดˆ
Micro 'U'๋˜๋Š”'us' 1๋งˆ์ดํฌ๋กœ์ดˆ
Nano 'N' 1๋‚˜๋…ธ์ดˆ

 

 

pd.merge๋Š” ๊ณตํ†ต์˜ ์—ด์„ ๊ธฐ์ค€์œผ๋กœ ๋‘ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ํ•ฉ์ณ์ค€๋‹ค.

import pandas as pd


# ๊ธฐ์ค€์—ด ์ด๋ฆ„์ด ๊ฐ™์„ ๋•Œ
pd.merge(left, right, on = '๊ธฐ์ค€์—ด', how = '์กฐ์ธ๋ฐฉ์‹')

# ๊ธฐ์ค€์—ด ์ด๋ฆ„์ด ๋‹ค๋ฅผ ๋•Œ
pd.merge(left, right, left_on = '์™ผ์ชฝ ์—ด', right_on = '์˜ค๋ฅธ์ชฝ ์—ด', how = '์กฐ์ธ๋ฐฉ์‹')

left: ์™ผ์ชฝ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„

right: ์˜ค๋ฅธ์ชฝ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„

on: (๋‘ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๊ธฐ์ค€์—ด ์ด๋ฆ„์ด ๊ฐ™์„ ๋•Œ) ๊ธฐ์ค€์—ด

how: ์กฐ์ธ ๋ฐฉ์‹('left', 'right', 'inner', 'outer') ๊ธฐ๋ณธ๊ฐ’์€ 'inner'

 

left_on: ๊ธฐ์ค€์—ด ์ด๋ฆ„์ด ๋‹ค๋ฅผ ๋•Œ, ์™ผ์ชฝ ๊ธฐ์ค€์—ด

right_on: ๊ธฐ์ค€์—ด ์ด๋ฆ„์ด ๋‹ค๋ฅผ ๋•Œ, ์˜ค๋ฅธ์ชฝ ๊ธฐ์ค€์—ด


์˜ˆ์‹œ

# ์˜ˆ์‹œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
students = pd.DataFrame({'Num':[4, 14, 7, 11, 10], 'students':['์ฑ„์น˜์ˆ˜', '์ •๋Œ€๋งŒ', '์†กํƒœ์„ญ', '์„œํƒœ์›…', '๊ฐ•๋ฐฑํ˜ธ']})
grade = pd.DataFrame({'Num':[14, 11, 12], 'Grade':['A', 'B', 'C']})

students๊ณผ grade๋Š” Num๋ฅผ ๊ณตํ†ต์˜ ์—ด๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

Num๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋ณ‘ํ•ฉํ•ด๋ณด์ž.

1. how = JOIN ๋ฐฉ์‹

 

Left Join: ์™ผ์ชฝ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๊ธฐ์ค€์œผ๋กœ ํ•œ๋‹ค. ์˜ค๋ฅธ์ชฝ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์—†๋Š” ๊ฐ’์€ NaN์œผ๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค.

pd.merge(left, right, on = 'col', how = 'left')

 

Right Join: ์˜ค๋ฅธ์ชฝ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๊ธฐ์ค€์œผ๋กœ ์กฐ์ธํ•œ๋‹ค. ์™ผ์ชฝ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์—†๋Š” ๊ฐ’์€ NaN์œผ๋กœ ๋‚˜ํƒ€๋‚ธ๋‹ค.

pd.merge(left, right, on = 'col', how = 'right')

 

Inner Join: ๊ต์ง‘ํ•ฉ์„ ์˜๋ฏธํ•œ๋‹ค. ์–‘์ชฝ์— ๊ณตํ†ต์œผ๋กœ ์žˆ๋Š” ๊ฐ’๋งŒ ๋‚˜ํƒ€๋‚ธ๋‹ค.

pd.merge(left, right, on = 'col', how = 'inner')

 

Outer Join: ๋ชจ๋“  ๊ฐ’์ด ๋‚˜ํƒ€๋‚˜๋„๋ก ํ•œ๋‹ค. ์™ผ์ชฝ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„๊ณผ ์˜ค๋ฅธ์ชฝ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์—†๋Š” ๊ฐ’๋“ค์€ NaN์œผ๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค.

pd.merge(left, right, on = 'col', how = 'outer')

 

 

2. on = '๊ธฐ์ค€์—ด'

๊ณตํ†ต์ด ๋˜๋Š” ๊ธฐ์ค€ ์—ด์ด ์—ฌ๋Ÿฌ ๊ฐœ์ผ ๋•Œ

pd.merge(left, right, on = ['col1', 'col2'])

 

๋‘ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์—ด ์ด๋ฆ„์ด ๋‹ค๋ฅผ ๋•Œ

์œ„ ์˜ˆ์‹œ์ฒ˜๋Ÿผ ์–‘์ชฝ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ merge์˜ ๊ธฐ์ค€์ด ๋˜๋Š” ์—ด์˜ ์ด๋ฆ„์ด ๊ฐ™๋‹ค๋ฉด on = '๊ธฐ์ค€์—ด'๋กœ ์ง€์ •ํ•ด์ฃผ๋ฉด ๋œ๋‹ค. ํ•˜์ง€๋งŒ ์–‘์ชฝ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ ๊ธฐ์ค€์ด ๋˜๋Š” ์—ด์˜ ์ด๋ฆ„์ด ๋‹ค๋ฅด๋‹ค๋ฉด ๊ฐ๊ฐ left_on = '์™ผ์ชฝ ์—ด', right_on = '์˜ค๋ฅธ์ชฝ ์—ด'๋กœ ์ง€์ •ํ•˜๋ฉด ๋œ๋‹ค.

pd.merge(left, right, left_on = '์™ผ์ชฝ ์—ด', right_on = '์˜ค๋ฅธ์ชฝ ์—ด')

 

 

pd.to_datetime['column'] ์ด์šฉํ•ด์„œ ๋ฐ”๊พธ๊ธฐ

import pandas as pd
df = pd.read_csv('../input/bigdatacertificationkr/basic2.csv')
print(df.head())

df.info()

pd.to_datetime ์ด์šฉํ•˜์—ฌ ๋ณ€๊ฒฝ

df['Date'] = pd.to_datetime(df['Date'])
df.info()

df.info()๋ฅผ ํ†ตํ•˜์—ฌ Date ์นผ๋Ÿผ์˜ Dtype์ด objectd์—์„œ datetime64[ns]๋กœ ๋ฐ”๋€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

read_csv๋กœ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ฌ ๋•Œ parse_dates ์˜ต์…˜์œผ๋กœ datetime์œผ๋กœ ๋ฐ”๋กœ ์ง€์ •ํ•˜๊ธฐ

df = pd.read_csv("../input/bigdatacertificationkr/basic2.csv", parse_dates=['Date'])
df.info()

Date ์นผ๋Ÿผ์ด ๋ฐ”๋กœ datetime64[ns]๋กœ ๋ฐ”๋€Œ์–ด์ ธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์™”์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')

reset_index:  ์„ค์ • ์ธ๋ฑ์Šค๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๊ธฐ๋ณธ ์ธ๋ฑ์Šค(0,1,2, ... , n)์œผ๋กœ ๋ณ€๊ฒฝ

  • level : Multi Index์˜ ๊ฒฝ์šฐ ์ œ๊ฑฐํ•  ์ธ๋ฑ์Šค์˜ ๋ ˆ๋ฒจ์„ ์„ค์ •ํ• ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ชจ๋“  ๋ ˆ๋ฒจ์ž…๋‹ˆ๋‹ค.
  • drop : ์ œ๊ฑฐํ•œ ์ธ๋ฑ์Šค๋ฅผ ์—ด์— ์ถ”๊ฐ€ํ• ์ง€ ์—ฌ๋ถ€์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ False๋กœ ์ œ๊ฑฐ๋œ ์ธ๋ฑ์Šค๋Š” ์—ด๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค.
  • inplace : pandas ๊ณตํ†ต ์ธ์ˆ˜๋กœ, ์›๋ณธ์„ ๋ณ€๊ฒฝํ• ์ง€ ์—ฌ๋ถ€ ์ž…๋‹ˆ๋‹ค.
  • col_level / col_fill : Multi Index์˜ ๊ฒฝ์šฐ ์ œ๊ฑฐ๋œ ์ธ๋ฑ์Šค๋ฅผ ์—ด์— ์ถ”๊ฐ€ํ•  ๋•Œ ๋ ˆ๋ฒจ๊ณผ ์—ด์˜ ์ด๋ฆ„์„ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    col_level์„ ํ†ตํ•ด ๋ ˆ๋ฒจ์„ ์„ค์ •ํ•˜๊ณ , col_fill์„ํ†ตํ•ด ํ•ด๋‹น ์—ด์˜ ์ด๋ฆ„์„ ์ •ํ•ฉ๋‹ˆ๋‹ค.

 

์˜ˆ์‹œ

๋จผ์ € 3x2 ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ

import pandas as pd 

df = pd.DataFrame([[1,2],[3,4],[5,6]],['row1','row2','row3'],['col1','col2'])
print(df)
      col1  col2
row1     1     2
row2     3     4
row3     5     6

 

๊ธฐ๋ณธ์ ์ธ ์‚ฌ์šฉ๋ฒ•(+drop, inplace)

๊ธฐ๋ณธ์ ์œผ๋กœ ์•„๋ฌด ์ธ์ˆ˜ ์—†์ด ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋ฉด, ๋ชจ๋“  ๋ ˆ๋ฒจ์— ๋Œ€ํ•ด ์ธ๋ฑ์Šค๊ฐ€ ์ œ๊ฑฐ๋˜๋ฉฐ, ์—ด์— ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค.

print(df.reset_index())
  index  col1  col2
0  row1     1     2
1  row2     3     4
2  row3     5     6

์ฒซ๋ฒˆ์งธ ์—ด 0,1,2 ์ถ”๊ฐ€  / row1 ์—ด ์ปฌ๋Ÿผ๋ช… index๋กœ ๋ฐ”๋€œ

drop = True์ธ ๊ฒฝ์šฐ ์—ด์ด ๋ฐ–์œผ๋กœ ๊บผ๋‚ด์ง€๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ์™„์ „ํžˆ ์‚ญ์ œ๋ฉ๋‹ˆ๋‹ค.

print(df.reset_index(drop=True))
   col1  col2
0     1     2
1     3     4
2     5     6

index ์ปฌ๋Ÿผ ์‚ญ์ œ / 0 1 2 ์ˆœ์„œ๋กœ๋งŒ ๋‚จ์Œ

inplace=True์ธ ๊ฒฝ์šฐ ์›๋ณธ์ด ๋ณ€๊ฒฝ๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

df.reset_index(inplace=True)
print(df)
  index  col1  col2
0  row1     1     2
1  row2     3     4
2  row3     5     6

 

 

Multi Index ์˜ˆ์‹œ

 4x4์งœ๋ฆฌ Multi Index ์ƒ์„ฑ

idx = [['IDX1','IDX1','IDX2','IDX2'],['row1','row2','row3','row4']]
col = [['COL1','COL1','COL2','COL2'],['val1','val2','val3','val4']]
data = [[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]]
df2 = pd.DataFrame(data,idx,col)
print(df2)
          COL1      COL2     
          val1 val2 val3 val4
IDX1 row1    1    2    3    4
     row2    5    6    7    8
IDX2 row3    9   10   11   12
     row4   13   14   15   16

 

๊ธฐ๋ณธ์ ์ธ ์‚ฌ์šฉ๋ฒ•(+level)
Multi Index์˜ ๊ฒฝ์šฐ level์„ ์„ค์ •ํ•ด์คŒ์œผ๋กœ์„œ ์ œ๊ฑฐํ•  ์ธ๋ฑ์Šค์˜ ๋ ˆ๋ฒจ์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
level์„ ์ž…๋ ฅํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋ชจ๋“  index์— ๋Œ€ํ•ด์„œ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

print(df2.reset_index())
  level_0 level_1 COL1      COL2     
                  val1 val2 val3 val4
0    IDX1    row1    1    2    3    4
1    IDX1    row2    5    6    7    8
2    IDX2    row3    9   10   11   12
3    IDX2    row4   13   14   15   16

 

level=0์ธ ๊ฒฝ์šฐ

print(df2.reset_index(level=0))

 

     level_0 COL1      COL2     
             val1 val2 val3 val4
row1    IDX1    1    2    3    4
row2    IDX1    5    6    7    8
row3    IDX2    9   10   11   12
row4    IDX2   13   14   15   16

 

level=1์ธ ๊ฒฝ์šฐ

print(df2.reset_index(level=1))
     level_1 COL1      COL2     
             val1 val2 val3 val4
IDX1    row1    1    2    3    4
IDX1    row2    5    6    7    8
IDX2    row3    9   10   11   12
IDX2    row4   13   14   15   16

 

col_fill / col_level์˜ ์‚ฌ์šฉ
col_fill์„ ์ด์šฉํ•˜์—ฌ ์—ด๋กœ ๋ณ€๊ฒฝ๋˜๋Š” ์ธ๋ฑ์Šค์˜ ์—ด ์ด๋ฆ„์„ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. COL0์œผ๋กœ ์„ค์ •ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

print(df2.reset_index(level=1, col_fill='COL0'))
     level_1 COL1      COL2     
        COL0 val1 val2 val3 val4
IDX1    row1    1    2    3    4
IDX1    row2    5    6    7    8
IDX2    row3    9   10   11   12
IDX2    row4   13   14   15   16

col_level์„ ์ด์šฉํ•ด ์—ด ์ด๋ฆ„์˜ ๋ ˆ๋ฒจ์„ ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. COL0์„ ๋‹ค๋ฅธ ์—ด์ด๋ฆ„์— ๋งž๊ฒŒ LEVEL์„ ๋ณ€๊ฒฝํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

print(df2.reset_index(level=1,col_fill='col0',col_level=1))
        col0 COL1      COL2     
     level_1 val1 val2 val3 val4
IDX1    row1    1    2    3    4
IDX1    row2    5    6    7    8
IDX2    row3    9   10   11   12
IDX2    row4   13   14   15   16

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)

fillna ๋ฉ”์„œ๋“œ : DataFrame์—์„œ ๊ฒฐ์ธก๊ฐ’์„ ์›ํ•˜๋Š” ๊ฐ’์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋Š” ๋ฉ”์„œ๋“œ์ž…๋‹ˆ๋‹ค.

  • value : ๊ฒฐ์ธก๊ฐ’์„ ๋Œ€์ฒดํ•  ๊ฐ’์ž…๋‹ˆ๋‹ค. dictํ˜•ํƒœ๋กœ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • method : ๊ฒฐ์ธก๊ฐ’์„ ๋ณ€๊ฒฝํ•  ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. bfill๋กœ ํ• ๊ฒฝ์šฐ ๊ฒฐ์ธก๊ฐ’์„ ๋ฐ”๋กœ ์•„๋ž˜ ๊ฐ’๊ณผ ๋™์ผํ•˜๊ฒŒ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.ffill๋กœ ํ•  ๊ฒฝ์šฐ ๊ฒฐ์ธก๊ฐ’์„ ๋ฐ”๋กœ ์œ„ ๊ฐ’๊ณผ ๋™์ผํ•˜๊ฒŒ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.
  • axis : {0 : index / 1 : columns} fillna ๋ฉ”์„œ๋“œ๋ฅผ ์ ์šฉํ•  ๋ ˆ์ด๋ธ”์ž…๋‹ˆ๋‹ค.
  • inplace : ์›๋ณธ์„ ๋ณ€๊ฒฝํ• ์ง€ ์—ฌ๋ถ€์ž…๋‹ˆ๋‹ค. True์ผ ๊ฒฝ์šฐ ์›๋ณธ์„ ๋ณ€๊ฒฝํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • limit : ๊ฒฐ์ธก๊ฐ’์„ ๋ณ€๊ฒฝํ•  ํšŸ์ˆ˜์ž…๋‹ˆ๋‹ค. ์œ„์—์„œ๋ถ€ํ„ฐ limit๋กœ ์ง€์ •๋œ ๊ฐฏ์ˆ˜๋งŒํผ๋งŒ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.
  • downcast : ๋‹ค์šด์บ์ŠคํŠธํ• ์ง€ ์—ฌ๋ถ€์ž…๋‹ˆ๋‹ค. downcast='infer'์ผ ๊ฒฝ์šฐ float64๋ฅผ int64๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.

 

์˜ˆ์‹œ

๋จผ์ € 5x5 ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ

col  = ['col1','col2','col3','col4','col5']
row  = ['row1','row2','row3','row4','row5']
na = np.nan
data = [[na, 2,na, 4,na],
        [ 6, 7,na, 9,na],
        [11,na,na,14,15],
        [na,17,na,na,20],
        [na,22,na,na,25]]
df = pd.DataFrame(data,row,col)
print(df)
      col1  col2  col3  col4  col5
row1   NaN   2.0   NaN   4.0   NaN
row2   6.0   7.0   NaN   9.0   NaN
row3  11.0   NaN   NaN  14.0  15.0
row4   NaN  17.0   NaN   NaN  20.0
row5   NaN  22.0   NaN   NaN  25.0

 

value์˜ ํ˜•์‹์— ๋”ฐ๋ฅธ ์‚ฌ์šฉ

value๊ฐ€ ์ˆซ์ž๋‚˜ ๋ฌธ์ž์ผ ๊ฒฝ์šฐ ๊ทธ๋Œ€๋กœ ๊ฒฐ์ธก๊ฐ’์„ ๋Œ€์ฒดํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” A๋กœ ๋ฐ”๊ฟ”๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

print(df.fillna('A'))
      col1  col2 col3  col4  col5
row1     A   2.0    A   4.0     A
row2   6.0   7.0    A   9.0     A
row3  11.0     A    A  14.0  15.0
row4     A  17.0    A     A  20.0
row5     A  22.0    A     A  25.0

dictํ˜•ํƒœ๋กœ ์ž…๋ ฅํ•  ๊ฒฝ์šฐ ๊ฐ๊ฐ ๋ ˆ์ด๋ธ”๊ฐ’์— ๋Œ€ํ•ด ์›ํ•˜๋Š” ๊ฐ’์œผ๋กœ์˜ ๋ณ€๊ฒฝ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

dict = {'col1':'A', 'col2':'B', 'col3':'C', 'col4':'D', 'col5':'E'}
print(df.fillna(value=dict))

# print(df.fillna(dict))๋„ ๊ฐ€๋Šฅ
      col1  col2 col3  col4  col5
row1     A   2.0    C   4.0     E
row2   6.0   7.0    C   9.0     E
row3  11.0     B    C  14.0  15.0
row4     A  17.0    C     D  20.0
row5     A  22.0    C     D  25.0

 

method์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ

method์ธ์ˆ˜์— bfill์„ ์ž…๋ ฅํ•  ๊ฒฝ์šฐ ๊ฒฐ์ธก๊ฐ’์ด ๋ฐ”๋กœ ์•„๋ž˜๊ฐ’๊ณผ ๋™์ผํ•˜๊ฒŒ ์„ค์ •๋ฉ๋‹ˆ๋‹ค.

โ€ป df.backfill( )์ด๋‚˜ df.bfill( )๊ณผ ์™„์ „ํžˆ ๋™์ผํ•œ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰

print(df.fillna(method='bfill'))
      col1  col2  col3  col4  col5
row1   6.0   2.0   NaN   4.0  15.0
row2   6.0   7.0   NaN   9.0  15.0
row3  11.0  17.0   NaN  14.0  15.0
row4   NaN  17.0   NaN   NaN  20.0
row5   NaN  22.0   NaN   NaN  25.0

 

method์ธ์ˆ˜์— ffill์„ ์ž…๋ ฅํ•  ๊ฒฝ์šฐ ๊ฒฐ์ธก๊ฐ’์ด ๋ฐ”๋กœ ์œ„๊ฐ’๊ณผ ๋™์ผํ•˜๊ฒŒ ์„ค์ •๋ฉ๋‹ˆ๋‹ค.

โ€ป df.pad( )๋‚˜ df.ffill( )๊ณผ ์™„์ „ํžˆ ๋™์ผํ•œ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

print(df.fillna(method='ffill'))
      col1  col2  col3  col4  col5
row1   NaN   2.0   NaN   4.0   NaN
row2   6.0   7.0   NaN   9.0   NaN
row3  11.0   7.0   NaN  14.0  15.0
row4  11.0  17.0   NaN  14.0  20.0
row5  11.0  22.0   NaN  14.0  25.0

 

limit์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ

limit์ธ์ˆ˜๋Š” ๊ฐ ๋ ˆ์ด๋ธ”๊ฐ’์— ๋Œ€ํ•ด์„œ ๊ฒฐ์ธก์น˜ ๋ณ€๊ฒฝ์„ ์ˆ˜ํ–‰ํ•  ํšŸ์ˆ˜์ž…๋‹ˆ๋‹ค. ํ–‰ ๊ธฐ์ค€์ผ๊ฒฝ์šฐ ์™ผ์ชฝ๋ถ€ํ„ฐ, ์—ด ๊ธฐ์ค€์ผ ๊ฒฝ์šฐ ์œ„์—์„œ๋ถ€ํ„ฐ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

print(df.fillna('A', limit=2))
       col1  col2 col3  col4  col5
row1     A   2.0    A   4.0     A
row2   6.0   7.0    A   9.0     A
row3  11.0     A  NaN  14.0  15.0
row4     A  17.0  NaN     A  20.0
row5   NaN  22.0  NaN     A  25.0

 

downcast์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ

downcast ์ธ์ˆ˜๋ฅผ 'infer'๋กœ ์„ค์ •ํ•จ์œผ๋กœ์จ float64ํ˜•ํƒœ๋ฅผ int64ํ˜•ํƒœ๋กœ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

print(df.fillna(0, downcast='infer'))
      col1  col2  col3  col4  col5
row1     0     2     0     4     0
row2     6     7     0     9     0
row3    11     0     0    14    15
row4     0    17     0     0    20
row5     0    22     0     0    25

 

inplace๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ

๋‹ค๋ฅธ ํŒŒ์ด์ฌ ๊ฐ์ฒด์—์„œ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ inplace๋Š” ์›๋ณธ์„ ๋ฎ์–ด์”Œ์šฐ๋Š” ๊ธฐ๋Šฅ๊ณผ ์œ ์‚ฌํ•œ ๊ธฐ๋Šฅ์„ ํ•ฉ๋‹ˆ๋‹ค.
์ฆ‰ df.fillna(0, inplace=True) ๋Š” df=df.fillna(0)๊ณผ ๋™์ผํ•œ ๊ธฐ๋Šฅ์„ ํ•ฉ๋‹ˆ๋‹ค.

df.fillna('A', inplace=True)
print(df)
      col1  col2 col3  col4  col5
row1     A   2.0    A   4.0     A
row2   6.0   7.0    A   9.0     A
row3  11.0     A    A  14.0  15.0
row4     A  17.0    A     A  20.0
row5     A  22.0    A     A  25.0

 

df.cumsum(axis=None, skipna=True, args, kwargs) ๋ˆ„์ ํ•ฉ
df.cumprod(axis=None, skipna=True, args, kwargs) ๋ˆ„์ ๊ณฑ


axis : ๋ˆ„์ ํ•ฉ/๋ˆ„์ ๊ณฑ์„ ๊ตฌํ•  ์ถ•์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.
skipna : ๊ฒฐ์ธก์น˜๋ฅผ ๋ฌด์‹œํ• ์ง€ ์—ฌ๋ถ€ ์ž…๋‹ˆ๋‹ค

์˜ˆ์‹œ

df = pd.DataFrame({'col1':[2,-2,4,5,6,8],'col2':[3,4,np.NaN,7,4,5]})
print(df)
   col1  col2
0     2   3.0
1    -2   4.0
2     4   NaN
3     5   7.0
4     6   4.0
5     8   5.0

 

๋ˆ„์ ํ•ฉ cumsum() 

print(df.cumsum())
 col1  col2
0     2   3.0
1     0   7.0
2     4   NaN
3     9  14.0
4    15  18.0
5    23  23.0

 

๋ˆ„์ ๊ณฑ cumprod()

print(df.cumprod())
   col1    col2
0     2     3.0
1    -4    12.0
2   -16     NaN
3   -80    84.0
4  -480   336.0
5 -3840  1680.0

 

skipna ์ธ์ˆ˜์˜ ์‚ฌ์šฉ

print(df.cumsum(skipna=False))
   col1  col2
0     2   3.0
1     0   7.0
2     4   NaN # NaN ๋“ฑ์žฅ๋ถ€ํ„ฐ ๊ณ„์‚ฐํ•  ์ˆ˜ ์—†์œผ๋ฏ€๋กœ NaN ๋ฐ˜ํ™˜
3     9   NaN
4    15   NaN
5    23   NaN

+ Recent posts