์ค‘๋ณตํ–‰ ํ™•์ธ (duplicated)

DataFrame.duplicated(subset=None, keep='first')

duplicated : ์ค‘๋ณต๋˜๋Š” ํ–‰์„ ํ™•์ธ
ํ–‰์˜ ๋ชจ๋“  ์š”์†Œ๊ฐ€ ๋™์ผํ•œ ํ–‰์ด ์ด๋ฏธ ์กด์žฌํ• ๊ฒฝ์šฐ ํ•ด๋‹น ํ–‰์€ True๋กœ ๋ฐ˜ํ™˜

df.duplicated(subset=None, keep='first')
subset : ํŠน์ • ์—ด๋งŒ์„ ๋Œ€์ƒ์œผ๋กœ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. list์˜ ์‚ฌ์šฉ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
keep : {first : ์œ„๋ถ€ํ„ฐ ๊ฒ€์‚ฌ / last : ์•„๋ž˜๋ถ€ํ„ฐ ๊ฒ€์‚ฌ} ๊ฒ€์‚ฌ ์ˆœ์„œ๋ฅผ ์ •ํ•ฉ๋‹ˆ๋‹ค. first์ผ ๊ฒฝ์šฐ ์œ„๋ถ€ํ„ฐ ํ™•์ธํ•ด์„œ ์ค‘๋ณตํ–‰์ด ๋‚˜์˜ค๋ฉด True๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋ฉฐ, last์ผ ๊ฒฝ์šฐ ์•„๋ž˜๋ถ€ํ„ฐ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

 

์˜ˆ์‹œ

df = pd.DataFrame({'Num':[1, 2, 1, 2, 2, 3], 
                   'Alphabet':['a', 'b', 'a', 'b', 'a', 'b']})
df

 

keep='first'์ด๋ฉฐ ์œ„์—์„œ๋ถ€ํ„ฐ ํ–‰์„ ํ™•์ธํ•˜์—ฌ ์ค‘๋ณต์ธ ํ–‰์ด ๋‚˜์˜ค๋ฉด True๋ฅผ ๋ฐ˜ํ™˜

print(df.duplicated(keep='first'))

keep='last'์ผ ๊ฒฝ์šฐ ์•„๋ž˜๋ถ€ํ„ฐ ํ–‰์„ ํ™•์ธํ•˜์—ฌ ์ค‘๋ณต์ธ ํ–‰์ด ๋‚˜์˜ค๋ฉด True๋ฅผ ๋ฐ˜ํ™˜

print(df.duplicated(keep='last'))

 ์•„๋ž˜ ํ–‰ ๋ถ€ํ„ฐ ํ™•์ธ ํ•˜์˜€์œผ๋ฏ€๋กœ 0,1๋ฒˆ์งธ ํ–‰ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋ž˜ 2,3๋ฒˆ์งธ ํ–‰์˜ ์ค‘๋ณต ๋ฐ์ดํ„ฐ๊ฐ€ ๋œ๋‹ค. 

 

 

 

subset์œผ๋กœ ํŠน์ • ์—ด๋งŒ ํ™•์ธ

print(df.duplicated(subset=['Alphabet']))

 


์ค‘๋ณตํ–‰ ์ œ๊ฑฐ (drop_duplicates)

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

drop_duplicates : ๋‚ด์šฉ์ด ์ค‘๋ณต๋˜๋Š” ํ–‰์„ ์ œ๊ฑฐํ•˜๋Š” ๋ฉ”์„œ๋“œ

df.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
subset : ์ค‘๋ณต๊ฐ’์„ ๊ฒ€์‚ฌํ•  ์—ด ์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ชจ๋“  ์—ด์„ ๊ฒ€์‚ฌํ•ฉ๋‹ˆ๋‹ค.
keep : {first / last} ์ค‘๋ณต์ œ๊ฑฐ๋ฅผํ• ๋•Œ ๋‚จ๊ธธ ํ–‰์ž…๋‹ˆ๋‹ค. first๋ฉด ์ฒซ๊ฐ’์„ ๋‚จ๊ธฐ๊ณ  last๋ฉด ๋งˆ์ง€๋ง‰ ๊ฐ’์„ ๋‚จ๊น๋‹ˆ๋‹ค.
inplace : ์›๋ณธ์„ ๋ณ€๊ฒฝํ• ์ง€์˜ ์—ฌ๋ถ€์ž…๋‹ˆ๋‹ค.
ignore_index : ์›๋ž˜ index๋ฅผ ๋ฌด์‹œํ• ์ง€ ์—ฌ๋ถ€์ž…๋‹ˆ๋‹ค. True์ผ ๊ฒฝ์šฐ 0,1,2, ... , n์œผ๋กœ ๋ถ€์—ฌ๋ฉ๋‹ˆ๋‹ค.

 

์˜ˆ์‹œ

df = pd.DataFrame({'Num':[1, 2, 1, 2, 2, 3], 
                   'Alphabet':['a', 'b', 'a', 'b', 'a', 'b']})
df

 

df.drop_duplicates()

0,1๋ฒˆ์งธ ํ–‰๊ณผ ์ค‘๋ณต๋œ 2,3๋ฒˆ์žฌ ํ–‰ ์ œ๊ฑฐ

 

 

 

 

subset์— ํŠน์ • ์ปฌ๋Ÿผ๋ช…๋งŒ ์ž…๋ ฅํ•  ๊ฒฝ์šฐ, ํ•ด๋‹น ์—ด์—๋Œ€ํ•ด์„œ๋งŒ ์ค‘๋ณต๊ฐ’ ๊ฒ€์‚ฌ๋ฅผ ์ˆ˜ํ–‰

df.drop_duplicates(subset='Num')

 

keep์ธ์ˆ˜๋ฅผ ํ†ตํ•ด ๋‚จ๊ธธ ํ–‰ ์„ ํƒ

keep='first'์ธ ๊ฒฝ์šฐ ์ฒ˜์Œ ๊ฐ’์„ ๋‚จ๊น๋‹ˆ๋‹ค. (๊ธฐ๋ณธ๊ฐ’)

df.drop_duplicates(subset='Num', keep='first')

keep='last'์ธ ๊ฒฝ์šฐ ๋งˆ์ง€๋ง‰ ๊ฐ’์„ ๋‚จ๊น๋‹ˆ๋‹ค.

df.drop_duplicates(subset='Num', keep='last')

 

์ถ”๊ฐ€๋กœ ignore_index=True๋กœ ํ•  ๊ฒฝ์šฐ ๊ฒฐ๊ณผ๊ฐ’์˜ ์ธ๋ฑ์Šค๋ฅผ 0, 1, 2, ... , n์œผ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

df.drop_duplicates(subset='Num', keep='last',ignore_index=True)

์œ„์˜ ๊ฒฐ๊ณผ์™€ ๋™์ผํ•˜์ง€๋งŒ ์ธ๋ฑ์Šค๊ฐ€ ๋ฐ”๋€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

inplace ์ธ์ˆ˜์˜ ์‚ฌ์šฉ
Pandas ๊ณตํ†ต์‚ฌํ•ญ์œผ๋กœ inplace์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์›๋ณธ์— ๋ณ€๊ฒฝ์ด ์ ์šฉ

df.drop_duplicates(subset='Num',inplace=True)
print(df)

 

 

 

+ Recent posts