본문 바로가기

Python skills for Data Analysis

Missing value processing.

In a real-world, There a ton of missing values, It makes the variables confused. So we need to fix them appropriately.

 

Here is the sample data set.

 

df.head()

 

 

First of all, we could see the missing values roughly.

df.isnull.sum()

 

 

And we could visualize it. It is my go-to way.

total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

percent_data = percent.head(20)
percent_data.plot(kind="bar", figsize = (8,6), fontsize = 10)
plt.xlabel("Columns", fontsize = 20)
plt.ylabel("Count", fontsize = 20)
plt.title("Total Missing Value (%)", fontsize = 20)

Wow, we've got a lot to be done.!

 

Drop

We can easily just drop the rows of  NaN(Missing value).

df = df.dropna(subset=['PoolQC'])

We should not forget what it means. We actually dropped all rows which had NaN with PoolQC. So it is really bad to other columns. Let's see.

haha... So just do not use it.

 

 

and another way is to drop the column.

df = df.drop(['PoolQC'], axis=1)

We just dropped the PoolQC column. it had had just 7 values of 1500 values in there. So I think it is a better way.

 

Fill the value manually.

If the missing values are able to be filled(if you could google or know the values) manually, we could do that as long as those are not that much.

 

Let's say id 1 has NaN in Fence column and I know it is originally 'MnPrv'. then you can fill that like this.

df.loc[1,'Fence'] = 'MnPrv'

 

It will be quite hard... If it is a lot. so I do this thing when important.

 

mode()

mode function is to indicate the common value in that column. It is what I'm used to using to fill the NaN

mode = df['Fence'].mode()[0]
df['Fence'] = df['Fence'].fillna(mode)

Before / After

You could select the second common value or third by changing the num [0].

 

 

replace()

it is to replace NaN as what you set

df['Fence'].replace(np.nan, 'MM', inplace=True)

I replaced NaN as 'MM'.

 

 

fillna() 

fillna function is to fill NaN as the specific value by the method. 

The first one is 'ffill' which meaning 'forward fill' So the NaN will be filled by the above value.

Another method is the method(method='backfill') to copy the values below.

 

df['FireplaceQu'].fillna(method='ffill', inplace=True)

 

 

But the con is sometimes one NaN value is remained.

 

notna()

The last function is quite similar to drop the NaN values. As I said we need to consider this seriously, we may lose the other variables which are very important. 

But It might be useful in some way. So I want to show!

df = df[df['MiscFeature'].notna()]

 

It is to take the only real value without NaN. We do not even have to drop it. 

 

 

Thanks for coming!!!

'Python skills for Data Analysis' 카테고리의 다른 글

[Functions] apply(lambda x :) examples  (0) 2020.08.15
[Functions] groupby examples  (0) 2020.08.14
Data Visualization(데이터 시각화)  (0) 2020.07.14
Data filtering  (0) 2020.06.16